且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

带有 BOM 的 UTF-8 HTML 和 CSS 文件(以及如何使用 Python 删除 BOM)

更新时间:2023-11-27 18:02:19

自您声明:

我所有的(文本)文件当前都是与 BOM 一起存储在 UTF-8 中

然后使用utf-8-sig"编解码器解码它们:

>>>s = u'Hello, world!'.encode('utf-8-sig')>>>秒'xefxbbxbf你好,世界!>>>s.decode('utf-8-sig')'你好,世界!

它会自动删除预期的 BOM,如果 BOM 不存在也能正常工作.

First, some background: I'm developing a web application using Python. All of my (text) files are currently stored in UTF-8 with the BOM. This includes all my HTML templates and CSS files. These resources are stored as binary data (BOM and all) in my DB.

When I retrieve the templates from the DB, I decode them using template.decode('utf-8'). When the HTML arrives in the browser, the BOM is present at the beginning of the HTTP response body. This generates a very interesting error in Chrome:

Extra <html> encountered. Migrating attributes back to the original <html> element and ignoring the tag.

Chrome seems to generate an <html> tag automatically when it sees the BOM and mistakes it for content, making the real <html> tag an error.

So, using Python, what is the best way to remove the BOM from my UTF-8 encoded templates (if it exists -- I can't guarantee this in the future)?

For other text-based files like CSS, will major browsers correctly interpret (or ignore) the BOM? They are being sent as plain binary data without .decode('utf-8').

Note: I am using Python 2.5.

Thanks!

Since you state:

All of my (text) files are currently stored in UTF-8 with the BOM

then use the 'utf-8-sig' codec to decode them:

>>> s = u'Hello, world!'.encode('utf-8-sig')
>>> s
'xefxbbxbfHello, world!'
>>> s.decode('utf-8-sig')
u'Hello, world!'

It automatically removes the expected BOM, and works correctly if the BOM is not present as well.