且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

在我的python文件中编写utf-8字符串

更新时间:2023-11-27 15:22:10

让我们仔细检查该错误消息:

Let's examine that error message very closely:

"UnicodeDecodeError:'utf8'编解码器无法解码位置8-13中的字节:不支持的Unicode代码范围"

"UnicodeDecodeError: 'utf8' codec can't decode bytes in position 8-13: unsupported Unicode code range"

请注意,它说的是位置8-13中的字节",即 6字节UTF-8序列.在黑暗时代这可能是有效的,但是由于Unicode被冻结为21位,因此最大为4个字节. UTF-8验证和错误报告最近已得到加强;出于兴趣,您到底在运行什么版本的Python?

Note carefully that it says "bytes in position 8-13" -- that's a 6-byte UTF-8 sequence. That might have been valid in the dark ages, but since Unicode was frozen at 21 bits, the maximum is FOUR bytes. UTF-8 validations and error reporting were tightened up recently; as a matter of interest, exactly what version of Python are you running?

至少在2.7.1和2.6.6下,该错误变得更加有用"...无法解码位置8的字节XXXX:无效的起始字节",其中XXXX只能是0xfc或0xfd(如果旧的消息建议使用6个字节的序列.在ISO-8859-1或cp1252中,0xfc表示U + 00FC带小写字母的拉丁文小写字母U(又名u-umlaut,可能是可疑的); 0xfd表示U + 00FD带有小写字母的拉丁文小写字母Y(不太可能).

With 2.7.1 and 2.6.6 at least, that error becomes the more useful "... can't decode byte XXXX in position 8: invalid start byte" where XXXX can be only be 0xfc or 0xfd if the old message suggested a 6-byte sequence. In ISO-8859-1 or cp1252, 0xfc represents U+00FC LATIN SMALL LETTER U WITH DIAERESIS (aka u-umlaut, a likely suspect); 0xfd represents U+00FD LATIN SMALL LETTER Y WITH ACUTE (less likely).

问题不在于源文件中的if line.startswith(u"Fußnote"):语句.如果它不是正确的UTF-8,则会在COMPILE时收到一条消息,并且该消息将以"SyntaxError"而不是"UnicodeDecodeError"开头.无论如何,该字符串的UTF-8编码只有8个字节长,而不是14个字节.

The problem is NOT with the if line.startswith(u"Fußnote"): statement in your source file. You would have got a message at COMPILE time if it wasn't proper UTF-8, and the message would have started with "SyntaxError", not "UnicodeDecodeError". In any case the UTF-8 encoding of that string is only 8 bytes long, not 14.

问题在于(正如@Mark Tolonen所指出的),无论行"指的是什么.它只能是一个str对象.

The problem is (as @Mark Tolonen has pointed out) in whatever "line" is referring to. It can only be a str object.

要进一步了解,您需要回答Mark的问题(1)print repr(line)的结果(2)site.py更改.

To get further you need to answer Mark's questions (1) result of print repr(line) (2) site.py change.

在这个阶段,***是将strunicode对象混合在一起(在许多操作中,不仅是a.startswith(b)).

At this stage it's a good idea to clear the air about mixing str and unicode objects (in many operations, not just a.startswith(b)).

除非定义了操作以产生str对象,否则不会将unicode对象强制为str.对于a.startswith(b)并非如此,它将尝试使用默认(通常为"ascii")编码对str对象进行解码.

Unless the operation is defined to produce a str object, it will NOT coerce the unicode object to str. This is not the case with a.startswith(b).It will attempt to decode the str object using the default (usually 'ascii') encoding.

示例:

>>> "\xff".startswith(u"\xab")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xff in position 0: ordinal not in range(128)

>>> u"\xff".startswith("\xab")
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeDecodeError: 'ascii' codec can't decode byte 0xab in position 0: ordinal not in range(128)

此外,说混合并得到UnicodeDecodeError"是不正确的.很有可能str对象已有效地以默认编码(通常为"ascii")编码- -没有异常.

Furthermore, it is NOT correct to say "Mix and you get UnicodeDecodeError". It is quite possible that the str object is validly encoded in the default encoding (usually 'ascii') -- no exception is raised.

示例:

>>> "abc".startswith(u"\xff")
False
>>> u"\xff".startswith("abc")
False
>>>