更新时间:2023-09-11 22:24:04
u''
语法仅适用于字符串文字,例如在源代码中定义值.使用语法可以创建unicode
对象,但这不是创建此类对象的唯一方法.
The u''
syntax only works for string literals, e.g. defining values in source code. Using the syntax results in a unicode
object being created, but that's not the only way to create such an object.
不能通过在字节串前面添加u
来从字节串中获取unicode
值.但是,如果您使用正确的编码调用了str.decode()
,则会得到一个unicode
值.反之亦然,您可以使用unicode.encode()
编码 unicode
对象到字节字符串.
You cannot make a unicode
value from a byte string by adding u
in front of it. But if you called str.decode()
with the right encoding, you get a unicode
value. Vice-versa, you can encode unicode
objects to byte strings with unicode.encode()
.
请注意,在显示unicode
对象时,Python再次使用Unicode字符串文字语法(因此是u'...'
)来表示 ,以简化调试.您可以将表示形式重新粘贴到Python解释器中,并获得具有相同值的对象.
Note that when displaying a unicode
object, Python represents it by using the Unicode string literal syntax again (so u'...'
), to ease debugging. You can paste the representation back in to a Python interpreter and get an object with the same value.
您的a
值是使用字节字符串文字定义的,因此您只需要解码:
Your a
value is defined using a byte string literal, so you only need to decode:
a = 'Entre\xc3\xa9'
b = a.decode('utf8')
您的第一个示例创建了 Mojibake ,这是一个Unicode字符串,其中包含实际表示的Latin-1代码点UTF-8字节.这就是为什么您必须先对Latin-1进行编码(以撤消Mojibake),然后再从UTF-8进行解码的原因.
Your first example created a Mojibake, a Unicode string containing Latin-1 codepoints that actually represent UTF-8 bytes. This is why you had to encode to Latin-1 first (to undo the Mojibake), then decode from UTF-8.
您可能想在 Unicode HOWTO 中阅读Python和Unicode. .其他有趣的文章是:
You may want to read up on Python and Unicode in the Unicode HOWTO. Other articles of interest are:
每个软件开发人员绝对肯定要完全了解Unicode和字符集(没有任何借口) !),乔尔·斯波斯基(Joel Spolsky)
The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky
实用的Unicode ,作者Ned Batchelder
Pragmatic Unicode by Ned Batchelder