且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

在Python中将UTF-8转换为字符串文字

更新时间:2023-09-11 22:24:04

u''语法仅适用于字符串文字,例如在源代码中定义值.使用语法可以创建unicode对象,但这不是创建此类对象的唯一方法.

The u'' syntax only works for string literals, e.g. defining values in source code. Using the syntax results in a unicode object being created, but that's not the only way to create such an object.

不能通过在字节串前面添加u来从字节串中获取unicode值.但是,如果您使用正确的编码调用了str.decode(),则会得到一个unicode值.反之亦然,您可以使用unicode.encode() 编码 unicode对象到字节字符串.

You cannot make a unicode value from a byte string by adding u in front of it. But if you called str.decode() with the right encoding, you get a unicode value. Vice-versa, you can encode unicode objects to byte strings with unicode.encode().

请注意,在显示unicode对象时,Python再次使用Unicode字符串文字语法(因此是u'...')来表示 ,以简化调试.您可以将表示形式重新粘贴到Python解释器中,并获得具有相同值的对象.

Note that when displaying a unicode object, Python represents it by using the Unicode string literal syntax again (so u'...'), to ease debugging. You can paste the representation back in to a Python interpreter and get an object with the same value.

您的a值是使用字节字符串文字定义的,因此您只需要解码:

Your a value is defined using a byte string literal, so you only need to decode:

a = 'Entre\xc3\xa9'
b = a.decode('utf8')

您的第一个示例创建了 Mojibake ,这是一个Unicode字符串,其中包含实际表示的Latin-1代码点UTF-8字节.这就是为什么您必须先对Latin-1进行编码(以撤消Mojibake),然后再从UTF-8进行解码的原因.

Your first example created a Mojibake, a Unicode string containing Latin-1 codepoints that actually represent UTF-8 bytes. This is why you had to encode to Latin-1 first (to undo the Mojibake), then decode from UTF-8.

您可能想在 Unicode HOWTO 中阅读Python和Unicode. .其他有趣的文章是:

You may want to read up on Python and Unicode in the Unicode HOWTO. Other articles of interest are:

  • The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) by Joel Spolsky

实用的Unicode ,作者Ned Batchelder

Pragmatic Unicode by Ned Batchelder