且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

Unicode vs UTF-8在Python / Django中的混淆?

更新时间:2023-12-03 22:27:58


Python中的 Unicode字符串是什么?这是否意味着UCS-2?

what is a "Unicode string" in Python? Does that mean UCS-2?

Python中的Unicode字符串在内部存储为UCS-2(定长16位表示形式) ,几乎与UTF-16相同)或UCS-4 / UTF-32(固定长度的32位表示形式)。这是一个编译时选项;在Windows上,它始终是UTF-16,而许多Linux发行版都为其Python版本设置了UTF-32(宽模式)。

Unicode strings in Python are stored internally either as UCS-2 (fixed-length 16-bit representation, almost the same as UTF-16) or UCS-4/UTF-32 (fixed-length 32-bit representation). It's a compile-time option; on Windows it's always UTF-16 whilst many Linux distributions set UTF-32 (‘wide mode’) for their versions of Python.

通常,您不必关心:您将在字符串中将Unicode代码点视为单个元素,并且您将不知道它们是以两个或四个字节存储的。如果您使用的是UTF-16版本,并且需要在Basic Multilingual Plane之外处理字符,那您肯定做错了,但这仍然非常罕见,真正需要额外字符的用户应该编译广泛的版本。 / p>

You are generally not supposed to care: you will see Unicode code-points as single elements in your strings and you won't know whether they're stored as two or four bytes. If you're in a UTF-16 build and you need to handle characters outside the Basic Multilingual Plane you'll be Doing It Wrong, but that's still very rare, and users who really need the extra characters should be compiling wide builds.


普通错误,还是这样?

plain wrong, or is it?



Yes, it's quite wrong. To be fair I think that tutorial is rather old; it probably pre-dates wide Unicode strings, if not Unicode 3.1 (the version that introduced characters outside the Basic Multilingual Plane).

还有其他的混乱源于Windows习惯于使用术语 Unicode来表示NT内部使用的UTF-16LE编码。来自Microsoftland的人们可能经常复制这种有点误导性的习惯。

There is an additional source of confusion stemming from Windows's habit of using the term "Unicode" to mean, specifically, the UTF-16LE encoding that NT uses internally. People from Microsoftland may often copy this somewhat misleading habit.