且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

Java - 什么是字符、代码点和代理?它们之间有什么区别?

更新时间:2022-02-18 03:58:17

要在计算机中表示文本,您必须解决两件事:首先,您必须将符号映射到数字,然后,您必须表示这些符号的序列带字节的数字.

To represent text in computers, you have to solve two things: first, you have to map symbols to numbers, then, you have to represent a sequence of those numbers with bytes.

代码点 是标识符号的数字.为符号分配数字的两个众所周知的标准是 ASCII 和 Unicode.ASCII 定义了 128 个符号.Unicode 目前定义了 109384 个符号,远远超过 216.

A Code point is a number that identifies a symbol. Two well-known standards for assigning numbers to symbols are ASCII and Unicode. ASCII defines 128 symbols. Unicode currently defines 109384 symbols, that's way more than 216.

此外,ASCII 指定数字序列每个数字表示一个字节,而 Unicode 指定了几种可能性,例如 UTF-8、UTF-16 和 UTF-32.

Furthermore, ASCII specifies that number sequences are represented one byte per number, while Unicode specifies several possibilities, such as UTF-8, UTF-16, and UTF-32.

当您尝试使用每个字符使用的位数少于表示所有可能值所需的位数时(例如使用 16 位的 UTF-16),您需要一些解决方法.

When you try to use an encoding which uses less bits per character than are needed to represent all possible values (such as UTF-16, which uses 16 bits), you need some workaround.

因此,Surrogates 是 16 位值,表示不适合的符号单个两字节值.

Thus, Surrogates are 16-bit values that indicate symbols that do not fit into a single two-byte value.

Java 在内部使用 UTF-16 来表示文本.

Java uses UTF-16 internally to represent text.

特别是,char(字符)是包含 UTF-16 值的无符号两字节值.

In particular, a char (character) is an unsigned two-byte value that contains a UTF-16 value.

如果您想了解有关 Java 和 Unicode 的更多信息,我可以推荐此时事通讯:第 1 部分, 第 2 部分

If you want to learn more about Java and Unicode, I can recommend this newsletter: Part 1, Part 2