更新时间:2023-11-14 20:18:40
哇.一方面,我很高兴知道大学课程正在向人们证明字符编码是一项艰苦的工作,但是实际上知道UTF-8编码规则听起来令人期待很多. (这将帮助学生通过土耳其测试?)
Wow. On the one hand I'm thrilled to know that university courses are teaching to the reality that character encodings are hard work, but actually knowing the UTF-8 encoding rules sounds like expecting a lot. (Will it help students pass the Turkey test?)
到目前为止,我看到的关于将UCS代码点编码为UTF-8的规则的最清晰的描述来自许多Linux系统上的utf-8(7)
联机帮助页:
The clearest description I've seen so far for the rules to encode UCS codepoints to UTF-8 are from the utf-8(7)
manpage on many Linux systems:
Encoding
The following byte sequences are used to represent a
character. The sequence to be used depends on the UCS code
number of the character:
0x00000000 - 0x0000007F:
0xxxxxxx
0x00000080 - 0x000007FF:
110xxxxx 10xxxxxx
0x00000800 - 0x0000FFFF:
1110xxxx 10xxxxxx 10xxxxxx
0x00010000 - 0x001FFFFF:
11110xxx 10xxxxxx 10xxxxxx 10xxxxxx
[... removed obsolete five and six byte forms ...]
The xxx bit positions are filled with the bits of the
character code number in binary representation. Only the
shortest possible multibyte sequence which can represent the
code number of the character can be used.
The UCS code values 0xd800–0xdfff (UTF-16 surrogates) as well
as 0xfffe and 0xffff (UCS noncharacters) should not appear in
conforming UTF-8 streams.
记住图表的压缩"版本可能会更容易:
It might be easier to remember a 'compressed' version of the chart:
整齐的代码点的初始字节开头以1
开头,并添加填充1+0
.后续字节从10
开始.
Initial bytes starts of mangled codepoints start with a 1
, and add padding 1+0
. Subsequent bytes start 10
.
0x80 5 bits, one byte
0x800 4 bits, two bytes
0x10000 3 bits, three bytes
您可以通过记下可以用新表示形式允许的位填充多少空间来得出范围:
You can derive the ranges by taking note of how much space you can fill with the bits allowed in the new representation:
2**(5+1*6) == 2048 == 0x800
2**(4+2*6) == 65536 == 0x10000
2**(3+3*6) == 2097152 == 0x200000
我知道我可以记住比图表本身更容易获得图表的规则.希望您也能牢记规则. :)
I know I could remember the rules to derive the chart easier than the chart itself. Here's hoping you're good at remembering rules too. :)
更新
建立完以上图表后,您可以通过以下方法将输入的Unicode代码点转换为UTF-8:查找其范围,从十六进制转换为二进制,然后根据上述规则插入位,然后再转换回十六进制:
Once you have built the chart above, you can convert input Unicode codepoints to UTF-8 by finding their range, converting from hexadecimal to binary, inserting the bits according to the rules above, then converting back to hex:
U+4E3E
这适合0x00000800 - 0x0000FFFF
范围(0x4E3E < 0xFFFF
),因此表示形式为:
This fits in the 0x00000800 - 0x0000FFFF
range (0x4E3E < 0xFFFF
), so the representation will be of the form:
1110xxxx 10xxxxxx 10xxxxxx
0x4E3E
是100111000111110b
.将这些位放到上面的x
中(从右侧开始,我们将在0
开头填充丢失的位):
0x4E3E
is 100111000111110b
. Drop the bits into the x
above (start from the right, we'll fill in missing bits at the start with 0
):
1110x100 10111000 10111110
开始时剩下一个x
点,并用0
填充:
There is an x
spot left over at the start, fill it in with 0
:
11100100 10111000 10111110
将位转换为十六进制:
0xE4 0xB8 0xBE