且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

手动将 unicode 代码点转换为 UTF-8 和 UTF-16

更新时间:2023-11-14 20:09:22

哇.一方面,我很高兴知道大学课程正在教授字符编码是一项艰苦的工作,但实际上了解 UTF-8 编码规则听起来像是期待很多.(它会帮助学生通过土耳其考试吗?)

Wow. On the one hand I'm thrilled to know that university courses are teaching to the reality that character encodings are hard work, but actually knowing the UTF-8 encoding rules sounds like expecting a lot. (Will it help students pass the Turkey test?)

到目前为止,我所看到的关于将 UCS 代码点编码为 UTF-8 的规则的最清晰的描述来自许多 Linux 系统上的 utf-8(7) 手册页:

The clearest description I've seen so far for the rules to encode UCS codepoints to UTF-8 are from the utf-8(7) manpage on many Linux systems:

Encoding
   The following byte sequences are used to represent a
   character.  The sequence to be used depends on the UCS code
   number of the character:

   0x00000000 - 0x0000007F:
       0xxxxxxx

   0x00000080 - 0x000007FF:
       110xxxxx 10xxxxxx

   0x00000800 - 0x0000FFFF:
       1110xxxx 10xxxxxx 10xxxxxx

   0x00010000 - 0x001FFFFF:
       11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

   [... removed obsolete five and six byte forms ...]

   The xxx bit positions are filled with the bits of the
   character code number in binary representation.  Only the
   shortest possible multibyte sequence which can represent the
   code number of the character can be used.

   The UCS code values 0xd800–0xdfff (UTF-16 surrogates) as well
   as 0xfffe and 0xffff (UCS noncharacters) should not appear in
   conforming UTF-8 streams.

可能更容易记住图表的压缩"版本:

It might be easier to remember a 'compressed' version of the chart:

重整代码点的初始字节以 1 开头,并添加填充 1+0.后续字节从 10 开始.

Initial bytes starts of mangled codepoints start with a 1, and add padding 1+0. Subsequent bytes start 10.

0x80      5 bits, one byte
0x800     4 bits, two bytes
0x10000   3 bits, three bytes

您可以通过记下可以用新表示中允许的位填充多少空间来推导出范围:

You can derive the ranges by taking note of how much space you can fill with the bits allowed in the new representation:

2**(5+1*6) == 2048       == 0x800
2**(4+2*6) == 65536      == 0x10000
2**(3+3*6) == 2097152    == 0x200000

我知道可以比图表本身更容易记住导出图表的规则.希望你也善于记住规则.:)

I know I could remember the rules to derive the chart easier than the chart itself. Here's hoping you're good at remembering rules too. :)

更新

一旦你建立了上面的图表,你可以通过找到它们的范围,将输入的 Unicode 代码点转换为 UTF-8,从十六进制转换为二进制,根据上述规则插入位,然后转换回十六进制:

Once you have built the chart above, you can convert input Unicode codepoints to UTF-8 by finding their range, converting from hexadecimal to binary, inserting the bits according to the rules above, then converting back to hex:

U+4E3E

这符合 0x00000800 - 0x0000FFFF 范围(0x4E3E ),因此表示形式为:

This fits in the 0x00000800 - 0x0000FFFF range (0x4E3E < 0xFFFF), so the representation will be of the form:

   1110xxxx 10xxxxxx 10xxxxxx

0x4E3E100111000111110b.将位放入上面的 x(从右侧开始,我们将在开头用 0 填充缺失的位):

0x4E3E is 100111000111110b. Drop the bits into the x above (start from the right, we'll fill in missing bits at the start with 0):

   1110x100 10111000 10111110

开头有一个x点,用0填入:

There is an x spot left over at the start, fill it in with 0:

   11100100 10111000 10111110

位转换为十六进制:

   0xE4 0xB8 0xBE