且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

手动将Unicode代码点转换为UTF-8和UTF-16

更新时间:2023-11-14 20:18:40

哇.一方面,我很高兴知道大学课程正在向人们证明字符编码是一项艰苦的工作,但是实际上知道UTF-8编码规则听起来令人期待很多. (这将帮助学生通过土耳其测试?)

Wow. On the one hand I'm thrilled to know that university courses are teaching to the reality that character encodings are hard work, but actually knowing the UTF-8 encoding rules sounds like expecting a lot. (Will it help students pass the Turkey test?)

到目前为止,我看到的关于将UCS代码点编码为UTF-8的规则的最清晰的描述来自许多Linux系统上的utf-8(7)联机帮助页:

The clearest description I've seen so far for the rules to encode UCS codepoints to UTF-8 are from the utf-8(7) manpage on many Linux systems:

Encoding
   The following byte sequences are used to represent a
   character.  The sequence to be used depends on the UCS code
   number of the character:

   0x00000000 - 0x0000007F:
       0xxxxxxx

   0x00000080 - 0x000007FF:
       110xxxxx 10xxxxxx

   0x00000800 - 0x0000FFFF:
       1110xxxx 10xxxxxx 10xxxxxx

   0x00010000 - 0x001FFFFF:
       11110xxx 10xxxxxx 10xxxxxx 10xxxxxx

   [... removed obsolete five and six byte forms ...]

   The xxx bit positions are filled with the bits of the
   character code number in binary representation.  Only the
   shortest possible multibyte sequence which can represent the
   code number of the character can be used.

   The UCS code values 0xd800–0xdfff (UTF-16 surrogates) as well
   as 0xfffe and 0xffff (UCS noncharacters) should not appear in
   conforming UTF-8 streams.

记住图表的压缩"版本可能会更容易:

It might be easier to remember a 'compressed' version of the chart:

整齐的代码点的初始字节开头以1开头,并添加填充1+0.后续字节从10开始.

Initial bytes starts of mangled codepoints start with a 1, and add padding 1+0. Subsequent bytes start 10.

0x80      5 bits, one byte
0x800     4 bits, two bytes
0x10000   3 bits, three bytes

您可以通过记下可以用新表示形式允许的位填充多少空间来得出范围:

You can derive the ranges by taking note of how much space you can fill with the bits allowed in the new representation:

2**(5+1*6) == 2048       == 0x800
2**(4+2*6) == 65536      == 0x10000
2**(3+3*6) == 2097152    == 0x200000

我知道可以记住比图表本身更容易获得图表的规则.希望您也能牢记规则. :)

I know I could remember the rules to derive the chart easier than the chart itself. Here's hoping you're good at remembering rules too. :)

更新

建立完以上图表后,您可以通过以下方法将输入的Unicode代码点转换为UTF-8:查找其范围,从十六进制转换为二进制,然后根据上述规则插入位,然后再转换回十六进制:

Once you have built the chart above, you can convert input Unicode codepoints to UTF-8 by finding their range, converting from hexadecimal to binary, inserting the bits according to the rules above, then converting back to hex:

U+4E3E

这适合0x00000800 - 0x0000FFFF范围(0x4E3E < 0xFFFF),因此表示形式为:

This fits in the 0x00000800 - 0x0000FFFF range (0x4E3E < 0xFFFF), so the representation will be of the form:

   1110xxxx 10xxxxxx 10xxxxxx

0x4E3E100111000111110b.将这些位放到上面的x中(从右侧开始,我们将在0开头填充丢失的位):

0x4E3E is 100111000111110b. Drop the bits into the x above (start from the right, we'll fill in missing bits at the start with 0):

   1110x100 10111000 10111110

开始时剩下一个x点,并用0填充:

There is an x spot left over at the start, fill it in with 0:

   11100100 10111000 10111110

位转换为十六进制:

   0xE4 0xB8 0xBE