且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

在C中将utf-16转换为utf-8

更新时间:2023-11-14 20:09:34

首先,你没有展示你的对象如何命名 char ... 被宣布。你需要对32位无符号整数进行所有计算;在其他情况下,大小不足以代表BMP之外的代码点



我没有检查UTF16部分,但是缺少至少一部分:应该有两个不同的分支:一个用于UTF16LE,另一个用于UTF16BE。在每种情况下,首先检查您是否正在阅读代理对,然后以对话形式计算代码点的内部表示形式无符号的32位整数。对于大端,所有表示都被翻转,包括代理对本身。其他代码点应由16位字组成;并且其无符号整数解释将在算术上等于代码点值。请参阅:

https://en.wikipedia.org/wiki/Endianness [ ^ ],

https://en.wikipedia.org/wiki/UTF-16 [ ^ ]。



第一阶段的目标是逐个字符地解释UTF16编码,每个字符应表示为16位无符号值,该值应在算术上等于代码点。在这里,您需要意识到Unicode代码点是代表基数值的数学抽象;从任何类型的计算机表示中抽象出这些数据的按位表示。它们只是抽象的数学值。



现在,UTF-8也是可变宽度编码。它使用非常狡猾的算法,冗余度非常低。例如,这里有完整的描述: https://en.wikipedia.org/wiki/UTF-8 [ ^ ]。



只需按照算法说明操作即可。我不认为这太复杂了。



UTF-16或UTF-8流还有另一个可选功能:BOM。这是可选的标记。您需要决定如何处理缺少标记的文本。如果未找到标记,则可以拒绝处理,或者您需要具有指定预期编码的其他函数。那应该是你的设计。请参阅: http://unicode.org/faq/utf_bom.html [ ^ ]。



最后,一个精致point:两种编码都允许无效的代码点。在您的特定问题中,UTF-8永远不是源,因此您可能遇到的所有问题都是UTF-16。例如,如果在遇到第一个代理对之前面对代理对的第二个成员,则这是无效数据。如果非代理词周围只有一个代理对的成员,则这是无效数据。所以,你必须决定如何处理这类案件;这应该只是一个自愿的决定。这应该是你的设计。



我希望我做了你想做的一切:没有代码,但现在你有所有的来龙去脉。它清楚了吗?



-SA
First, you did not show how your objects named char… are declared. You need to do all the calculations on 32-bit unsigned integer; in other cases, the size would be not enough to represent a code point beyond BMP.

I did not check up UTF16 part, but at least one part is missing: there should be two different branches: one for UTF16LE and another for UTF16BE. In each of the cases, you first check up if you are reading a surrogate pair and then calculate your internal representation of a code point out of the pair, in the form of unsigned 32-bit integer. For big endian, all representations are flipped, including the surrogate pairs themselves. Other code points should be composed out of 16-bit words; and its unsigned integer interpretation will be arithmetically equal to a code point value. Please see:
https://en.wikipedia.org/wiki/Endianness[^],
https://en.wikipedia.org/wiki/UTF-16[^].

The goal of first stage is to interpret UTF16 encoding character by character, and each character should be represented as 16-bit unsigned value which should be arithmetically equal to the code point. Here, you need to realize that Unicode code points are mathematical abstraction representing cardinal value; they are abstracted from the bitwise representation of this data, from any kind of computer representation. They are just abstract mathematical values.

Now, UTF-8 is also variable-width encoding. It uses pretty cunning algorithm with very low redundancy. It is fully described, for example, here: https://en.wikipedia.org/wiki/UTF-8[^].

Just follow the algorithm description. I don't think it's anything too complicated.

There is another optional feature of the UTF-16 or UTF-8 streams: the BOM. This is the marker which is optional. You need to decide what to do with text with absent marker. You can deny processing if the marker is not found, or you need to have another function where the expected encoding is specified. That should be your design. Please see: http://unicode.org/faq/utf_bom.html[^].

And finally, one delicate point: both encodings allow invalid code points. In your particular problem, UTF-8 is never a source, so all problems you may have are with UTF-16. If, for example, you face a second member of a surrogate pair before the first one is encountered, this is invalid data. If you have only one member of a surrogate pairs surrounding by the non-surrogate words, this is invalid data. So, you have to decide what to do with such cases; and this should be just a voluntary decision. It should be by your design.

I hope I did all you wanted: no code, but now you have all ins and outs. It it clear?

—SA