更新时间:2023-01-31 14:40:31
在UTF-8中,每个代码点(=逻辑字符)由多个代码单元(= char
)表示; ɑftənun特别是:
In UTF-8 each code-point (=logical character) is represented by multiple code units (=char
); ɑftənun, in particular, is:
ch| c.p. | c.u.
--+------+-------
ɑ | 0251 | c9 91
f | 0066 | 66
t | 0074 | 74
ə | 0259 | c9 99
n | 006e | 6e
u | 0075 | 75
n | 006e | 6e
(ch =字符; c.p .:代码点编号; c.p.代码单位以UTF-8表示; c.u.和c.p.用十六进制表示)
(ch=character; c.p.: code point number; c.p. code unit representation in UTF-8; c.u. and c.p. are expressed in hexadecimal)
在中解释了如何将代码点映射到代码单元的确切详细信息.很多地方;最基本的是:
如果单独打印每个代码单元,则会破坏需要表达多个代码单元的代码点的UTF-8编码.您在第一行中的终端应用程序看到
If you print out each code unit on its own you are breaking the UTF-8 encoding for the code points that require more than one code unit to be expressed. Your terminal application in the first row sees
c9 0a
(第一个代码单元后跟换行符),并立即检测到这是一个损坏的UTF-8序列,因为c9设置了高位,而下一个c.u.没有它;因此-字符.第二个字符和c.u也一样.序列中代表ə的部分.
(the first code unit followed by a newline), and immediately detects that this is a broken UTF-8 sequence, as c9 has the high bit set but the next c.u. doesn't have it; hence the � character. The same holds for the second character, as well as for the c.u. parts of the sequence representing ə.
现在,如果您想打印出完整的代码点(不是代码单元),std::string
将无济于事-std::string
对这些东西一无所知,它本质上是荣耀的std::vector<char>
,完全忽略了编码问题;它所做的只是存储/索引代码单位,而不是代码点.
Now, if you want to print out full code-points (not code-units), std::string
won't be of any help - std::string
knows nothing about this stuff, it is essentially a glorified std::vector<char>
, completely oblivious of encoding issues; all it does is to store/index code units, not code points.
但是,有第三方库可以帮助您解决此问题; utf8-cpp 很小但是很完整.在您的情况下,utf8::next
函数将特别有用:
There are however third party libraries to help work with this; utf8-cpp is a small but complete one; in your case, the utf8::next
function would be particularly helpful:
while (source >> word >> word_ipa) {
auto cur = word_ipa.begin();
auto end = word_ipa.end();
auto next = cur;
for(;cur!=end; cur=next) {
utf8::next(next, end);
myfile << word << "is ";
for(; cur!=next; ++cur) myfile<<*cur;
myfile << "\n";
}
}
utf8::next
这里只是增加给定的迭代器,使其指向启动下一个代码单元的代码点;此代码可确保我们将组成单个代码点的所有代码单元一起打印.
utf8::next
here just increments the given iterator to make it point to the code point that starts the next code unit; this code makes sure that we print together all the code units that make up a single code point.
请注意,我们可以非常简单地重现其准系统行为,这只是阅读UTF-8规范的问题(请参阅上面的Wikipedia链接中的第一张表):
Notice that we can reproduce its barebones behavior quite simply, it's just a matter of reading the UTF-8 specs (see the first table in the link to Wikipedia above):
template<typename ItT>
void safe_advance(ItT &it, size_t n, ItT end) {
size_t d = std::distance(it, end);
if(n>d) throw std::logic_error("Truncated UTF-8 sequence");
std::advance(it, n);
}
template<typename ItT>
void my_next(ItT &it, ItT end) {
uint8_t b = *it;
if(b>>7 == 0) safe_advance(it, 1, end);
else if(b>>5 == 6) safe_advance(it, 2, end);
else if(b>>4 == 14) safe_advance(it, 3, end);
else if(b>>3 == 30) safe_advance(it, 4, end);
else throw std::logic_error("Invalid UTF-8 sequence");
}
在这里,我们利用了一个事实,即序列的第一个字节声明了将要完成代码单元的额外代码点.
Here we are exploiting the fact that the first byte of a sequence declares how many extra code points are going to come to complete the code unit.
(请注意,这需要有效的UTF-8,并且不会尝试重新同步损坏的UTF-8序列;库版本在这方面的表现可能会更好)
(notice that this expects valid UTF-8 and does not do any attempt to resynchronize a broken UTF-8 sequence; the library version probably fares way better in this regard)
OTOH,也可以内联将同一代码单元保持在一起所需的内容:
OTOH, it's also possible to inline just what's necessary to keep the same code unit together:
while (source >> word >> word_ipa) {
auto cur = word_ipa.begin();
auto end = word_ipa.end();
for(;cur!=end;) {
myfile << word << "is "<<*cur;
if(uint8_t(*cur++)>>7 != 0) {
for(; cur!=end && (uint8_t(*cur)>>6)==2; ++cur) myfile<<*cur;
}
myfile << "\n";
}
}
在这里,我们完全忽略了第一个c.u中的声明的计数",我们只是检查高位是否已设置;在这种情况下,只要得到c.u,我们就继续打印.自"continuation c.u."以来,前两个字节设置为10(二进制,AKA 2十进制).多c.u. UTF-8序列都遵循这种模式.
Here instead we are disregarding completely the "declared count" in the first c.u., we just check if the high bit is set; in this case, we go on printing as long as we get c.u. with the top two bytes set to 10 (in binary, AKA 2 in decimal) - since the "continuation c.u." of a multi-c.u. UTF-8 sequence all follow this pattern.