对于字符串中的每个字符给出错误的结果

更新时间：2023-01-31 14:40:31

在UTF-8中，每个代码点(=逻辑字符)由多个代码单元(= char)表示； ɑftənun特别是:

In UTF-8 each code-point (=logical character) is represented by multiple code units (=char); ɑftənun, in particular, is:

ch| c.p. | c.u.
--+------+-------
ɑ | 0251 | c9 91
f | 0066 | 66
t | 0074 | 74
ə | 0259 | c9 99
n | 006e | 6e
u | 0075 | 75
n | 006e | 6e

(ch =字符; c.p .:代码点编号; c.p.代码单位以UTF-8表示; c.u.和c.p.用十六进制表示)

(ch=character; c.p.: code point number; c.p. code unit representation in UTF-8; c.u. and c.p. are expressed in hexadecimal)

在中解释了如何将代码点映射到代码单元的确切详细信息.很多地方;最基本的是:

代码点直接映射到单个代码单元；对于这些，永远不会设置高位；
代码点被映射到多个代码单元；多代码单元序列中的所有代码单元都设置了高位；
如果高位被置位，则高位具有特殊含义；在多字节序列的第一个字节中，它们告诉我们期望有多少个连续字节，在其他字节中，它们明确地标记为连续字节.

code points less than 0x7f are mapped straight to a single code unit; for these, the high bit is never set;
code points from 0x80 onwards are mapped to multiple code units; all the code units in a multi-code-unit sequence have the high bit set;
if the high bit is set, the top bits have a particular meaning; in the first byte of a multibyte sequence they tell how many continuation bytes are to be expected, in the others they are unambiguously marked as continuation bytes.

如果单独打印每个代码单元，则会破坏需要表达多个代码单元的代码点的UTF-8编码.您在第一行中的终端应用程序看到

If you print out each code unit on its own you are breaking the UTF-8 encoding for the code points that require more than one code unit to be expressed. Your terminal application in the first row sees

c9 0a

(第一个代码单元后跟换行符)，并立即检测到这是一个损坏的UTF-8序列，因为c9设置了高位，而下一个c.u.没有它；因此-字符.第二个字符和c.u也一样.序列中代表ə的部分.

(the first code unit followed by a newline), and immediately detects that this is a broken UTF-8 sequence, as c9 has the high bit set but the next c.u. doesn't have it; hence the � character. The same holds for the second character, as well as for the c.u. parts of the sequence representing ə.

现在，如果您想打印出完整的代码点(不是代码单元)，std::string将无济于事-std::string对这些东西一无所知，它本质上是荣耀的std::vector<char>，完全忽略了编码问题；它所做的只是存储/索引代码单位，而不是代码点.

Now, if you want to print out full code-points (not code-units), std::string won't be of any help - std::string knows nothing about this stuff, it is essentially a glorified std::vector<char>, completely oblivious of encoding issues; all it does is to store/index code units, not code points.

但是，有第三方库可以帮助您解决此问题； utf8-cpp 很小但是很完整.在您的情况下，utf8::next函数将特别有用:

There are however third party libraries to help work with this; utf8-cpp is a small but complete one; in your case, the utf8::next function would be particularly helpful:

while (source >> word >> word_ipa) {
    auto cur = word_ipa.begin();
    auto end = word_ipa.end();
    auto next = cur;
    for(;cur!=end; cur=next) {
        utf8::next(next, end);
        myfile << word << "is ";
        for(; cur!=next; ++cur) myfile<<*cur;
        myfile << "\n";
    }
}

utf8::next这里只是增加给定的迭代器，使其指向启动下一个代码单元的代码点；此代码可确保我们将组成单个代码点的所有代码单元一起打印.

utf8::next here just increments the given iterator to make it point to the code point that starts the next code unit; this code makes sure that we print together all the code units that make up a single code point.

请注意，我们可以非常简单地重现其准系统行为，这只是阅读UTF-8规范的问题(请参阅上面的Wikipedia链接中的第一张表):

Notice that we can reproduce its barebones behavior quite simply, it's just a matter of reading the UTF-8 specs (see the first table in the link to Wikipedia above):

template<typename ItT>
void safe_advance(ItT &it, size_t n, ItT end) {
    size_t d = std::distance(it, end);
    if(n>d) throw std::logic_error("Truncated UTF-8 sequence");
    std::advance(it, n);
}


template<typename ItT>
void my_next(ItT &it, ItT end) {
    uint8_t b = *it;
    if(b>>7 == 0) safe_advance(it, 1, end);
    else if(b>>5 == 6) safe_advance(it, 2, end);
    else if(b>>4 == 14) safe_advance(it, 3, end);
    else if(b>>3 == 30) safe_advance(it, 4, end);
    else throw std::logic_error("Invalid UTF-8 sequence");
}

在这里，我们利用了一个事实，即序列的第一个字节声明了将要完成代码单元的额外代码点.

Here we are exploiting the fact that the first byte of a sequence declares how many extra code points are going to come to complete the code unit.

(请注意，这需要有效的UTF-8，并且不会尝试重新同步损坏的UTF-8序列；库版本在这方面的表现可能会更好)

(notice that this expects valid UTF-8 and does not do any attempt to resynchronize a broken UTF-8 sequence; the library version probably fares way better in this regard)

OTOH，也可以内联将同一代码单元保持在一起所需的内容:

OTOH, it's also possible to inline just what's necessary to keep the same code unit together:

while (source >> word >> word_ipa) {
    auto cur = word_ipa.begin();
    auto end = word_ipa.end();
    for(;cur!=end;) {
        myfile << word << "is "<<*cur;
        if(uint8_t(*cur++)>>7 != 0) {
            for(; cur!=end && (uint8_t(*cur)>>6)==2; ++cur) myfile<<*cur;
        }
        myfile << "\n";
    }
}

在这里，我们完全忽略了第一个c.u中的声明的计数"，我们只是检查高位是否已设置；在这种情况下，只要得到c.u，我们就继续打印.自"continuation c.u."以来，前两个字节设置为10(二进制，AKA 2十进制).多c.u. UTF-8序列都遵循这种模式.

Here instead we are disregarding completely the "declared count" in the first c.u., we just check if the high bit is set; in this case, we go on printing as long as we get c.u. with the top two bytes set to 10 (in binary, AKA 2 in decimal) - since the "continuation c.u." of a multi-c.u. UTF-8 sequence all follow this pattern.

上一篇 : ：使用moment.js计算出错误的日期差下一篇 : 使用moment.js转换日期格式

对于字符串中的每个字符给出错误的结果

相关阅读

技术问答最新文章