且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

检测字符串是否在UTF-8中进行了双重编码

更新时间:2023-11-27 11:52:28

原则上您不能这样做,尤其是允许猫乱扔垃圾.

In principle you can't, especially allowing for cat-garbage.

您无需说一遍或两次对数据进行UTF-8编码之前,数据的原始字符编码是什么.我假设使用CP1251(或者至少使用CP1251是其中之一),因为这是一个非常棘手的情况.

You don't say what the original character encoding of the data was before it was UTF-8 encoded once or twice. I'll assume CP1251, (or at least that CP1251 is one of the possibilities) because it's quite a tricky case.

采用非ASCII字符. UTF-8对其进行编码.您将获得一些字节,并且所有这些字节在CP1251中都是有效字符,除非其中之一恰好是CP1251中的唯一空洞0x98.

Take a non-ASCII character. UTF-8 encode it. You get some bytes, and all those bytes are valid characters in CP1251 unless one of them happens to be 0x98, the only hole in CP1251.

因此,如果将这些字节从CP1251转换为UTF-8,则结果与正确地UTF-8编码由那些俄语字符组成的CP1251字符串完全相同.无法判断结果是错误地对一个字符进行了双重编码还是正确地对两个字符进行了单编码.

So, if you convert those bytes from CP1251 to UTF-8, the result is exactly the same as if you'd correctly UTF-8 encoded a CP1251 string consisting of those Russian characters. There's no way to tell whether the result is from incorrectly double-encoding one character, or correctly single-encoding 2 characters.

如果您可以控制原始数据,则可以在BOM表的开头放置一个BOM表.然后,当返回到您时,请检查初始字节以查看您是否具有UTF-8 BOM或对BOM进行不正确的双重编码的结果.但是我想您可能对原始文本没有这种控制.

If you have some control over the original data, you could put a BOM at the start of it. Then when it comes back to you, inspect the initial bytes to see whether you have a UTF-8 BOM, or the result of incorrectly double-encoding a BOM. But I guess you probably don't have that kind of control over the original text.

实际上,您可以猜测-UTF-8对其进行解码,然后:

In practice you can guess - UTF-8 decode it and then:

(a)查看字符频率,字符对频率,不可打印字符的数量.这可能允许您暂时将其声明为废话,因此可能进行了双重编码.拥有足够多的不可打印字符,可能太荒谬了,以至于您甚至无法通过在键盘上混搭来现实地键入它,除非您的ALT键被卡住了.

(a) look at the character frequencies, character pair frequencies, numbers of non-printable characters. This might allow you to tentatively declare it nonsense, and hence possibly double-encoded. With enough non-printable characters it may be so nonsensical that you couldn't realistically type it even by mashing at the keyboard, unless maybe your ALT key was stuck.

(b)尝试第二次解码.也就是说,从对UTF-8数据进行解码得到的Unicode代码点开始,首先将其编码为CP1251(或其他),然后对UTF-8的结果进行解码.如果任何一个步骤失败(由于无效的字节序列),那么它肯定不是双重编码的,至少没有使用CP1251作为错误的解释.

(b) attempt the second decode. That is, starting from the Unicode code points that you got by decoding your UTF-8 data, first encode it to CP1251 (or whatever) and then decode the result from UTF-8. If either step fails (due to invalid sequences of bytes), then it definitely wasn't double-encoded, at least not using CP1251 as the faulty interpretation.

如果您有一些字节可能是UTF-8或CP1251,而又不知道是哪个字节,则这或多或少是您要做的.

This is more or less what you do if you have some bytes that might be UTF-8 or might be CP1251, and you don't know which.

对于双编码数据无法区分的单编码猫垃圾,您会得到一些误报,对于双编码但在Fluke第一次编码后仍然看起来像俄语的数据,可能会得到极少的误报

You'll get some false positives for single-encoded cat-garbage indistinguishable from double-encoded data, and maybe a very few false negatives for data that was double-encoded but that after the first encode by fluke still looked like Russian.

如果您的原始编码中的空洞比CP1251多,那么您的假阴性将更少.

If your original encoding has more holes in it than CP1251 then you'll have fewer false negatives.

字符编码很难.