且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

使用C#检测文本文件的编码

更新时间:2022-04-22 22:28:25

检测编码始终是一项棘手的事情,但是检测BOM仍然非常简单.要将BOM作为字节数组获取,只需使用编码对象的 GetPreamble()函数.这样一来,您就可以通过前导码检测整个编码范围.

Detecting encoding is always a tricky business, but detecting BOMs is dead simple. To get the BOM as byte array, just use the GetPreamble() function of the encoding objects. This should allow you to detect a whole range of encodings by preamble.

现在,要检测没有前导码的UTF-8,实际上也不是很困难.请参见,UTF8 对于在有效序列中期望的值具有严格的按位规则,您可以初始化UTF8Encoding对象在这些序列为错误.

Now, as for detecting UTF-8 without preamble, actually that's not very hard either. See, UTF8 has strict bitwise rules about what values are expected in a valid sequence, and you can initialize a UTF8Encoding object in a way that will fail by throwing an exception when these sequences are incorrect.

因此,如果您先执行BOM表检查,然后进行严格的解码检查,最后又退回到Win-1252编码(您称其为"ANSI"),那么您的检测就完成了.

So if you first do the BOM check, and then the strict decoding check, and finally fall back to Win-1252 encoding (what you call "ANSI") then your detection is done.

Byte[] bytes = File.ReadAllBytes(filename);
Encoding encoding = null;
String text = null;
// Test UTF8 with BOM. This check can easily be copied and adapted
// to detect many other encodings that use BOMs.
UTF8Encoding encUtf8Bom = new UTF8Encoding(true, true);
Boolean couldBeUtf8 = true;
Byte[] preamble = encUtf8Bom.GetPreamble();
Int32 prLen = preamble.Length;
if (bytes.Length >= prLen && preamble.SequenceEqual(bytes.Take(prLen)))
{
    // UTF8 BOM found; use encUtf8Bom to decode.
    try
    {
        // Seems that despite being an encoding with preamble,
        // it doesn't actually skip said preamble when decoding...
        text = encUtf8Bom.GetString(bytes, prLen, bytes.Length - prLen);
        encoding = encUtf8Bom;
    }
    catch (ArgumentException)
    {
        // Confirmed as not UTF-8!
        couldBeUtf8 = false;
    }
}
// use boolean to skip this if it's already confirmed as incorrect UTF-8 decoding.
if (couldBeUtf8 && encoding == null)
{
    // test UTF-8 on strict encoding rules. Note that on pure ASCII this will
    // succeed as well, since valid ASCII is automatically valid UTF-8.
    UTF8Encoding encUtf8NoBom = new UTF8Encoding(false, true);
    try
    {
        text = encUtf8NoBom.GetString(bytes);
        encoding = encUtf8NoBom;
    }
    catch (ArgumentException)
    {
        // Confirmed as not UTF-8!
    }
}
// fall back to default ANSI encoding.
if (encoding == null)
{
    encoding = Encoding.GetEncoding(1252);
    text = encoding.GetString(bytes);
}

请注意,Windows-1252(美国/西欧ANSI)是每个字符一个字节的编码,这意味着其中的所有内容都会产生技术上有效的字符,因此

Note that Windows-1252 (US / Western European ANSI) is a one-byte-per-character encoding, meaning everything in it produces a technically valid character, so unless you go for heuristic methods, no further detection can be done on it to distinguish it from other one-byte-per-character encodings.