且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

如何检测文件编码?

更新时间:2023-02-26 14:02:26

UTF-16编码文件应始终包含字节顺序标记 - *** [ ^ ]。使用UTF-8文件,这是可选的。



因此,您应首先检查BOM。如果没有,您可以检查有效的UTF-8。



使用Windows,您可以使用 MultiByteToWideChar函数(Windows) [ ^ ]这样做(无论如何,它必须被称为将UTF-8文本转换为Windows使用的UTF-16)。



另一种选择是使用ICU转换器库(使用转换器 - ICU用户指南 [ ^ ])。



还有一些项目提供转换器和检查功能,如 UTF8-CPP:UTF-8使用C ++以便携方式 [ ^ ]。



或根据允许的代码点编写自己的代码。我曾经找到一个基于Unicode建议的示例实现,但我不再找到它了。



请注意,所有检查都将返回true(有效的UTF-8) ASCII文件。所以可能需要首先检查字符> = 0x80。
UTF-16 encoded files should always contain a Byte order mark - Wikipedia[^]. With UTF-8 files this is optional.

So you should check first for a BOM. If there is none, you might check for valid UTF-8.

With Windows you can use the MultiByteToWideChar function (Windows)[^] to do that (it must be probably called anyway to convert UTF-8 text to UTF-16 which is used by Windows).

Another option is using the ICU converter library (Using Converters - ICU User Guide[^]).

There are also some projects providing converters and check functions like UTF8-CPP: UTF-8 with C++ in a Portable Way[^].

Or write your own according to the allowed code points. I once found a sample implementation based on the Unicode recommendations but I did not find it anymore.

Note that all checks will return true (valid UTF-8) for plain ASCII files. So it might be necessary to check first for characters >= 0x80.


参见在C / C ++中处理简单的文本文件 [ ^ ]。它显示了如何从文件的前几个字节中识别编码。
See Handling simple text files in C/C++[^]. It shows how you can identify the encoding from the first few bytes of the file.