什么是最准确的编码检测器？

更新时间：2023-02-26 17:38:04

我已经检查juniversalchardet和ICU4J on某些 CSV文件，并且结果不一致：
juniversalchardet有更好的效果：

I've checked juniversalchardet and ICU4J on some CSV files, and the results are inconsistent: juniversalchardet had better results:

UTF-

Windows-1255：juniversalchardet检测到有足够的希伯来字母，ICU4J仍然认为它是ISO-8859-1。

SHIFT_JIS（日语）：juniversalchardet检测到，ICU4J检测到了这个问题，并且ICU4J检测到了它的另一个希伯来语编码的ISO-8859-8。认为是ISO-8859-2。

ISO-8859-1：由ICU4J检测，不受juniversalchardet支持。

UTF-8: Both detected.
Windows-1255: juniversalchardet detected when it had enough hebrew letters, ICU4J still thought it was ISO-8859-1. With even more hebrew letters, ICU4J detected it as ISO-8859-8 which is the other hebrew encoding(and so the text was OK).
SHIFT_JIS(Japanese): juniversalchardet detected and ICU4J thought it was ISO-8859-2.
ISO-8859-1: detected by ICU4J, not supported by juniversalchardet.

因此，应该考虑他最有可能处理哪些编码。
最后，我选择了 ICU4J 。

So one should consider which encodings he will most likely have to deal with. In the end I chose ICU4J.

注意ICU4J仍然保留。

Notice that ICU4J is still maintained.

还要注意，你可能想使用ICU4J，如果它返回null，因为它没有成功，尝试使用juniversalchardet。

Also notice that you may want to use ICU4J, and in case that it returns null because it didn't succeed, try to use juniversalchardet. Or the opposite.

Apache Tika 的AutoDetectReader正是这样 - 首先尝试使用HtmlEncodingDetector，然后使用UniversalEncodingDetector（基于juniversalchardet），然后尝试Icu4jEncodingDetector（基于ICU4J）。

AutoDetectReader of Apache Tika does exactly this - first tries to use HtmlEncodingDetector, then UniversalEncodingDetector(which is based on juniversalchardet), and then tries Icu4jEncodingDetector(based on ICU4J).

上一篇 : ：之后的 Tkinter 在时钟倒带中幸存下来下一篇 : 如何在Python中删除离群值?

什么是最准确的编码检测器？

相关阅读

技术问答最新文章