且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

什么是最准确的编码检测器?

更新时间:2023-02-26 17:38:04

我已经检查juniversalchardet和ICU4J on某些 CSV文件,并且结果不一致:
juniversalchardet有更好的效果:

I've checked juniversalchardet and ICU4J on some CSV files, and the results are inconsistent: juniversalchardet had better results:


  • UTF-

  • Windows-1255:juniversalchardet检测到有足够的希伯来字母,ICU4J仍然认为它是ISO-8859-1。

  • SHIFT_JIS(日语):juniversalchardet检测到,ICU4J检测到了这个问题,并且ICU4J检测到了它的另一个希伯来语编码的ISO-8859-8。认为是ISO-8859-2。

  • ISO-8859-1:由ICU4J检测,不受juniversalchardet支持。

  • UTF-8: Both detected.
  • Windows-1255: juniversalchardet detected when it had enough hebrew letters, ICU4J still thought it was ISO-8859-1. With even more hebrew letters, ICU4J detected it as ISO-8859-8 which is the other hebrew encoding(and so the text was OK).
  • SHIFT_JIS(Japanese): juniversalchardet detected and ICU4J thought it was ISO-8859-2.
  • ISO-8859-1: detected by ICU4J, not supported by juniversalchardet.

因此,应该考虑他最有可能处理哪些编码。
最后,我选择了 ICU4J

So one should consider which encodings he will most likely have to deal with. In the end I chose ICU4J.

注意ICU4J仍然保留。

Notice that ICU4J is still maintained.

还要注意,你可能想使用ICU4J,如果它返回null,因为它没有成功,尝试使用juniversalchardet。

Also notice that you may want to use ICU4J, and in case that it returns null because it didn't succeed, try to use juniversalchardet. Or the opposite.

Apache Tika 的AutoDetectReader正是这样 - 首先尝试使用HtmlEncodingDetector,然后使用UniversalEncodingDetector(基于juniversalchardet),然后尝试Icu4jEncodingDetector(基于ICU4J)。

AutoDetectReader of Apache Tika does exactly this - first tries to use HtmlEncodingDetector, then UniversalEncodingDetector(which is based on juniversalchardet), and then tries Icu4jEncodingDetector(based on ICU4J).