更新时间:2023-02-26 17:38:04
我已经检查juniversalchardet和ICU4J on某些 CSV文件,并且结果不一致:
juniversalchardet有更好的效果:
I've checked juniversalchardet and ICU4J on some CSV files, and the results are inconsistent: juniversalchardet had better results:
因此,应该考虑他最有可能处理哪些编码。
最后,我选择了 ICU4J 。
So one should consider which encodings he will most likely have to deal with. In the end I chose ICU4J.
注意ICU4J仍然保留。
Notice that ICU4J is still maintained.
还要注意,你可能想使用ICU4J,如果它返回null,因为它没有成功,尝试使用juniversalchardet。
Also notice that you may want to use ICU4J, and in case that it returns null because it didn't succeed, try to use juniversalchardet. Or the opposite.
Apache Tika 的AutoDetectReader正是这样 - 首先尝试使用HtmlEncodingDetector,然后使用UniversalEncodingDetector(基于juniversalchardet),然后尝试Icu4jEncodingDetector(基于ICU4J)。
AutoDetectReader of Apache Tika does exactly this - first tries to use HtmlEncodingDetector, then UniversalEncodingDetector(which is based on juniversalchardet), and then tries Icu4jEncodingDetector(based on ICU4J).