且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

检测编码错误的UTF-8文本文件中的编码

更新时间:2023-02-20 10:56:40

最终,我知道了.使用CharsetNormalizerMatches似乎可以正常检测编码.无论如何,这就是我的实现方式,它就像一个超级按钮一样工作,可以正确地检测出相关文件的gb18030编码:

Eventually, I've figured it out. Using CharsetNormalizerMatches seems to work, properly detecting the encoding. Anyways, this is how I implemented it and it works like a charm, correctly detecting gb18030 encoding for the file in question:

from charset_normalizer import CharsetNormalizerMatches as CnM
encoding = CnM.from_path(path).best().first().encoding

注意:有人建议使用CharsetNormalizerMatches,但有人在此提示了我的答案,但后来在这里删除了他的帖子.太可惜了,我很想给他/她功劳.

Note: The answer was hinted to me by someone who suggested using CharsetNormalizerMatches, but later deleted his post here. Too bad, I'd love to give him/her the credit.