且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

Python无法打开UTF-8编码的文本文件

更新时间:2023-02-20 10:30:19

首先,此文本绝对不是UTF-8,所以这就是Python无法将其作为UTF-8编码的文本文件打开的原因.

First, this text is definitely not UTF-8, so that's why Python can't open it as a UTF-8-encoded text file.

第二,您声称您也尝试过utf-16-be和utf-16-le",但没有说明您是如何做到的,而且我怀疑您做错了.

Second, you claim you "tried also utf-16-be and utf-16-le", but didn't show how you did that, and I suspect you did it wrong.

从输出中看,这很有可能是BOM编码的UTF-16-LE.

From the output, this is very likely BOM-encoded UTF-16-LE.

前两个字节-由于您打印它们的方式,我们无法确定它们是哪个字节,但这就是您打印出\xFF\xFE字节时的样子.其余字符串是一堆NUL偶数字节和看起来合理的字节,这几乎总是表示UTF-16-LE.另外,最常见的带有BOM的2字节是UTF-16-LE,而您正在使用所有Microsoft工具的事实使这种可能性更大.

The first two bytes—because of the way you've printed them, we can't tell which bytes they are, but this is what it looks like when you print out \xFF and \xFE bytes. And the rest of the strings are a bunch of NUL even bytes alternating with reasonable-looking bytes, which almost always means UTF-16-LE. Plus, most common two-byte with a BOM in the wild is UTF-16-LE, and the fact that you're using all Microsoft tools makes that even more likely.

因此,如果您真的尝试过utf-16-le,则几乎可以肯定会得到正确的字符串,但是在开始时要有一个额外的\ufeff.

So, if you'd really tried utf-16-le, you would almost certainly have gotten the right string, but with an extra \ufeff at the start.

但是,当然,正确的答案是将其解码为'utf-16',这将正确消耗和使用BOM.

But of course the right answer is to just decode it as 'utf-16', which will consume and use the BOM properly.