更新时间:2023-11-27 12:22:58
给定一个文件对象,以及一些字符,你可以使用:
# 建立一个表,将前导字节映射到预期的后继字节数# 字节 00-BF 有 0 个后续字节,F5-FF 不是合法的 UTF8# C0-DF: 1, E0-EF: 2 和 F0-F4: 3 跟随字节.# 将 F5-FF 设置为 0 以尽量减少读取损坏的数据._lead_byte_to_count = []对于我在范围内(256):_lead_byte_to_count.append(1 + (i >= 0xe0) + (i >= 0xf0) 如果 0xbf
测试结果:
>>>test = StringIO(u'This is a test contains Unicode data: \ua000'.encode('utf8'))>>>readUTF8(测试,41)u'这是一个包含 Unicode 数据的测试:\ua000'在 Python 3 中,将文件对象包装在 io.TextIOWrapper()
对象 并将解码留给本地和高效的 Python UTF-8 实现.
I have some files which contains a bunch of different kinds of binary data and I'm writing a module to deal with these files.
Amongst other, it contains UTF-8 encoded strings in the following format: 2 bytes big endian stringLength (which I parse using struct.unpack()) and then the string. Since it's UTF-8, the length in bytes of the string may be greater than stringLength and doing read(stringLength) will come up short if the string contains multi-byte characters (not to mention messing up all the other data in the file).
How do I read n UTF-8 characters (distinct from n bytes) from a file, being aware of the multi-byte properties of UTF-8? I've been googling for half an hour and all the results I've found are either not relevant or makes assumptions that I cannot make.
Given a file object, and a number of characters, you can use:
# build a table mapping lead byte to expected follow-byte count
# bytes 00-BF have 0 follow bytes, F5-FF is not legal UTF8
# C0-DF: 1, E0-EF: 2 and F0-F4: 3 follow bytes.
# leave F5-FF set to 0 to minimize reading broken data.
_lead_byte_to_count = []
for i in range(256):
_lead_byte_to_count.append(
1 + (i >= 0xe0) + (i >= 0xf0) if 0xbf < i < 0xf5 else 0)
def readUTF8(f, count):
"""Read `count` UTF-8 bytes from file `f`, return as unicode"""
# Assumes UTF-8 data is valid; leaves it up to the `.decode()` call to validate
res = []
while count:
count -= 1
lead = f.read(1)
res.append(lead)
readcount = _lead_byte_to_count[ord(lead)]
if readcount:
res.append(f.read(readcount))
return (''.join(res)).decode('utf8')
Result of a test:
>>> test = StringIO(u'This is a test containing Unicode data: \ua000'.encode('utf8'))
>>> readUTF8(test, 41)
u'This is a test containing Unicode data: \ua000'
In Python 3, it is of course much, much easier to just wrap the file object in a io.TextIOWrapper()
object and leave decoding to the native and efficient Python UTF-8 implementation.