且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

在Python中读取同一文件中的二进制文本和文本

更新时间:2022-12-20 08:33:55

一种方法可能是使用 Hachoir 定义文件解析协议。



简单的替代方法是以二进制模式打开文件手动初始化它周围的缓冲区和文本包装器。然后你可以非常巧妙地切换进出二进制文件:

  my_file = io.open(myfile.txt,rb )
my_file_buffer = io.BufferedReader(my_file,buffer_size = 1)#性能不高但更大的缓冲区会吃掉二进制数据
my_file_text_reader = io.TextIOWrapper(my_file_buffer,encoding =utf -8)
string_buffer =

而True:
而接近结束不在string_buffer中:
string_buffer + = my_file_text_reader.read(1) #一次读取一个Unicode字符

#二进制数据必须是下一个。我们从哪里得到二进制长度?
print string_buffer
data = my_file_buffer.read(3)

打印数据
string_buffer =

更快,更少可扩展的方法可能是通过智能解析文本部分来使用您在问题中建议的方法,读取每个UTF-8字节序列一时间以下代码(来自 http://rosettacode.org/wiki/Read_a_file_character_by_character/UTF8#Python),似乎是保守地将UTF-8字节读入二进制文件中的字符的一种巧妙方法:

  def get_next_character( f):
#note:假定有效utf-8
c = f.read(1)
而c:
而True:
try:
yield c.decode('utf-8')
除了UnicodeDecodeError:
#我们遇到了一个多字节字符
#读取另一个字节并再试一次
c + = f.read (1)
else:
#c是一个有效的字符,并且已经产生,继续
c = f.read(1)
break

#用法:
打开(input.txt,rb)为f:
my_unicode_str =
for c在get_next_character(f)中:
my_unicode_str + = c


How does one read binary and text from the same file in Python? I know how to do each separately, and can imagine doing both very carefully, but not both with the built-in IO library directly.

So I have a file that has a format that has large chunks of UTF-8 text interspersed with binary data. The text does not have a length written before it or a special character like "\0" delineating it from the binary data, there is a large portion of text near the end when parsed means "we are coming to an end".

The optimal solution would be to have the built-in file reading classes have "read(n)" and "read_char(n)" methods, but alas they don't. I can't even open the file twice, once as text and once as binary, since the return value of tell() on the text one can't be used with the binary one in any meaningful way.

So my first idea would be to open the whole file as binary and when I reach a chunk of text, read it "character by character" until I realize that the text is ending and then go back to reading it as binary. However this means that I have to read byte-by-byte and do my own decoding of UTF-8 characters (do I need to read another byte for this character before doing something with it?). If it was a fixed-width character encoding I would just read that many bytes each time. In the end I would also like the universal line endings as supported by the Python text-readers, but that would be even more difficult to implement while reading byte-by-byte.

Another easier solution would be if I could ask the text file object its real offset in the file. That alone would solve all my problems.

One way might be to use Hachoir to define a file parsing protocol.

The simple alternative is to open the file in binary mode and manually initialise a buffer and text wrapper around it. You can then switch in and out of binary pretty neatly:

my_file = io.open("myfile.txt", "rb")
my_file_buffer = io.BufferedReader(my_file, buffer_size=1) # Not as performant but a larger buffer will "eat" into the binary data 
my_file_text_reader = io.TextIOWrapper(my_file_buffer, encoding="utf-8")
string_buffer = ""

while True:
    while "near the end" not in string_buffer:
        string_buffer += my_file_text_reader.read(1) # read one Unicode char at a time

    # binary data must be next. Where do we get the binary length from?
    print string_buffer
    data = my_file_buffer.read(3)

    print data
    string_buffer = ""

A quicker, less extensible way might be to use the approach you've suggested in your question by intelligently parsing the text portions, reading each UTF-8 sequence of bytes at a time. The following code (from http://rosettacode.org/wiki/Read_a_file_character_by_character/UTF8#Python), seems to be a neat way to conservatively read UTF-8 bytes into characters from a binary file:

 def get_next_character(f):
     # note: assumes valid utf-8
     c = f.read(1)
     while c:
         while True:
             try:
                 yield c.decode('utf-8')
             except UnicodeDecodeError:
                 # we've encountered a multibyte character
                 # read another byte and try again
                 c += f.read(1)
             else:
                 # c was a valid char, and was yielded, continue
                 c = f.read(1)
                 break

# Usage:
with open("input.txt","rb") as f:
    my_unicode_str = ""
    for c in get_next_character(f):
        my_unicode_str += c