且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

使用Python检测文本文件编码的代码中的陷阱?

更新时间:2023-02-20 11:00:39

可能最简单的方式来了解代码的工作原理是为其他现有库获取测试套件,并将其作为基础来创建自己的综合测试套件。他们会知道您的代码是否适用于所有这些情况,您还可以测试所有关心的案例。


I know more about bicycle repair, chainsaw use and trench safety than I do Python or text encoding; with that in mind...

Python text encoding seems to be a perennial issue (my own question: Searching text files' contents with various encodings with Python?, and others I've read: 1, 2. I've taken a crack at writing some code to guess the encoding below.

In limited testing this code seems to work for my purposes* without me having to know an excess about the first three bytes of text encoding and the situations where those data aren't informative.

*My purposes are:

  1. Have a dependency-free snippet I can use with a moderate-high degree of success,
  2. Scan a local workstation for text based log files of any encoding and identify them as a file I am interested in based on their contents (which requires the file to be opened with the proper encoding)
  3. for the challenge of getting this to work.

Question: What are the pitfalls with using a what I assume to be a klutzy method of comparing and counting characters like I do below? Any input is greatly appreciated.

def guess_encoding_debug(file_path):
    """
    DEBUG - returns many 2 value tuples
    Will return list of all possible text encodings with a count of the number of chars
    read that are common characters, which might be a symptom of success.
    SEE warnings in sister function
    """

    import codecs
    import string
    from operator import itemgetter

    READ_LEN = 1000
    ENCODINGS = ['ascii','cp1252','mac_roman','utf_8','utf_16','utf_16_le',\
                 'utf_16_be','utf_32','utf_32_le','utf_32_be']

    #chars in the regular ascii printable set are BY FAR the most common
    #in most files written in English, so their presence suggests the file
    #was decoded correctly.
    nonsuspect_chars = string.printable

    #to be a list of 2 value tuples
    results = []

    for e in ENCODINGS:
        #some encodings will cause an exception with an incompatible file,
        #they are invalid encoding, so use try to exclude them from results[]
        try:
            with codecs.open(file_path, 'r', e) as f:

                #sample from the beginning of the file
                data = f.read(READ_LEN)

                nonsuspect_sum = 0

                #count the number of printable ascii chars in the
                #READ_LEN sized sample of the file
                for n in nonsuspect_chars:
                    nonsuspect_sum += data.count(n)

                #if there are more chars than READ_LEN
                #the encoding is wrong and bloating the data
                if nonsuspect_sum <= READ_LEN:
                    results.append([e, nonsuspect_sum])
        except:
            pass

    #sort results descending based on nonsuspect_sum portion of
    #tuple (itemgetter index 1).
    results = sorted(results, key=itemgetter(1), reverse=True)

    return results


def guess_encoding(file_path):
    """
    Stupid, simple, slow, brute and yet slightly accurate text file encoding guessing.
    Will return one likely text encoding, though there may be others just as likely.
    WARNING: DO NOT use if your file uses any significant number of characters
             outside the standard ASCII printable characters!
    WARNING: DO NOT use for critical applications, this code will fail you.
    """

    results = guess_encoding_debug(file_path)

    #return the encoding string (second 0 index) from the first
    #result in descending list of encodings (first 0 index)
    return results[0][0]

I am assuming it would be slow compared to chardet, which I am not particularly familiar with. Also less accurate. They way it is designed, any roman character based language that uses accents, umlauts, etc. will not work, at least not well. It will be hard to know when it fails. However, most text in English, including most programming code, would largely be written with string.printable on which this code depends.

External libraries may be an option in the future, but for now I want to avoid them because:

  1. This script will be run on multiple company computers on and off the network with various versions of python, so the fewer complications the better. When I say 'company' I mean small non-profit of social scientists.
  2. I am in charge of collecting the logs from GPS data processing, but I am not the systems administrator - she is not a python programmer and the less time I take of hers the better.
  3. The installation of Python that is generally available at my company is installed with a GIS software package, and is generally better when left alone.
  4. My requirements aren't too strict, I just want to identify the files I am interested in and use other methods to copy them to an archive. I am not reading the full contents to memory to manipulate, appending or to rewriting the contents.
  5. It seems like a high-level programming language should have some way of accomplishing this on its own. While "seems like" is a shaky foundation for any endeavor, I wanted to try and see if I could get it to work.

Probably the simplest way to find out how well your code works is to take the test suites for the other existing libraries, and use those as a base to create your own comprehensive test suite. They you will know if your code works for all of those cases, and you can also test for all of the cases you care about.