且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

将CSV文件拆分成相等的部分?

更新时间:2022-04-25 21:34:39

正如我在评论中所说,csv文件需要在行(或行)边界上拆分.您的代码不会执行此操作,并且可能会将它们分解成一个—的中间位置.我怀疑是您_csv.Error的原因.

As I said in a comment, csv files would need to be split on row (or line) boundaries. Your code doesn't do this and potentially breaks them up somewhere in the middle of one — which I suspect is the cause of your _csv.Error.

以下通过将输入文件处理为一系列行来避免这样做.我已经对其进行了测试,在将示例文件分成大小相等的大约块的意义上,它似乎是独立工作的,因为不太可能将全部行都完全适合一个块.

The following avoids doing that by processing the input file as a series of lines. I've tested it and it seems to work standalone in the sense that it divided the sample file up into approximately equally size chunks because it's unlikely that an whole number of rows will fit exactly into a chunk.

更新

这是比我最初发布的代码快得多的版本.改进之处在于,它现在使用临时文件自己的tell()方法来确定文件正在写入时不断变化的长度,而不是调用os.path.getsize(),从而消除了flush()文件并调用os.fsync()的需要每行写完后就放在上面.

This it is a substantially faster version of the code than I originally posted. The improvement is because it now uses the temp file's own tell() method to determine the constantly changing length of the file as it's being written instead of calling os.path.getsize(), which eliminated the need to flush() the file and call os.fsync() on it after each row is written.

import csv
import multiprocessing
import os
import tempfile

def split(infilename, num_chunks=multiprocessing.cpu_count()):
    READ_BUFFER = 2**13
    in_file_size = os.path.getsize(infilename)
    print 'in_file_size:', in_file_size
    chunk_size = in_file_size // num_chunks
    print 'target chunk_size:', chunk_size
    files = []
    with open(infilename, 'rb', READ_BUFFER) as infile:
        for _ in xrange(num_chunks):
            temp_file = tempfile.TemporaryFile()
            while temp_file.tell() < chunk_size:
                try:
                    temp_file.write(infile.next())
                except StopIteration:  # end of infile
                    break
            temp_file.seek(0)  # rewind
            files.append(temp_file)
    return files

files = split("sample_simple.csv", num_chunks=4)
print 'number of files created: {}'.format(len(files))

for i, ifile in enumerate(files, start=1):
    print 'size of temp file {}: {}'.format(i, os.path.getsize(ifile.name))
    print 'contents of file {}:'.format(i)
    reader = csv.reader(ifile)
    for row in reader:
        print row
    print ''