更新时间:2022-12-30 17:03:35
不要对内存中的 1000 万行进行排序.改为分批拆分:
Don't sort 10 million lines in memory. Split this up in batches instead:
运行 100 次 100k 行排序(使用文件作为迭代器,结合 islice()
或类似方法来选择一个批次).写出到别处的单独文件.
Run 100 100k line sorts (using the file as an iterator, combined with islice()
or similar to pick a batch). Write out to separate files elsewhere.
合并排序后的文件.这是一个合并生成器,您可以传递 100 个打开的文件,它会按排序顺序生成行.逐行写入新文件:
Merge the sorted files. Here is an merge generator that you can pass 100 open files and it'll yield lines in sorted order. Write to a new file line by line:
import operator
def mergeiter(*iterables, **kwargs):
"""Given a set of sorted iterables, yield the next value in merged order
Takes an optional `key` callable to compare values by.
"""
iterables = [iter(it) for it in iterables]
iterables = {i: [next(it), i, it] for i, it in enumerate(iterables)}
if 'key' not in kwargs:
key = operator.itemgetter(0)
else:
key = lambda item, key=kwargs['key']: key(item[0])
while True:
value, i, it = min(iterables.values(), key=key)
yield value
try:
iterables[i][0] = next(it)
except StopIteration:
del iterables[i]
if not iterables:
raise