带有生成器/可迭代/迭代器的 Python 随机样本

更新时间：2022-06-14 23:29:02

虽然 Martijn Pieters 的答案是正确的，但是当 samplesize 变大时它确实会变慢，因为使用 list.insert 在循环中可能具有二次复杂度.

While the answer of Martijn Pieters is correct, it does slow down when samplesize becomes large, because using list.insert in a loop may have quadratic complexity.

在我看来，这里有一个替代方案，可以在提高性能的同时保持一致性:

Here's an alternative that, in my opinion, preserves the uniformity while increasing performance:

def iter_sample_fast(iterable, samplesize):
    results = []
    iterator = iter(iterable)
    # Fill in the first samplesize elements:
    try:
        for _ in xrange(samplesize):
            results.append(iterator.next())
    except StopIteration:
        raise ValueError("Sample larger than population.")
    random.shuffle(results)  # Randomize their positions
    for i, v in enumerate(iterator, samplesize):
        r = random.randint(0, i)
        if r < samplesize:
            results[r] = v  # at a decreasing rate, replace random items
    return results

对于 10000 以上的 samplesize 值，差异开始慢慢显现.使用 (1000000, 100000) 调用的次数:

The difference slowly starts to show for samplesize values above 10000. Times for calling with (1000000, 100000):

iterSample:5.05s
iter_sample_fast:2.64 秒

上一篇 : ：随机样本组而不是 SQL 中的行下一篇 : Python中配对列表的随机样本

带有生成器/可迭代/迭代器的 Python 随机样本

相关阅读

技术问答最新文章