更新时间:2022-06-14 23:29:02
虽然 Martijn Pieters 的答案是正确的,但是当 samplesize
变大时它确实会变慢,因为使用 list.insert
在循环中可能具有二次复杂度.
While the answer of Martijn Pieters is correct, it does slow down when samplesize
becomes large, because using list.insert
in a loop may have quadratic complexity.
在我看来,这里有一个替代方案,可以在提高性能的同时保持一致性:
Here's an alternative that, in my opinion, preserves the uniformity while increasing performance:
def iter_sample_fast(iterable, samplesize):
results = []
iterator = iter(iterable)
# Fill in the first samplesize elements:
try:
for _ in xrange(samplesize):
results.append(iterator.next())
except StopIteration:
raise ValueError("Sample larger than population.")
random.shuffle(results) # Randomize their positions
for i, v in enumerate(iterator, samplesize):
r = random.randint(0, i)
if r < samplesize:
results[r] = v # at a decreasing rate, replace random items
return results
对于 10000
以上的 samplesize
值,差异开始慢慢显现.使用 (1000000, 100000)
调用的次数:
The difference slowly starts to show for samplesize
values above 10000
. Times for calling with (1000000, 100000)
: