且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

高效查找字符串中的重复字符

更新时间:2023-02-18 16:46:58

由于这是一个性能问题,让我们做一些时间安排:

As this is a performance question, let's do some timings:

def test_set(xs):
    seen = set()  # O(1) lookups
    for x in xs:
        if x not in seen:
            seen.add(x)
        else:
            return x

import collections

def test_counter(xs):
    freq = collections.Counter(xs)
    for k in freq:
        if freq[k] > 1:
            return k

def test_dict(xs):
    d = {}
    for x in xs:
        if x in d:
            return x
        d[x] = 1

def test_sort(xs):
    ys = sorted(xs)

    for n in range(1, len(xs)):
        if ys[n] == ys[n-1]:
            return ys[n]

##

import sys, timeit
print (sys.version + "\n")
xs = list(range(10000)) + [999]
fns = [p for name, p in globals().items() if name.startswith('test')]
for fn in fns:
    assert fn(xs) == 999
    print ('%50s %.5f' % (fn, timeit.timeit(lambda: fn(xs), number=100)))

我正在测试一个整数列表而不是一个字符串(因为使用一个字符串你不能得到超过 256 个循环).我机器上的结果是这样的:

I'm testing on an list of integers rather than a string (because with a string you can't get more than 256 loops). The results on my machine look like this:

3.2.3 (v3.2.3:3d0686d90f55, Apr 10 2012, 11:25:50) 
[GCC 4.2.1 (Apple Inc. build 5666) (dot 3)]

                <function test_set at 0x1020f7380> 0.19265
               <function test_dict at 0x1020f7490> 0.12725
               <function test_sort at 0x1020f7518> 0.04683
            <function test_counter at 0x1020f7408> 0.92485

所以排序方法似乎是赢家.我想这是因为它不会浪费时间创建哈希和分配 dict/set 结构.此外,如果您不关心正在更改的源列表,您可以执行 xs.sort() 而不是 ys = sorted(xs),这会给您零内存占用.

So the sort method appears to be the winner. I guess this is because it doesn't waste time creating hashes and allocating dict/set structures. Also, if you don't care about the source list being changed, you can do xs.sort() instead of ys = sorted(xs), which gives you zero memory footprint.

另一方面,如果重复项更有可能出现在输入的开头(如xs = 'abcdef' * 10000),则set方法将执行***,因为它与 sortCounter 不同,一旦找到重复项就立即返回并且不需要预处理整个列表.如果您需要 first 重复元素,而不仅仅是其中一个,您还应该使用 set.

On the other side, if repeated items are more probable to occur towards the beginning of the input (as in xs = 'abcdef' * 10000), the set method will perform the best, as it, unlike sort or Counter, returns immediately once a repeat is found and doesn't need to preprocess the whole list. You should also use set if you need the first repeating element, not just one of them.

Counter 是一个不错的工具,但它不是为性能而设计的,所以如果你真的必须处理巨大的输入",请使用集合(如果它们适合内存)或归并排序(如果它们)不要.

Counter is a nice tool, but it's not designed for performance, so if you really have to deal with "gigantic inputs", go with sets (if they fit in memory) or mergesort if they don't.