且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

获得列表联合的最快方法 - Python

更新时间:2023-11-25 22:04:34

什么最快取决于 x 的性质 -- 无论是长列表还是短列表,子列表很多还是很少子列表,子列表的长短,重复多还是少.

以下是一些比较一些替代方案的时间结果.有很多可能性,这绝不是一个完整的分析,但也许这将为您提供一个研究用例的框架.

func |× |时间unique_concatenate |many_uniques |0.863empty_set_union |many_uniques |1.191short_set_union_rest |many_uniques |1.192long_set_union_rest |many_uniques |1.194set_chain |many_uniques |1.224功能 |× |时间long_set_union_rest |many_duplicates |0.958short_set_union_rest |many_duplicates |0.969empty_set_union |many_duplicates |0.971set_chain |many_duplicates |1.128unique_concatenate |many_duplicates |2.411功能 |× |时间empty_set_union |many_small_lists |1.023long_set_union_rest |many_small_lists |1.028set_chain |many_small_lists |1.032short_set_union_rest |many_small_lists |1.036unique_concatenate |many_small_lists |1.351功能 |× |时间long_set_union_rest |几个大列表|0.791empty_set_union |几个大列表|0.813unique_concatenate |几个大列表|0.814set_chain |几个大列表|0.829short_set_union_rest |几个大列表|0.849

请务必在您自己的机器上运行 timeit 基准测试,因为结果可能会有所不同.

from __future__ import print_function随机导入导入时间从 itertools 导入链将 numpy 导入为 npdef unique_concatenate(x):返回 np.unique(np.concatenate(x))def short_set_union_rest(x):# 这里假设 x[0] 是 x 中最短的列表返回列表(set(x[0]).union(*x[1:]))def long_set_union_rest(x):# 这里假设 x[-1] 是 x 中最长的列表返回列表(set(x[-1]).union(*x[1:]))def empty_set_union(x):返回列表(set().union(*x))def set_chain(x):返回列表(设置(链(* x)))big_range = 列表(范围(10**7))small_range = 列表(范围(10**5))many_uniques = [[random.choice(big_range) for i in range(j)]对于范围内的 j (10, 10000, 10)]many_duplicates = [[random.choice(small_range) for i in range(j)]对于范围内的 j (10, 10000, 10)]many_small_lists = [[random.choice(big_range) for i in range(10)]对于范围内的 j (10, 10000, 10)]很少的大列表 = [[random.choice(big_range) for i in range(1000)]对于范围内的 j (10, 100, 10)]如果 __name__=='__main__':对于 x, n 在 [('many_uniques', 1), ('many_duplicates', 4),('many_small_lists', 800), ('few_large_lists', 800)]:时间 = dict()对于 func 在 ['unique_concatenate', 'short_set_union_rest', 'long_set_union_rest','empty_set_union', 'set_chain']:计时[func, x] = timeit.timeit('{}({})'.format(func, x), number=n,setup='from __main__ import {}, {}'.format(func, x))打印('{:20} | {:20} | {}'.format('func', 'x', 'time'))对于 key, t in sorted(timing.items(), key=lambda item: item[1]):功能,x = 键打印('{:20} | {:20} | {:.3f}'.format(func, x, t))打印(结束='\n')

There's a C++ comparison to get union of lists from lists of lists: The fastest way to find union of sets

And there's several other python related questions but none suggest the fastest way to unionize the lists:

From the answers, I've gathered that there are at least 2 ways to do it:

>>> from itertools import chain
>>> x = [[1,2,3], [3,4,5], [1,7,8]]
>>> list(set().union(*x))
[1, 2, 3, 4, 5, 7, 8]
>>> list(set(chain(*x)))
[1, 2, 3, 4, 5, 7, 8]

Note that I'm casting the set to list afterwards because I need the order of the list to be fixed for further processing.

After some comparison, it seems like list(set(chain(*x))) is more stable and takes less time:

from itertools import chain
import time
import random

# Dry run.
x = [[random.choice(range(10000)) 
    for i in range(10)] for j in range(10)]
list(set().union(*x))
list(set(chain(*x)))

y_time = 0
z_time = 0

for _ in range(1000):
    x = [[random.choice(range(10000)) 
        for i in range(10)] for j in range(10)]
    start = time.time()
    y = list(set().union(*x))
    y_time += time.time() - start 
    #print 'list(set().union(*x)):\t', y_time
    start = time.time()
    z = list(set(chain(*x)))
    z_time += time.time() - start 
    #print 'list(set(chain(*x))):\t', z_time
    assert sorted(y) == sorted(z)
    #print 

print y_time / 1000.
print z_time / 1000. 

[out]:

1.39586925507e-05
1.09834671021e-05

Taking out the variable of casting sets to list:

y_time = 0
z_time = 0

for _ in range(1000):
    x = [[random.choice(range(10000)) 
        for i in range(10)] for j in range(10)]

    start = time.time()
    y = set().union(*x)
    y_time += time.time() - start 

    start = time.time()
    z = set(chain(*x))
    z_time += time.time() - start 

    assert sorted(y) == sorted(z)

print y_time / 1000.
print z_time / 1000. 

[out]:

1.22241973877e-05
1.02684497833e-05

Here's the full output when I try to print the intermediate timings (without list casting): http://pastebin.com/raw/y3i6dXZ8

Why is it that list(set(chain(*x))) takes less time than list(set().union(*x))?

Is there another way of achieving the same union of lists? Using numpy or pandas or sframe or something? Is the alternative faster?

What's fastest depends on the nature of x -- whether it is a long list or a short list, with many sublists or few sublists, whether the sublists are long or short, and whether there are many duplicates or few duplicates.

Here are some timeit results comparing some alternatives. There are so many possibilities that this is by no means a complete analysis, but perhaps this will give you a framework for studying your use case.

func                 | x                    | time
unique_concatenate   | many_uniques         | 0.863
empty_set_union      | many_uniques         | 1.191
short_set_union_rest | many_uniques         | 1.192
long_set_union_rest  | many_uniques         | 1.194
set_chain            | many_uniques         | 1.224

func                 | x                    | time
long_set_union_rest  | many_duplicates      | 0.958
short_set_union_rest | many_duplicates      | 0.969
empty_set_union      | many_duplicates      | 0.971
set_chain            | many_duplicates      | 1.128
unique_concatenate   | many_duplicates      | 2.411

func                 | x                    | time
empty_set_union      | many_small_lists     | 1.023
long_set_union_rest  | many_small_lists     | 1.028
set_chain            | many_small_lists     | 1.032
short_set_union_rest | many_small_lists     | 1.036
unique_concatenate   | many_small_lists     | 1.351

func                 | x                    | time
long_set_union_rest  | few_large_lists      | 0.791
empty_set_union      | few_large_lists      | 0.813
unique_concatenate   | few_large_lists      | 0.814
set_chain            | few_large_lists      | 0.829
short_set_union_rest | few_large_lists      | 0.849

Be sure to run the timeit benchmarks on your own machine since results may vary.


from __future__ import print_function
import random
import timeit
from itertools import chain
import numpy as np

def unique_concatenate(x):
    return np.unique(np.concatenate(x))

def short_set_union_rest(x):
    # This assumes x[0] is the shortest list in x
    return list(set(x[0]).union(*x[1:]))

def long_set_union_rest(x):
    # This assumes x[-1] is the longest list in x
    return list(set(x[-1]).union(*x[1:]))

def empty_set_union(x):
    return list(set().union(*x))

def set_chain(x):
    return list(set(chain(*x)))


big_range = list(range(10**7))
small_range = list(range(10**5))
many_uniques = [[random.choice(big_range) for i in range(j)] 
                for j in range(10, 10000, 10)]
many_duplicates = [[random.choice(small_range) for i in range(j)] 
              for j in range(10, 10000, 10)]
many_small_lists = [[random.choice(big_range) for i in range(10)] 
                    for j in range(10, 10000, 10)]
few_large_lists = [[random.choice(big_range) for i in range(1000)] 
                    for j in range(10, 100, 10)]

if __name__=='__main__':
    for x, n in [('many_uniques', 1), ('many_duplicates', 4), 
                 ('many_small_lists', 800), ('few_large_lists', 800)]:
        timing = dict()
        for func in [
                'unique_concatenate', 'short_set_union_rest', 'long_set_union_rest',
                'empty_set_union', 'set_chain']:
            timing[func, x] = timeit.timeit(
                '{}({})'.format(func, x), number=n,
                setup='from __main__ import {}, {}'.format(func, x))
        print('{:20} | {:20} | {}'.format('func', 'x', 'time'))
        for key, t in sorted(timing.items(), key=lambda item: item[1]):
            func, x = key
            print('{:20} | {:20} | {:.3f}'.format(func, x, t))
        print(end='\n')