pandas 从列表列中获取唯一值

更新时间：2022-10-17 23:19:07

如果您只想找到唯一值，我建议使用 itertools.chain.from_iterable 来连接所有这些列表

import itertools>>>np.unique([*itertools.chain.from_iterable(df.Genre)])数组(['动作'，'犯罪'，'戏剧']，dtype='<U6')

甚至更快

>>>设置(itertools.chain.from_iterable(df.Genre)){'动作'、'犯罪'、'戏剧'}

`时间`

df = pd.DataFrame({'Genre':[['crime','drama'],['action','crime','drama']]})df = pd.concat([df]*10000)%timeit set(itertools.chain.from_iterable(df.Genre))100 个循环，***的 3 个:每个循环 2.55 毫秒%timeit set([x for y in df['Genre'] for x in y])100 个循环，***的 3 个:每个循环 4.09 毫秒%timeit np.unique([*itertools.chain.from_iterable(df.Genre)])100 个循环，***的 3 个:每个循环 12.8 毫秒%timeit np.unique(df['流派'].sum())1 个循环，***的 3 个:每个循环 1.65 秒%timeit set(df['流派'].sum())1 个循环，***的 3 个:每个循环 1.66 秒

How do I get the unique values of a column of lists in pandas or numpy such that second column from

would result in 'action', 'crime', 'drama'.

The closest (but non-functional) solutions I could come up with were:

 genres = data['Genre'].unique()

But this predictably results in a TypeError saying how lists aren't hashable.

TypeError: unhashable type: 'list'

Set seemed to be a good idea but

genres = data.apply(set(), columns=['Genre'], axis=1)

but also results in a TypeError: set() takes no keyword arguments

If you only want to find the unique values, I'd recommend using itertools.chain.from_iterable to concatenate all those lists

import itertools

>>> np.unique([*itertools.chain.from_iterable(df.Genre)])
array(['action', 'crime', 'drama'], dtype='<U6')

Or even faster

>>> set(itertools.chain.from_iterable(df.Genre))
{'action', 'crime', 'drama'}

`Timings`

df = pd.DataFrame({'Genre':[['crime','drama'],['action','crime','drama']]})
df = pd.concat([df]*10000)

%timeit set(itertools.chain.from_iterable(df.Genre))
100 loops, best of 3: 2.55 ms per loo
    
%timeit set([x for y in df['Genre'] for x in y])
100 loops, best of 3: 4.09 ms per loop

%timeit np.unique([*itertools.chain.from_iterable(df.Genre)])
100 loops, best of 3: 12.8 ms per loop

%timeit np.unique(df['Genre'].sum())
1 loop, best of 3: 1.65 s per loop

%timeit set(df['Genre'].sum())
1 loop, best of 3: 1.66 s per loop

上一篇 : ：从文件中读取列表列表作为python中的列表列表下一篇 : 将Excel数据导入数据库

pandas 从列表列中获取唯一值

`时间`

`Timings`

相关阅读

技术问答最新文章