且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

pandas 独特价值多列

更新时间:2023-02-16 15:38:34

一种方法是选择列将它们传递给 np.unique

One way is to select the columns and pass them to np.unique:

>>> np.unique(df[['Col1', 'Col2']])
array(['Bill', 'Bob', 'Joe', 'Mary', 'Steve'], dtype=object)

请注意,一些版本的Pandas / NumPy可能需要您从列中显式传递值, code> .values 属性:

Note that some versions of Pandas/NumPy may require you to explicitly pass the values from the columns with the .values attribute:

np.unique(df[['Col1', 'Col2']].values)

更快的方法是使用 pd.unique 。该函数使用基于哈希表的算法,而不是使用NumPy的基于分类的算法。您将需要使用 ravel()传递1D数组:

A faster way is to use pd.unique. This function uses a hashtable-based algorithm instead of NumPy's sort-based algorithm. You will need to pass a 1D array using ravel():

>>> pd.unique(df[['Col1', 'Col2']].values.ravel())
array(['Bob', 'Joe', 'Steve', 'Bill', 'Mary'], dtype=object)

对于较大的DataFrames,速度差异很大:

The difference in speed is significant for larger DataFrames:

>>> df1 = pd.concat([df]*100000) # DataFrame with 500000 rows
>>> %timeit np.unique(df1[['Col1', 'Col2']].values)
1 loops, best of 3: 619 ms per loop

>>> %timeit pd.unique(df1[['Col1', 'Col2']].values.ravel())
10 loops, best of 3: 49.9 ms per loop