更新时间:2023-02-16 15:38:34
一种方法是选择列将它们传递给 np.unique
:
One way is to select the columns and pass them to np.unique
:
>>> np.unique(df[['Col1', 'Col2']])
array(['Bill', 'Bob', 'Joe', 'Mary', 'Steve'], dtype=object)
请注意,一些版本的Pandas / NumPy可能需要您从列中显式传递值, code> .values 属性:
Note that some versions of Pandas/NumPy may require you to explicitly pass the values from the columns with the .values
attribute:
np.unique(df[['Col1', 'Col2']].values)
更快的方法是使用 pd.unique
。该函数使用基于哈希表的算法,而不是使用NumPy的基于分类的算法。您将需要使用 ravel()
传递1D数组:
A faster way is to use pd.unique
. This function uses a hashtable-based algorithm instead of NumPy's sort-based algorithm. You will need to pass a 1D array using ravel()
:
>>> pd.unique(df[['Col1', 'Col2']].values.ravel())
array(['Bob', 'Joe', 'Steve', 'Bill', 'Mary'], dtype=object)
对于较大的DataFrames,速度差异很大:
The difference in speed is significant for larger DataFrames:
>>> df1 = pd.concat([df]*100000) # DataFrame with 500000 rows
>>> %timeit np.unique(df1[['Col1', 'Col2']].values)
1 loops, best of 3: 619 ms per loop
>>> %timeit pd.unique(df1[['Col1', 'Col2']].values.ravel())
10 loops, best of 3: 49.9 ms per loop