更新时间:2021-11-27 23:29:45
@Divakar刚刚发布了一个很好的答案.如果您已经定义了类别数组,则可以使用@Divakar的答案.如果您尚未定义唯一值,则使用我的.
@Divakar just posted a very good answer. If you already have the array of categories defined, I'd use @Divakar's answer. If you don't have unique values already define, I'd use mine.
我将使用 pd.factorize
分解类别.然后使用 np.bincount
参数weights
设置为values
数组的a>
I'd use pd.factorize
to factorize the categories. Then use np.bincount
with weights
parameter set to be the values
array
f, u = pd.factorize(valcats)
np.bincount(f, values).astype(values.dtype)
array([ 1, 12, 7, 14, 13, 8])
pd.factorize
还会在u
变量中产生唯一值.我们可以将结果与u
对齐,以查看是否找到了正确的解决方案.
pd.factorize
also produces the unique values in the u
variable. We can line up the results with u
to see that we've arrived at the correct solution.
np.column_stack([u, np.bincount(f, values).astype(values.dtype)])
array([[101, 1],
[301, 12],
[201, 7],
[102, 14],
[302, 13],
[202, 8]])
您可以使用pd.Series
f, u = pd.factorize(valcats)
pd.Series(np.bincount(f, values).astype(values.dtype), u)
101 1
301 12
201 7
102 14
302 13
202 8
dtype: int64
为什么 pd.factorize
而不是 np.unique
?
Why pd.factorize
and not np.unique
?
我们本可以用
u, f = np.unique(valcats, return_inverse=True)
但是,np.unique
对值进行排序,并在nlogn
时间运行.另一方面,pd.factorize
不排序,并且在线性时间内运行.对于较大的数据集,pd.factorize
将主导性能.
But, np.unique
sorts the values and that runs in nlogn
time. On the other hand pd.factorize
does not sort and runs in linear time. For larger data sets, pd.factorize
will dominate performance.