且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

在Pandas DataFrame中转换列值的最有效方法

更新时间:2023-11-25 11:09:22

您可以使用 np.where 设置所需的值基于布尔条件:

You can use np.where to set your desired value based on a boolean condition:

In [18]:
DF_test['value'] = np.where(DF_test['value'] > threshold, 1,0)
DF_test

Out[18]:
  c1 c2  value
0  a  p      0
1  b  q      0
2  c  r      1
3  d  s      1
4  e  t      0

请注意,因为您的数据是一个异构的np数组,'value'列包含字符串而不是浮点数:

Note that because your data is a heterogenous np array the 'value' column contains strings rather than floats:

In [58]:
DF_test.iloc[0]['value']

Out[58]:
'0.12'

所以你需要首先将 dtype 转换为 float DF_test ['value'] = DF_test ['value']。astype(float)

So you'll need to convert the dtype to float first: DF_test['value'] = DF_test['value'].astype(float)

您可以比较时间:

In [16]:
%timeit np.where(DF_test['value'] > threshold, 1,0)
1000 loops, best of 3: 297 µs per loop

In [17]:
%%timeit
DF_naive = pd.DataFrame()
for i in range(DF_test.shape[0]):
    #Get first 2 columns
    first2cols = list(DF_test.ix[i][:-1])
    #Check if value is greater than threshold
    binary_value = [int((bool(float(DF_test.ix[i][-1]) > threshold)))]
    #Create series object
    SR_row = pd.Series( first2cols + binary_value,name=i)
    #Add to empty dataframe container
    DF_naive = DF_naive.append(SR_row)
10 loops, best of 3: 39.3 ms per loop

np.where 版本速度超过100倍,不可否认,你的代码正在做很多不必要的事情,但你得到了点

the np.where version is over 100x faster, admittedly your code is doing a lot of unnecessary stuff but you get the point