且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

如何在Python中删除离群值?

更新时间:2023-02-26 17:37:58

由于尝试在分类列上计算zscore,您的代码遇到了麻烦.

You're having trouble with your code because you're trying to calculate zscore on categorical columns.

为避免这种情况,您应该首先将火车分成具有数字和分类特征的部分:

To avoid this, you should first separate your train into parts with numerical and categorical features:

num_train = train.select_dtypes(include=["number"])
cat_train = train.select_dtypes(exclude=["number"])

并且仅在此之后计算要保留的行索引:

and only after that calculate index of rows to keep:

idx = np.all(stats.zscore(num_train) < 3, axis=1)

最后将这两部分加在一起:

and finally add the two pieces together:

train_cleaned = pd.concat([num_train.loc[idx], cat_train.loc[idx]], axis=1)

对于IQR部分:

Q1 = num_train.quantile(0.02)
Q3 = num_train.quantile(0.98)
IQR = Q3 - Q1
idx = ~((num_train < (Q1 - 1.5 * IQR)) | (num_train > (Q3 + 1.5 * IQR))).any(axis=1)
train_cleaned = pd.concat([num_train.loc[idx], cat_train.loc[idx]], axis=1)

如果您还有其他疑问,请告诉我们.

Please let us know if you have any further questions.

PS

同样,您可以考虑使用 pandas.DataFrame.clip ,它将根据情况裁剪异常值,而不是完全删除行.

As well, you might consider one more approach for dealing with outliers with pandas.DataFrame.clip, which will clip outliers on a case-by-case basis instead of dropping a row altogether.