更新时间:2023-02-26 17:17:27
以下功能测试,列中的哪些值在Tukey的防护范围之外(第1个和第3个四分位数下方和上方的异常值)。然后,根据用户的喜好,该函数使用异常值删除包含任何值的所有行,或将异常值替换为 NA
。
The following function tests, which values in columns are outside of Tukey's fences (outliers below and above the 1st and the 3rd quartile). Then, depending on the user preference, the function removes all rows that contain any value with an outlier or replaces the outliers with NA
.
outlier.out <- function(dat, q = c(0.25, 0.75), out = TRUE){
# create a place for identification of outliers
tests <- matrix(NA, ncol = ncol(dat), nrow = nrow(dat))
# test, which cells contain outliers, ignoring existing NA values
for(i in 1:ncol(dat)){
qq <- quantile(dat[, i], q, na.rm = TRUE)
tests[, i] <- sapply(dat[, i] < qq[1] | dat[, i] > qq[2], isTRUE)
}
if(out){
# removes lines with outliers
dat <- dat[!apply(tests, 1, FUN = any, na.rm = TRUE) ,]
} else {
# replaces outliers with NA
dat[tests] <- NA
}
return(dat)
}
outlier.out(df1)
# Var1 var2 var3 var4
# 4 456 44422 215000 0.78
outlier.out(df1, out = FALSE)
# Var1 var2 var3 var4
# 1 NA NA NA 0.983
# 2 110 NA 210000 NA
# 3 200 45465 NA 0.983
# 4 456 44422 215000 0.780
# 5 NA NA NA NA