且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

识别R中数据框中的离群值

更新时间:2023-02-26 17:17:27

以下功能测试,列中的哪些值在Tukey的防护范围之外(第1个和第3个四分位数下方和上方的异常值)。然后,根据用户的喜好,该函数使用异常值删除包含任何值的所有行,或将异常值替换为 NA

The following function tests, which values in columns are outside of Tukey's fences (outliers below and above the 1st and the 3rd quartile). Then, depending on the user preference, the function removes all rows that contain any value with an outlier or replaces the outliers with NA.

outlier.out <- function(dat, q = c(0.25, 0.75), out = TRUE){
    # create a place for identification of outliers
    tests <- matrix(NA, ncol = ncol(dat), nrow = nrow(dat))
    # test, which cells contain outliers, ignoring existing NA values
    for(i in 1:ncol(dat)){
        qq <- quantile(dat[, i], q, na.rm = TRUE)
        tests[, i] <- sapply(dat[, i] < qq[1] | dat[, i] > qq[2], isTRUE)
    }
    if(out){
        # removes lines with outliers
        dat <- dat[!apply(tests, 1, FUN = any, na.rm = TRUE) ,]
    } else {
        # replaces outliers with NA
        dat[tests] <- NA
    }
    return(dat)
}

outlier.out(df1)
#   Var1  var2   var3 var4
# 4  456 44422 215000 0.78


outlier.out(df1, out = FALSE)
#   Var1  var2   var3  var4
# 1   NA    NA     NA 0.983
# 2  110    NA 210000    NA
# 3  200 45465     NA 0.983
# 4  456 44422 215000 0.780
# 5   NA    NA     NA    NA