且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

根据 R 中的条件删除数据框的列

更新时间:2023-11-18 23:01:16

我觉得这一切都过于复杂了.条件 2 已经包含了所有其余的条件,好像一列中至少有两个非 NA 值,显然整列都不是 NA.如果一列中至少有两个连续的值,那么显然该列包含多个值.因此,这不是 3 个条件,而是全部汇总为一个条件(我不希望每列运行许多函数,而是在每列运行 diff 之后 - 对整个事物进行矢量化):

I feel like this is all over-complicated. Condition 2 already includes all the rest of the conditions, as if there are at least two non-NA values in a column, obviously the whole column aren't NAs. And if there are at least two consecutive values in a column, then obviously this column contains more than one value. So instead of 3 conditions, this all sums up into a single condition (I prefer not to run many functions per column, rather after running diff per column- vecotrize the whole thing):

cond <- colSums(is.na(sapply(df, diff))) < nrow(df) - 1

这是可行的,因为如果一列中没有连续的值,则整列将变为 NA.

This works because if there are no consecutive values in a column, the whole column will become NAs.

那么,就

df[, cond, drop = FALSE]
#        A     E
# 1  0.018    NA
# 2  0.017    NA
# 3  0.019    NA
# 4  0.018    NA
# 5  0.018    NA
# 6  0.015 0.037
# 7  0.016 0.031
# 8  0.019 0.025
# 9  0.016 0.035
# 10 0.018 0.035
# 11 0.017 0.043
# 12 0.023 0.040
# 13 0.022 0.042

根据您的编辑,您似乎有一个 data.table 对象,并且您还有一个 Date 列,因此代码需要一些修改.


Per your edit, it seems like you have a data.table object and you also have a Date column so the code would need some modifications.

cond <- df[, lapply(.SD, function(x) sum(is.na(diff(x)))) < .N - 1, .SDcols = -1] 
df[, c(TRUE, cond), with = FALSE]

一些解释:

  • 我们想忽略计算中的第一列,因此在对 .SD 进行操作时指定 .SDcols = -1(这意味着 Sub Data in data.tableis)
  • .N 只是行数(类似于 nrow(df)
  • 下一步是按条件进行子集化.我们也不必忘记抓取第一列,所以我们从 c(TRUE,...
  • 开始
  • 最后,data.table 默认使用非标准评估,因此,如果您想像在 data.frame 中一样选择列,则需要指定 with = FALSE
  • We want to ignore the first column in our calculations so we specify .SDcols = -1 when operating on our .SD (which means Sub Data in data.tableis)
  • .N is just the rows count (similar to nrow(df)
  • Next step is to subset by condition. We need not forget to grab the first column too so we start with c(TRUE,...
  • Finally, data.table works with non standard evaluation by default, hence, if you want to select column as if you would in a data.frame you will need to specify with = FALSE

不过,更好的方法是使用 := NULL

A better way though, would be just to remove the column by reference using := NULL

cond <- c(FALSE, df[, lapply(.SD, function(x) sum(is.na(diff(x)))) == .N - 1, .SDcols = -1])
df[, which(cond) := NULL]