且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

在数据表中删除单个列的惯用法

更新时间:2023-12-01 18:17:46

:= NULL

  dat [,z:= NULL] 

如果你有你的列作为字符串使用()

  toDrop  
dat [,(toDrop):= NULL]

.SD 中的列,可以传递 .SDcols 参数

  dat [,lapply(.SD,somefunction),.SDcols = setdiff(names(dat),'z')] 

但是, data.table 检查 j 参数,只获取您使用任何方式的列。请参阅FAQ 1.12


当您写入X [Y,sum(foo * bar)]时,data.table
会自动检查j表达式,以查看它使用哪些列。


,不会尝试加载的所有数据。 SD (除非您在 j 的呼叫中有 .SD






subset.data.table 正在处理调用并最终评估 dat [,c('x','y'),with = FALSE]



应该基本上是瞬时的,永远删除列。

I need to drop one column from a data.frame containing a few hundred columns.

With a data.frame, I'd use subset to do this conveniently:

> dat <- data.table( data.frame(x=runif(10),y=rep(letters[1:5],2),z=runif(10)),key='y' )
> subset(dat,select=c(-z))
            x y
 1: 0.1969049 a
 2: 0.7916696 a
 3: 0.9095970 b
 4: 0.3529506 b
 5: 0.4923602 c
 6: 0.5993034 c
 7: 0.1559861 d
 8: 0.9929333 d
 9: 0.3980169 e
10: 0.1921226 e

Obviously this still works, but it seems like not a very data.table-like idiom. I could manually construct a list of the column names I wanted to keep, which seems a little more data.table-like:

> dat[,list(x,y)]
            x y
 1: 0.1969049 a
 2: 0.7916696 a
 3: 0.9095970 b
 4: 0.3529506 b
 5: 0.4923602 c
 6: 0.5993034 c
 7: 0.1559861 d
 8: 0.9929333 d
 9: 0.3980169 e
10: 0.1921226 e

But then I have to construct such a list, which is clunky.

Is subset the proper way to conveniently drop a column or two, or does it cause a performance hit? If not, what's the better way?

Edit

Benchmarks:

> dat <- data.table( data.frame(x=runif(10^7),y=rep(letters[1:10],10^6),z=runif(10^7)),key='y' )
> microbenchmark( subset(dat,select=c(-z)), dat[,list(x,y)] )
Unit: milliseconds
                         expr       min        lq    median        uq      max
1           dat[, list(x, y)] 102.62826 167.86793 170.72847 199.89789 792.0207
2 subset(dat, select = c(-z))  33.26356  52.55311  53.53934  55.00347 180.8740

But really where it may matter more is for memory if subset copies the whole data.table.

If you are wanting to remove the column permanently use := NULL

dat[, z := NULL]

If you have your columns to drop as a character string use () to force evaluation as a character string, not as the character name.

toDrop <- c('z')

dat[, (toDrop) := NULL]

If you want to limit the availability of the columns in .SD, you can pass the .SDcols argument

dat[,lapply(.SD, somefunction) , .SDcols = setdiff(names(dat),'z')]

However, data.table inspects the j arguments and only gets the columns you use any way. See FAQ 1.12

When you write X[Y,sum(foo*bar)], data.table automatically inspects the j expression to see which columns it uses.

and doesn't try and load all the data for .SD (unless you have .SD within your call to j)


subset.data.table is processing the call and eventually evaluating dat[, c('x','y'), with=FALSE]

using := NULL should be basically instantaneous, howveer t does permanently delete the column.