更新时间:2023-12-01 18:17:46
:= NULL
dat [,z:= NULL]
如果你有你的列作为字符串使用()
toDrop
dat [,(toDrop):= NULL]
.SD
中的列,可以传递 .SDcols
参数
dat [,lapply(.SD,somefunction),.SDcols = setdiff(names(dat),'z')]
但是, data.table
检查 j
参数,只获取您使用任何方式的列。请参阅FAQ 1.12
当您写入X [Y,sum(foo * bar)]时,data.table
会自动检查j表达式,以查看它使用哪些列。
,不会尝试加载的所有数据。 SD
(除非您在 j
的呼叫中有 .SD
)
subset.data.table
正在处理调用并最终评估 dat [,c('x','y'),with = FALSE]
I need to drop one column from a data.frame containing a few hundred columns.
With a data.frame
, I'd use subset
to do this conveniently:
> dat <- data.table( data.frame(x=runif(10),y=rep(letters[1:5],2),z=runif(10)),key='y' )
> subset(dat,select=c(-z))
x y
1: 0.1969049 a
2: 0.7916696 a
3: 0.9095970 b
4: 0.3529506 b
5: 0.4923602 c
6: 0.5993034 c
7: 0.1559861 d
8: 0.9929333 d
9: 0.3980169 e
10: 0.1921226 e
Obviously this still works, but it seems like not a very data.table
-like idiom. I could manually construct a list of the column names I wanted to keep, which seems a little more data.table
-like:
> dat[,list(x,y)]
x y
1: 0.1969049 a
2: 0.7916696 a
3: 0.9095970 b
4: 0.3529506 b
5: 0.4923602 c
6: 0.5993034 c
7: 0.1559861 d
8: 0.9929333 d
9: 0.3980169 e
10: 0.1921226 e
But then I have to construct such a list, which is clunky.
Is subset
the proper way to conveniently drop a column or two, or does it cause a performance hit? If not, what's the better way?
Edit
Benchmarks:
> dat <- data.table( data.frame(x=runif(10^7),y=rep(letters[1:10],10^6),z=runif(10^7)),key='y' )
> microbenchmark( subset(dat,select=c(-z)), dat[,list(x,y)] )
Unit: milliseconds
expr min lq median uq max
1 dat[, list(x, y)] 102.62826 167.86793 170.72847 199.89789 792.0207
2 subset(dat, select = c(-z)) 33.26356 52.55311 53.53934 55.00347 180.8740
But really where it may matter more is for memory if subset
copies the whole data.table
.
If you are wanting to remove the column permanently use := NULL
dat[, z := NULL]
If you have your columns to drop as a character string use ()
to force evaluation as a character string, not as the character name.
toDrop <- c('z')
dat[, (toDrop) := NULL]
If you want to limit the availability of the columns in .SD
, you can pass the .SDcols
argument
dat[,lapply(.SD, somefunction) , .SDcols = setdiff(names(dat),'z')]
However, data.table
inspects the j
arguments and only gets the columns you use any way. See FAQ 1.12
When you write X[Y,sum(foo*bar)], data.table automatically inspects the j expression to see which columns it uses.
and doesn't try and load all the data for .SD
(unless you have .SD
within your call to j
)
subset.data.table
is processing the call and eventually evaluating dat[, c('x','y'), with=FALSE]
using := NULL
should be basically instantaneous, howveer t does permanently delete the column.