且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

R:循环遍历 data.table 中的列

更新时间:2023-01-17 12:47:55

简单排查了一下,貌似是data.table的bug.

Have briefly investigated, and it looks like a data.table bug.

> DT = data.table(a=1:1e6,b=1:1e6,c=1:1e6,d=1:1e6)
> Rprofmem()
> sapply(DT,class)
        a         b         c         d 
"integer" "integer" "integer" "integer" 
> Rprofmem(NULL)
> noquote(readLines("Rprofmem.out"))
[1] 4000040 :"as.list.data.table" "as.list" "lapply" "sapply"       
[2] 4000040 :"as.list.data.table" "as.list" "lapply" "sapply" 
[3] 4000040 :"as.list.data.table" "as.list" "lapply" "sapply"   
[4] 4000040 :"as.list.data.table" "as.list" "lapply" "sapply" 

> tracemem(DT)
> sapply(DT,class)
tracemem[000000000431A290 -> 00000000065D70D8]: as.list.data.table as.list lapply sapply 
        a         b         c         d 
"integer" "integer" "integer" "integer" 

所以,看看 as.list.data.table :

> data.table:::as.list.data.table
function (x, ...) 
{
    ans <- unclass(x)
    setattr(ans, "row.names", NULL)
    setattr(ans, "sorted", NULL)
    setattr(ans, ".internal.selfref", NULL)
    ans
}
<environment: namespace:data.table>
> 

注意第一行讨厌的 unclass.?unclass 确认它接受了其参数的深层副本.从这个快速的外观来看,似乎 sapplylapply 并没有进行复制(我认为他们没有这样做,因为 R 擅长写时复制,并且那些不是写的),而是 lapply 中的 as.list (调度到 as.list.data.table).

Note the pesky unclass on the first line. ?unclass confirms that it takes a deep copy of its argument. From this quick look it doesn't seem like sapply or lapply are doing the copying (I didn't think they did since R is good at copy-on-write, and those aren't writing), but rather the as.list in lapply (which dispatches to as.list.data.table).

所以,如果我们避免 unclass,它应该会加快速度.让我们试试吧:

So, if we avoid the unclass, it should speed up. Let's try:

> DT = data.table(a=1:1e7,b=1:1e7,c=1:1e7,d=1:1e7)
> system.time(sapply(DT,class))
   user  system elapsed 
   0.28    0.06    0.35 
> system.time(sapply(DT,class))  # repeat timing a few times and take minimum
   user  system elapsed 
   0.17    0.00    0.17 
> system.time(sapply(DT,class))
   user  system elapsed 
   0.13    0.04    0.18 
> system.time(sapply(DT,class))
   user  system elapsed 
   0.14    0.03    0.17 
> assignInNamespace("as.list.data.table",function(x)x,"data.table")
> data.table:::as.list.data.table
function(x)x
> system.time(sapply(DT,class))
   user  system elapsed 
      0       0       0 
> system.time(sapply(DT,class))
   user  system elapsed 
   0.01    0.00    0.02 
> system.time(sapply(DT,class))
   user  system elapsed 
      0       0       0 
> sapply(DT,class)
        a         b         c         d 
"integer" "integer" "integer" "integer" 
> 

所以,是的,无限更好.

我提出了 错误报告 #2000 删除 as.list.data.table 方法,因为 data.table is() 也已经是一个 list.这实际上可能会加速很多习语,例如 lapply(.SD,...)..

I've raised bug report #2000 to remove the as.list.data.table method, since a data.table is() already a list, too. This might speed up quite a few idioms actually, such as lapply(.SD,...). .

感谢您提出这个问题!