且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

加快在大型R数据框中搜索索引的速度

更新时间:2023-01-31 09:05:15

这里有一个基准,用于比较data.table和data.frame。如果您知道这种情况下的特殊数据表调用,它的速度大约是原来的7倍,而忽略了建立索引的费用(该索引相对较小,通常会在多次调用中摊销)。如果您不知道特殊的语法,只会更快一点。 (请注意,问题的大小比原始大小要小一些,以便于研究)

Here's a little benchmark comparing data.table to data.frame. If you know the special data table invocation for this case, it's about 7x faster, ignoring the cost of setting up the index (which is relatively small, and would typically be amortised across multiple calls). If you don't know the special syntax, it's only a little faster. (Note the problem size is a little smaller than the original to make it easier to explore)

library(data.table)
library(microbenchmark)
options(digits = 3)

# Regular data frame
df <- data.frame(id = 1:1e5, x = runif(1e5), y = runif(1e5))

# Data table, with index
dt <- data.table(df)
setkey(dt, "id")

ids <- sample(1e5, 1e4)

microbenchmark(
  df[df$id %in% ids , ], # won't preserve order
  df[match(ids, df$id), ],
  dt[id %in% ids, ],
  dt[match(ids, id), ],
  dt[.(ids)]
)
# Unit: milliseconds
#                     expr   min    lq median    uq   max neval
#     df[df$id %in% ids, ] 13.61 13.99  14.69 17.26 53.81   100
#  df[match(ids, df$id), ] 16.62 17.03  17.36 18.10 21.22   100
#        dt[id %in% ids, ]  7.72  7.99   8.35  9.23 12.18   100
#     dt[match(ids, id), ] 16.44 17.03  17.36 17.77 61.57   100
#               dt[.(ids)]  1.93  2.16   2.27  2.43  5.77   100

我原本以为您可能也是能够使用
行名来做到这一点,我认为可以建立一个哈希表并有效地索引
。但这显然不是这种情况:

I had originally thought you might also be able to do this with rownames, which I thought built up a hash table and did the indexing efficiently. But that's obviously not the case:

df2 <- df
rownames(df2) <- as.character(df$id)
df2[as.character(ids), ],

microbenchmark(
  df[df$id %in% ids , ], # won't preserve order
  df2[as.character(ids), ],
  times = 1
)
# Unit: milliseconds
#                     expr    min     lq median     uq    max neval
#     df[df$id %in% ids, ]   15.3   15.3   15.3   15.3   15.3     1
# df2[as.character(ids), ] 3609.8 3609.8 3609.8 3609.8 3609.8     1