在数据表中快速EXISTS

更新时间：2023-12-01 18:21:52

简短答案：除了 nomatch = 0 ，我认为 mult =first 。

长答案：假设您想检查一个值（或多于1个值）是否存在于 data.table 或者不，这似乎要快得多。这里唯一的假设是data.table只有一个键列（因为这对我很含糊）。

  my .values = c（1：100,1000）
 require（data.table）
 set.seed（45）
 DT  setkey（DT，V1）
＃data.table way 
 system.time（all ％in％.subset2（DT [J（my.values），mult =first，nomatch = 0]，V1）））
用户系统已过
 0.006 0.000 0.006 

＃向量（扫描）方法
 system.time（all（my.values％in％.subset2（DT，V1）））
用户系统已过
 0.037 0.000 0.038

您可以将全部更改为 any 如果要检查子集中是否至少存在1个值。两者之间的唯一区别是，您首先 $ c>使用 data.table 的方法（利用键和 mult 参数）。正如你可以看到它的速度非常快（也很好地缩放）。然后从子集中检索键列（称为 the_subset ），

  .subset2（the_subset，V1）（或）the_subset $ V1（或）the_subset [[V1]]

但是， the_subset [，V1] 会慢一些。

同样的想法可以扩展到许多列，但我必须知道你想要做什么后。

What is the fastest way to check if a value exists in a data.table?. Suppose that

dt is a data.table of n columns with k columns being the key
keys is a list, or a value, or a data.table, or anything that can be used in the i argument of [.data.table

I'm currently doing

NROW(dt[keys,nomatch=0])!=0

Is there anything faster?

Example

require(data.table)
iniDate = as.Date("2000-01-01")
theData = data.table(a = LETTERS, d = seq(from=iniDate ,to= iniDate + length(LETTERS)*3000-1,by="day"))
theKeys = data.table(a = c("J","M","T"), d = as.Date(c("2005-01-20","2005-05-20","2013-01-12")))
setkey(theData,a,d)
NROW(theData[theKeys],nomatch=0)!=0

Short answer: In addition to nomatch=0, I think mult="first" would help speed it even more.

Long answer: Assuming that you want to check if a value (or more than 1 value) is present in the key column of a data.table or not, this seems to be much faster. The only assumption here is that the data.table has only 1 key column (as this is quite ambiguous to me).

my.values = c(1:100, 1000)
require(data.table)
set.seed(45)
DT <- as.data.table(matrix(sample(2e4, 1e6*100, replace=TRUE), ncol=100))
setkey(DT, "V1")
# the data.table way
system.time(all(my.values %in% .subset2(DT[J(my.values), mult="first", nomatch=0], "V1")))
   user  system elapsed 
  0.006   0.000   0.006 

# vector (scan) approach
system.time(all(my.values %in% .subset2(DT, "V1")))
   user  system elapsed 
  0.037   0.000   0.038

You can change all to any if you want to check if at least 1 value is present in the subset or not. The only difference between the two is that you first subset using data.table's approach (taking advantage of key and mult argument). As you can see the it's extremely faster (and also scales well). And then to retrieve the key columns from the subset (call it the_subset),

.subset2(the_subset, "V1") (or) the_subset$V1 (or) the_subset[["V1"]]

But, the_subset[, V1] will be slower.

Of course the same idea could be extended to many columns as well, but I'll have to know exactly what you want to do after.

上一篇 : ：R：在数据表中的列上循环下一篇 : 返回所有行的数据表最大列宽。

在数据表中快速EXISTS

相关阅读

推荐文章