且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

在数据表中快速EXISTS

更新时间:2023-12-01 18:21:52

简短答案:除了 nomatch = 0 ,我认为 mult =first



长答案:假设您想检查一个值(或多于1个值)是否存在于 data.table 或者不,这似乎要快得多。这里唯一的假设是data.table只有一个键列(因为这对我很含糊)。

  my .values = c(1:100,1000)
require(data.table)
set.seed(45)
DT setkey(DT,V1)
#data.table way
system.time(all %in%.subset2(DT [J(my.values),mult =first,nomatch = 0],V1)))
用户系统已过
0.006 0.000 0.006

#向量(扫描)方法
system.time(all(my.values%in%.subset2(DT,V1)))
用户系统已过
0.037 0.000 0.038

您可以将全部更改为 any 如果要检查子集中是否至少存在1个值。两者之间的唯一区别是,您首先 $ c>使用 data.table 的方法(利用 mult 参数)。正如你可以看到它的速度非常快(也很好地缩放)。然后从子集中检索键列(称为 the_subset ),

  .subset2(the_subset,V1)(或)the_subset $ V1(或)the_subset [[V1]] 

但是, the_subset [,V1] 会慢一些。



同样的想法可以扩展到许多列,但我必须知道你想要做什么后。


What is the fastest way to check if a value exists in a data.table?. Suppose that

  • dt is a data.table of n columns with k columns being the key
  • keys is a list, or a value, or a data.table, or anything that can be used in the i argument of [.data.table

I'm currently doing

NROW(dt[keys,nomatch=0])!=0

Is there anything faster?

Example

require(data.table)
iniDate = as.Date("2000-01-01")
theData = data.table(a = LETTERS, d = seq(from=iniDate ,to= iniDate + length(LETTERS)*3000-1,by="day"))
theKeys = data.table(a = c("J","M","T"), d = as.Date(c("2005-01-20","2005-05-20","2013-01-12")))
setkey(theData,a,d)
NROW(theData[theKeys],nomatch=0)!=0

Short answer: In addition to nomatch=0, I think mult="first" would help speed it even more.

Long answer: Assuming that you want to check if a value (or more than 1 value) is present in the key column of a data.table or not, this seems to be much faster. The only assumption here is that the data.table has only 1 key column (as this is quite ambiguous to me).

my.values = c(1:100, 1000)
require(data.table)
set.seed(45)
DT <- as.data.table(matrix(sample(2e4, 1e6*100, replace=TRUE), ncol=100))
setkey(DT, "V1")
# the data.table way
system.time(all(my.values %in% .subset2(DT[J(my.values), mult="first", nomatch=0], "V1")))
   user  system elapsed 
  0.006   0.000   0.006 

# vector (scan) approach
system.time(all(my.values %in% .subset2(DT, "V1")))
   user  system elapsed 
  0.037   0.000   0.038 

You can change all to any if you want to check if at least 1 value is present in the subset or not. The only difference between the two is that you first subset using data.table's approach (taking advantage of key and mult argument). As you can see the it's extremely faster (and also scales well). And then to retrieve the key columns from the subset (call it the_subset),

.subset2(the_subset, "V1") (or) the_subset$V1 (or) the_subset[["V1"]]

But, the_subset[, V1] will be slower.

Of course the same idea could be extended to many columns as well, but I'll have to know exactly what you want to do after.