更新时间:2023-12-01 18:21:52
简短答案:除了 nomatch = 0
,我认为 mult =first
。
长答案:假设您想检查一个值(或多于1个值)是否存在于 data.table
或者不,这似乎要快得多。这里唯一的假设是data.table只有一个键列(因为这对我很含糊)。
my .values = c(1:100,1000)
require(data.table)
set.seed(45)
DT setkey(DT,V1)
#data.table way
system.time(all %in%.subset2(DT [J(my.values),mult =first,nomatch = 0],V1)))
用户系统已过
0.006 0.000 0.006
#向量(扫描)方法
system.time(all(my.values%in%.subset2(DT,V1)))
用户系统已过
0.037 0.000 0.038
您可以将全部
更改为 any
如果要检查子集中是否至少存在1个值。两者之间的唯一区别是,您首先 $ c>使用
data.table
的方法(利用键
和 mult
参数)。正如你可以看到它的速度非常快(也很好地缩放)。然后从子集中检索键列(称为 the_subset
),
.subset2(the_subset,V1)(或)the_subset $ V1(或)the_subset [[V1]]
但是, the_subset [,V1]
会慢一些。
同样的想法可以扩展到许多列,但我必须知道你想要做什么后。
What is the fastest way to check if a value exists in a data.table?. Suppose that
i
argument of [.data.table
I'm currently doing
NROW(dt[keys,nomatch=0])!=0
Is there anything faster?
Example
require(data.table)
iniDate = as.Date("2000-01-01")
theData = data.table(a = LETTERS, d = seq(from=iniDate ,to= iniDate + length(LETTERS)*3000-1,by="day"))
theKeys = data.table(a = c("J","M","T"), d = as.Date(c("2005-01-20","2005-05-20","2013-01-12")))
setkey(theData,a,d)
NROW(theData[theKeys],nomatch=0)!=0
Short answer: In addition to nomatch=0
, I think mult="first"
would help speed it even more.
Long answer: Assuming that you want to check if a value (or more than 1 value) is present in the key column of a data.table
or not, this seems to be much faster. The only assumption here is that the data.table has only 1 key column (as this is quite ambiguous to me).
my.values = c(1:100, 1000)
require(data.table)
set.seed(45)
DT <- as.data.table(matrix(sample(2e4, 1e6*100, replace=TRUE), ncol=100))
setkey(DT, "V1")
# the data.table way
system.time(all(my.values %in% .subset2(DT[J(my.values), mult="first", nomatch=0], "V1")))
user system elapsed
0.006 0.000 0.006
# vector (scan) approach
system.time(all(my.values %in% .subset2(DT, "V1")))
user system elapsed
0.037 0.000 0.038
You can change all
to any
if you want to check if at least 1 value is present in the subset or not. The only difference between the two is that you first subset
using data.table
's approach (taking advantage of key
and mult
argument). As you can see the it's extremely faster (and also scales well). And then to retrieve the key columns from the subset (call it the_subset
),
.subset2(the_subset, "V1") (or) the_subset$V1 (or) the_subset[["V1"]]
But, the_subset[, V1]
will be slower.
Of course the same idea could be extended to many columns as well, but I'll have to know exactly what you want to do after.