且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

提高 R 中字符串匹配的性能和速度

更新时间:2022-06-18 23:40:12

您可以使用 data.table 相对轻松地做到这一点:

You can do this relatively easily with data.table:

vin.names <- vinDB[seq(1, nrow(vinDB), 2), ]
vin.vins <- vinDB[seq(2, nrow(vinDB), 2), ]
car.vins <- carFile[seq(2, nrow(carFile), 4), ]

library(data.table)
dt <- data.table(vin.names, vin.vins, key="vin.vins")
dt[J(car.vins), list(NumTimesFound=.N), by=vin.names]
#         vin.names NumTimesFound
#  1:     Ford 2014            15
#  2: Chrysler 1998            10
#  3:       GM 1998             9
#  4:     Ford 1998            11
#  5:   Toyota 2000            12
# ---                            
# 75:   Toyota 2007             7
# 76: Chrysler 1995             4
# 77:   Toyota 2010             5
# 78:   Toyota 2008             1
# 79:       GM 1997             5    

要理解的主要事情是使用 J(car.vins) 我们正在创建一个单列 data.table 与要匹配的 vins (J 只是 data.table 的简写,只要您在 data.table 中使用它).通过在 dt 中使用 data.table,我们将 vins 列表加入到汽车列表中,因为我们键入了 dt 在上一步中通过vin.vins".最后一个参数告诉我们通过 vin.names 对连接的集合进行分组,中间的参数我们想知道每个组的实例数 .N (.N>.N 是一个特殊的 data.table 变量).

The main thing to understand is with J(car.vins) we are creating a one column data.table with the vins to match (J is just shorthand for data.table, so long as you use it within a data.table). By using that data.table inside dt, we are joining the list of vins to the list of cars because we keyed dt by "vin.vins" in the prior step. The last argument tells us to group the joined set by vin.names, and the middle argument that we want to know the number of instances .N for each group (.N is a special data.table variable).

此外,我制作了一些垃圾数据来运行它.以后请提供这样的数据.

Also, I made some junk data to run this on. In the future, please provide data like this.

set.seed(1)
makes <- c("Toyota", "Ford", "GM", "Chrysler")
years <- 1995:2014
cars <- paste(sample(makes, 500, rep=T), sample(years, 500, rep=T))
vins <- unlist(replicate(500, paste0(sample(LETTERS, 16), collapse="")))
vinDB <- data.frame(c(cars, vins)[order(rep(1:500, 2))])               
carFile <- 
  data.frame(
    c(rep("junk", 1000), sample(vins, 1000, rep=T), rep("junk", 2000))[order(rep(1:1000, 4))]
  )