更新时间:2022-04-24 17:41:50
你已经完成了 90% 的路......
You are 90% of the way there...
你说你想
知道字符串与df2中的哪一行数据匹配
know with which row of data the string was matched from df2
您只需要了解您已有的代码.见 ?amatch
:
You just need to understand the code you already have. See ?amatch
:
amatch
返回 x
在 table
中最接近匹配的位置.当存在多个具有相同最小距离度量的匹配时,返回第一个.
amatch
returns the position of the closest match ofx
intable
. When multiple matches with the same smallest distance metric exist, the first one is returned.
换句话说,amatch
为您提供 df2
(即您的 table
)中最接近匹配的行的索引df1
中的每个地址(即您的 x
).您通过返回新地址来过早地包装此索引.
In other words, amatch
gives you the index for the row in df2
(which is your table
) that is the closest match of each address in df1
(which is your x
). You are prematurely wrapping this index by returning the new address instead.
相反,检索索引本身以进行查找或为左连接检索 unique_id(如果您确信它确实是唯一的 id).
Instead, retrieve either the index itself for lookup or the unique_id (if you are confident that it is truly a unique id) for a left join.
两种方法的说明:
library(data.table) # you forgot this in your example
library(stringdist)
df1 <- data.table(Address1 = c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR, STREET 2, ABC-E, PQR","45-B, GALI NO5, XYZ","HECTIC, 99 STREET, PQR","786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr"),
Year1 = 2001:2007) # already a vector, no need to combine
df2 <- data.table(Address2=c("abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR","abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR"),
Year2=2001:2010)
df2[,unique_id := sprintf("%06d", .I)] # use .I, it's neater
# Return position from strVec of closest match to str
match_pos = function(str, strVec, n){
amatch(str, strVec, method = "dl", maxDist=n,useBytes = T) # are you sure you want useBytes = TRUE?
}
# Option 1: use unique_id as a key for left join
df1[!is.na(Address1) | nchar(Address1>0), # I would exclude only on NA_character_ but also empty string, perhaps string of length < 3
unique_id := df2$unique_id[match_pos(Address1, df2$Address2,3)] ]
merge(df1, df2, by='unique_id', all.x=TRUE) # see ?merge for more options
# Option 2: use the row index
df1[!is.na(Address1) | nchar(Address1>0),
df2_pos := match_pos(Address1, df2$Address2,3) ]
df1[!is.na(df2_pos), (c('Address2','Year2','UniqueID')):=df2[df2_pos,.(Address2,Year2,unique_id)] ][]