且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

R模糊字符串匹配以根据匹配的字符串返回特定列

更新时间:2022-04-24 17:41:50

你已经完成了 90% 的路......

You are 90% of the way there...

你说你想

知道字符串与df2中的哪一行数据匹配

know with which row of data the string was matched from df2

您只需要了解您已有的代码.见 ?amatch:

You just need to understand the code you already have. See ?amatch:

amatch 返回 xtable 中最接近匹配的位置.当存在多个具有相同最小距离度量的匹配时,返回第一个.

amatch returns the position of the closest match of x in table. When multiple matches with the same smallest distance metric exist, the first one is returned.

换句话说,amatch 为您提供 df2(即您的 table)中最接近匹配的行的索引df1 中的每个地址(即您的 x).您通过返回新地址来过早地包装此索引.

In other words, amatch gives you the index for the row in df2 (which is your table) that is the closest match of each address in df1 (which is your x). You are prematurely wrapping this index by returning the new address instead.

相反,检索索引本身以进行查找为左连接检索 unique_id(如果您确信它确实是唯一的 id).

Instead, retrieve either the index itself for lookup or the unique_id (if you are confident that it is truly a unique id) for a left join.

两种方法的说明:

library(data.table) # you forgot this in your example
library(stringdist)
df1 <- data.table(Address1 = c("786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr","23/4, 23RD FLOOR, STREET 2, ABC-E, PQR","45-B, GALI NO5, XYZ","HECTIC, 99 STREET, PQR","786, GALI NO 5, XYZ","rambo, 45, strret 4, atlast, pqr"),
                  Year1 = 2001:2007) # already a vector, no need to combine
df2 <- data.table(Address2=c("abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR","abc, pqr, xyz","786, GALI NO 4 XYZ","45B, GALI NO 5, XYZ","del, 546, strret2, towards east, pqr","23/4, STREET 2, PQR"),
                  Year2=2001:2010)
df2[,unique_id := sprintf("%06d", .I)] # use .I, it's neater

# Return position from strVec of closest match to str
match_pos = function(str, strVec, n){
  amatch(str, strVec, method = "dl", maxDist=n,useBytes = T) # are you sure you want useBytes = TRUE?
}

# Option 1: use unique_id as a key for left join
df1[!is.na(Address1) | nchar(Address1>0), # I would exclude only on NA_character_ but also empty string, perhaps string of length < 3
    unique_id := df2$unique_id[match_pos(Address1, df2$Address2,3)] ]
merge(df1, df2, by='unique_id', all.x=TRUE) # see ?merge for more options

# Option 2: use the row index
df1[!is.na(Address1) | nchar(Address1>0),
    df2_pos := match_pos(Address1, df2$Address2,3) ] 
df1[!is.na(df2_pos), (c('Address2','Year2','UniqueID')):=df2[df2_pos,.(Address2,Year2,unique_id)] ][]