且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

模糊匹配两个字符串

更新时间:2022-04-30 23:19:23

在我的经验中,余弦匹配对于此类工作是很好的选择:

In my experience the cosine match is a good one for such kind of a jobs:

V1 <- c("pen", "document folder", "warn")
V2 <- c("copy folder", "warning", "pens")   
result <- sapply(V1, function(x) stringdist(x, V2, method = 'cosine', q = 1))
rownames(result) <- V2
result
                  pen document folder      warn
copy folder 0.6797437       0.2132042 0.8613250
warning     0.6150998       0.7817821 0.1666667
pens        0.1339746       0.6726732 0.7500000

当距离足够近时,您必须定义一个截止点,距离越低,匹配度越好.您还可以使用Q参数,该参数说明应将多少个字母组合进行比较.例如:

You have to define a cut off when the distance is close enough, how lower the distance how better they match. You can also play with the Q parameter which says how many letters combinations should be compared to each other. For example:

result <- sapply(V1, function(x) stringdist(x, V2, method = 'cosine', q = 3))
rownames(result) <- V2
result
                  pen document folder      warn
copy folder 1.0000000       0.5377498 1.0000000
warning     1.0000000       1.0000000 0.3675445
pens        0.2928932       1.0000000 1.0000000