更新时间:2022-04-30 23:19:23
在我的经验中,余弦匹配对于此类工作是很好的选择:
In my experience the cosine match is a good one for such kind of a jobs:
V1 <- c("pen", "document folder", "warn")
V2 <- c("copy folder", "warning", "pens")
result <- sapply(V1, function(x) stringdist(x, V2, method = 'cosine', q = 1))
rownames(result) <- V2
result
pen document folder warn
copy folder 0.6797437 0.2132042 0.8613250
warning 0.6150998 0.7817821 0.1666667
pens 0.1339746 0.6726732 0.7500000
当距离足够近时,您必须定义一个截止点,距离越低,匹配度越好.您还可以使用Q参数,该参数说明应将多少个字母组合进行比较.例如:
You have to define a cut off when the distance is close enough, how lower the distance how better they match. You can also play with the Q parameter which says how many letters combinations should be compared to each other. For example:
result <- sapply(V1, function(x) stringdist(x, V2, method = 'cosine', q = 3))
rownames(result) <- V2
result
pen document folder warn
copy folder 1.0000000 0.5377498 1.0000000
warning 1.0000000 1.0000000 0.3675445
pens 0.2928932 1.0000000 1.0000000