且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

如何在有序键值存储中进行大于内存字典的模糊字符串匹配?

更新时间:2022-06-18 23:40:00

simhash捕获(小)字符串之间的相似性.但这并不能真正解决在大于RAM数据集的情况下查询大多数相似字符串的问题.我认为,原始论文建议对一些排列进行索引,它需要大量内存,并且没有利用OKVS的有序特性.

simhash captures similarity between (small) strings. But it does not really solve the problem of querying most similar string in a bigger than RAM dataset. I think, the original paper recommends to index some permutations, it requires a lot of memory and it does not take advantage of the ordered nature of OKVS.

我想我发现了一个散列,该散列允许在散列的前缀内捕获相似性:

I think I found a hash that allows to capture similarity inside the prefix of the hash:

In [1]: import fuzz                                                                                      

In [2]: hello = fuzz.bbkh("hello")                                                                       

In [3]: helo = fuzz.bbkh("helo")                                                                         

In [4]: hellooo = fuzz.bbkh("hellooo")                                                                   

In [5]: salut = fuzz.bbkh("salut")                                                                       

In [6]: len(fuzz.lcp(hello.hex(), helo.hex())) # Longest Common Prefix                                                      
Out[6]: 213

In [7]: len(fuzz.lcp(hello.hex(), hellooo.hex()))                                                        
Out[7]: 12

In [8]: len(fuzz.lcp(hello.hex(), salut.hex()))                                                          
Out[8]: 0

在对Wikidata标签进行小型测试之后,似乎可以得出良好的结果:

After small test over wikidata labels it seems to give good results:

$ time python fuzz.py query 10 france
* most similar according to bbk fuzzbuzz
** france    0
** farrance      -2
** freande   -2
** defrance      -2

real    0m0.054s

$ time python fuzz.py query 10 frnace
* most similar according to bbk fuzzbuzz
** farnace   -1
** france    -2
** fernacre      -2

real    0m0.060s

$ time python fuzz.py query 10 beglium
* most similar according to bbk fuzzbuzz
** belgium   -2

real    0m0.047s


$ time python fuzz.py query 10 belgium
* most similar according to bbk fuzzbuzz
** belgium   0
** ajbelgium     -2

real    0m0.059s

$ time python fuzz.py query 10 begium
* most similar according to bbk fuzzbuzz
** belgium   -1
** beijum    -2

real    0m0.047s

这是一个实现:

HASH_SIZE = 2**10
BBKH_LENGTH = int(HASH_SIZE * 2 / 8)

chars = ascii_lowercase + "$"
ONE_HOT_ENCODER = sorted([''.join(x) for x in product(chars, chars)])


def ngram(string, n):
    return [string[i:i+n] for i in range(len(string)-n+1)]


def integer2booleans(integer):
    return [x == '1' for x in bin(integer)[2:].zfill(HASH_SIZE)]


def chunks(l, n):
    """Yield successive n-sized chunks from l."""
    for i in range(0, len(l), n):
        yield l[i:i + n]


def merkletree(booleans):
    assert len(booleans) == HASH_SIZE
    length = (2 * len(booleans) - 1)
    out = [False] * length
    index = length - 1
    booleans = list(reversed(booleans))
    while len(booleans) > 1:
        for boolean in booleans:
            out[index] = boolean
            index -= 1
        new = []
        for (right, left) in chunks(booleans, 2):
            value = right or left
            new.append(value)
        booleans = new
    return out


def bbkh(string):
    integer = 0
    string = "$" + string + "$"
    for gram in ngram(string, 2):
        hotbit = ONE_HOT_ENCODER.index(gram)
        hotinteger = 1 << hotbit
        integer = integer | hotinteger
    booleans = integer2booleans(integer)
    tree = merkletree(booleans)
    fuzz = ''.join('1' if x else '0' for x in tree)
    buzz = int(fuzz, 2)
    hash = buzz.to_bytes(BBKH_LENGTH, 'big')
    return hash


def lcp(a, b):
    """Longest Common Prefix between a and b"""
    out = []
    for x, y in zip(a, b):
        if x == y:
            out.append(x)
        else:
            break
    return ''.join(out)

评估:使用Wikipedia 常见拼写错误个单词,大约有8k个单词.考虑到最接近的前10个单词,每个查询大约需要20毫秒才能获得77%的成功.考虑到前100名,每个查询花费不到200毫秒的时间就能产生94%的成功率.大多数错误都来自连接词(例如,约"而不是约")或连字符之间的破折号.

Evaluation: Using Wikipedia common misspelled words, there is around 8k words. Considering top 10 nearest words, yields 77% success with each query taking around 20ms. Considering top 100, yields 94% success with each query taking less than 200ms. Most mistakes come from joined words (e.g. "abouta" instead of "about a") or words separated with a dash.

> https://github.com/amirouche/fuzzbuzz中检出代码/blob/master/typofix.py

注意:计算simhash而不是输入字符串,仅适用于一袋引理或词干,确实可以找到相似的文档.

Note: computing simhash instead of the input string, only works with a bag of lemma or stem, indeed it finds similar documents.