且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

如何在 Ruby 中进行模糊子字符串匹配?

更新时间:2022-01-14 00:11:42

你可以试试 amatch.它可以作为 ruby​​ gem 使用,虽然我很长时间没有使用模糊逻辑,但它看起来有你需要的东西.amatch 的主页是:http://flori.github.com/amatch/.

You could try amatch. It's available as a ruby gem and, although I haven't worked with fuzzy logic for a long time, it looks to have what you need. The homepage for amatch is: http://flori.github.com/amatch/.

只是对这个想法感到无聊和混乱,一个完全未经优化和未经测试的解决方案黑客如下:

Just bored and messing around with the idea, a completely non-optimized and untested hack of a solution follows:

include 'amatch'

module FuzzyFinder
  def scanner( input )
    out = [] unless block_given?
    pos = 0
    input.scan(/(\w+)(\W*)/) do |word, white|
      startpos = pos
      pos = word.length + white.length
      if block_given?
        yield startpos, word
      else
        out << [startpos, word]
      end
    end
  end

  def find( text, doc )
    index = scanner(doc)
    sstr = text.gsub(/\W/,'')
    levenshtein = Amatch::Levensthtein.new(sstr)
    minlen = sstr.length
    maxndx = index.length
    possibles = []
    minscore = minlen*2
    index.each_with_index do |x, i|
      spos = x[0]
      str = x[1]
      si = i
      while (str.length < minlen)
        i += 1
        break unless i < maxndx
        str += index[i][1]
      end
      str = str.slice(0,minlen) if (str.length > minlen)
      score = levenshtein.search(str)
      if score < minscore
        possibles = [spos]
        minscore = score
      elsif score == minscore
        possibles << spos
      end
    end
    [minscore, possibles]
  end
end

显然有很多改进是可能的,而且可能是必要的!一些最重要的:

Obviously there are numerous improvements possible and probably necessary! A few off the top:

  1. 处理一次文档并存储结果,可能在数据库中.
  2. 确定字符串的可用长度对于初步检查,过程首先针对该初始子字符串在尝试匹配整个片段.
  3. 继上一个之后,预先计算的起始片段那个长度.