且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

Python中的模糊字符串匹配

更新时间:2022-04-24 17:36:26

这里有多种优化级别可以将这个问题从O(n ^ 2)转化为a时间复杂度更低。

There are several level of optimizations possible here to turn this problem from O(n^2) to a lesser time complexity.


  • 预处理:在第一遍中对列表进行排序,创建输出映射对于每个字符串,它们的映射键都可以是规范化的字符串。
    规范化可能包括:

  • Preprocessing : Sort your list in the first pass, creating an output map for each string , they key for the map can be normalized string. Normalizations may include:

  • lowercase conversion,
  • no whitespaces, special characters removal,
  • transform unicode to ascii equivalents if possible,use unicodedata.normalize or unidecode module )

这将导致 Andrew H Smith andrew h。 smith ándréwh。smith 生成相同的密钥 andrewhsmith ,以及会将您的数百万个名称减少为较小的一组唯一/相似的分组名称。

This would result in "Andrew H Smith", "andrew h. smith", "ándréw h. smith" generating same key "andrewhsmith", and would reduce your set of million names to a smaller set of unique/similar grouped names.

您可以使用此通用方法以标准化您的字符串(不包括unicode部分):

You can use this utlity method to normalize your string (does not include the unicode part though) :

def process_str_for_similarity_cmp(input_str, normalized=False, ignore_list=[]):
    """ Processes string for similarity comparisons , cleans special characters and extra whitespaces
        if normalized is True and removes the substrings which are in ignore_list)
    Args:
        input_str (str) : input string to be processed
        normalized (bool) : if True , method removes special characters and extra whitespace from string,
                            and converts to lowercase
        ignore_list (list) : the substrings which need to be removed from the input string
    Returns:
       str : returns processed string
    """
    for ignore_str in ignore_list:
        input_str = re.sub(r'{0}'.format(ignore_str), "", input_str, flags=re.IGNORECASE)

    if normalized is True:
        input_str = input_str.strip().lower()
        #clean special chars and extra whitespace
        input_str = re.sub("\W", "", input_str).strip()

    return input_str




  • 如果相似的字符串的规范化密钥相同,那么现在它们将已经位于同一存储桶中。

    • Now similar strings will already lie in the same bucket if their normalized key is same.

      为进一步比较,您将需要仅比较键,而不是名称。例如
      andrewhsmith andrewhsmeeth ,因为名称的相似性
      将需要模糊字符串匹配

      For further comparison, you will need to compare the keys only, not the names. e.g andrewhsmith and andrewhsmeeth, since this similarity of names will need fuzzy string matching apart from the normalized comparison done above.

      装箱您真的需要比较5个字符吗?键和9个字符的键是否匹配95%?你不可以。
      这样您就可以创建与字符串匹配的存储桶。例如5个字符名称将与4-6个字符名称相匹配,6个字符名称与5-7个字符相匹配,等等。对于大多数实际匹配而言,字符键的n + 1,n-1个字符限制是一个不错的选择。

      Bucketing : Do you really need to compare a 5 character key with 9 character key to see if that is 95% match ? No you do not. So you can create buckets of matching your strings. e.g. 5 character names will be matched with 4-6 character names, 6 character names with 5-7 characters etc. A n+1,n-1 character limit for a n character key is a reasonably good bucket for most practical matching.

      开始匹配:大多数名称的变体在标准化格式中都具有相同的第一个字符(例如 Andrew H Smith 安德烈·史密斯 Andrew H.Smeeth 生成密钥 andrewhsmith andrewhsmith andrewhsmeeth
      通常不会第一个字符有所不同,因此您可以将以 a 开头的键与以 a 开头的其他键进行匹配,并落入长度桶之内,这将大大减少您的匹配时间。无需将键 andrewhsmith bndrewhsmith 匹配因为这样的名字首字母缩写几乎不会存在。

      Beginning match : Most variations of names will have same first character in the normalized format ( e.g Andrew H Smith, ándréw h. smith, and Andrew H. Smeeth generate keys andrewhsmith,andrewhsmith, and andrewhsmeeth respectively. They will usually not differ in the first character, so you can run matching for keys starting with a to other keys which start with a, and fall within the length buckets. This would highly reduce your matching time. No need to match a key andrewhsmith to bndrewhsmith as such a name variation with first letter will rarely exist.

      然后您可以使用somet遵循此方法(或FuzzyWuzzy模块)以查找字符串相似度百分比,您可以排除 jaro_winkler 或difflib之一以优化速度和结果质量:

      Then you can use something on the lines of this method ( or FuzzyWuzzy module ) to find string similarity percentage, you may exclude one of jaro_winkler or difflib to optimize your speed and result quality:

      def find_string_similarity(first_str, second_str, normalized=False, ignore_list=[]):
          """ Calculates matching ratio between two strings
          Args:
              first_str (str) : First String
              second_str (str) : Second String
              normalized (bool) : if True ,method removes special characters and extra whitespace
                                  from strings then calculates matching ratio
              ignore_list (list) : list has some characters which has to be substituted with "" in string
          Returns:
             Float Value : Returns a matching ratio between 1.0 ( most matching ) and 0.0 ( not matching )
                          using difflib's SequenceMatcher and and jellyfish's jaro_winkler algorithms with
                          equal weightage to each
          Examples:
              >>> find_string_similarity("hello world","Hello,World!",normalized=True)
              1.0
              >>> find_string_similarity("entrepreneurship","entreprenaurship")
              0.95625
              >>> find_string_similarity("Taj-Mahal","The Taj Mahal",normalized= True,ignore_list=["the","of"])
              1.0
          """
          first_str = process_str_for_similarity_cmp(first_str, normalized=normalized, ignore_list=ignore_list)
          second_str = process_str_for_similarity_cmp(second_str, normalized=normalized, ignore_list=ignore_list)
          match_ratio = (difflib.SequenceMatcher(None, first_str, second_str).ratio() + jellyfish.jaro_winkler(unicode(first_str), unicode(second_str)))/2.0
          return match_ratio