更新时间:2022-04-24 17:36:26
这里有多种优化级别可以将这个问题从O(n ^ 2)转化为a时间复杂度更低。
There are several level of optimizations possible here to turn this problem from O(n^2) to a lesser time complexity.
预处理:在第一遍中对列表进行排序,创建输出映射对于每个字符串,它们的映射键都可以是规范化的字符串。
规范化可能包括:
Preprocessing : Sort your list in the first pass, creating an output map for each string , they key for the map can be normalized string. Normalizations may include:
这将导致 Andrew H Smith
, andrew h。 smith
,ándréwh。smith
生成相同的密钥 andrewhsmith
,以及会将您的数百万个名称减少为较小的一组唯一/相似的分组名称。
This would result in "Andrew H Smith"
, "andrew h. smith"
, "ándréw h. smith"
generating same key "andrewhsmith"
, and would reduce your set of million names to a smaller set of unique/similar grouped names.
您可以使用此通用方法以标准化您的字符串(不包括unicode部分):
You can use this utlity method to normalize your string (does not include the unicode part though) :
def process_str_for_similarity_cmp(input_str, normalized=False, ignore_list=[]):
""" Processes string for similarity comparisons , cleans special characters and extra whitespaces
if normalized is True and removes the substrings which are in ignore_list)
Args:
input_str (str) : input string to be processed
normalized (bool) : if True , method removes special characters and extra whitespace from string,
and converts to lowercase
ignore_list (list) : the substrings which need to be removed from the input string
Returns:
str : returns processed string
"""
for ignore_str in ignore_list:
input_str = re.sub(r'{0}'.format(ignore_str), "", input_str, flags=re.IGNORECASE)
if normalized is True:
input_str = input_str.strip().lower()
#clean special chars and extra whitespace
input_str = re.sub("\W", "", input_str).strip()
return input_str
如果相似的字符串的规范化密钥相同,那么现在它们将已经位于同一存储桶中。
Now similar strings will already lie in the same bucket if their normalized key is same.
为进一步比较,您将需要仅比较键,而不是名称。例如 andrewhsmith
和 andrewhsmeeth
,因为名称的相似性
将需要模糊字符串匹配
For further comparison, you will need to compare the keys only, not the names. e.g
andrewhsmith
and andrewhsmeeth
, since this similarity
of names will need fuzzy string matching apart from the normalized
comparison done above.
装箱:您真的需要比较5个字符吗?键和9个字符的键是否匹配95%?你不可以。
这样您就可以创建与字符串匹配的存储桶。例如5个字符名称将与4-6个字符名称相匹配,6个字符名称与5-7个字符相匹配,等等。对于大多数实际匹配而言,字符键的n + 1,n-1个字符限制是一个不错的选择。
Bucketing : Do you really need to compare a 5 character key with 9 character key to see if that is 95% match ? No you do not. So you can create buckets of matching your strings. e.g. 5 character names will be matched with 4-6 character names, 6 character names with 5-7 characters etc. A n+1,n-1 character limit for a n character key is a reasonably good bucket for most practical matching.
开始匹配:大多数名称的变体在标准化格式中都具有相同的第一个字符(例如 Andrew H Smith
,安德烈·史密斯
和 Andrew H.Smeeth
生成密钥 andrewhsmith
, andrewhsmith
和 andrewhsmeeth
。
通常不会第一个字符有所不同,因此您可以将以 a
开头的键与以 a
开头的其他键进行匹配,并落入长度桶之内,这将大大减少您的匹配时间。无需将键 andrewhsmith
与 bndrewhsmith
匹配因为这样的名字首字母缩写几乎不会存在。
Beginning match : Most variations of names will have same first character in the normalized format ( e.g Andrew H Smith
, ándréw h. smith
, and Andrew H. Smeeth
generate keys andrewhsmith
,andrewhsmith
, and andrewhsmeeth
respectively.
They will usually not differ in the first character, so you can run matching for keys starting with a
to other keys which start with a
, and fall within the length buckets. This would highly reduce your matching time. No need to match a key andrewhsmith
to bndrewhsmith
as such a name variation with first letter will rarely exist.
然后您可以使用somet遵循此方法(或FuzzyWuzzy模块)以查找字符串相似度百分比,您可以排除 jaro_winkler 或difflib之一以优化速度和结果质量:
Then you can use something on the lines of this method ( or FuzzyWuzzy module ) to find string similarity percentage, you may exclude one of jaro_winkler or difflib to optimize your speed and result quality:
def find_string_similarity(first_str, second_str, normalized=False, ignore_list=[]):
""" Calculates matching ratio between two strings
Args:
first_str (str) : First String
second_str (str) : Second String
normalized (bool) : if True ,method removes special characters and extra whitespace
from strings then calculates matching ratio
ignore_list (list) : list has some characters which has to be substituted with "" in string
Returns:
Float Value : Returns a matching ratio between 1.0 ( most matching ) and 0.0 ( not matching )
using difflib's SequenceMatcher and and jellyfish's jaro_winkler algorithms with
equal weightage to each
Examples:
>>> find_string_similarity("hello world","Hello,World!",normalized=True)
1.0
>>> find_string_similarity("entrepreneurship","entreprenaurship")
0.95625
>>> find_string_similarity("Taj-Mahal","The Taj Mahal",normalized= True,ignore_list=["the","of"])
1.0
"""
first_str = process_str_for_similarity_cmp(first_str, normalized=normalized, ignore_list=ignore_list)
second_str = process_str_for_similarity_cmp(second_str, normalized=normalized, ignore_list=ignore_list)
match_ratio = (difflib.SequenceMatcher(None, first_str, second_str).ratio() + jellyfish.jaro_winkler(unicode(first_str), unicode(second_str)))/2.0
return match_ratio