且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

在 Python 中检查较长字符串中存在的模糊/近似子字符串?

更新时间:2023-09-03 15:40:16

即将取代 re 的新正则表达式库包含模糊匹配.

https://pypi.python.org/pypi/regex/

模糊匹配语法看起来相当有表现力,但这会给你一个或更少的插入/添加/删除匹配.

导入正则表达式regex.match('(amazing){e<=1}', 'amazing')

Using algorithms like leveinstein ( leveinstein or difflib) , it is easy to find approximate matches.eg.

>>> import difflib
>>> difflib.SequenceMatcher(None,"amazing","amaging").ratio()
0.8571428571428571

The fuzzy matches can be detected by deciding a threshold as needed.

Current requirement : To find fuzzy substring based on a threshold in a bigger string.

eg.

large_string = "thelargemanhatanproject is a great project in themanhattincity"
query_string = "manhattan"
#result = "manhatan","manhattin" and their indexes in large_string

One brute force solution is to generate all substrings of length N-1 to N+1 ( or other matching length),where N is length of query_string, and use levenstein on them one by one and see the threshold.

Is there better solution available in python , preferably an included module in python 2.7 , or an externally available module .

---------------------UPDATE AND SOLUTION ----------------

Python regex module works pretty well, though it is little bit slower than inbuilt re module for fuzzy substring cases, which is an obvious outcome due to extra operations. The desired output is good and the control over magnitude of fuzziness can be easily defined.

>>> import regex
>>> input = "Monalisa was painted by Leonrdo da Vinchi"
>>> regex.search(r'\b(leonardo){e<3}\s+(da)\s+(vinci){e<2}\b',input,flags=regex.IGNORECASE)
<regex.Match object; span=(23, 41), match=' Leonrdo da Vinchi', fuzzy_counts=(0, 2, 1)>

The new regex library that's soon supposed to replace re includes fuzzy matching.

https://pypi.python.org/pypi/regex/

The fuzzy matching syntax looks fairly expressive, but this would give you a match with one or fewer insertions/additions/deletions.

import regex
regex.match('(amazing){e<=1}', 'amaging')