且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

高效使用python计算汉明距离

更新时间:2023-02-27 09:28:13

距离 python 中的包提供了汉明距离计算器:

The distance package in python provides a hamming distance calculator:

import distance

distance.levenshtein("lenvestein", "levenshtein")
distance.hamming("hamming", "hamning")

还有一个 levenshtein 包,它提供了 levenshtein 距离计算.最后 difflib 可以提供一些简单的字符串比较.

There is also a levenshtein package which provides levenshtein distance calculations. Finally difflib can provide some simple string comparisons.

这个老问题上有更多信息和示例代码.

您现有的代码很慢,因为您在最内部的循环中重新计算文件哈希,这意味着每个文件都被多次哈希.如果你先计算散列,那么这个过程会更有效率:

Your existing code is slow because you recalculate the file hash in the most inner loop, which means every file gets hashed many times. If you calculate the hash first then the process will be much more efficient:

files = ...
files_and_hashes = [(f, pHash.imagehash(f)) for f in files]
file_comparisons = [
    (hamming(first[0], second[0]), first, second)
    for second in files
    for first in files
    if first[1] != second[1]
]

这个过程基本上涉及 O(N^2) 比较,因此以适合于 map reduce 问题的方式分发它涉及获取完整的字符串集并将它们划分为 B 块,其中 B^2 = M(B = 字符串块数,M = 工人数).因此,如果您有 16 个字符串和 4 个工人,您会将字符串列表分成两个块(因此块大小为 8).一个划分工作的例子如下:

This process fundamentally involves O(N^2) comparisons so to distribute this in a way suitable for a map reduce problem involves taking the complete set of strings and dividing them into B blocks where B^2 = M (B = number of string blocks, M = number of workers). So if you had 16 strings and 4 workers you would split the list of strings into two blocks (so a block size of 8). An example of dividing the work follows:

all_strings = [...]
first_8 = all_strings[:8]
last_8 = all_strings[8:]
compare_all(machine_1, first_8, first_8)
compare_all(machine_2, first_8, last_8)
compare_all(machine_3, last_8, first_8)
compare_all(machine_4, last_8, last_8)