更新时间:2023-02-27 09:28:13
距离 python 中的包提供了汉明距离计算器:
The distance package in python provides a hamming distance calculator:
import distance
distance.levenshtein("lenvestein", "levenshtein")
distance.hamming("hamming", "hamning")
还有一个 levenshtein 包,它提供了 levenshtein 距离计算.最后 difflib 可以提供一些简单的字符串比较.
There is also a levenshtein package which provides levenshtein distance calculations. Finally difflib can provide some simple string comparisons.
您现有的代码很慢,因为您在最内部的循环中重新计算文件哈希,这意味着每个文件都被多次哈希.如果你先计算散列,那么这个过程会更有效率:
Your existing code is slow because you recalculate the file hash in the most inner loop, which means every file gets hashed many times. If you calculate the hash first then the process will be much more efficient:
files = ...
files_and_hashes = [(f, pHash.imagehash(f)) for f in files]
file_comparisons = [
(hamming(first[0], second[0]), first, second)
for second in files
for first in files
if first[1] != second[1]
]
这个过程基本上涉及 O(N^2)
比较,因此以适合于 map reduce 问题的方式分发它涉及获取完整的字符串集并将它们划分为 B
块,其中 B^2 = M
(B = 字符串块数,M = 工人数).因此,如果您有 16 个字符串和 4 个工人,您会将字符串列表分成两个块(因此块大小为 8).一个划分工作的例子如下:
This process fundamentally involves O(N^2)
comparisons so to distribute this in a way suitable for a map reduce problem involves taking the complete set of strings and dividing them into B
blocks where B^2 = M
(B = number of string blocks, M = number of workers). So if you had 16 strings and 4 workers you would split the list of strings into two blocks (so a block size of 8). An example of dividing the work follows:
all_strings = [...]
first_8 = all_strings[:8]
last_8 = all_strings[8:]
compare_all(machine_1, first_8, first_8)
compare_all(machine_2, first_8, last_8)
compare_all(machine_3, last_8, first_8)
compare_all(machine_4, last_8, last_8)