# 高效使用python计算汉明距离

The distance package in python provides a hamming distance calculator:

``````import distance

distance.levenshtein("lenvestein", "levenshtein")
distance.hamming("hamming", "hamning")
``````

There is also a levenshtein package which provides levenshtein distance calculations. Finally difflib can provide some simple string comparisons.

Your existing code is slow because you recalculate the file hash in the most inner loop, which means every file gets hashed many times. If you calculate the hash first then the process will be much more efficient:

``````files = ...
files_and_hashes = [(f, pHash.imagehash(f)) for f in files]
file_comparisons = [
(hamming(first[0], second[0]), first, second)
for second in files
for first in files
if first[1] != second[1]
]
``````

This process fundamentally involves `O(N^2)` comparisons so to distribute this in a way suitable for a map reduce problem involves taking the complete set of strings and dividing them into `B` blocks where `B^2 = M` (B = number of string blocks, M = number of workers). So if you had 16 strings and 4 workers you would split the list of strings into two blocks (so a block size of 8). An example of dividing the work follows:

``````all_strings = [...]
first_8 = all_strings[:8]
last_8 = all_strings[8:]
compare_all(machine_1, first_8, first_8)
compare_all(machine_2, first_8, last_8)
compare_all(machine_3, last_8, first_8)
compare_all(machine_4, last_8, last_8)
``````