且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

给定稀疏矩阵数据,Python中最快的计算余弦相似度的方法是什么?

更新时间:2022-01-11 09:04:53

您可以直接使用sklearn在稀疏矩阵的行上计算成对的余弦相似度.从0.17版开始,它还支持稀疏输出:

You can compute pairwise cosine similarity on the rows of a sparse matrix directly using sklearn. As of version 0.17 it also supports sparse output:

from sklearn.metrics.pairwise import cosine_similarity
from scipy import sparse

A =  np.array([[0, 1, 0, 0, 1], [0, 0, 1, 1, 1],[1, 1, 0, 1, 0]])
A_sparse = sparse.csr_matrix(A)

similarities = cosine_similarity(A_sparse)
print('pairwise dense output:\n {}\n'.format(similarities))

#also can output sparse matrices
similarities_sparse = cosine_similarity(A_sparse,dense_output=False)
print('pairwise sparse output:\n {}\n'.format(similarities_sparse))

结果:

pairwise dense output:
[[ 1.          0.40824829  0.40824829]
[ 0.40824829  1.          0.33333333]
[ 0.40824829  0.33333333  1.        ]]

pairwise sparse output:
(0, 1)  0.408248290464
(0, 2)  0.408248290464
(0, 0)  1.0
(1, 0)  0.408248290464
(1, 2)  0.333333333333
(1, 1)  1.0
(2, 1)  0.333333333333
(2, 0)  0.408248290464
(2, 2)  1.0

如果要按列进行余弦相似度,只需预先转置输入矩阵即可:

If you want column-wise cosine similarities simply transpose your input matrix beforehand:

A_sparse.transpose()