且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

pyspark计算稀疏向量的距离矩阵

更新时间:2021-10-05 03:52:03

问题是配置错误,导致我的数据分成了1000个分区.解决方案只是简单地告诉spark他应该创建多少个分区(例如10个):

The problem was a configuration error that led to split up my data into 1000 partitions. The solution was simply to tell spark explicitly how many partitions he should create (e.g. 10):

rdd = sc.parallelize(sparse_vectors, 10)

此外,我通过枚举扩展了稀疏向量的列表,这样我就可以过滤掉不属于上三角矩阵的对:

Moreover I extended the list of sparse vectors with an enumeration, this way I could then filter out pairs which are not part of the upper triangle matrix:

sparse_vectors = [(i, csr_to_sparse_vector(row)) for i, row in enumerate(authors)]
rdd = sc.parallelize(sparse_vectors, 10)
rdd2 = rdd.cartesian(rdd).filter(lambda x: x[0][0] < x[1][0])
rdd2.map(lambda x: jacc_sim(x)).filter(lambda x: x is not None).saveAsTextFile('hdfs:///user/username/similarities')

所属相似性函数如下:

def jacc_sim(pair):
    id_0 = pair[0][0]
    vec_0 = pair[0][1]
    id_1 = pair[1][0]
    vec_1 = pair[1][1]
    dot_product = vec_0.dot(vec_1)
    try:
        sim = dot_product / (vec_0.numNonzeros() + vec_1.numNonzeros())
        if sim > 0:
            return (id_0, id_1, sim)
    except ZeroDivisionError:
        pass
    return None

这对我来说非常有效,我希望其他人也会发现它有用!

This worked very well for me and I hope someone else will find it useful as well!