且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

Spark RDD:如何最有效地计算统计数据?

更新时间:2023-11-18 21:38:22

你可以试试reduceByKey.如果我们只想计算 min():

You can try reduceByKey. It's pretty straightforward if we only want to compute the min():

rdd.reduceByKey(lambda x,y: min(x,y)).collect()
#Out[84]: [('key3', 2.0), ('key2', 3.0), ('key1', 1.0)]

要计算 mean,您首先需要创建 (value, 1) 元组,我们用它来计算 sumcountreduceByKey 操作中.最后,我们将它们相互除以得到 mean:

To calculate the mean, you'll first need to create (value, 1) tuples which we use to calculate both the sum and count in the reduceByKey operation. Lastly we divide them by each other to arrive at the mean:

meanRDD = (rdd
           .mapValues(lambda x: (x, 1))
           .reduceByKey(lambda x, y: (x[0]+y[0], x[1]+y[1]))
           .mapValues(lambda x: x[0]/x[1]))

meanRDD.collect()
#Out[85]: [('key3', 5.5), ('key2', 5.0), ('key1', 3.3333333333333335)]

对于方差,可以使用公式(sumOfSquares/count) - (sum/count)^2,我们按以下方式翻译:

For the variance, you can use the formula (sumOfSquares/count) - (sum/count)^2, which we translate in the following way:

varRDD = (rdd
          .mapValues(lambda x: (1, x, x*x))
          .reduceByKey(lambda x,y: (x[0]+y[0], x[1]+y[1], x[2]+y[2]))
          .mapValues(lambda x: (x[2]/x[0] - (x[1]/x[0])**2)))

varRDD.collect()
#Out[106]: [('key3', 12.25), ('key2', 4.0), ('key1', 2.8888888888888875)]

我在虚拟数据中使用了 double 类型的值而不是 int 来准确说明计算平均值和方差:

I used values of type double instead of int in the dummy data to accurately illustrate computing the average and variance:

rdd = sc.parallelize([("key1", 1.0),
                      ("key3", 9.0),
                      ("key2", 3.0),
                      ("key1", 4.0),
                      ("key1", 5.0),
                      ("key3", 2.0),
                      ("key2", 7.0)])