更新时间:2023-11-18 21:38:22
你可以试试reduceByKey
.如果我们只想计算 min()
:
You can try reduceByKey
. It's pretty straightforward if we only want to compute the min()
:
rdd.reduceByKey(lambda x,y: min(x,y)).collect()
#Out[84]: [('key3', 2.0), ('key2', 3.0), ('key1', 1.0)]
要计算 mean
,您首先需要创建 (value, 1)
元组,我们用它来计算 sum
和 count
在 reduceByKey
操作中.最后,我们将它们相互除以得到 mean
:
To calculate the mean
, you'll first need to create (value, 1)
tuples which we use to calculate both the sum
and count
in the reduceByKey
operation. Lastly we divide them by each other to arrive at the mean
:
meanRDD = (rdd
.mapValues(lambda x: (x, 1))
.reduceByKey(lambda x, y: (x[0]+y[0], x[1]+y[1]))
.mapValues(lambda x: x[0]/x[1]))
meanRDD.collect()
#Out[85]: [('key3', 5.5), ('key2', 5.0), ('key1', 3.3333333333333335)]
对于方差
,可以使用公式(sumOfSquares/count) - (sum/count)^2
,我们按以下方式翻译:
For the variance
, you can use the formula (sumOfSquares/count) - (sum/count)^2
,
which we translate in the following way:
varRDD = (rdd
.mapValues(lambda x: (1, x, x*x))
.reduceByKey(lambda x,y: (x[0]+y[0], x[1]+y[1], x[2]+y[2]))
.mapValues(lambda x: (x[2]/x[0] - (x[1]/x[0])**2)))
varRDD.collect()
#Out[106]: [('key3', 12.25), ('key2', 4.0), ('key1', 2.8888888888888875)]
我在虚拟数据中使用了 double
类型的值而不是 int
来准确说明计算平均值和方差:
I used values of type double
instead of int
in the dummy data to accurately illustrate computing the average and variance:
rdd = sc.parallelize([("key1", 1.0),
("key3", 9.0),
("key2", 3.0),
("key1", 4.0),
("key1", 5.0),
("key3", 2.0),
("key2", 7.0)])