且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

为什么过滤器不保留分区?

更新时间:2023-11-18 16:33:22

您当然是正确的.引用不正确. filter 确实保留了分区(由于您已经描述了原因),并且确认这一点很简单

You are of course right. The quote is just incorrect. filter does preserve partitioning (for the reason you've already described), and it is trivial to confirm that

val rdd = sc.range(0, 10).map(x => (x % 3, None)).partitionBy(
  new org.apache.spark.HashPartitioner(11)
)

rdd.partitioner
// Option[org.apache.spark.Partitioner] = Some(org.apache.spark.HashPartitioner@b)

val filteredRDD = rdd.filter(_._1 == 3)
filteredRDD.partitioner
// Option[org.apache.spark.Partitioner] = Some(org.apache.spark.HashPartitioner@b)

rdd.partitioner == filteredRDD.partitioner
// Boolean = true

这与 map 之类的操作相反,后者不保留分区( Partitioner ):

This stays in contrast to operations like map, which don't preserver partitioning (Partitioner):

rdd.map(identity _).partitioner
// Option[org.apache.spark.Partitioner] = None

数据集较为微妙,因为过滤器通常是下推式的,但总体行为是相似的.

Datasets are a bit more subtle, as filters are normally pushed-down, but overall the behavior is similar.