更新时间:2023-11-18 16:33:22
您当然是正确的.引用不正确. filter
确实保留了分区(由于您已经描述了原因),并且确认这一点很简单
You are of course right. The quote is just incorrect. filter
does preserve partitioning (for the reason you've already described), and it is trivial to confirm that
val rdd = sc.range(0, 10).map(x => (x % 3, None)).partitionBy(
new org.apache.spark.HashPartitioner(11)
)
rdd.partitioner
// Option[org.apache.spark.Partitioner] = Some(org.apache.spark.HashPartitioner@b)
val filteredRDD = rdd.filter(_._1 == 3)
filteredRDD.partitioner
// Option[org.apache.spark.Partitioner] = Some(org.apache.spark.HashPartitioner@b)
rdd.partitioner == filteredRDD.partitioner
// Boolean = true
这与 map
之类的操作相反,后者不保留分区( Partitioner
):
This stays in contrast to operations like map
, which don't preserver partitioning (Partitioner
):
rdd.map(identity _).partitioner
// Option[org.apache.spark.Partitioner] = None
数据集
较为微妙,因为过滤器通常是下推式的,但总体行为是相似的.
Datasets
are a bit more subtle, as filters are normally pushed-down, but overall the behavior is similar.