且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

即使列不在数据框中,Spark 也会向下推过滤器

更新时间:2023-11-18 16:42:10

Dataset/Dataframe 描述了您观察到的原因:

The documentation on the Dataset/Dataframe describes the reason for what you are observing quite well:

"数据集是懒惰的",即只有在调用操作时才会触发计算.在内部,数据集代表一个逻辑计划,描述生成数据所需的计算.当调用一个操作时,Spark 的查询优化器会优化逻辑计划并生成一个物理计划,以便并行和高效地执行分布式方式."

"Datasets are "lazy", i.e. computations are only triggered when an action is invoked. Internally, a Dataset represents a logical plan that describes the computation required to produce the data. When an action is invoked, Spark's query optimizer optimizes the logical plan and generates a physical plan for efficient execution in a parallel and distributed manner. "

重要部分以粗体突出显示.当应用 selectfilter 语句时,它只会被添加到一个逻辑计划中,当应用一个动作时,它只会被 Spark 解析.在解析这个完整逻辑计划时,Catalyst Optimizer 会查看整个计划,优化规则之一是下推过滤器,这就是您在示例中看到的内容.

The important part is highlighted in bold. When applying select and filter statements it just gets added to a logical plan that gets only parsed by Spark when an action is applied. When parsing this full logical plan, the Catalyst Optimizer looks at the whole plan and one of the optimization rules is to push down filters, which is what you see in your example.

我认为这是一个很棒的功能.即使您对在最终 Dataframe 中看到此特定字段不感兴趣,它也理解您对某些原始数据不感兴趣.

I think this is a great feature. Even though you are not interested in seeing this particular field in your final Dataframe, it understands that you are not interested in some of the original data.

这是 Spark SQL 引擎相对于 RDD 的主要优势.它了解什么您正在尝试做,而不会被告知如何去做.

That is the main benefit of Spark SQL engine as opposed to RDDs. It understands what you are trying to do without being told how to do it.