且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

为什么SparkSession对一个动作执行两次?

更新时间:2023-02-14 17:32:46

发生这种情况是因为您没有为DataFrameReader提供架构.结果,Spark必须急切地扫描数据集以推断输出模式.

It happens because you don't provide schema for DataFrameReader. As a result Spark has to eagerly scan data set to infer output schema.

由于未缓存mappedRdd,因此将被评估两次:

Since mappedRdd is not cached it will be evaluated twice:

  • 一次进行模式推断
  • 一次致电data.show

如果要阻止,则应为阅读器提供架构(Scala语法):

If you want to prevent you should provide schema for reader (Scala syntax):

val schema: org.apache.spark.sql.types.StructType = ???
spark.read.schema(schema).json(mappedRdd)