更新时间:2023-02-14 17:32:46
发生这种情况是因为您没有为DataFrameReader
提供架构.结果,Spark必须急切地扫描数据集以推断输出模式.
It happens because you don't provide schema for DataFrameReader
. As a result Spark has to eagerly scan data set to infer output schema.
由于未缓存mappedRdd
,因此将被评估两次:
Since mappedRdd
is not cached it will be evaluated twice:
data.show
如果要阻止,则应为阅读器提供架构(Scala语法):
If you want to prevent you should provide schema for reader (Scala syntax):
val schema: org.apache.spark.sql.types.StructType = ???
spark.read.schema(schema).json(mappedRdd)