更新时间:2023-11-18 18:45:16
It is something related to Tungsten project which was enabled in Spark. It uses hardware optimization and calls hash partitioning which triggers shuffle operation. By default spark.sql.shuffle.partitions is set to be 200. You can verify by calling explain on your dataframe before repartitioning and after just calling:
myDF.explain
val repartitionedDF = myDF.repartition($"x")
repartitionedDF.explain