且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

spark.sql.files.maxPartitionBytes不限制写入分区的最大大小

更新时间:2023-10-20 15:27:16

在Spark上读取数据时,设置 spark.sql.files.maxPartitionBytes 确实会影响分区的最大大小.簇.如果输出后的最终文件太大,则建议减小此设置的值,并且应创建更多文件,因为输入数据将分布在更多分区中.但是,如果查询中有任何混洗,则情况将不会如此,因为这将始终将其重新划分为 spark.sql.shuffle.partitions 设置所给定的分区数.

The setting spark.sql.files.maxPartitionBytes has indeed impact on the max size of the partitions when reading the data on the Spark cluster. If your final files after the output are too large, then I suggest decreasing the value of this setting and it should create more files because the input data will be distributed among more partitions. This will however not be true if you have any shuffle in your query because then it will be always repartitioned into the number of partitions given by spark.sql.shuffle.partitions setting.

此外,文件的最终大小取决于您将使用的文件格式和压缩率.因此,如果将数据输出到例如镶木地板中,则文件将比输出到csv或json小得多.

Also, the final size of your files will depend on the file format and compression that you will use. So if you output the data into for example parquet, the files will be much smaller than outputing to csv or json.