且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

SPARK DataFrame:如何基于相同的列值为每个组有效地拆分数据框

更新时间:2023-11-18 23:18:22

正如我的评论中所述,解决此问题的一种可能的简便方法是使用:

As noted in my comments, one potentially easy approach to this problem would be to use:

df.write.partitionBy("hour").saveAsTable("myparquet")

如前所述,文件夹结构将是myparquet/hour=1myparquet/hour=2,...,myparquet/hour=24,而不是myparquet/1myparquet/2,...,myparquet/24.

As noted, the folder structure would be myparquet/hour=1, myparquet/hour=2, ..., myparquet/hour=24 as opposed to myparquet/1, myparquet/2, ..., myparquet/24.

要更改文件夹结构,您可以

To change the folder structure, you could

  1. 可能在显式的HiveContext中使用Hive配置设置hcat.dynamic.partitioning.custom.pattern;有关详细信息,请参见 HCatalog DynamicPartitions .
  2. 另一种方法是使用for f in *; do mv $f ${f/${f:0:5}/} ; done之类的命令执行df.write.partitionBy.saveAsTable(...)命令后直接更改文件系统,这将从文件夹名称中删除Hour=文本.
  1. Potentially use the Hive configuration setting hcat.dynamic.partitioning.custom.pattern within an explicit HiveContext; more information at HCatalog DynamicPartitions.
  2. Another approach would be to change the file system directly after you have executed the df.write.partitionBy.saveAsTable(...) command with something like for f in *; do mv $f ${f/${f:0:5}/} ; done which would remove the Hour= text from the folder name.

请注意,通过更改文件夹的命名模式,当您在该文件夹中运行spark.read.parquet(...)时,Spark将不会自动了解动态分区,因为它缺少partitionKey(即Hour)信息.

It is important to note that by changing the naming pattern for the folders, when you are running spark.read.parquet(...) in that folder, Spark will not automatically understand the dynamic partitions since its missing the partitionKey (i.e. Hour) information.