且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

根据列值对火花数据框进行分区?

更新时间:2023-11-18 23:05:52

这是你可以做的

import org.apache.spark.sql.functions._
//create a dataframe with demo data
val df = spark.sparkContext.parallelize(Seq(
  (1, "Fname1", "Lname1", "Belarus"),
  (2, "Fname2", "Lname2", "Belgium"),
  (3, "Fname3", "Lname3", "Austria"),
  (4, "Fname4", "Lname4", "Australia")
)).toDF("id", "fname","lname", "country")

//create a new column with the first letter of column
val result = df.withColumn("countryFirst", split($"country", "")(0))

//save the data with partitionby first letter of country 

result.write.partitionBy("countryFirst").format("com.databricks.spark.csv").save("outputpath")

您还可以使用可以提高性能的子字符串,如 Rachel 建议的

Edited: You can also use the substring which can increase the performance as suggested by Raphel as

substring(Column str, int pos, int len) 子串从 pos 开始,是当 str 为 String 类型时长度为 len 或返回字节切片以字节为单位从 pos 开始且长度为 len 的数组,当 str 为二进制类型

substring(Column str, int pos, int len) Substring starts at pos and is of length len when str is String type or returns the slice of byte array that starts at pos in byte and is of length len when str is Binary type

val result = df.withColumn("firstCountry", substring($"country",1,1))

然后使用 partitionby 和 write

and then use partitionby with write

希望这能解决您的问题!

Hope this solves your problem!