更新时间:2023-11-18 23:05:52
这是你可以做的
import org.apache.spark.sql.functions._
//create a dataframe with demo data
val df = spark.sparkContext.parallelize(Seq(
(1, "Fname1", "Lname1", "Belarus"),
(2, "Fname2", "Lname2", "Belgium"),
(3, "Fname3", "Lname3", "Austria"),
(4, "Fname4", "Lname4", "Australia")
)).toDF("id", "fname","lname", "country")
//create a new column with the first letter of column
val result = df.withColumn("countryFirst", split($"country", "")(0))
//save the data with partitionby first letter of country
result.write.partitionBy("countryFirst").format("com.databricks.spark.csv").save("outputpath")
已您还可以使用可以提高性能的子字符串,如 Rachel 建议的
Edited: You can also use the substring which can increase the performance as suggested by Raphel as
substring(Column str, int pos, int len)
子串从 pos 开始,是当 str 为 String 类型时长度为 len 或返回字节切片以字节为单位从 pos 开始且长度为 len 的数组,当 str 为二进制类型
substring(Column str, int pos, int len)
Substring starts at pos and is of length len when str is String type or returns the slice of byte array that starts at pos in byte and is of length len when str is Binary type
val result = df.withColumn("firstCountry", substring($"country",1,1))
然后使用 partitionby 和 write
and then use partitionby with write
希望这能解决您的问题!
Hope this solves your problem!