且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

Spark scala 数据框:将多列合并为单列

更新时间:2023-02-02 22:07:01

您的预期输出似乎并未反映您生成名称-值结构化对象列表的要求.如果我理解正确,请考虑使用 foldLeft 将所需列迭代转换为 StructType 名称-值列,并将它们分组到 ArrayType 列中:

Your expected output doesn't seem to reflect your requirement of producing a list of name-value structured objects. If I understand it correctly, consider using foldLeft to iteratively convert the wanted columns to StructType name-value columns, and group them into an ArrayType column:

import org.apache.spark.sql.functions._

val df = Seq(
  (1, "bat", "done"),
  (2, "mouse", "mone"),
  (3, "horse", "gun"),
  (4, "horse", "some")
).toDF("id", "animal", "talk")

val cols = df.columns.filter(_ != "id")

val resultDF = cols.
  foldLeft(df)( (accDF, c) => 
    accDF.withColumn(c, struct(lit(c).as("name"), col(c).as("value")))
  ).
  select($"id", array(cols.map(col): _*).as("merged"))

resultDF.show(false)
// +---+-----------------------------+
// |id |merged                       |
// +---+-----------------------------+
// |1  |[[animal,bat], [talk,done]]  |
// |2  |[[animal,mouse], [talk,mone]]|
// |3  |[[animal,horse], [talk,gun]] |
// |4  |[[animal,horse], [talk,some]]|
// +---+-----------------------------+

resultDF.printSchema
// root
//  |-- id: integer (nullable = false)
//  |-- merged: array (nullable = false)
//  |    |-- element: struct (containsNull = false)
//  |    |    |-- name: string (nullable = false)
//  |    |    |-- value: string (nullable = true)