且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

如何在Apache Spark/Scala中从包装数组中获取数据

更新时间:2023-11-18 22:34:10

假定数据是这样的字符串数组:

Assuming the data is an array of strings like this:

val df = Seq(Seq("1", "5DC7F285-052B-4739-8DC3-62827014A4CD", "1", "1425450997", "714909", "1425450997", "714909", "{}", "2013", "GAVIN", "STLAWRENCE", "M", "9"),
    Seq("2", "17C0D0ED-0B12-477B-8A23-1ED2C49AB8AF", "2", "1425450997", "714909", "1425450997", "714909", "{}", "2013", "LEVI", "ST LAWRENCE", "M", "9"),
    Seq("3", "53E20DA8-8384-4EC1-A9C4-071EC2ADA701", "3", "1425450997", "714909", "1425450997", "714909", "{}", "2013", "LOGAN", "NEW YORK", "M", "44"))
  .toDF("array")

您可以使用返回案例类的 UDF ,也可以多次使用 withColumn .后者应该更有效,并且可以这样完成:

You could either use an UDF that returns a case class or you can use withColumn multiple times. The latter should be more efficient and can be done like this:

val df2 = df.withColumn("year", $"array"(8).cast(IntegerType))
  .withColumn("first_name", $"array"(9))
  .withColumn("county", $"array"(10))
  .withColumn("sex", $"array"(11))
  .withColumn("count", $"array"(12).cast(IntegerType))
  .drop($"array")
  .as[Name]

这将为您提供 DataSet [Name] :

+----+----------+-----------+---+-----+
|year|first_name|county     |sex|count|
+----+----------+-----------+---+-----+
|2013|GAVIN     |STLAWRENCE |M  |9    |
|2013|LEVI      |ST LAWRENCE|M  |9    |
|2013|LOGAN     |NEW YORK   |M  |44   |
+----+----------+-----------+---+-----+

希望有帮助!