且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

Scala Spark数据帧爆炸很慢-因此,替代方法-根据列中的数组创建列和行

更新时间:2023-11-18 20:59:52

这是另一个简单的示例:

Here is another simple example:

val ds = sc.parallelize(Seq((0, "Lorem ipsum dolor", 1.0, Array("prp1", "prp2", "prp3"))))

使用flatMap展开数组的另一种方法.

Alternative way of exploding arrays using flatMaps.

ds.flatMap { t => 
  t._4.map { prp => 
    (t._1, t._2, t._3, prp) }}.collect.foreach(println) 

结果:

(0,Lorem ipsum dolor,1.0,prp1)
(0,Lorem ipsum dolor,1.0,prp2)
(0,Lorem ipsum dolor,1.0,prp3)

尝试过使用数据集,但不确定是否是***的数据处理方式.

Tried with your dataset but not sure if its the optimal way of doing it.

df1.show(false)

+---+---+------------------------------------------------+
|A  |B  |C                                               |
+---+---+------------------------------------------------+
|a  |1  |[[a, b, c, 0], [a1, b1, c1, 1], [a2, b2, c2, 2]]|
|b  |2  |[[a, b, c, 0]]                                  |
+---+---+------------------------------------------------+


df1.rdd.flatMap { t:Row => t.getSeq(2).map { row: Row => (t.getString(0),t.getString(1),row)}}
.map {
    case (col1: String,col2: String, col3: Row) => (col1, col2,col3.getString(0),col3.getString(1),col3.getString(2),col3.getInt(3))
  }.collect.foreach(println)

结果:

(a,1,a,b,c,0)
(a,1,a1,b1,c1,1)
(a,1,a2,b2,c2,2)
(b,2,a,b,c,0)  

希望这会有所帮助!!

Hope this helps!!