且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

在Spark UDF中操作数据框

更新时间:2023-11-18 23:31:16

您不能在UDF中使用数据集操作. UDF只能处理现有的列并产生一个结果列.它不能过滤数据集或进行聚合,但是可以在过滤器内部使用. UDAF还可以汇总值.

You can't use Dataset operations inside UDFs. UDF can only manupulate on existing columns and produce one result column. It can't filter Dataset or make aggregations, but it can be used inside filter. UDAF also can aggregate values.

相反,您可以使用.as[SomeCaseClass]从DataFrame中创建数据集,并在filter,map,reduce中使用常规的强类型函数.

Instead, you can use .as[SomeCaseClass] to make Dataset from DataFrame and use normal, strongly typed functions inside filter, map, reduce.

如果要将bigDF与smallDFs列表中的每个小型DF联接,则可以执行以下操作:

If you want to join your bigDF with every small DF in smallDFs List, you can do:

import org.apache.spark.sql.functions._
val bigDF = // some processing
val smallDFs = Seq(someSmallDF1, someSmallDF2)
val joined = smallDFs.foldLeft(bigDF)((acc, df) => acc.join(broadcast(df), "join_column"))

broadcast是将广播提示添加到小型DF的功能,因此小型DF将使用更有效的广播联接,而不是排序合并联接

broadcast is a function to add Broadcast Hint to small DF, so that small DF will use more efficient Broadcast Join instead of Sort Merge Join