且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

在Spark和Spark Broadcast变量中处理Hive查找表

更新时间:2023-11-18 14:40:34

与上述两个选项相比,这应该是更好的选择.

This should be better option than those 2 options mentioned.

由于具有公共密钥,因此可以进行内部联接.

since you have common key you can do inner join.

dataset2.join(dataset1, Seq("empid"), "inner").show()

您也可以像这样使用broadcast功能/提示.这意味着您要告诉框架,应将小型数据帧(即dataset1)广播给每个执行者.

you can use broadcast function/hint like this as well. which means you are telling framework that small dataframe i.e dataset1 should be broadcasted to every executor.

import org.apache.spark.sql.functions.broadcast
dataset2.join(broadcast(dataset1), Seq("empid"), "inner").show()

也请参阅以获取更多详细信息.

Also Look at for more details..

  • DataFrame join optimization - Broadcast Hash Join how broadcast joins will work.

What-is-the-the-the-maximum-size-for-a-broadcast-object-in-spark