更新时间:2023-11-18 14:40:34
与上述两个选项相比,这应该是更好的选择.
This should be better option than those 2 options mentioned.
由于具有公共密钥,因此可以进行内部联接.
since you have common key you can do inner join.
dataset2.join(dataset1, Seq("empid"), "inner").show()
您也可以像这样使用broadcast
功能/提示.这意味着您要告诉框架,应将小型数据帧(即dataset1)广播给每个执行者.
you can use broadcast
function/hint like this as well. which means you are telling framework that small dataframe i.e dataset1 should be broadcasted to every executor.
import org.apache.spark.sql.functions.broadcast
dataset2.join(broadcast(dataset1), Seq("empid"), "inner").show()
也请参阅以获取更多详细信息.
Also Look at for more details..
DataFrame连接优化-广播哈希连接广播加入将如何工作.
DataFrame join optimization - Broadcast Hash Join how broadcast joins will work.
What-is-the-the-the-maximum-size-for-a-broadcast-object-in-spark