更新时间:2023-11-28 19:00:34
你可以使用join
:
val usersToKeep = sc.parallelize(
listOfUsersToKeep.map(Tuple1(_))).toDF("userID_")
val finalDataFrame = usersToKeep
.join(initialDataFrame, $"userID" === $"userID_")
.drop("userID_")
或广播变量和 UDF:
or a broadcast variable and an UDF:
import org.apache.spark.sql.functions.udf
val usersToKeepBD = sc.broadcast(listOfUsersToKeep.toSet)
val checkUser = udf((id: Long) => usersToKeepBD.value.contains(id))
val finalDataFrame = initialDataFrame.where(checkUser($"userID"))
也应该可以广播数据帧:
It should be also possible to broadcast a DataFrame:
import org.apache.spark.sql.functions.broadcast
initialDataFrame.join(broadcast(usersToKeep), $"userID" === $"userID_")