更新时间:2023-08-28 16:32:58
Well there are some bugs here (the first issue looks like related to to the same problem as SPARK-6231) and JIRA looks like a good idea, but SUBTRACT
/ EXCEPT
is no the right choice for partial matches.
相反,从 Spark 2.0 开始,您可以使用反连接:
Instead, as of Spark 2.0, you can use anti-join:
df1.join(df1_with_df2, ["name"], "leftanti").show()
在1.6中,您可以使用标准外部联接执行几乎相同的操作:
In 1.6 you can do pretty much the same thing with standard outer join:
import pyspark.sql.functions as F
ref = df1_with_df2.select("name").alias("ref")
(df1
.join(ref, ref.name == df1.name, "leftouter")
.filter(F.isnull("ref.name"))
.drop(F.col("ref.name")))