且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

na().drop()和filter(col.isNotNull)之间的区别(Apache Spark)

更新时间:2023-12-04 11:05:22

使用df.na.drop()删除包含任何 null或NaN值的行.

使用df.filter(df.col("onlyColumnInOneColumnDataFrame").isNotNull()),可将仅在onlyColumnInOneColumnDataFrame列中为空的行删除.

如果您想实现同一目标,那就是df.na.drop(["onlyColumnInOneColumnDataFrame"]).

Is there any difference in semantics between df.na().drop() and df.filter(df.col("onlyColumnInOneColumnDataFrame").isNotNull() && !df.col("onlyColumnInOneColumnDataFrame").isNaN()) where df is Apache Spark Dataframe?

Or shall I consider it as a bug if the first one does NOT return afterwards null (not a String null, but simply a null value) in the column onlyColumnInOneColumnDataFrame and the second one does?

EDIT: added !isNaN() as well. The onlyColumnInOneColumnDataFrame is the only column in the given Dataframe. Let's say it's type is Integer.

With df.na.drop() you drop the rows containing any null or NaN values.

With df.filter(df.col("onlyColumnInOneColumnDataFrame").isNotNull()) you drop those rows which have null only in the column onlyColumnInOneColumnDataFrame.

If you would want to achieve the same thing, that would be df.na.drop(["onlyColumnInOneColumnDataFrame"]).