如何在 Python 中排除 Spark 数据框中的多列

更新时间：2023-11-18 23:01:22

Simply with select:

df.select([c for c in df.columns if c not in {'GpuName','GPU1_TwoPartHwID'}])

或者如果你真的想使用 drop 那么 reduce 应该可以解决问题:

or if you really want to use drop then reduce should do the trick:

from functools import reduce
from pyspark.sql import DataFrame

reduce(DataFrame.drop, ['GpuName','GPU1_TwoPartHwID'], df)

注意:

(执行时间差异):

在数据处理时间方面应该没有区别.虽然这些方法生成不同的逻辑计划，但物理计划完全相同.

There should be no difference when it comes to data processing time. While these methods generate different logical plans physical plans are exactly the same.

然而，当我们分析驱动程序端代码时存在差异:

There is a difference however when we analyze driver-side code:

第一种方法只进行一次 JVM 调用，而第二种方法必须为必须排除的每一列调用 JVM
第一种方法生成逻辑计划，相当于物理计划.在第二种情况下，它被重写.
最后，Python 中的理解比 map 或 reduce
Spark 2.x+ 支持 drop 中的多列.请参阅 SPARK-11884(在 DataFrame API 中删除多个列em>) 和 SPARK-12204(在SparkR) 以获取详细信息.

the first method makes only a single JVM call while the second one has to call JVM for each column that has to be excluded
the first method generates logical plan which is equivalent to physical plan. In the second case it is rewritten.
finally comprehensions are significantly faster in Python than methods like map or reduce
Spark 2.x+ supports multiple columns in drop. See SPARK-11884 (Drop multiple columns in the DataFrame API) and SPARK-12204 (Implement drop method for DataFrame in SparkR) for detials.

上一篇 : ：从数据框列中随机选择行下一篇 : 在Unity中随机排列数组

如何在 Python 中排除 Spark 数据框中的多列

相关阅读

推荐文章