且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

如何在 Python 中排除 Spark 数据框中的多列

更新时间:2023-11-18 23:01:22

Simply with select:

Simply with select:

df.select([c for c in df.columns if c not in {'GpuName','GPU1_TwoPartHwID'}])

或者如果你真的想使用 drop 那么 reduce 应该可以解决问题:

or if you really want to use drop then reduce should do the trick:

from functools import reduce
from pyspark.sql import DataFrame

reduce(DataFrame.drop, ['GpuName','GPU1_TwoPartHwID'], df)

注意:

(执行时间差异):

在数据处理时间方面应该没有区别.虽然这些方法生成不同的逻辑计划,但物理计划完全相同.

There should be no difference when it comes to data processing time. While these methods generate different logical plans physical plans are exactly the same.

然而,当我们分析驱动程序端代码时存在差异:

There is a difference however when we analyze driver-side code:

  • 第一种方法只进行一次 JVM 调用,而第二种方法必须为必须排除的每一列调用 JVM
  • 第一种方法生成逻辑计划,相当于物理计划.在第二种情况下,它被重写.
  • 最后,Python 中的理解比 mapreduce
  • 等方法要快得多
  • Spark 2.x+ 支持 drop 中的多列.请参阅 SPARK-11884(在 DataFrame API 中删除多个列em>) 和 SPARK-12204(在SparkR) 以获取详细信息.
  • the first method makes only a single JVM call while the second one has to call JVM for each column that has to be excluded
  • the first method generates logical plan which is equivalent to physical plan. In the second case it is rewritten.
  • finally comprehensions are significantly faster in Python than methods like map or reduce
  • Spark 2.x+ supports multiple columns in drop. See SPARK-11884 (Drop multiple columns in the DataFrame API) and SPARK-12204 (Implement drop method for DataFrame in SparkR) for detials.