且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

pyspark:删除所有行中具有相同值的列

更新时间:2022-11-14 10:27:58

您可以在每列上应用 countDistinct() 聚合函数以获取每列不同值的计数.count=1 的列表示所有行中只有 1 个值.

You can apply the countDistinct() aggregation function on each column to get count of distinct values per column. Column with count=1 means it has only 1 value in all rows.

# apply countDistinct on each column
col_counts = df.agg(*(countDistinct(col(c)).alias(c) for c in df.columns)).collect()[0].asDict()

# select the cols with count=1 in an array
cols_to_drop = [col for col in df.columns if col_counts[col] == 1 ]

# drop the selected column
df.drop(*cols_to_drop).show()