且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

从Spark数据框中选择或删除重复的列

更新时间:2023-11-18 23:26:58

经过数小时的研究,我发现的唯一方法是重命名列集,然后使用新集作为标题创建另一个数据框.

The only way I found with hours of research is to rename the column set, then create another dataframe with the new set as the header.

例如,如果您有:

>>> import pyspark
>>> from pyspark.sql import SQLContext
>>> 
>>> sc = pyspark.SparkContext()
>>> sqlContext = SQLContext(sc)
>>> df = sqlContext([(1, 2, 3), (4, 5, 6)], ['a', 'b', 'a'])
DataFrame[a: bigint, b: bigint, a: bigint]
>>> df.columns
['a', 'b', 'a']
>>> df2 = df.toDF('a', 'b', 'c')
>>> df2.columns
['a', 'b', 'c']

您可以使用 df.columns 获取列列表,然后使用循环重命名所有重复项以获取新的列列表(请不要忘记传递 * new_col_list 而不是 toDF 函数的 new_col_list ,否则会抛出无效的计数错误.

You can get the list of columns using df.columns and then use a loop to rename any duplicates to get the new column list (don't forget to pass *new_col_list instead of new_col_list to toDF function else it'll throw an invalid count error).