Spark:数据帧中zipwithindex的等效项

更新时间：2023-11-18 22:12:58

请查看 https://issues.apache.org/jira/browse/SPARK-23074 对于数据帧中的这种直接功能奇偶校验.. 如果您有兴趣在 Spark 的某个时候看到这一点，请对该 jira 点赞.

Please check https://issues.apache.org/jira/browse/SPARK-23074 for this direct functionality parity in dataframes .. upvote that jira if you're interested to see this at some point in Spark.

这是 PySpark 中的一种解决方法:

Here's a workaround though in PySpark:

def dfZipWithIndex (df, offset=1, colName="rowId"):
    '''
        Enumerates dataframe rows is native order, like rdd.ZipWithIndex(), but on a dataframe 
        and preserves a schema

        :param df: source dataframe
        :param offset: adjustment to zipWithIndex()'s index
        :param colName: name of the index column
    '''

    new_schema = StructType(
                    [StructField(colName,LongType(),True)]        # new added field in front
                    + df.schema.fields                            # previous schema
                )

    zipped_rdd = df.rdd.zipWithIndex()

    new_rdd = zipped_rdd.map(lambda args: ([args[1] + offset] + list(args[0])))

    return spark.createDataFrame(new_rdd, new_schema)

这也可以在包.

That's also available in abalon package.

上一篇 : ：如何使用Spark Scala加入3个RDD下一篇 : 如何使用Spark Scala加入3个RDD

Spark:数据帧中zipwithindex的等效项

相关阅读

推荐文章