且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

Spark:数据帧中zipwithindex的等效项

更新时间:2023-11-18 22:12:58

请查看 https://issues.apache.org/jira/browse/SPARK-23074 对于数据帧中的这种直接功能奇偶校验.. 如果您有兴趣在 Spark 的某个时候看到这一点,请对该 jira 点赞.

Please check https://issues.apache.org/jira/browse/SPARK-23074 for this direct functionality parity in dataframes .. upvote that jira if you're interested to see this at some point in Spark.

这是 PySpark 中的一种解决方法:

Here's a workaround though in PySpark:

def dfZipWithIndex (df, offset=1, colName="rowId"):
    '''
        Enumerates dataframe rows is native order, like rdd.ZipWithIndex(), but on a dataframe 
        and preserves a schema

        :param df: source dataframe
        :param offset: adjustment to zipWithIndex()'s index
        :param colName: name of the index column
    '''

    new_schema = StructType(
                    [StructField(colName,LongType(),True)]        # new added field in front
                    + df.schema.fields                            # previous schema
                )

    zipped_rdd = df.rdd.zipWithIndex()

    new_rdd = zipped_rdd.map(lambda args: ([args[1] + offset] + list(args[0])))

    return spark.createDataFrame(new_rdd, new_schema)

这也可以在 包.

That's also available in abalon package.