更新时间:2023-11-18 22:12:58
请查看 https://issues.apache.org/jira/browse/SPARK-23074 对于数据帧中的这种直接功能奇偶校验.. 如果您有兴趣在 Spark 的某个时候看到这一点,请对该 jira 点赞.
Please check https://issues.apache.org/jira/browse/SPARK-23074 for this direct functionality parity in dataframes .. upvote that jira if you're interested to see this at some point in Spark.
这是 PySpark 中的一种解决方法:
Here's a workaround though in PySpark:
def dfZipWithIndex (df, offset=1, colName="rowId"):
'''
Enumerates dataframe rows is native order, like rdd.ZipWithIndex(), but on a dataframe
and preserves a schema
:param df: source dataframe
:param offset: adjustment to zipWithIndex()'s index
:param colName: name of the index column
'''
new_schema = StructType(
[StructField(colName,LongType(),True)] # new added field in front
+ df.schema.fields # previous schema
)
zipped_rdd = df.rdd.zipWithIndex()
new_rdd = zipped_rdd.map(lambda args: ([args[1] + offset] + list(args[0])))
return spark.createDataFrame(new_rdd, new_schema)
这也可以在 包.
That's also available in abalon package.