如何在pyspark中创建数据框的副本?

更新时间：2023-11-17 14:41:52

如对另一个问题的回答中所述，您可以对初始模式进行深度复制.然后，我们可以修改该副本并将其用于初始化新的DataFrame _X:

As explained in the answer to the other question, you could make a deepcopy of your initial schema. We can then modify that copy and use it to initialize the new DataFrame _X:

import pyspark.sql.functions as F
from pyspark.sql.types import LongType
import copy

X = spark.createDataFrame([[1,2], [3,4]], ['a', 'b'])
_schema = copy.deepcopy(X.schema)
_schema.add('id_col', LongType(), False) # modified inplace
_X = X.rdd.zipWithIndex().map(lambda l: list(l[0]) + [l[1]]).toDF(_schema)

现在让我们检查一下:

print('Schema of X: ' + str(X.schema))
print('Schema of _X: ' + str(_X.schema))

输出:

Schema of X: StructType(List(StructField(a,LongType,true),StructField(b,LongType,true)))
Schema of _X: StructType(List(StructField(a,LongType,true),
                  StructField(b,LongType,true),StructField(id_col,LongType,false)))

请注意，要复制DataFrame，您只能使用_X = X.每当您添加新列时，例如withColumn，该对象未就地更改，但返回了新副本. 希望这会有所帮助！

Note that to copy a DataFrame you can just use _X = X. Whenever you add a new column with e.g. withColumn, the object is not altered in place, but a new copy is returned. Hope this helps!

上一篇 : ：如何使用 tkinter 创建多行条目?下一篇 : 创建表中的MySQL日期格式

如何在pyspark中创建数据框的副本?

相关阅读

推荐文章