如何从字典中创建数据框，其中每个项目都是 PySpark 中的一列

更新时间：2022-12-09 08:12:24

最简单的方法是创建一个 Pandas DataFrame 并转换为 Spark DataFrame:

Easiest way is to create a pandas DataFrame and convert to a Spark DataFrame:

col_dict = {'col1': [1, 2, 3],
            'col2': [4, 5, 6]}

import pandas as pd
pandas_df = pd.DataFrame(col_dict)
df = sqlCtx.createDataFrame(pandas_df)
df.show()
#+----+----+
#|col1|col2|
#+----+----+
#|   1|   4|
#|   2|   5|
#|   3|   6|
#+----+----+

没有熊猫

如果 pandas 不可用，您只需将数据处理为适用于 createDataFrame() 函数的表单.从之前的答案中引用自己的话:

Without Pandas

If pandas is not available, you'll just have to manipulate your data into a form that works for the createDataFrame() function. Quoting myself from a previous answer:

我发现将 createDataFrame() 的参数视为一个有用的元组列表，其中列表中的每个条目对应于DataFrame 和元组的每个元素对应一列.

I find it's useful to think of the argument to createDataFrame() as a list of tuples where each entry in the list corresponds to a row in the DataFrame and each element of the tuple corresponds to a column.

colnames, data = zip(*col_dict.items())
print(colnames)
#('col2', 'col1')
print(data)
#([4, 5, 6], [1, 2, 3])

现在我们需要修改数据，使其成为元组列表，其中每个元素都包含相应列的数据.幸运的是，这很容易使用 zip:

Now we need to modify data so that it's a list of tuples, where each element contains the data for the corresponding column. Luckily, this is easy using zip:

data = zip(*data)
print(data)
#[(4, 1), (5, 2), (6, 3)]

现在调用createDataFrame():

df = sqlCtx.createDataFrame(data, colnames)
df.show()
#+----+----+
#|col2|col1|
#+----+----+
#|   4|   1|
#|   5|   2|
#|   6|   3|
#+----+----+

上一篇 : ：从另一个数据框中替换列的值下一篇 : 使用flexbox创建两列布局

如何从字典中创建数据框，其中每个项目都是 PySpark 中的一列

没有熊猫

Without Pandas

相关阅读

技术问答最新文章