且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

将Spark DataFrame列转换为python列表

更新时间:2022-06-04 04:32:34

请参阅为什么您这样做的方式不起作用.首先,您尝试从类型,收集的输出如下:

See, why this way that you are doing is not working. First, you are trying to get integer from a Row Type, the output of your collect is like this:

>>> mvv_list = mvv_count_df.select('mvv').collect()
>>> mvv_list[0]
Out: Row(mvv=1)

如果您采取这样的做法:

If you take something like this:

>>> firstvalue = mvv_list[0].mvv
Out: 1

您将获得mvv值.如果您需要数组的所有信息,则可以采取以下方法:

You will get the mvv value. If you want all the information of the array you can take something like this:

>>> mvv_array = [int(row.mvv) for row in mvv_list.collect()]
>>> mvv_array
Out: [1,2,3,4]

但是,如果对另一列尝试相同的操作,则会得到:

But if you try the same for the other column, you get:

>>> mvv_count = [int(row.count) for row in mvv_list.collect()]
Out: TypeError: int() argument must be a string or a number, not 'builtin_function_or_method'

之所以会发生这种情况,是因为count是内置方法.并且该列与count同名.一种解决方法是将count的列名更改为_count:

This happens because count is a built-in method. And the column has the same name as count. A workaround to do this is change the column name of count to _count:

>>> mvv_list = mvv_list.selectExpr("mvv as mvv", "count as _count")
>>> mvv_count = [int(row._count) for row in mvv_list.collect()]

但是不需要此解决方法,因为您可以使用字典语法访问列:

But this workaround is not needed, as you can access the column using the dictionary syntax:

>>> mvv_array = [int(row['mvv']) for row in mvv_list.collect()]
>>> mvv_count = [int(row['count']) for row in mvv_list.collect()]

它最终将起作用!