更新时间:2022-06-04 04:32:34
请参阅为什么您这样做的方式不起作用.首先,您尝试从行类型,收集的输出如下:
See, why this way that you are doing is not working. First, you are trying to get integer from a Row Type, the output of your collect is like this:
>>> mvv_list = mvv_count_df.select('mvv').collect()
>>> mvv_list[0]
Out: Row(mvv=1)
如果您采取这样的做法:
If you take something like this:
>>> firstvalue = mvv_list[0].mvv
Out: 1
您将获得mvv
值.如果您需要数组的所有信息,则可以采取以下方法:
You will get the mvv
value. If you want all the information of the array you can take something like this:
>>> mvv_array = [int(row.mvv) for row in mvv_list.collect()]
>>> mvv_array
Out: [1,2,3,4]
但是,如果对另一列尝试相同的操作,则会得到:
But if you try the same for the other column, you get:
>>> mvv_count = [int(row.count) for row in mvv_list.collect()]
Out: TypeError: int() argument must be a string or a number, not 'builtin_function_or_method'
之所以会发生这种情况,是因为count
是内置方法.并且该列与count
同名.一种解决方法是将count
的列名更改为_count
:
This happens because count
is a built-in method. And the column has the same name as count
. A workaround to do this is change the column name of count
to _count
:
>>> mvv_list = mvv_list.selectExpr("mvv as mvv", "count as _count")
>>> mvv_count = [int(row._count) for row in mvv_list.collect()]
但是不需要此解决方法,因为您可以使用字典语法访问列:
But this workaround is not needed, as you can access the column using the dictionary syntax:
>>> mvv_array = [int(row['mvv']) for row in mvv_list.collect()]
>>> mvv_count = [int(row['count']) for row in mvv_list.collect()]
它最终将起作用!