火花请求最大数量

更新时间：2022-12-03 10:11:55

在pyspark + sql中:

In pyspark + sql:

logDF.registerTempTable("logDF")

mostPopularWebPageDF = sqlContext.sql("""select webPage, cntWebPage from (
                                            select webPage, count(*) as cntWebPage, max(count(*)) over () as maxcnt 
                                            from logDF 
                                            group by webPage) as tmp
                                            where tmp.cntWebPage = tmp.maxcnt""")

也许我可以使它更清洁，但它可以工作.我将尝试对其进行优化.

Maybe I can make it cleaner, but it works. I will try to optimize it.

我的结果:

webPage      cntWebPage
google.com   2

对于数据集:

webPage    usersid
google.com 1
google.com 3
bing.com   10

说明:正常计数是通过分组+ count(*)函数完成的.所有这些计数的最大值是通过窗口函数计算的，因此对于上面的数据集，不删除maxCount column/的立即DataFrame是:

Explanation: normal counting is done via grouping + count(*) function. Max of all these counts are calculated via window function, so for dataset above, immediate DataFrame /without dropping maxCount column/ is:

webPage    count  maxCount
google.com 2      2
bing.com   1      2

然后我们选择计数等于maxCount的行

Then we select rows with count equal to maxCount

我已删除DSL版本-它不支持()上的窗口，并且订购正在更改结果.对不起，这个错误. SQL版本正确

I have deleted DSL version - it does not support window over () and ordering is changing result. Sorry for this bug. SQL version is correct

上一篇 : ：如何根据屏幕分辨率调整表格下一篇 : IE 11 错误 - 访问被拒绝 - XMLHttpRequest

火花请求最大数量

相关阅读

技术问答最新文章