更新时间:2022-12-03 10:11:55
在pyspark + sql中:
In pyspark + sql:
logDF.registerTempTable("logDF")
mostPopularWebPageDF = sqlContext.sql("""select webPage, cntWebPage from (
select webPage, count(*) as cntWebPage, max(count(*)) over () as maxcnt
from logDF
group by webPage) as tmp
where tmp.cntWebPage = tmp.maxcnt""")
也许我可以使它更清洁,但它可以工作.我将尝试对其进行优化.
Maybe I can make it cleaner, but it works. I will try to optimize it.
我的结果:
webPage cntWebPage
google.com 2
对于数据集:
webPage usersid
google.com 1
google.com 3
bing.com 10
说明:正常计数是通过分组+ count(*)函数完成的.所有这些计数的最大值是通过窗口函数计算的,因此对于上面的数据集,不删除maxCount column/的立即DataFrame是:
Explanation: normal counting is done via grouping + count(*) function. Max of all these counts are calculated via window function, so for dataset above, immediate DataFrame /without dropping maxCount column/ is:
webPage count maxCount
google.com 2 2
bing.com 1 2
然后我们选择计数等于maxCount的行
Then we select rows with count equal to maxCount
我已删除DSL版本-它不支持()上的窗口,并且订购正在更改结果.对不起,这个错误. SQL版本正确
I have deleted DSL version - it does not support window over () and ordering is changing result. Sorry for this bug. SQL version is correct