且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

Spark从DataFrame中删除重复的行

更新时间:2023-11-18 18:45:10

可以在sparksql中使用窗口函数来实现这一点。 p>

You can use window function in sparksql to achieve this.

df.registerTempTable("x")
sqlContext.sql("SELECT a, b,c,d  FROM( SELECT *, ROW_NUMBER()OVER(PARTITION BY a ORDER BY b DESC) rn FROM x) y WHERE rn = 1").collect

这将实现你所需要的。
阅读更多关于Window函数的支持 https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html

This will achieve what you need. Read more about Window function suupport https://databricks.com/blog/2015/07/15/introducing-window-functions-in-spark-sql.html