如何计算数据框中每一列每个不同值的出现?

更新时间：2022-12-10 09:19:35

countDistinct可能是首选:

import org.apache.spark.sql.functions.countDistinct

df.agg(countDistinct("some_column"))

如果速度比精度更重要，则可以考虑approx_count_distinct(Spark 1.x中的approxCountDistinct):

If speed is more important than the accuracy you may consider approx_count_distinct (approxCountDistinct in Spark 1.x):

import org.apache.spark.sql.functions.approx_count_distinct

df.agg(approx_count_distinct("some_column"))

要获取值和计数:

df.groupBy("some_column").count()

在SQL(spark-sql)中:

SELECT COUNT(DISTINCT some_column) FROM df

和

SELECT approx_count_distinct(some_column) FROM df

相关阅读