更新时间:2022-12-10 09:19:35
countDistinct
可能是首选:
import org.apache.spark.sql.functions.countDistinct
df.agg(countDistinct("some_column"))
如果速度比精度更重要,则可以考虑approx_count_distinct
(Spark 1.x中的approxCountDistinct
):
If speed is more important than the accuracy you may consider approx_count_distinct
(approxCountDistinct
in Spark 1.x):
import org.apache.spark.sql.functions.approx_count_distinct
df.agg(approx_count_distinct("some_column"))
要获取值和计数:
df.groupBy("some_column").count()
在SQL(spark-sql
)中:
SELECT COUNT(DISTINCT some_column) FROM df
和
SELECT approx_count_distinct(some_column) FROM df