且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

Spark 数据框添加带有随机数据的新列

更新时间:2022-06-13 09:08:49

您正在使用 python 内置随机.这将返回一个特定的常量值(返回值).

You are using python builtin random. This returns a specific value which is constant (the returned value).

如错误消息所示,我们需要一个代表表达式的列.

As the error message shows, we expect a column which represents the expression.

要做到这一点:

from pyspark.sql.functions import rand,when
df1 = df.withColumn('isVal', when(rand() > 0.5, 1).otherwise(0))

这将给出 0 和 1 之间的均匀分布.有关更多选项,请参阅函数文档(http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions)

This would give a uniform distribution between 0 and 1. See the functions documentation for more options (http://spark.apache.org/docs/latest/api/python/pyspark.sql.html#module-pyspark.sql.functions)