关于如何在Scala中将新列添加到具有随机值的现有DataFrame中

更新时间：2023-11-18 18:40:16

火花> = 2.3

可以使用asNondeterministic方法禁用某些优化:

It is possible to disable some optimizations using asNondeterministic method:

import org.apache.spark.sql.expressions.UserDefinedFunction

val f: UserDefinedFunction = ???
val fNonDeterministic: UserDefinedFunction = f.asNondeterministic

在使用此选项之前，请确保您了解担保.

Please make sure you understand the guarantees before using this option.

火花< 2.3

传递给udf的函数应该是确定性的( SPARK-20586 )和null函数调用可以用常量代替.如果要生成随机数，请使用以下内置函数:

Function which is passed to udf should be deterministic (with possible exception of SPARK-20586) and nullary functions calls can be replaced by constants. If you want to generate random numbers use on of the built-in functions:

randn -使用标准正态分布生成具有独立且分布均匀的(iid)样本的列.

rand - Generate a random column with independent and identically distributed (i.i.d.) samples from U[0.0, 1.0].
randn - Generate a column with independent and identically distributed (i.i.d.) samples from the standard normal distribution.

并转换输出以获取所需的分布，例如:

and transform the output to obtain required distribution for example:

(rand * Integer.MAX_VALUE).cast("bigint").cast("string")

上一篇 : ：使用 PySpark 删除 Dataframe 的嵌套列下一篇 : Julia DataFrame中某列的累积总和

关于如何在Scala中将新列添加到具有随机值的现有DataFrame中

相关阅读

推荐文章