带有数据框的火花udf

更新时间：2023-11-18 23:22:34

我不认为您可以将 DateTimeFormatter 传递为 UDF 的参数.您只能传递 Column .一种解决方案是:

I don't believe you can pass in the DateTimeFormatter as an argument to the UDF. You can only pass in a Column. One solution would be to do:

val return_date = udf((str: String, format: String) => {
  DateTimeFormat.forPatten(format).formatted(str))
})

然后:

val user_with_dates_formatted = users.withColumn(
  "formatted_date",
  return_date(users("ordering_date"), lit("yyyy/MM/dd"))
)

不过，老实说，这和您的原始算法都存在相同的问题.他们都使用 forPattern 为每条记录解析 yyyy/MM/dd .***是创建一个包裹在 Map [String，DateTimeFormatter] 周围的单例对象，也许是这样的(完全未经测试，但您知道了):

Honestly, though -- both this and your original algorithms have the same problem. They both parse yyyy/MM/dd using forPattern for every record. Better would be to create a singleton object wrapped around a Map[String,DateTimeFormatter], maybe like this (thoroughly untested, but you get the idea):

object DateFormatters {
  var formatters = Map[String,DateTimeFormatter]()

  def getFormatter(format: String) : DateTimeFormatter = {
    if (formatters.get(format).isEmpty) {
      formatters = formatters + (format -> DateTimeFormat.forPattern(format))
    }
    formatters.get(format).get
  }
}

然后您将 UDF 更改为:

val return_date = udf((str: String, format: String) => {
  DateFormatters.getFormatter(format).formatted(str))
})

那样，每个执行者每种格式只调用一次 DateTimeFormat.forPattern(...).

That way, DateTimeFormat.forPattern(...) is only called once per format per executor.

关于单例对象解决方案要注意的一件事是，您不能在 spark-shell 中定义对象-您必须将其打包到JAR文件中并使用 DateFormatters 对象，请在 spark-shell 中使用>-jars 选项.

One thing to note about the singleton object solution is that you can't define the object in the spark-shell -- you have to pack it up in a JAR file and use the --jars option to spark-shell if you want to use the DateFormatters object in the shell.

上一篇 : ：根据条件从 pandas 数据框列中删除低计数下一篇 : 随机排列 DataFrame 行

带有数据框的火花udf

相关阅读

推荐文章