更新时间:2023-11-18 23:22:34
我不认为您可以将 DateTimeFormatter
传递为 UDF
的参数.您只能传递 Column
.一种解决方案是:
I don't believe you can pass in the DateTimeFormatter
as an argument to the UDF
. You can only pass in a Column
. One solution would be to do:
val return_date = udf((str: String, format: String) => {
DateTimeFormat.forPatten(format).formatted(str))
})
然后:
val user_with_dates_formatted = users.withColumn(
"formatted_date",
return_date(users("ordering_date"), lit("yyyy/MM/dd"))
)
不过,老实说,这和您的原始算法都存在相同的问题.他们都使用 forPattern
为每条记录解析 yyyy/MM/dd
.***是创建一个包裹在 Map [String,DateTimeFormatter]
周围的单例对象,也许是这样的(完全未经测试,但您知道了):
Honestly, though -- both this and your original algorithms have the same problem. They both parse yyyy/MM/dd
using forPattern
for every record. Better would be to create a singleton object wrapped around a Map[String,DateTimeFormatter]
, maybe like this (thoroughly untested, but you get the idea):
object DateFormatters {
var formatters = Map[String,DateTimeFormatter]()
def getFormatter(format: String) : DateTimeFormatter = {
if (formatters.get(format).isEmpty) {
formatters = formatters + (format -> DateTimeFormat.forPattern(format))
}
formatters.get(format).get
}
}
然后您将 UDF
更改为:
val return_date = udf((str: String, format: String) => {
DateFormatters.getFormatter(format).formatted(str))
})
那样,每个执行者每种格式只调用一次 DateTimeFormat.forPattern(...)
.
That way, DateTimeFormat.forPattern(...)
is only called once per format per executor.
关于单例对象解决方案要注意的一件事是,您不能在 spark-shell
中定义对象-您必须将其打包到JAR文件中并使用 DateFormatters
对象,请在 spark-shell
中使用>-jars 选项.
One thing to note about the singleton object solution is that you can't define the object in the spark-shell
-- you have to pack it up in a JAR file and use the --jars
option to spark-shell
if you want to use the DateFormatters
object in the shell.