且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

Spark DataSet过滤器性能

更新时间:2023-11-18 16:33:34

这是由于步骤3 在前两个方法中,spark不需要反序列化整个Java/Scala对象-它仅查看一列并继续进行操作.

In the first two, spark doesn't need to deserialize the whole Java/Scala object - it just looks at the one column and moves on.

在第三种方法中,由于您使用的是lambda函数,因此spark不能告诉您只需要一个字段,因此它将每行的所有33个字段都从内存中拉出,以便您可以检查一个字段

In the third, since you're using a lambda function, spark can't tell that you just want the one field, so it pulls all 33 fields out of memory for each row, so that you can check the one field.

我不确定第四个为什么这么慢.似乎它的工作方式与第一种相同.

I'm not sure why the fourth is so slow. It seems like it would work the same way as the first.