且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

我们可以在数据框spark scala中拥有的最大列数

更新时间:2023-02-06 18:49:42

为您提供详细信息,答案为,Apache Spark中的列数大小受到限制.

Sparing you the details, the answer is Yes, there is a limit for the size the number of columns in Apache Spark.

从理论上讲,此限制取决于平台和每列中元素的大小.

Theoretically speaking, this limit depends on the platform and the size of element in each column.

不要忘记Java受JVM大小的限制,而执行程序也受JVM大小的限制- Java在堆中最大的对象大小.

Don't forget that Java is limited by the size of the JVM and an executor is also limited by that size - Java largest object size in Heap.

我会回头引用此为什么Spark RDD分区的HDFS是否有2GB的限制?是指HDFS对块/分区大小的限制.

I would go back an refer to this Why does Spark RDD partition has 2GB limit for HDFS? which refers to the limitation with HDFS on block/partition size.

因此实际上有很多限制要考虑.

So there is actually lots of restriction to take into account.

这意味着您可以轻松地找到一个硬限制(例如Int.MaxValue),但是更重要的 Spark仅可缩放较长且相对较薄的数据. (如保险柜中所述).

This means that you can easily find a hard limit (Int.MaxValue par ex.) but what is more important Spark scales well only long and relatively thin data. (like stated by pault).

最后,您需要记住,从根本上讲,您不能在执行者/分区之间分割单个记录.并且存在许多实际限制(GC,磁盘IO),这些限制使非常宽泛的数据变得不切实际.更不用说一些已知的错误了.

Finally, you need to remember that fundamentally you cannot split a single record between executors/partitions. And there is a number of practical limitations (GC, disk IO) which make very wide data impractical. Not to mention some known bugs.

注意:我提到@pault和@RameshMaharjan,因为这个答案实际上是我们进行讨论的结果. (而ofc @ zero323则来自其他答案的评论).

Note : I mention @pault and @RameshMaharjan as this answer is actually the fruit of the discussion we had. (And ofc @zero323 for his comment from the other answer).