更新时间:2023-02-06 18:49:42
为您提供详细信息,答案为是,Apache Spark中的列数大小受到限制.
Sparing you the details, the answer is Yes, there is a limit for the size the number of columns in Apache Spark.
从理论上讲,此限制取决于平台和每列中元素的大小.
Theoretically speaking, this limit depends on the platform and the size of element in each column.
不要忘记Java受JVM大小的限制,而执行程序也受JVM大小的限制- Java在堆中最大的对象大小.
Don't forget that Java is limited by the size of the JVM and an executor is also limited by that size - Java largest object size in Heap.
我会回头引用此为什么Spark RDD分区的HDFS是否有2GB的限制?是指HDFS对块/分区大小的限制.
I would go back an refer to this Why does Spark RDD partition has 2GB limit for HDFS? which refers to the limitation with HDFS on block/partition size.
因此实际上有很多限制要考虑.
So there is actually lots of restriction to take into account.
这意味着您可以轻松地找到一个硬限制(例如Int.MaxValue),但是更重要的 Spark仅可缩放较长且相对较薄的数据. (如保险柜中所述).
This means that you can easily find a hard limit (Int.MaxValue par ex.) but what is more important Spark scales well only long and relatively thin data. (like stated by pault).
最后,您需要记住,从根本上讲,您不能在执行者/分区之间分割单个记录.并且存在许多实际限制(GC,磁盘IO),这些限制使非常宽泛的数据变得不切实际.更不用说一些已知的错误了.
Finally, you need to remember that fundamentally you cannot split a single record between executors/partitions. And there is a number of practical limitations (GC, disk IO) which make very wide data impractical. Not to mention some known bugs.
注意:我提到@pault和@RameshMaharjan,因为这个答案实际上是我们进行讨论的结果. (而ofc @ zero323则来自其他答案的评论).
Note : I mention @pault and @RameshMaharjan as this answer is actually the fruit of the discussion we had. (And ofc @zero323 for his comment from the other answer).