且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

如何从迭代器创建Spark RDD?

更新时间:2023-09-11 21:09:52

正如其他人所说,您可以通过Spark Streaming做一些事情,但是就纯Spark而言,您做不到,原因是您在做什么问违背了火花的模型.让我解释. 为了分发和并行化工作,spark必须将其分成多个部分.从HDFS读取数据时,HDFS会对Spark进行分块"操作,因为HDFS文件是按块组织的. Spark通常会在每个块中生成一个任务. 现在,迭代器仅提供对数据的顺序访问,因此spark不可能在不读取内存中所有内容的情况下将它们组织成块.

As somebody else said, you could do something with spark streaming, but as for pure spark, you can't, and the reason is that what you're asking goes against spark's model. Let me explain. To distribute and parallelize work, spark has to divide it in chunks. When reading from HDFS, that 'chunking' is done for Spark by HDFS, since HDFS files are organized in blocks. Spark will generally generate one task per block. Now, iterators only provide sequential access to your data, so it's impossible for spark to organize it in chunks without reading it all in memory.

构建具有单个可迭代分区的RDD是可能的,但是即使那样,也无法确定是否可以将Iterable的实现发送给工作人员.使用sc.parallelize()时,spark将创建实现serializable的分区,以便可以将每个分区发送到不同的工作程序.可迭代可能是通过网络连接,也可能是本地FS中的文件,因此除非将它们缓冲在内存中,否则它们无法发送给工作线程.

It may be possible to build a RDD that has a single iterable partition, but even then, it is impossible to say if the implementation of the Iterable could be sent to workers. When using sc.parallelize() spark creates partitions that implement serializable so each partition can be sent to a different worker. The iterable could be over a network connection, or file in the local FS, so they cannot be sent to the workers unless they are buffered in memory.