且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

将压缩在 tar.gz 存档中的多个文件读入 Spark

更新时间:2021-10-31 21:44:28

从 Spark 中的压缩读取整个文本文件.使用提供的代码示例,我能够从压缩存档中创建一个 DataFrame,如下所示:

A solution is given in Read whole text files from a compression in Spark . Using the code sample provided, I was able to create a DataFrame from the compressed archive like so:

val jsonRDD = sc.binaryFiles("gzarchive/*").
               flatMapValues(x => extractFiles(x).toOption).
               mapValues(_.map(decode())

val df = sqlContext.read.json(jsonRDD.map(_._2).flatMap(x => x))

此方法适用于相对较小的 tar 归档文件,但不适用于较大的归档文件.

This method works fine for tar archives of a relatively small size, but is not suitable for large archive sizes.

该问题的更好解决方案似乎是将 tar 存档转换为 Hadoop SequenceFiles,后者是可拆分的,因此可以在 Spark 中并行读取和处理(与 tar 存档相反).

A better solution to the problem seems to be to convert the tar archives to Hadoop SequenceFiles, which are splittable and hence can be read and processed in parallel in Spark (as opposed to tar archives.)

请参阅:一百万个小文件——Stuart Sierra 的数字题外话.