更新时间:2021-10-31 21:44:28
从 Spark 中的压缩读取整个文本文件.使用提供的代码示例,我能够从压缩存档中创建一个 DataFrame
,如下所示:
A solution is given in Read whole text files from a compression in Spark .
Using the code sample provided, I was able to create a DataFrame
from the compressed archive like so:
val jsonRDD = sc.binaryFiles("gzarchive/*").
flatMapValues(x => extractFiles(x).toOption).
mapValues(_.map(decode())
val df = sqlContext.read.json(jsonRDD.map(_._2).flatMap(x => x))
此方法适用于相对较小的 tar 归档文件,但不适用于较大的归档文件.
This method works fine for tar archives of a relatively small size, but is not suitable for large archive sizes.
该问题的更好解决方案似乎是将 tar 存档转换为 Hadoop SequenceFiles
,后者是可拆分的,因此可以在 Spark 中并行读取和处理(与 tar 存档相反).
A better solution to the problem seems to be to convert the tar archives to Hadoop SequenceFiles
, which are splittable and hence can be read and processed in parallel in Spark (as opposed to tar archives.)