在Pyspark中读取tar.gz存档时使用特定模式过滤文件

更新时间：2022-12-08 13:02:28

基于此帖子，您可以阅读 .tar.gz 文件作为 binaryFile ，然后使用python tarfile ，您可以提取存档成员并使用正则表达式 def_过滤文件名[1-9] .结果是rdd，您可以将其转换为数据帧:

Based on this post, you can read the .tar.gz file as binaryFile then using python tarfile you can extract the archive members and filter on file names using the regex def_[1-9]. The result is an rdd that you can convert into a data frame :

import re
import tarfile
from io import BytesIO

# extract only the files with which math regex 'def_[1-9].csv'
def extract_files(bytes):
    tar = tarfile.open(fileobj=BytesIO(bytes), mode="r:gz")
    return [tar.extractfile(x).read() for x in tar if re.match(r"def_[1-9].csv", x.name)]

# read binary file and convert to df
rdd = sc.binaryFiles("/path/myfolder.tar.gz") \
        .mapValues(extract_files) \
        .flatMap(lambda row: [x.decode("utf-8").split("\n") for x in row[1]])\
        .flatMap(lambda row: [e.split(",") for e in row])

df = rdd.toDF(*csv_cols)

上一篇 : ：如何使用C ++ API在HDF5文件中创建多值属性下一篇 : 用ruby解压缩(zip，tar，tag.gz)文件

在Pyspark中读取tar.gz存档时使用特定模式过滤文件

相关阅读

技术问答最新文章