且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

在Pyspark中读取tar.gz存档时使用特定模式过滤文件

更新时间:2022-12-08 13:02:28

基于此帖子,您可以阅读 .tar.gz 文件作为 binaryFile ,然后使用python tarfile ,您可以提取存档成员并使用正则表达式 def_过滤文件名[1-9] .结果是rdd,您可以将其转换为数据帧:

Based on this post, you can read the .tar.gz file as binaryFile then using python tarfile you can extract the archive members and filter on file names using the regex def_[1-9]. The result is an rdd that you can convert into a data frame :

import re
import tarfile
from io import BytesIO

# extract only the files with which math regex 'def_[1-9].csv'
def extract_files(bytes):
    tar = tarfile.open(fileobj=BytesIO(bytes), mode="r:gz")
    return [tar.extractfile(x).read() for x in tar if re.match(r"def_[1-9].csv", x.name)]

# read binary file and convert to df
rdd = sc.binaryFiles("/path/myfolder.tar.gz") \
        .mapValues(extract_files) \
        .flatMap(lambda row: [x.decode("utf-8").split("\n") for x in row[1]])\
        .flatMap(lambda row: [e.split(",") for e in row])

df = rdd.toDF(*csv_cols)