更新时间:2022-12-08 13:02:28
基于此帖子,您可以阅读 .tar.gz
文件作为 binaryFile
,然后使用python tarfile
,您可以提取存档成员并使用正则表达式 def_过滤文件名[1-9]
.结果是rdd,您可以将其转换为数据帧:
Based on this post, you can read the .tar.gz
file as binaryFile
then using python tarfile
you can extract the archive members and filter on file names using the regex def_[1-9]
. The result is an rdd that you can convert into a data frame :
import re
import tarfile
from io import BytesIO
# extract only the files with which math regex 'def_[1-9].csv'
def extract_files(bytes):
tar = tarfile.open(fileobj=BytesIO(bytes), mode="r:gz")
return [tar.extractfile(x).read() for x in tar if re.match(r"def_[1-9].csv", x.name)]
# read binary file and convert to df
rdd = sc.binaryFiles("/path/myfolder.tar.gz") \
.mapValues(extract_files) \
.flatMap(lambda row: [x.decode("utf-8").split("\n") for x in row[1]])\
.flatMap(lambda row: [e.split(",") for e in row])
df = rdd.toDF(*csv_cols)