且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

Pyspark:将tar.gz文件加载到数据框中并按文件名过滤

更新时间:2022-05-23 08:53:33

Databricks 不支持直接的 *.tar.gz 迭代.为了处理文件,必须将它们解压缩到临时位置.Databricks支持 bash 不能胜任这项工作.

Databricks does not support direct *.tar.gz iteration. In order to process file, they have to be unzipped into temporary location. Databricks support bash than can do the job.

%sh find $source -name *.tar.gz -exec tar -xvzf {} -C $destination \;

以上代码会将源中扩展名为 *.tar.gz 的所有文件解压缩到目标位置.如果通过 dbutils.widgets %scala %pyspark 中的静态传递路径,则必须将该路径声明为环境变量.这可以在%pyspark

Above code will unzip all files with extension *.tar.gz in source to destination location. If the path is passed via dbutils.widgets or static in %scala or %pyspark, the path must be declared as environmental variable. This can be achieved in %pyspark

import os
os.environ[' source '] = '/dbfs/mnt/dl/raw/source/'

假设 *.csv 文件中的内容,请使用以下方法加载文件:

Use following methods to load file, in assumption the content in *.csv file:

DF = spark.read.format('csv').options(header='true', inferSchema='true').option("mode","DROPMALFORMED").load('/mnt/dl/raw/source/sample.csv')