更新时间:2022-05-23 08:53:33
Databricks 不支持直接的 *.tar.gz 迭代.为了处理文件,必须将它们解压缩到临时位置.Databricks支持 bash 不能胜任这项工作.
Databricks does not support direct *.tar.gz iteration. In order to process file, they have to be unzipped into temporary location. Databricks support bash than can do the job.
%sh find $source -name *.tar.gz -exec tar -xvzf {} -C $destination \;
以上代码会将源中扩展名为 *.tar.gz 的所有文件解压缩到目标位置.如果通过 dbutils.widgets 或%scala 或%pyspark 中的静态传递路径,则必须将该路径声明为环境变量.这可以在%pyspark
Above code will unzip all files with extension *.tar.gz in source to destination location. If the path is passed via dbutils.widgets or static in %scala or %pyspark, the path must be declared as environmental variable. This can be achieved in %pyspark
import os
os.environ[' source '] = '/dbfs/mnt/dl/raw/source/'
假设 *.csv 文件中的内容,请使用以下方法加载文件:
Use following methods to load file, in assumption the content in *.csv file:
DF = spark.read.format('csv').options(header='true', inferSchema='true').option("mode","DROPMALFORMED").load('/mnt/dl/raw/source/sample.csv')