Pyspark:将tar.gz文件加载到数据框中并按文件名过滤

更新时间：2022-05-23 08:53:33

Databricks 不支持直接的 *.tar.gz 迭代.为了处理文件，必须将它们解压缩到临时位置.Databricks支持 bash 不能胜任这项工作.

Databricks does not support direct *.tar.gz iteration. In order to process file, they have to be unzipped into temporary location. Databricks support bash than can do the job.

%sh find $source -name *.tar.gz -exec tar -xvzf {} -C $destination \;

以上代码会将源中扩展名为 *.tar.gz 的所有文件解压缩到目标位置.如果通过 dbutils.widgets 或％scala 或％pyspark 中的静态传递路径，则必须将该路径声明为环境变量.这可以在％pyspark

Above code will unzip all files with extension *.tar.gz in source to destination location. If the path is passed via dbutils.widgets or static in %scala or %pyspark, the path must be declared as environmental variable. This can be achieved in %pyspark

import os
os.environ[' source '] = '/dbfs/mnt/dl/raw/source/'

假设 *.csv 文件中的内容，请使用以下方法加载文件:

Use following methods to load file, in assumption the content in *.csv file:

DF = spark.read.format('csv').options(header='true', inferSchema='true').option("mode","DROPMALFORMED").load('/mnt/dl/raw/source/sample.csv')

上一篇 : ：我试图使用ajax在ASP.NET网页上调用webmethod。发生500内部服务器错误。下一篇 : Logstash:如何将文件名添加为字段?

Pyspark:将tar.gz文件加载到数据框中并按文件名过滤

相关阅读

技术问答最新文章