且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

如何将多个csv文件(不同的架构)加载到bigquery中

更新时间:2023-12-02 20:21:04

我要自动执行此操作的方法基本上是从给定存储桶(或其子文件夹)读取所有文件,并使用其文件名"(进行假设)"作为要提取的目标表名.方法如下:

The way I would go about automating this is basically reading all the files from a given bucket (or its subfolder) and (making an assumption) using their "filename" to be the target tablename to ingest. Here is how:

gsutil ls gs://mybucket/subfolder/*.csv | xargs -I{} echo {} | awk '{n=split($1,A,"/"); q=split(A[n],B,"."); print "mydataset."B[1]" "$0}' | xargs -I{} sh -c 'bq --location=US load --replace=false --autodetect --source_format=CSV {}'

请确保将 location mydataset 替换为所需的值.另外,请注意以下假设:

Make sure to replace location, mydataset with your desired values. Also, please take note of the following assumptions:

  • 假定每个CSV的第一行都是标题,因此被视为列名.
  • 我们正在使用-replace = false 标志进行编写,这意味着数据将在您每次运行命令时附加.如果您想改写,只需将其设置为 true ,则每次运行都会覆盖所有表的数据.
  • CSV文件名( .csv 之前的部分用作表名.您可以修改awk脚本以将其更改为其他任何替代名称.
  • First row of each CSV is assumed to be the header, and thus is treated as column names.
  • We are writing with --replace=false flag, meaning data will be appended everytime you run the command. If you want to overwrite instead, just turn it to true and all tables' data will be over-written on each run.
  • CSV filenames (part before .csv is used as a tablename. You can modify the awk script to change it to any other alternative.