且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

BigQuery 从查询中创建重复记录字段

更新时间:2023-01-29 20:48:44

解决NEST()嵌套"只有一个字段的限制的一种方法是使用BigQuery 用户定义函数.它们非常强大,但仍有一些限制限制要注意.从我的角度来看,最重要的是要记住 - 他们非常适合被视为昂贵的高计算查询

One of the way to go around NEST() limitation of "nesting" just one field is to use BigQuery User-Defined Functions. They are extremely powerful yet still have some Limits and Limitations to be aware of. And most important from my prospective to have in mind - they are quite a candidates for being qualified as expensive High-Compute queries

复杂的查询会消耗非常大的计算资源相对于处理的字节数.通常,此类查询包含大量 JOIN 或 CROSS JOIN 子句或复杂的用户定义的函数.

Complex queries can consume extraordinarily large computing resources relative to the number of bytes processed. Typically, such queries contain a very large number of JOIN or CROSS JOIN clauses or complex User-defined Functions.

所以,下面是从 questino 的例子中模仿"NEST(b, c) 的例子:

So, below is example that "mimic" NEST(b, c) from example in questino:

SELECT a, d.b, d.c FROM 
JS((      // input table
  SELECT a, NEST(CONCAT(STRING(b), ',', STRING(c))) AS d
  FROM (
    SELECT * FROM 
    (SELECT 1 AS a, 5 AS b, 2 AS c),
    (SELECT 1 AS a, 3 AS b, 1 AS c),
    (SELECT 2 AS a, 2 AS b, 1 AS c)
  ) GROUP BY a),
  a, d,     // input columns
  "[{'name': 'a', 'type': 'INTEGER'},    // output schema
    {'name': 'd', 'type': 'RECORD',
     'mode': 'REPEATED',
     'fields': [
       {'name': 'b', 'type': 'STRING'},
       {'name': 'c', 'type': 'STRING'}
     ]    
    }
  ]",
  "function(row, emit){    // function 
    var c = [];
    for (var i = 0; i < row.d.length; i++) {
      x = row.d[i].toString().split(',');
      t = {b:x[0], c:x[1]}
      c.push(t);
    };
    emit({a: row.a, d: c});  
  }"
)

比较简单.我希望你能够通过它并得到一个想法

It is relatively straightforward. I hope you will be able to walk through it and get an idea

仍然 - 记住:

无论您如何使用嵌套/重复字段创建记录 - BigQuery自动展平查询结果,因此可见结果不会包含重复字段.所以你应该把它用作产生的子选择由同一查询立即使用的中间结果.

No matter how you create record with nested/repeated fields - BigQuery automatically flattens query results, so visible results won't contain repeated fields. So you should use it as a subselect that produces intermediate results for immediate use by the same query.

作为仅供参考,您可以通过运行下面的查询来证明上面只返回两条记录(而不是三条记录,因为它是扁平的)

As FYI, you can prove for yourself that above returns only two records (not three as it is looks like when it is flattened) by running below query

SELECT COUNT(1) AS rows FROM (
  <above query here>
) 

另一个重要注意事项:
众所周知,NEST()UnFlatten Results 输出不兼容,主要用于子查询中的中间结果.
相比之下,上面的解决方案可以很容易地直接保存到表格中(带有未选中的展平结果)

Another important NOTE:
This is a known that NEST() is not compatible with UnFlatten Results Output and mostly is used for intermediate result in subquery.
In contrast, above solution can be easily saved directly to table (with unchecked Flatten Results)