且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

BigQuery - 删除重复记录有时花费很长时间

更新时间:2023-01-29 22:00:05

It could be that you have many duplicate values for a particular id, so computing row numbers takes a long time. If you want to check for whether this is the case, you can try:

#standardSQL
SELECT id, COUNT(*) AS id_count
FROM rawData.stock_movement
GROUP BY id
ORDER BY id_count DESC LIMIT 5;

With that said, it may be faster to remove duplicates with this query instead:

#standardSQL
SELECT latest_row.*
FROM (
  SELECT ARRAY_AGG(t ORDER BY timestamp DESC LIMIT 1)[OFFSET(0)] AS latest_row
  FROM rawData.stock_movement AS t
  GROUP BY t.id
);

Here is an example:

#standardSQL
WITH T AS (
  SELECT 1 AS id, 'foo' AS x, TIMESTAMP '2017-04-01' AS timestamp UNION ALL
  SELECT 2, 'bar', TIMESTAMP '2017-04-02' UNION ALL
  SELECT 1, 'baz', TIMESTAMP '2017-04-03')
SELECT latest_row.*
FROM (
  SELECT ARRAY_AGG(t ORDER BY timestamp DESC LIMIT 1)[OFFSET(0)] AS latest_row
  FROM rawData.stock_movement AS t
  GROUP BY t.id
);

The reason that this may be faster is that BigQuery will only keep the row with the largest timestamp in memory at any given point in time.

相关阅读

技术问答最新文章