且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

定期将大数据 (json) 导入 Firebase

更新时间:2022-06-19 21:16:26

我终于发布了答案,因为它与 2017 年的新 Google Cloud Platform 工具保持一致.

I finally post the answer as it aligns with the new Google Cloud Platform tooling of 2017.

新推出的 Google Cloud Functions 的运行时间有限,约为 9 分钟(540 秒).但是,云函数能够像这样从云存储创建 node.js 读取流(@googlecloud/storage on npm)

The newly introduced Google Cloud Functions have a limited run-time of approximately 9 minutes (540 seconds). However, cloud functions are able to create a node.js read stream from cloud storage like so (@googlecloud/storage on npm)

var gcs = require('@google-cloud/storage')({
// You don't need extra authentication when running the function
// online in the same project
  projectId: 'grape-spaceship-123',
  keyFilename: '/path/to/keyfile.json'
});

// Reference an existing bucket. 
var bucket = gcs.bucket('json-upload-bucket');

var remoteReadStream = bucket.file('superlarge.json').createReadStream();

虽然是远程流,但效率很高.在测试中,我能够在 4 分钟内解析大于 3 GB 的 json,进行简单的 json 转换.

Even though it is a remote stream, it is highly efficient. In tests I was able to parse jsons larger than 3 GB under 4 minutes, doing simple json transformations.

由于我们现在正在使用 node.js 流,任何 JSONStream 库都可以有效地动态转换数据 ( npm 上的 JSONStream),就像一个带有事件流的大数组一样异步处理数据( npm 上的事件流).

As we are working with node.js streams now, any JSONStream Library can efficiently transform the data on the fly (JSONStream on npm), dealing with the data asynchronously just like a large array with event streams (event-stream on npm).

es = require('event-stream')

remoteReadStream.pipe(JSONStream.parse('objects.*'))
  .pipe(es.map(function (data, callback(err, data)) {
    console.error(data)
    // Insert Data into Firebase.
    callback(null, data) // ! Return data if you want to make further transformations.
  }))

在管道末端的回调中只返回 null 以防止内存泄漏阻塞整个函数.

Return only null in the callback at the end of the pipe to prevent a memory leak blocking the whole function.

如果您执行需要更长运行时间的更重的转换,请使用 firebase 中的作业数据库"来跟踪您所在的位置,并且只执行 100.000 次转换并再次调用该函数,或者设置一个额外的函数来监听on 插入到forimport db"中,最终将原始 jsons 对象记录异步转换为目标格式和生产系统.拆分导入和计算.

If you do heavier transformations that require a longer run time, either use a "job db" in firebase to track where you are at and only do i.e. 100.000 transformations and call the function again, or set up an additional function which listens on inserts into a "forimport db" that finally transforms the raw jsons object record into your target format and production system asynchronously. Splitting import and computation.

此外,您可以在 nodejs 应用引擎中运行云函数代码.但不一定反过来.

Additionally, you can run cloud functions code in a nodejs app engine. But not necessarily the other way around.