且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

快速处理大量CSV数据的***方式

更新时间:2023-02-26 13:18:03

如何使用数据库。
$ b

将记录插入表中,然后使用连接查询它们。



导入可能需要一段时间,为连接和检索部分优化...


I have large CSV datasets (10M+ lines) that need to be processed. I have two other files that need to be referenced for the output—they contain data that amplifies what we know about the millions of lines in the CSV file. The goal is to output a new CSV file that has each record merged with the additional information from the other files.

Imagine that the large CSV file has transactions but the customer information and billing information is recorded in two other files and we want to output a new CSV that has each transaction linked to the customer ID and account ID, etc.

A colleague has a functional program written in Java to do this but it is very slow. The reason is that the CSV file with the millions of lines has to be walked through many, many, many times apparently.

My question is—yes, I am getting to it—how should I approach this in Ruby? The goal is for it to be faster (18+ hours right now with very little CPU activity)

Can I load this many records into memory? If so, how should I do it?

I know this is a little vague. Just looking for ideas as this is a little new to me.

how about using a database.

jam the records into tables, and then query them out using joins.

the import might take awhile, but the DB engine will be optimized for the join and retrieval part...