且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

需要时间有效的方法通过PHP导入大型CSV文件到多个MySQL表

更新时间:2023-01-21 13:54:26

我编写了PHP脚本来批量加载Stack Overflow数据转储发布的数据。我导入数百万行,它不需要那么长。

I have written PHP scripts to bulk-load the data published by Stack Overflow data dump. I import millions of rows and it doesn't take that long.

以下是一些提示:


  • t依赖于自动提交。为每一行启动和提交事务的开销是巨大的。

  • Don't rely on autocommit. The overhead of starting and committing a transaction for every row is enormous. Use explicit transactions, and commit after every 1000 rows (or more).

使用预准备的语句。由于您基本上是在执行同样插入数千次,可以在开始循环之前准备每个插入,然后在循环期间执行,将值作为参数传递。

Use prepared statements. Since you are basically doing the same inserts thousands of times, you can prepare each insert before you start looping, and then execute during the loop, passing values as parameters. I don't know how to do this with CodeIgniter's database library, you'll have to figure it out.

调试MySQL以进行导入。

Tune MySQL for import. Increase cache buffers and so on. See Speed of INSERT Statements for more information.

使用LOAD DATA INFILE。如果可能。它的速度比使用INSERT逐行加载数据快20倍。我明白如果你不能,因为你需要得到最后一个插入id等等。但是在大多数情况下,即使您读取CSV文件,重新排列并将其写入多个临时CSV文件,数据加载速度仍然比使用INSERT快。

Use LOAD DATA INFILE. If possible. It's literally 20x faster than using INSERT to load data row by row. I understand if you can't because you need to get the last insert id and so on. But in most cases, even if you read the CSV file, rearrange it and write it out to multiple temp CSV files, the data load is still faster than using INSERT.

离线使用。请勿在网络请求期间运行长时间运行的任务。 PHP请求的时间限制将终止作业,如果不是今天,那么下周二当作业长10%。相反,让Web请求队列作业,然后将控制权返回给用户。您应该将数据导入作为服务器进程运行,并定期允许用户查看进度速率。例如,一个便宜的方法是为您的导入脚本输出。到临时文件,然后用户可以请求查看临时文件并在其浏览器中保持重新加载。

Do it offline. Don't run long-running tasks during a web request. The time limit of a PHP request will terminate the job, if not today then next Tuesday when the job is 10% longer. Instead, make the web request queue the job, and then return control to the user. You should run the data import as a server process, and periodically allow the user to glimpse the rate of progress. For instance, a cheap way to do this is for your import script to output "." to a temp file, and then the user can request to view the temp file and keep reloading in their browser. If you want to get fancy, do something with Ajax.