且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

使用线程/多进程读取多个文件

更新时间:2023-11-10 16:08:16

所以你的意思是没有办法加快速度吗?,因为我的情况是读取一堆文件,然后读取文件的每一行并将其存储到数据库中

So you mean there is no way to speed this up?, because my scenario is to read bunch of files then read each lines of it and store it to the database

优化的第一法则是问自己是否应该打扰.如果您的程序仅运行一次或几次优化,那将是浪费时间.

The first rule of optimization is to ask yourself if you should bother. If your program is run only once or a couple of times optimizing it is a waste of time.

第二条规则是在执行其他任何操作之前,先测量问题所在;

The second rule is that before you do anything else, measure where the problem lies;

编写一个简单的程序,该程序顺序读取文件,将文件拆分为几行,然后将其填充到数据库中. 在 profiler 下运行该程序,以查看该程序将大部分时间花费在哪里.

Write a simple program that sequentially reads files, splits them into lines and stuffs those in a database. Run that program under a profiler to see where the program is spending most of its time.

只有这样,您才知道该程序的哪一部分需要加快速度.

Only then do you know which part of the program needs speeding up.

尽管如此,这里还是有一些指针.

Here are some pointers nevertheless.

  • 使用mmap可以完成文件读取.
  • 您可以使用multiprocessing.Pool将读取的文件分散到不同的内核上.但是,这些文件中的数据将最终进入不同的进程,并且必须使用IPC发送回父进程.对于大量数据,这会产生大量开销.
  • 在Python的CPython实现中,一次只能有一个线程在执行Python字节码.尽管不受实际读取文件的限制,但处理结果是受限制的.因此,线程是否可以提供改进值得怀疑.
  • 将行填充到数据库中可能始终是一个主要的瓶颈,因为这是所有内容组合在一起的地方.这有多少问题取决于数据库.它是在内存中还是在磁盘上,是否允许多个程序同时更新它,等等?
  • Speading up the reading of files can be done using mmap.
  • You could use multiprocessing.Pool to spread out the reading of multiple files over different cores. But then the data from those files will end up in different processes and would have to be sent back to the parent process using IPC. This has significant overhead for large amounts of data.
  • In the CPython implementation of Python, only one thread at a time can be executing Python bytecode. While the actual reading from files isn't inhibited by that, processing the results is. So it is questionable if threads would offer improvement.
  • Stuffing the lines into a database will probably always be a major bottleneck, because that is where everything comes together. How much of a problem this is depends on the database. Is it in-memory or on disk, does it allow multiple programs to update it simultaneously, et cetera.