且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

如何使用 Python 实现并行 gzip 压缩?

更新时间:2023-11-10 08:31:34

我不知道有一个用于 Python 的 pigz 接口,但如果你能编写它可能并不难真的需要它.Python 的 zlib 模块 允许压缩任意字节块,和 pigz 手册页 描述了用于并行化压缩的系统和输出格式已经.

I don't know of a pigz interface for Python off-hand, but it might not be that hard to write if you really need it. Python's zlib module allows compressing arbitrary chunks of bytes, and the pigz man page describes the system for parallelizing the compression and the output format already.

如果你真的需要并行压缩,应该可以使用 zlib 实现一个 pigz 等价物来压缩包裹在 multiprocessing.dummy.Pool.imap 中的块(multiprocessing.dummymultiprocessing API 的线程支持版本,因此您不会产生大量的 IPC 成本来向 worker 发送数据块和从 worker 发送数据块)并行化压缩.由于 zlib 是在 CPU 密集型工作期间释放 GIL 的少数内置模块之一,因此您实际上可能会从基于线程的并行性中获益.

If you really need parallel compression, it should be possible to implement a pigz equivalent using zlib to compress chunks wrapped in multiprocessing.dummy.Pool.imap (multiprocessing.dummy is the thread-backed version of the multiprocessing API, so you wouldn't incur massive IPC costs sending chunks to and from the workers) to parallelize the compression. Since zlib is one of the few built-in modules that releases the GIL during CPU-bound work, you might actually gain a benefit from thread based parallelism.

请注意,在实践中,当压缩级别没有调得那么高时,I/O 的成本通常与实际的 zlib 压缩相似(在数量级左右);如果您的数据源实际上无法以比压缩速度更快的速度提供给线程,那么您将不会从并行化中获得太多收益.

Note that in practice, when the compression level isn't turned up that high, I/O is often of similar (within order of magnitude or so) cost to the actual zlib compression; if your data source isn't able to actually feed the threads faster than they compress, you won't gain much from parallelizing.