且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

Linux上文件的默认缓冲区大小

更新时间:2023-11-13 09:32:28

由于您已链接到2.7文档,所以我假设您使用的是2.7. (在Python 3.x中,这一切都变得更加简单,因为在Python级别上公开了更多的缓冲.)

所有open实际上(在POSIX系统上)都调用fopen,然后,如果为buffering传递了任何内容,则调用setvbuf.由于没有传递任何内容,因此只能得到fopen中的默认缓冲区,该缓冲区由C标准库决定. (有关详细信息,请参见.没有,它将-1传递给PyFile_SetBufSize,除非bufsize >= 0,否则不执行任何操作.)

如果您阅读了 glibc setvbuf联机帮助页,则说明如果您从未调用任何缓冲功能:

通常所有文件都是块缓冲的.当对文件执行第一次I/O操作时,将调用malloc(3),并获得一个缓冲区.

请注意,它没有说获得了什么大小的缓冲区.这是故意的;这意味着实现可以很聪明,并针对不同情况选择不同的缓冲区大小. (有一个BUFSIZ常量,但是仅当您调用诸如setbuf之类的旧函数时才使用;不保证在任何其他情况下都可以使用它.)

那么,会发生什么?好吧,如果您查看glibc源代码,最终它将调用宏 _G_BUFSIZE . >

当然,您可能想在自己的系统上跟踪宏,而不是信任通用源.


您可能想知道为什么没有很好的记录方法来获取此信息.大概是因为您不应该在乎.如果需要特定的缓冲区大小,则可以手动设置一个大小;如果您相信系统最了解,那就相信它.除非您实际上在内核或libc上工作,否则谁在乎?从理论上讲,这还使系统有可能在这里做一些聪明的事情,例如根据文件文件系统的块大小甚至基于运行的统计数据来选择bufsize,尽管它看起来不像linux/glibc ,FreeBSD或OS X除了使用常量以外,不会做任何其他事情.这很可能是因为对于大多数应用程序而言,这实际上并不重要. (您可能要自己进行测试-在某些缓冲I/O绑定脚本上使用从1KB到2MB的显式缓冲区大小,并查看性能差异.)

The documentation states that the default value for buffering is: If omitted, the system default is used. I am currently on Red Hat Linux 6, but I am not able to figure out the default buffering that is set for the system.

Can anyone please guide me as to how determine the buffering for a system?

Since you linked to the 2.7 docs, I'm assuming you're using 2.7. (In Python 3.x, this all gets a lot simpler, because a lot more of the buffering is exposed at the Python level.)

All open actually does (on POSIX systems) is call fopen, and then, if you've passed anything for buffering, setvbuf. Since you're not passing anything, you just end up with the default buffer from fopen, which is up to your C standard library. (See the source for details. With no buffering, it passes -1 to PyFile_SetBufSize, which does nothing unless bufsize >= 0.)

If you read the glibc setvbuf manpage, it explains that if you never call any of the buffering functions:

Normally all files are block buffered. When the first I/O operation occurs on a file, malloc(3) is called, and a buffer is obtained.

Note that it doesn't say what size buffer is obtained. This is intentional; it means the implementation can be smart and choose different buffer sizes for different cases. (There is a BUFSIZ constant, but that's only used when you call legacy functions like setbuf; it's not guaranteed to be used in any other case.)

So, what does happen? Well, if you look at the glibc source, ultimately it calls the macro _IO_DOALLOCATE, which can be hooked (or overridden, because glibc unifies C++ streambuf and C stdio buffering), but ultimately, it allocates a buf of _IO_BUFSIZE, which is an alias for the platform-specific macro _G_BUFSIZE, which is 8192.

Of course you probably want to trace down the macros on your own system rather than trust the generic source.


You may wonder why there is no good documented way to get this information. Presumably it's because you're not supposed to care. If you need a specific buffer size, you set one manually; if you trust that the system knows best, just trust it. Unless you're actually working on the kernel or libc, who cares? In theory, this also leaves open the possibility that the system could do something smart here, like picking a bufsize based on the block size for the file's filesystem, or even based on running stats data, although it doesn't look like linux/glibc, FreeBSD, or OS X do anything other than use a constant. And most likely that's because it really doesn't matter for most applications. (You might want to test that out yourself—use explicit buffer sizes ranging from 1KB to 2MB on some buffered-I/O-bound script and see what the performance differences are.)