且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

为什么矩阵乘法与numpy的速度比使用Python ctypes的?

更新时间:2022-01-15 15:15:28

我不是太熟悉numpy的,但来源Github上。的点积一部分https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/arraytypes.c.src,这我假设被转化为具体的C实现每个数据类型。例如:

I'm not too familiar with Numpy, but the source is on Github. Part of dot products are implemented in https://github.com/numpy/numpy/blob/master/numpy/core/src/multiarray/arraytypes.c.src, which I'm assuming is translated into specific C implementations for each datatype. For example:

/**begin repeat
 *
 * #name = BYTE, UBYTE, SHORT, USHORT, INT, UINT,
 * LONG, ULONG, LONGLONG, ULONGLONG,
 * FLOAT, DOUBLE, LONGDOUBLE,
 * DATETIME, TIMEDELTA#
 * #type = npy_byte, npy_ubyte, npy_short, npy_ushort, npy_int, npy_uint,
 * npy_long, npy_ulong, npy_longlong, npy_ulonglong,
 * npy_float, npy_double, npy_longdouble,
 * npy_datetime, npy_timedelta#
 * #out = npy_long, npy_ulong, npy_long, npy_ulong, npy_long, npy_ulong,
 * npy_long, npy_ulong, npy_longlong, npy_ulonglong,
 * npy_float, npy_double, npy_longdouble,
 * npy_datetime, npy_timedelta#
 */
static void
@name@_dot(char *ip1, npy_intp is1, char *ip2, npy_intp is2, char *op, npy_intp n,
           void *NPY_UNUSED(ignore))
{
    @out@ tmp = (@out@)0;
    npy_intp i;

    for (i = 0; i < n; i++, ip1 += is1, ip2 += is2) {
        tmp += (@out@)(*((@type@ *)ip1)) *
               (@out@)(*((@type@ *)ip2));
    }
    *((@type@ *)op) = (@type@) tmp;
}
/**end repeat**/

此出现来计算一维点的产品,即在载体中。在我Github上浏览几分钟,我无法找到矩阵源,但它是可能的,它使用一个调用 FLOAT_dot 在结果矩阵中的每个元素。这意味着,在此功能的环对应于你内心最循环。

This appears to compute one-dimensional dot products, i.e. on vectors. In my few minutes of Github browsing I was unable to find the source for matrices, but it's possible that it uses one call to FLOAT_dot for each element in the result matrix. That means the loop in this function corresponds to your inner-most loop.

它们之间的一个区别是,步幅 - 中的输入连续元素之间的差 - 在调用之前明确计算一次。在万一没有步幅,并且每个输入的偏移被每次计算,例如 A [I * N + K] 。我本来期望编译好于客场类似numpy的步幅优化的东西,但也许不能证明步骤是一个常数(或它没有被优化)。

One difference between them is that the "stride" -- the difference between successive elements in the inputs -- is explicitly computed once before calling the function. In your case there is no stride, and the offset of each input is computed each time, e.g. a[i * n + k]. I would have expected a good compiler to optimise that away to something similar to the Numpy stride, but perhaps it can't prove that the step is a constant (or it's not being optimised).

numpy的也可能会做一些聪明与上级code调用此函数缓存的影响。一种常见的伎俩是考虑每一行是连续的,或每列 - 并尝试第一次遍历每个连续的一部分。似乎很难是完全最优的,对于每个点积,一个输入矩阵必须由行遍历且另一个通过列(除非它们碰巧被存储在不同的主要顺序)。但它至少可以做到这一点的结果元素。

Numpy may also be doing something smart with cache effects in the higher-level code that calls this function. A common trick is to think about whether each row is contiguous, or each column -- and try to iterate over each contiguous part first. It seems difficult to be perfectly optimal, for each dot product, one input matrix must be traversed by rows and the other by columns (unless they happened to be stored in different major order). But it can at least do that for the result elements.

numpy的还包含code键选择特定的操作,包括点,从不同的基本实现方式的实施。例如它可以使用一个 BLAS 的库。从上面的讨论听起来好像是用来CBLAS。这是自Fortran语言翻译成C.我想在你的测试中使用的将是一个在这里找到实现:http://www.netlib.org/clapack/cblas/sdot.c.

Numpy also contains code to choose the implementation of certain operations, including "dot", from different basic implementations. For instance it can use a BLAS library. From discussion above it sounds like CBLAS is used. This was translated from Fortran into C. I think the implementation used in your test would be the one found in here: http://www.netlib.org/clapack/cblas/sdot.c.

请注意,这个程序是为另一台机器读取写入由机器。但是你可以在底部看到,它的使用展开循环来一次处理5个元素:

Note that this program was written by a machine for another machine to read. But you can see at the bottom that it's using an unrolled loop to process 5 elements at a time:

for (i = mp1; i <= *n; i += 5) {
stemp = stemp + SX(i) * SY(i) + SX(i + 1) * SY(i + 1) + SX(i + 2) * 
    SY(i + 2) + SX(i + 3) * SY(i + 3) + SX(i + 4) * SY(i + 4);
}

此展开的因素可能是一些分析后,都被接走。但它的一个理论优势是,更多的算术运算的每个分支点之间完成,编译器和CPU对如何以***方式安排它们得到尽可能多的指令流水线尽可能更多的选择。

This unrolling factor is likely to have been picked after profiling several. But one theoretical advantage of it is that more arithmetical operations are done between each branch point, and the compiler and CPU have more choice about how to optimally schedule them to get as much instruction pipelining as possible.