且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

不同优化下的 Fortran 矩阵乘法性能

更新时间:2021-11-19 04:06:09

在遍历数组元素时需要注意以下几点:

There are several things one should be aware of when looping over array elements:

  • 确保内循环在内存中的连续元素上.在您当前的循环"算法中,内部循环超过索引 k.由于矩阵在内存中作为列布局(第一个索引在通过内存时变化最快),访问新的 k 值可能需要将新页面加载到缓存中.在这种情况下,您可以通过将循环重新排序来优化您的算法:

  • Make sure the inner loop is over consecutive elements in memory. In your current 'loop' algorithm, the inner loop is over index k. Since matrices are laid out in memory as columns (first index varying most rapidly when going through the memory), accessing a new value of k might need to load a new page into cache. In this case, you could optimize your algorithm by reordering the loops as:

do j=1,n
    do k=1,size(mtx1,dim=2)
        do i=1,m
            mtx3(i,j)=mtx3(i,j)+mtx1(i,k)*mtx2(k,j)
        end do      
    end do
end do

现在,内部循环在内存中的连续元素上(mtx2(k,j) 值可能只会在编译器内部循环之前获取一次,如果没有,您可以存储它在循环之前的一个临时变量中)

now, the inner loop is over consecutive elements in memory (the mtx2(k,j) value will be probably be fetched only once before the inner loop by the compiler, if not you can store it in a temporary variable before the loop)

确保整个循环都可以放入缓存中,以避免过多的缓存未命中.这可以通过阻塞算法来完成.在这种情况下,解决方案可能是例如:

Make sure the entire loops can fit into the cache in order to avoid too much cache misses. This can be done by blocking the algorithm. In this case, a solution could be e.g.:

l=size(mtx1,dim=2)
ichunk=512 ! I have a 3MB cache size (real*4)
do jj=1,n,ichunk
    do kk=1,l,ichunk

do j=jj,min(jj+ichunk-1,n)
    do k=kk,min(kk+ichunk-1,l)
        do i=1,m
            mtx3(i,j)=mtx3(i,j)+mtx1(i,k)*mtx2(k,j)
        end do      
    end do
end do

    end do
end do

在这种情况下,性能将取决于 ichunk 的大小,尤其是对于足够大的矩阵(您甚至可以阻塞内部循环,这只是一个示例).

in which case performance will depend in the size of ichunk, especially for large enough matrices (you could even block the inner loop, this is just an example).

确保执行循环所需的工作远小于循环内部的工作.这可以通过循环展开"来解决,即在循环的一次迭代中组合多个语句.通常编译器可以通过提供标志 -funroll-loops 来做到这一点.

Make sure the work needed to perform the loop is much smaller than the work inside the loop. This can be solved by 'loop unrolling', i.e. combining several statements in one iteration of the loop. Usually the compiler can do this by supplying the flag -funroll-loops.

如果我使用上面的代码并使用标志 -O3 -funroll-loops 进行编译,我得到的性能比使用 matmul 稍好.

If I use the above code and compile with the flags -O3 -funroll-loops, I get a slightly better performance than with matmul.

这三点中要记住的重要一点是关于循环排序的第一点,因为这会影响其他用例中的性能,而编译器通常无法修复它.循环展开,您可以留给编译器(但对其进行测试,因为这并不总能提高性能).至于第二点,由于这取决于硬件,因此您不应该(通常)尝试自己实现一个非常有效的矩阵乘法,而是考虑使用诸如 e.g. 之类的库.atlas,它可以优化缓存大小,或者 MKL 或 ACML 等供应商库.

The important thing to remember of those three is the first point about loop ordering, since this is something that will affect performance in other use cases, and the compiler cannot usually fix that. The loop unrolling, you can leave to the compiler (but test it, as this does not always increase performance). As for the second point, since this is dependent on the hardware, you shouldn't (generally) try to implement a very efficient matrix multiplication yourself and instead consider using a library such as e.g. atlas, which can optimize for cache size, or a vendor library such as MKL or ACML.