且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

Fortran矩阵乘法性能在不同的优化中

更新时间:2022-10-28 10:01:23

在循环数组元素时,应该注意以下几点:


  • 确保内部循环遍历内存中的连续元素。在当前的'循环'算法中,内部循环在索引k之上。由于矩阵在内存中按列排列(第一个索引在访问内存时变化最快),因此访问新的k值可能需要将新页面加载到缓存中。在这种情况下,您可以通过重新排序循环来优化算法:

      do j = 1,n 
    do k = 1,size(mtx1,dim = 2)
    do i = 1,m
    mtx3(i,j)= mtx3(i,j)+ mtx1(i,k)* mtx2 ,j)
    end do
    end do
    end do

    现在,内部循环在内存中的连续元素上(如果不是,编译器在内部循环之前, mtx2(k,j)值可能只会被提取一次您可以在循环之前将它存储在一个临时变量中)确保整个循环可以放入缓存中,以避免缓存遗漏太多。这可以通过阻止算法来完成。在这种情况下,解决方案可以是:例如:

      l = size(mtx1,dim = 2)
    ichunk = 512!我有一个3MB的缓存大小(实数* 4)
    do jj = 1,n,ichunk
    do kk = 1,l,ichunk

    do j = jj,min (i,j)= mtx3(i,j),其中k是第k个单元的数目, i,j)+ mtx1(i,k)* mtx2(k,j)
    end do
    end do
    end do

    end do
    在这种情况下,性能将取决于 ichunk $>的大小, code>,尤其是对于足够大的矩阵(你甚至可以阻止内部循环,这只是一个例子)。
  • 确保所需的工作执行循环比循环内的工作小得多。这可以通过循环展开来解决,即在循环的一次迭代中组合几个语句。通常情况下,编译器可以通过提供标志 -funroll-loops 来完成此操作。


如果我使用上面的代码并使用标记 -O3 -funroll-loops 进行编译,那么性能会比使用 matmul



重要的是要记住这三点是关于循环排序的第一点,因为这会影响其他性能用例,编译器通常无法解决这个问题。循环展开,你可以离开编译器(但是测试它,因为这并不总是提高性能)。至于第二点,由于这取决于硬件,因此您不应(通常)尝试自己实现非常高效的矩阵乘法,而应考虑使用库(例如,可优化缓存大小的Atlas,或MKL或ACML等供应商库。

I'm reading the book "Scientific Software Development with Fortran", and there is an exercise in it I think very interesting:

"Create a Fortran module called MatrixMultiplyModule. Add three subroutines to it called LoopMatrixMultiply, IntrinsicMatrixMultiply, and MixMatrixMultiply. Each routine should take two real matrices as argument, perform a matrix multiplication, and return the result via a third argument. LoopMatrixMultiply should be written entirely with do loops, and no array operations or intrinsic procedures; IntrinsicMatrixMultiply should be written utilizing the matmul intrinsic function; and MixMatrixMultiply should be written using some do loops and the intrinsic function dot_product. Write a small program to test the performance of these three different ways of performing the matrix multiplication for different sizes of matrices."

I did some test of multiply of two rank 2 matrix and here are the results, under different optimization flags:

compiler:ifort version 13.0.0 on Mac 

Here is my question:

Why under -O0 they have about the same performance but matmul has huge performance boost when using -O3, while explicit loop and dot product has less performance boost? Also, why dot_product seems have the same performance compare to explicit do loops?

The code I use is the following:

module MatrixMultiplyModule

contains
    subroutine LoopMatrixMultiply(mtx1,mtx2,mtx3)
        real,intent(in)                 ::  mtx1(:,:),mtx2(:,:)
        real,intent(out),allocatable    ::  mtx3(:,:)
        integer                         ::  m,n
        integer                         ::  i,j
        if(size(mtx1,dim=2) /= size(mtx2,dim=1)) stop "input array size not match"
        m=size(mtx1,dim=1)
        n=size(mtx2,dim=2)
        allocate(mtx3(m,n))
        mtx3=0.

        do i=1,m
            do j=1,n
                do k=1,size(mtx1,dim=2)
                    mtx3(i,j)=mtx3(i,j)+mtx1(i,k)*mtx2(k,j)
                end do      
            end do
        end do
    end subroutine

    subroutine IntrinsicMatrixMultiply(mtx1,mtx2,mtx3)
        real,intent(in)                 ::  mtx1(:,:),mtx2(:,:)
        real,intent(out),allocatable    ::  mtx3(:,:)
        integer                         ::  m,n
        integer                         ::  i,j
        if(size(mtx1,dim=2) /= size(mtx2,dim=1)) stop "input array size not match"
        m=size(mtx1,dim=1)
        n=size(mtx2,dim=2)
        allocate(mtx3(m,n))
        mtx3=matmul(mtx1,mtx2)

    end subroutine

    subroutine MixMatrixMultiply(mtx1,mtx2,mtx3)
        real,intent(in)                 ::  mtx1(:,:),mtx2(:,:)
        real,intent(out),allocatable    ::  mtx3(:,:)
        integer                         ::  m,n
        integer                         ::  i,j
        if(size(mtx1,dim=2) /= size(mtx2,dim=1)) stop "input array size not match"
        m=size(mtx1,dim=1)
        n=size(mtx2,dim=2)
        allocate(mtx3(m,n))

        do i=1,m
            do j=1,n
                    mtx3(i,j)=dot_product(mtx1(i,:),mtx2(:,j))
            end do
        end do

    end subroutine

end module




program main
use MatrixMultiplyModule
implicit none

real,allocatable        ::  a(:,:),b(:,:)
real,allocatable    ::  c1(:,:),c2(:,:),c3(:,:)
integer ::  n
integer ::  count, rate
real    ::  timeAtStart, timeAtEnd
real    ::  time(3,10)
do n=100,1000,100
    allocate(a(n,n),b(n,n))

    call random_number(a)
    call random_number(b)

    call system_clock(count = count, count_rate = rate)
    timeAtStart = count / real(rate)
    call LoopMatrixMultiply(a,b,c1)
    call system_clock(count = count, count_rate = rate)
    timeAtEnd = count / real(rate)
    time(1,n/100)=timeAtEnd-timeAtStart

    call system_clock(count = count, count_rate = rate)
    timeAtStart = count / real(rate)
    call IntrinsicMatrixMultiply(a,b,c2)
    call system_clock(count = count, count_rate = rate)
    timeAtEnd = count / real(rate)
    time(2,n/100)=timeAtEnd-timeAtStart

    call system_clock(count = count, count_rate = rate)
    timeAtStart = count / real(rate)
    call MixMatrixMultiply(a,b,c3)
    call system_clock(count = count, count_rate = rate)
    timeAtEnd = count / real(rate)
    time(3,n/100)=timeAtEnd-timeAtStart


    deallocate(a,b)

end do

open(1,file="time.txt")
do n=1,10
    write(1,*) time(:,n)
end do
close(1)
deallocate(c1,c2,c3)
end program

There are several things one should be aware of when looping over array elements:

  • Make sure the inner loop is over consecutive elements in memory. In your current 'loop' algorithm, the inner loop is over index k. Since matrices are laid out in memory as columns (first index varying most rapidly when going through the memory), accessing a new value of k might need to load a new page into cache. In this case, you could optimize your algorithm by reordering the loops as:

    do j=1,n
        do k=1,size(mtx1,dim=2)
            do i=1,m
                mtx3(i,j)=mtx3(i,j)+mtx1(i,k)*mtx2(k,j)
            end do      
        end do
    end do
    

    now, the inner loop is over consecutive elements in memory (the mtx2(k,j) value will be probably be fetched only once before the inner loop by the compiler, if not you can store it in a temporary variable before the loop)

  • Make sure the entire loops can fit into the cache in order to avoid too much cache misses. This can be done by blocking the algorithm. In this case, a solution could be e.g.:

    l=size(mtx1,dim=2)
    ichunk=512 ! I have a 3MB cache size (real*4)
    do jj=1,n,ichunk
        do kk=1,l,ichunk
    
    do j=jj,min(jj+ichunk-1,n)
        do k=kk,min(kk+ichunk-1,l)
            do i=1,m
                mtx3(i,j)=mtx3(i,j)+mtx1(i,k)*mtx2(k,j)
            end do      
        end do
    end do
    
        end do
    end do
    

    in which case performance will depend in the size of ichunk, especially for large enough matrices (you could even block the inner loop, this is just an example).

  • Make sure the work needed to perform the loop is much smaller than the work inside the loop. This can be solved by 'loop unrolling', i.e. combining several statements in one iteration of the loop. Usually the compiler can do this by supplying the flag -funroll-loops.

If I use the above code and compile with the flags -O3 -funroll-loops, I get a slightly better performance than with matmul.

The important thing to remember of those three is the first point about loop ordering, since this is something that will affect performance in other use cases, and the compiler cannot usually fix that. The loop unrolling, you can leave to the compiler (but test it, as this does not always increase performance). As for the second point, since this is dependent on the hardware, you shouldn't (generally) try to implement a very efficient matrix multiplication yourself and instead consider using a library such as e.g. atlas, which can optimize for cache size, or a vendor library such as MKL or ACML.