且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

OpenMP通过三重for循环并行化矩阵乘法(性能问题)

更新时间:2022-06-08 21:35:47

尝试较少地点击结果.这会导致高速缓存行共享,并阻止操作并行运行.相反,使用局部变量将允许大部分写入操作在每个内核的L1缓存中进行.

Try hitting the result less often. This induces cacheline sharing and prevents the operation from running in parallel. Using a local variable instead will allow most of the writes to take place in each core's L1 cache.

此外,使用restrict可能会有所帮助.否则,编译器无法保证对C的写入不会更改AB.

Also, use of restrict may help. Otherwise the compiler can't guarantee that writes to C aren't changing A and B.

尝试:

for (i=0; i<Nu; i++){
  const double* const Arow = A + i*Nu;
  double* const Crow = C + i*Nu;
#pragma omp parallel for
  for (j=0; j<Nu; j++){
    const double* const Bcol = B + j*Nu;
    double sum = 0.0;
    for(k=0;k<Nu ;k++){
      sum += Arow[k] * Bcol[k]; //C(i,j)=sum(over k) A(i,k)*B(k,j)
    }
    Crow[j] = sum;
  }
}

此外,我认为Elalfer在并行化最内部循环时需要减少处理是正确的.

Also, I think Elalfer is right about needing reduction if you parallelize the innermost loop.