更新时间:2022-06-08 21:35:47
尝试较少地点击结果.这会导致高速缓存行共享,并阻止操作并行运行.相反,使用局部变量将允许大部分写入操作在每个内核的L1缓存中进行.
Try hitting the result less often. This induces cacheline sharing and prevents the operation from running in parallel. Using a local variable instead will allow most of the writes to take place in each core's L1 cache.
此外,使用restrict
可能会有所帮助.否则,编译器无法保证对C
的写入不会更改A
和B
.
Also, use of restrict
may help. Otherwise the compiler can't guarantee that writes to C
aren't changing A
and B
.
尝试:
for (i=0; i<Nu; i++){
const double* const Arow = A + i*Nu;
double* const Crow = C + i*Nu;
#pragma omp parallel for
for (j=0; j<Nu; j++){
const double* const Bcol = B + j*Nu;
double sum = 0.0;
for(k=0;k<Nu ;k++){
sum += Arow[k] * Bcol[k]; //C(i,j)=sum(over k) A(i,k)*B(k,j)
}
Crow[j] = sum;
}
}
此外,我认为Elalfer在并行化最内部循环时需要减少处理是正确的.
Also, I think Elalfer is right about needing reduction if you parallelize the innermost loop.