且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

CUDA:全局内存变量上的数据总和

更新时间:2022-05-28 23:57:50

要从块的部分结果中计算出最终总和,我建议采用以下方式:

To compute a final sum out of partial results of your blocks, I would suggest doing it the following way:

  • 让每个块将部分结果写入gridDim.x大小的数组的单独单元格中.
  • 将阵列复制到主机.
  • 在主机上执行最终金额.

我假设每个块都有很多要独立计算的空间,这将首先保证使用CUDA.

I assume each block has a lot to compute on its own, which would warrant the usage of CUDA in the first place.

在您当前的状态下---我认为您的内核可能有问题.在我看来,每个块都在对所有数据求和,并返回最终结果,就好像它是部分结果一样.

In your current state --- I think there can be something wrong in your kernel. Seems to me that every block is summing all the data, returning a final result as if it was a partial result.

您介绍的循环实际上没有任何意义.对于每个块,只有一个 i 可以执行某项操作.该代码等同于简单地编写:

The loop you presented does not really make sense. For each block there is only one i which will do something. The code will be equivalent to simply writing:

currentErrors[threadIdx.x]=0;
currentErrors[threadIdx.x]+=globalError(mynet,myoutput);

保存一些不可预测的计划差异.

save for some unpredictable scheduling differences.

请记住,块是同步执行的.每个块都可以在任何其他块之前,之中或之后运行.

Remember that blocks are not executed in sync. Each block can run before, during or after any other block.

也:

  • You may be interested in parallel prefix sum algorithm.
  • You may want to check an efficient CUDA implementation of the prefix sum.