CUDA:全局内存变量上的数据总和

更新时间：2022-05-28 23:57:50

要从块的部分结果中计算出最终总和，我建议采用以下方式:

To compute a final sum out of partial results of your blocks, I would suggest doing it the following way:

让每个块将部分结果写入gridDim.x大小的数组的单独单元格中.
将阵列复制到主机.
在主机上执行最终金额.

我假设每个块都有很多要独立计算的空间，这将首先保证使用CUDA.

I assume each block has a lot to compute on its own, which would warrant the usage of CUDA in the first place.

在您当前的状态下---我认为您的内核可能有问题.在我看来，每个块都在对所有数据求和，并返回最终结果，就好像它是部分结果一样.

In your current state --- I think there can be something wrong in your kernel. Seems to me that every block is summing all the data, returning a final result as if it was a partial result.

您介绍的循环实际上没有任何意义.对于每个块，只有一个 i 可以执行某项操作.该代码等同于简单地编写:

The loop you presented does not really make sense. For each block there is only one i which will do something. The code will be equivalent to simply writing:

currentErrors[threadIdx.x]=0;
currentErrors[threadIdx.x]+=globalError(mynet,myoutput);

保存一些不可预测的计划差异.

save for some unpredictable scheduling differences.

请记住，不块是同步执行的.每个块都可以在任何其他块之前，之中或之后运行.

Remember that blocks are not executed in sync. Each block can run before, during or after any other block.

也:

您可能对并行前缀和算法感兴趣.
您可能需要检查前缀的有效的CUDA实现总和.

You may be interested in parallel prefix sum algorithm.
You may want to check an efficient CUDA implementation of the prefix sum.

上一篇 : ：动态添加程序集参考下一篇 : 有效分配内核中的内存

CUDA:全局内存变量上的数据总和

相关阅读

技术问答最新文章