且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

在OpenCL中,相对于barrier(),mem_fence()的作用是什么?

更新时间:2022-06-27 02:41:20

要(希望如此)更清楚地说明它,

To try to put it more clearly (hopefully),

mem_fence()等到工作组中的所有线程都可以看到在mem_fence()之前由调用工作项对本地和/或全局内存进行的所有读/写操作.

mem_fence() waits until all reads/writes to local and/or global memory made by the calling work-item prior to mem_fence() are visible to all threads in the work-group.

来自: http://developer.download. nvidia.com/presentations/2009/SIGGRAPH/asia/3_OpenCL_Programming.pdf

可以对内存操作进行重新排序以适合其运行的设备.规范指出(基本上),对内存操作的任何重新排序都必须确保内存在单个工作项中处于一致状态.但是,如果您(例如)执行存储操作并且值决定现在暂时驻留在工作项特定的缓存中,直到呈现出更好的时间来写入本地/全局内存,该怎么办?如果您尝试从该内存中加载,则写入该值的工作项会将其存储在其缓存中,因此没有问题.但是工作组中的其他工作项则没有,因此它们可能会读取错误的值.放置内存围墙可确保在调用内存围墙时,使本地/全局内存(根据参数)保持一致(刷新所有缓存,并且任何重新排序都将考虑到您预期其他线程可能会遇到的问题).在此之后需要访问此数据.)

Memory operations can be reordered to suit the device they are running on. The spec states (basically) that any reordering of memory operations must ensure that memory is in a consistent state within a single work-item. However, what if you (for example) perform a store operation and value decides to live in a work-item specific cache for now until a better time presents itself to write through to local/global memory? If you try to load from that memory, the work-item that wrote the value has it in its cache, so no problem. But other work-items within the work-group don't, so they may read the wrong value. Placing a memory fence ensures that, at the time of the memory fence call, local/global memory (as per the parameters) will be made consistent (any caches will be flushed, and any reordering will take into account that you expect other threads may need to access this data after this point).

我承认这仍然令人困惑,我不会发誓我的理解是100%正确的,但我认为至少这是一个普遍的想法.

I admit it is still confusing, and I won't swear that my understanding is 100% correct, but I think it is at least the general idea.

跟进:

我发现此链接谈论CUDA内存隔离网,但相同的基本思想也适用于OpenCL:

I found this link which talks about CUDA memory fences, but the same general idea applies to OpenCL:

http://developer.download. nvidia.com/compute/cuda/2_3/toolkit/docs/NVIDIA_CUDA_Programming_Guide_2.3.pdf

查看 B.5内存栅栏功能部分.

他们有一个代码示例,该示例计算一次调用中的数字数组的总和.设置代码以计算每个工作组中的部分和.然后,如果还有更多要做的工作,那么代码将由最后一个工作组来完成工作.

They have a code example that computes the sum of an array of numbers in one call. The code is set up to compute a partial sum in each work-group. Then, if there is more summing to do, the code has the last work-group do the work.

因此,每个工作组基本上要做两件事:部分和,更新全局变量,然后原子递增计数器全局变量.

So, basically 2 things are done in each work-group: A partial sum, which updates a global variable, then atomic increment of a counter global variable.

此后,如果还有其他工作要做,则将计数器递增到("work-group size"-1)值的工作组将作为最后一个工作组.该工作组继续完成工作.

After that, if there is any more work left to do, the work-group that incremented the counter to the value of ("work-group size" - 1) is taken to be the last work-group. That work-group goes on to finish up.

现在,问题(如他们所解释的)是,由于内存重新排序和/或缓存,计数器可能会增加,并且最后一个工作组可能会在该部分和全局变量之前开始工作.已将其最新值写入全局内存.

Now, the problem (as they explain it) is that, because of memory re-ordering and/or caching, the counter may get incremented and the last work-group may begin to do its work before that partial sum global variable has had its most recent value written to global memory.

内存栅栏将确保在通过栅栏之前,该部分和变量的值对于所有线程都是一致的.

A memory fence will ensure that the value of that partial sum variable is consistent for all threads before moving past the fence.

我希望这是有道理的.令人困惑.

I hope this makes some sense. It is confusing.