且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

CUDA、互斥锁和 atomicCAS()

更新时间:2023-11-13 22:50:10

有问题的循环

do 
{
    atomicCAS(mutex, 0, 1 + i);
}
while (*mutex != i + 1);

如果它在主机(CPU)端运行;一旦线程 0 将 *mutex 设置为 1,其他线程就会等待,直到线程 0 将 *mutex 设置回 0.

would work fine if it were running on the host (CPU) side; once thread 0 sets *mutex to 1, the other threads would wait exactly until thread 0 sets *mutex back to 0.

然而,GPU 线程并不像 CPU 线程那样独立.GPU 线程分为 32 个一组,通常称为 warp.同一经线中的线程将以完整的锁步执行指令.如果诸如 ifwhile 之类的控制语句导致 32 个线程中的一些与其余线程发散,则剩余线程将等待(即休眠) 为了完成不同的线程. [1]

However, GPU threads are not as independent as their CPU counterparts. GPU threads are grouped into groups of 32, commonly referred to as warps. Threads in the same warp will execute instructions in complete lock-step. If a control statement such as if or while causes some of the 32 threads to diverge from the rest, the remaining threads will wait (i.e. sleeps) for the divergent threads to finish. [1]

回到有问题的循环,线程 0 变为非活动状态,因为线程 1、2、...、31 仍然停留在 while 循环中.所以线程 0 永远不会到达 *mutex = 0 行,其他 31 个线程永远循环.

Going back to the loop in question, thread 0 becomes inactive because threads 1, 2, ..., 31 are still stuck in the while loop. So thread 0 never reaches the line *mutex = 0, and the other 31 threads loops forever.

一个潜在的解决方案是制作有问题的共享资源的本地副本,让 32 个线程修改副本,然后选择一个线程将更改推送"回共享资源.在这种情况下,__shared__ 变量是理想的:它将由属于同一块的线程共享,但不会由其他块的线程共享.我们可以使用__syncthreads()来精细控制成员线程对这个变量的访问.

A potential solution is to make a local copy of the shared resource in question, let 32 threads modify the copy, and then pick one thread to 'push' the change back to the shared resource. A __shared__ variable is ideal in this situation: it will be shared by the threads belonging to the same block but not other blocks. We can use __syncthreads() to fine-control the access of this variable by the member threads.

[1] CUDA***实践指南 - 分支和分歧

避免在同一个扭曲中使用不同的执行路径.

Avoid different execution paths within the same warp.

任何流控制指令(if、switch、do、for、while)都会导致相同warp的线程发散,从而显着影响指令吞吐量;也就是说,遵循不同的执行路径.如果发生这种情况,必须将不同的执行路径序列化,因为一个 warp 的所有线程共享一个程序计数器;这增加了为此扭曲执行的指令总数.当所有不同的执行路径都完成后,线程会收敛到相同的执行路径.

Any flow control instruction (if, switch, do, for, while) can significantly affect the instruction throughput by causing threads of the same warp to diverge; that is, to follow different execution paths. If this happens, the different execution paths must be serialized, since all of the threads of a warp share a program counter; this increases the total number of instructions executed for this warp. When all the different execution paths have completed, the threads converge back to the same execution path.