且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

cuda共享内存-不一致的结果

更新时间:2023-02-01 21:53:04

当我在Linux上编译您现在拥有的代码时,会收到以下警告:

When I compile the code you have now on linux, I get the following warning:

t614.cu(55): warning: __shared__ memory variable with non-empty constructor or destructor (potential race between threads)

这种类型的警告不应忽略.它与以下代码行关联:

This type of warning should not be ignored. It is associated with this line of code:

__shared__ double3c blockmean[THREADS_PER_BLOCK]; 

由于这些存储在共享内存中的对象的初始化(由构造函数执行)将以任意顺序发生,并且在与设置这些值的后续代码之间没有障碍,因此,不可预测的事情(*)可能会发生.

Since the initialization of these objects stored in shared memory (by the constructor) will happen in some arbitrary order, and you have no barrier between that and the subsequent code that will also set these values, unpredictable things (*) can happen.

如果我在代码中插入 __ syncthreads()以将构造函数活动与后续代码隔离开,我将得到预期的结果:

If I insert a __syncthreads() in the code to isolate the constructor activity from the subsequent code, I get expected results:

__shared__ double3c blockmean[THREADS_PER_BLOCK];    
int index = threadIdx.x + blockIdx.x * blockDim.x;
__syncthreads();  // add this line
if (index < m.size_x * m.size_y * m.size_z) blockmean[threadIdx.x] = m.data[index] = double3c(0, 1, 0);
else blockmean[threadIdx.x] = double3c(0,0,0);
__syncthreads();

但是,这仍然给我们留下了警告.要解决此问题并使警告消失的一种修改是动态分配必要的 __ shared __ 大小.将您的共享内存声明更改为此:

This still leaves us with the warning, however. A modification to fix this and make the warning go away would be to allocate the necessary __shared__ size dynamically. Change your shared memory declaration to this:

extern __shared__ double3c blockmean[];

并修改您的内核调用:

KernelCalculateMeanFieldBlock <<<blocknum, THREADS_PER_BLOCK, THREADS_PER_BLOCK*sizeof(double3c)>>> (m, cu_mean);

这将消除警告,产生正确的结果,并避免对共享内存变量进行不必要的构造函数通信.(并且不再需要上面描述的其他 __ syncthreads().)

This will eliminate the warning, produce the correct result, and avoid the unnecessary constructor traffic on the shared memory variable. (And the additional __syncthreads() described above is no longer necessary.)

*关于不可预测的事物",如果您通过检查生成的SASS(

*regarding "unpredictable things", if you look under the hood by inspecting either the generated SASS (cuobjdump -sass ...) or the PTX (**) (nvcc -ptx ...), you will see that each thread initializes the entire __shared__ array of objects to zero (the behavior of the default constructor). As a result of this, some of the threads (i.e. warps) can race ahead and begin populating the shared memory area according to this line:

if (index < m.size_x * m.size_y * m.size_z) blockmean[threadIdx.x] = m.data[index] = double3c(0, 1, 0);

然后,当其他扭曲开始执行时,这些线程将再次清除 entire 共享内存阵列.这种赛车行为会导致不可预测的结果.

Then, when other warps begin executing, those threads will clear out the entire shared memory array again. This racing behavior leads to unpredictable results.

**我通常不建议通过检查PTX来判断代码的行为,但是在这种情况下,它同样具有启发性.最后的编译阶段不会优化构造函数的行为.

** I don't normally suggest judging code behavior by inspecting the PTX, but in this case it is equally instructive. The final compile stages will not optimize away the constructor behavior.