且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

CUDA常数内存错误

更新时间:2023-02-26 22:55:10

这里有几个问题。它可能更容易通过显示正确的方式来使用这两个常量数组,然后解释为什么你所做的不工作。所以内核应该是这样:

There are several problems here. It is probably easier to start by showing the "correct" way to use those two constant arrays, then explain why what you did doesn't work. So the kernel should look like this:

__global__ void kernel(int *X, int *out, int N)
{
    int tid = threadIdx.x + blockIdx.x*blockDim.x;
    if( tid<N )
    {
        out[tid] = A[tid%3000]*X[tid] + B[tid%3000];
    }
}

不要尝试将A和B传递给内核。原因如下:

ie. don't try passing A and B to the kernel. The reasons are as follows:


  1. 有点混乱, A 和 B 不是有效的设备内存地址。它们是向运行时设备符号查找提供钩子的主机符号。将它们传递到内核是非法的 - 如果你想要他们的设备内存地址,你必须使用 cudaGetSymbolAddress 在运行时检索它。

  2. 即使你没有调用 cudaGetSymbolAddress 并在常量内存中检索符号设备地址,你不应该将它们作为参数传递给内核,因为这样做不会产生在运行的内核中进行统一的内存访问。正确使用常量内存需要编译器发出特殊的PTX指令,编译器只有在知道特定的全局内存位置在常量内存中时才会这样做。如果你通过值传递一个常量内存地址作为参数,__constant__属性会丢失,编译器不能知道生成正确的加载指令。

  1. Somewhat confusingly, A and B in host code are not valid device memory addresses. They are host symbols which provide hooks into a runtime device symbol lookup. It is illegal to pass them to a kernel- If you want their device memory address, you must use cudaGetSymbolAddress to retrieve it at runtime.
  2. Even if you did call cudaGetSymbolAddress and retrieve the symbols device addresses in constant memory, you shouldn't pass them to a kernel as an argument, because doing do would not yield uniform memory access in the running kernel. Correct use of constant memory requires the compiler to emit special PTX instructions, and the compiler will only do that when it knows that a particular global memory location is in constant memory. If you pass a constant memory address by value as an argument, the __constant__ property is lost and the compiler can't know to produce the correct load instructions

一旦你得到这个工作,你会发现它是非常慢,如果你的配置文件,你会发现,有很高程度的指令重放和序列化。使用常量内存的整个想法是,当一个warp中的每个线程访问常量内存中的相同值时,你可以利用一个常量缓存广播机制。你的例子是完全相反的 - 每个线程访问一个不同的值。在这种使用情况下,常规全局内存将更快。还要注意,模运算符对当前GPU的性能很差,您应该尽可能避免它。

Once you get this working, you will find it is terribly slow and if you profile it you will find that there is very high degrees of instruction replay and serialization. The whole idea of using constant memory is that you can exploit a constant cache broadcast mechanism in cases when every thread in a warp accesses the same value in constant memory. Your example is the complete opposite of that - every thread is accessing a different value. Regular global memory will be faster in such a use case. Also be aware that the performance of the modulo operator on current GPUs is poor, and you should avoid it wherever possible.