且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

CUDA全局内存访问速度

更新时间:2022-10-26 10:42:36

删除代码行时:

direct_map[index] = -1; 

您的内核没有执行有用的的任何操作.编译器可以识别出这一点,并消除了与内核启动相关的大多数代码.从编译器的角度来看,对内核代码的修改意味着内核不再影响任何 global状态,并且该代码实际上是无用的.

您可以通过转储编译器在每种情况下生成的汇编代码(例如,使用cuobjdump -sass myexecutable

)来验证这一点.

每当您对代码进行较小的更改并在时序上看到较大的更改时,您都应该怀疑所做的更改已使编译器做出不同的优化决策.

here is simple cuda code.
I am testing the time of accessing global memory. read and right.

below is kernel function(test1()).

enter code here

__global__ void test1(int *direct_map)   
{  
    int index = 10;  
    int index2;  

    for(int j=0; j<1024; j++)  
    {  
        index2 = direct_map[index];  
        direct_map[index] = -1;  
        index = index2;  
    }  
}  

direct_map is 683*1024 linear matrix and, each pixel has a offset value to access to other pixel.
index and index2 is not continued address.

this kernel function needs about 600 micro second.

But, if i delete the code,
direct_map[index] = -1;

just takes 27 micro second.

I think the code already read the value of direct_map[index] from global memory from

index2 = direct_map[index];

then, it should be located L2 cache.
So, when doing "direct_map[index] = -1;", the speed should be fast.

And, I tested random writing to global memory(test2()).

It takes about 120 micro seconds.

enter code here

__global__ void test2(int *direct_map)   
{  
    int index = 10;  

    for(int j=0; j<1024; j++)  
    {  
        direct_map[index] = -1;  
        index = j*683 + j/3 - 1;  
    }  
}

So, I don't know why test1() takes over than 600 micro seconds. thank you.

When you delete the code line:

direct_map[index] = -1; 

your kernel isn't doing anything useful. The compiler can recognize this and eliminate most of the code associated with the kernel launch. That modification to the kernel code means that the kernel no longer affects any global state and the code is effectively useless, from the compiler's perspective.

You can verify this by dumping the assembly code that the compiler generates in each case, for example with cuobjdump -sass myexecutable

Anytime you make a small change to the code and see a large change in timing, you should suspect that the change you made has allowed the compiler to make different optimization decisions.