且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

CUDA中的动态编程:用于与子内核交换数据的全局内存分配

更新时间:2022-10-26 10:42:42

您正在从i<的每个线程中调用cudaMalloc. M,表示您正在进行M个cudaMalloc调用.

M越大,得到的效果越差.

相反,您可以从该块的第一个线程进行单个cudaMalloc调用,分配的M倍于您之前使用的大小(实际上,您应该分配更多的空间,因此每个块都正确对齐).同步线程之后,可以为每个子内核使用正确计算的phi_cap地址启动子内核.

或者(如果您的特定情况允许您分配足够的内存,以便在两次内核调用之间可以保留),则可以在内核外部分配一次内存,然后重新使用它.那会更快.如果内核调用之间的M有所不同,则您可以分配所需的最大M.

I have a the following code:

__global__ void interpolation(const double2* __restrict__ data, double2* __restrict__ result, const double* __restrict__ x, const double* __restrict__ y, const int N1, const int N2, int M)
{
    int i = threadIdx.x + blockDim.x * blockIdx.x;

    [...]        

    double phi_cap1, phi_cap2;

    if(i<M) {   

         for(int m=0; m<(2*K+1); m++) {

              [calculate phi_cap1];

              for(int n=0; n<(2*K+1); n++) {

                 [calculate phi_cap2];

                 [calculate phi_cap=phi_cap1*phi_cap2];

                 [use phi_cap];

             }
    }

}

}

I would like to use Dynamic Programming on a Kepler K20 card to dispatch the processing of phi_cap1 and phi_cap2 in parallel to a bunch of threads to reduce the computation time. K=6 in my code, so I'm launching a single block of 13x13 threads.

Following the CUDA Dynamic Parallelism Programming Guide, I'm allocating a matrix phi_cap of 169 elements (formed by the products of phi_cap1 and phi_cap2), needed to exchange the data with the child kernel, in global memory. Indeed, quoting the guide,

As a general rule, all storage passed to a child kernel should be allocated explicitly from the global-memory heap.

I then ended-up with the following code

__global__ void interpolation(const double2* __restrict__ data, double2* __restrict__ result, const double* __restrict__ x, const double* __restrict__ y, const int N1, const int N2, int M)
{
    int i = threadIdx.x + blockDim.x * blockIdx.x;

    [...]   

    dim3 dimBlock(2*K+1,2*K+1); dim3 dimGrid(1,1);

    if(i<M) {   

    double* phi_cap; cudaMalloc((void**)&phi_cap,sizeof(double)*(2*K+1)*(2*K+1));

    child_kernel<<<dimGrid,dimBlock>>>(cc_diff1,cc_diff2,phi_cap);

    for(int m=0; m<(2*K+1); m++) {

        for(int n=0; n<(2*K+1); n++) {

                        [use phi_cap];

        }
    }

}

}

The problem is that the first routine takes 5ms to run, while the second routine, even by commenting the child_kernel launch, takes 23ms, with practically all the time spent in the cudaMalloc API.

Since in dynamic programming one would often need allocating memory space to exchange data with the child kernels, and the only solution seems to be global memory taking so much time, it seems to me that one serious bottleneck of the usefulness of dynamic programming is the data exchange, unless there is a way to circumvent the global memory allocation issue.

The question then is: is there any workaround to the mentioned issue, namely, taking so much time when allocating global memory from within a kernel?. Thanks

SOLUTION PROPOSED IN THE COMMENTS

Allocate the required global memory from outside the parent kernel. I have verified that allocating the required global memory from outside the parent kernel is much faster.

You are calling cudaMalloc from each thread where i < M which means that you are making M cudaMalloc calls.

The bigger M is the worse it is going to get.

Instead you could make a single cudaMalloc call from the first thread of the block allocating M times the size that you used before (actually in your case you should allocate more, so each block is properly aligned). After that sync the threads and you can start your child kernels with correctly computed phi_cap address for each child kernel.

Alternatively (if your specific situation allows you to allocate enough memory that you can hold on to between the kernel calls) you could allocate the memory once outside of the kernel and reuse it. That would be a lot quicker. If M varies between kernel calls you could allocate as much as you would need for the biggest M.