且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

在CUDA并行化循环(1D天真卷积)

更新时间:2023-12-02 18:42:10

您的cpu 转换次数功能似乎是这样做(在 N = 4,作为一个例子):

  A0B0 A0B1 A0B2 A0B3 + ^
      A1B0 A1B1 A1B2 A1B3 + N
            A2B0 A2B1 A2B2 A2B3 +行
                  A3B0 A3B1 A3B2 A3B3 = V
------------------------------------------
OUT0 OUT1 OUT2 OUT3 OUT4 OUT5 OUT6
    < - (2 * N)-1列 - >

您卷积是一个事实,即它是卷积长度相等的2个信号区分开来(我)。由于GPU喜欢的大问题的工作,这意味着 N 要大。然而,随着你的 conv_Kernel 实现一个直接的问题是,它意味着该块尺寸将被用于索引到 A ,并且线程尺寸将被用于索引到 B 。但螺纹尺寸( threadIdx.x )限制为512或1024,当前的GPU CUDA。这将贬低我们只解决pretty小问题。

有其他各种问题的认识。一个问题是分配的共享内存的大小是不够的,适合 I + J 范围(可从去0->​​ 2 *(N-1))。这是微不足道的修复过程,但更严重的问题是,我没有看到一个方法来你的算术映射到任何类似上述所需的图案。花一点时间思考你的内核之后,我放弃了。

卷积问题具有与之相关联的大量研究,并可以在像GPU的大规模并行架构各种方式来优化。因此,我将专注于这立即建议根据自己上图中两个非常简单的实现。

第一个实现很简单,就是上图中重新创建。我们将创建一个中间温度阵列来存储所有的个人AxBy产品,计算和在 conv_Kernel 这些产品的存储。然后,我们将推出第二个内核( sum_Kernel ),这只是求和温度阵列,以生产各种退出值。第一个内核要求 N 线程,这将先后计算出上图的每一行,以倾斜的方式,因为我们通过ñ迭代 for循环迭代,每行一个。第二个内核要求(2 * N)-1线程,每个列/ 退出值。

我的第二个实现(conv_Kernel2)与需要有一个温度阵列省去了,只是一个线程分配给每个列/ 退出价值,并遍历了 N 行,计算所需要的产品行由行,和即时这些​​产品总结。总和结果则直接存储在退出阵列

考虑只计算,而不是用于数据移动/初始化所需要的时间,在GPU实现中开始是在大约比幼稚单线程CPU实现更快 N = 512上K20x GPU,这是我碰巧使用。第二实现也通过以下事实所需的唯一数据移动为A,B和结果赞扬。第一个实现除了需要温度来进行分配和初始化为全零数组。在温度数组是成正比的,以 n个大小 * N ,所以第二实现还具有它不需要此临时存储的益处。

下面是一个全面合作的测试用例,运行时间和您提供的CPU实现加,我创建了两个略有不同GPU的实现:

  $猫t617.cu
#包括LT&;&stdio.h中GT;
#包括LT&;&stdlib.h中GT;
#包括LT&;&time.h中GT;
#包括LT&; SYS / time.h中>#定义ñ4096
RG的#define 10
#定义USECPSEC 1000000ULL
#定义nTPB 256
无效CONV为(int * A,为int * B,为int *出){    的for(int i = 0; I< N ++ I)
        对于(INT J = 0; J< N ++ j)条
            出[I + J] + = A [I] * B [J]。
}无符号长长dtime_usec(无符号长长preV){
  timeval中TV1;
  函数gettimeofday(安培; tv1,0);
  收益率((tv1.tv_sec * USECPSEC)+ tv1.tv_usec) - preV;
}
__global__无效conv_Kernel为(int * A,为int * B,为int *临时){    INT IDX = threadIdx.x + blockDim.x * blockIdx.x;
    如果(IDX< N){
      INT my_B = B [IDX];
      的for(int i = 0; I< N;我++)
        温度[IDX +(I * 2 * N)+ I] = my_B * A [I];
      }
}__global__无效sum_Kernel为(int *温度,诠释*总分){
    INT IDX = threadIdx.x + blockDim.x * blockIdx.x;
    如果(IDX≤(2 * N)-1){
      INT my_sum = 0;
      的for(int i = 0; I< N;我++)my_sum + =温度[IDX +(I * 2 * N)];
      出[IDX] = my_sum;}
}__global__无效conv_Kernel2为(int * A,为int * B,为int *出){
    INT IDX = threadIdx.x + blockDim.x * blockIdx.x;
    如果(IDX≤(2 * N)-1){
      INT my_sum = 0;
      的for(int i = 0; I< N;我++)
        如果(((IDX&下; N)及及(I&下; = IDX))||((IDX> = N)及及(I>(IDX-N))))my_sum + = A [I] * B [IDX-I]
      出[IDX] = my_sum;
    }
}诠释主(){  为int * h_a,数组* D_A,* h_result,* d_result,*成绩,* h_B,* d_B,* A,* B,* d_temp;  B =(INT *)malloc的(N * sizeof的(INT));
  A =(INT *)malloc的(N * sizeof的(INT));
  h_A =(INT *)malloc的(N * sizeof的(INT));
  h_B =(INT *)malloc的(N * sizeof的(INT));
  h_result =(INT *)malloc的(2 * N *的sizeof(INT));
  结果=(INT *)malloc的(2 * N *的sizeof(INT));  cudaMalloc(安培; d_B,N *的sizeof(INT));
  cudaMalloc(安培; D_A,N *的sizeof(INT));
  cudaMalloc(安培; d_result,2 * N *的sizeof(INT));
  cudaMalloc(安培; d_temp,2 * N * N *的sizeof(INT));  的for(int i = 0; I< N;我++){
    A [i] =兰特()%RG;
    B〔I] =兰特()%RG;
    h_A [I] = A [I]
    h_B [I] = B [I];}  对(INT I = 0; I&2 * N;我++){
    结果由[i] = 0;
    h_result [I] = 0;}  无符号长长CPU_TIME = dtime_usec(0);
  CONV(A,B,结果);
  CPU_TIME = dtime_usec(CPU_TIME);  cudaMemcpy(D_A h_a,数组N * sizeof的(INT),cudaMemcpyHostToDevice);
  cudaMemcpy(d_B,h_B,N *的sizeof(INT),cudaMemcpyHostToDevice);
  cudaMemset(d_result,0,2 * N *的sizeof(int)的);
  cudaMemset(d_temp,0,2 * N * N *的sizeof(INT));  无符号长长gpu_time = dtime_usec(0);
  conv_Kernel&所述;&所述;≤(N + nTPB-1)/ nTPB,nTPB>>>(D_A,d_B,d_temp);
  sum_Kernel&所述;&所述;≤((2 *(N-1))+ nTPB-1)/ nTPB,nTPB>>>(d_temp,d_result);
  cudaDeviceSynchronize();
  gpu_time = dtime_usec(gpu_time);  cudaMemcpy(h_result,d_result,2 * N *的sizeof(INT),cudaMemcpyDeviceToHost);
  的for(int i = 0;我2 * N;我++),如果(结果[I] = h_result [I]!){printf的(在%D,CPU不匹配:%d个,GPU%d个\\ N, I,导致由[i],h_result [I]);返回1;}
  的printf(说完结果匹配CPU时间:%的LDU,GPU时间:%的LDU \\ n,CPU_TIME,gpu_time);
  cudaMemset(d_result,0,2 * N *的sizeof(int)的); //只是错误检查时,kernel2不需要结果的初始化  gpu_time = dtime_usec(0);
  conv_Kernel2&所述;&所述;≤((2 *(N-1))+ nTPB-1)/ nTPB,nTPB>>>(D_A,d_B,d_result);
  cudaDeviceSynchronize();
  gpu_time = dtime_usec(gpu_time);  cudaMemcpy(h_result,d_result,2 * N *的sizeof(INT),cudaMemcpyDeviceToHost);
  的for(int i = 0;我2 * N;我++),如果(结果[I] = h_result [I]!){printf的(mismatch2在%D,CPU:%d个,GPU%d个\\ N, I,导致由[i],h_result [I]);返回1;}
  的printf(说完结果匹配CPU时间:%的LDU,GPU2时间:%的LDU \\ n,CPU_TIME,gpu_time);
  返回0;
}
$ NVCC -arch = sm_35 -o t617 t617.cu
$ ./t617
成品。结果比赛。 cpu时间:69059us,GPU时间:3204us
成品。结果比赛。 cpu时间:69059us,GPU2时间:1883us
$ NVCC -arch = sm_35 -O3 -o t617 t617.cu
$ ./t617
成品。结果比赛。 cpu时间:13750us,GPU时间:3214us
成品。结果比赛。 cpu时间:13750us,GPU2时间:1886us
$

(注意,即使只是用-O3参数使得在CPU code执行的显著差异)

正如我所说,我会考虑我的两个例子是同样的GPU code很天真(niether使用共享内存,例如),但他们可能给你如何开始的一些想法。

有关presentation的简洁,我省略了CUDA错误检查。不过,我建议您有一个CUDA code麻烦您执行proper CUDA错误检查。在你的 conv_Kernel ,我相信它会显示一些错误(如果你试图运行它。)作为一个快速测试的情况下,你总是可以运行任何CUDA code。与 CUDA-MEMCHECK ,看是否有API错误发生。

编辑:我尝试了一个简单的共享内存版本我的 conv_Kernel2 ,但它不是任何更快。我相信这样做的原因是,这些数据集(在 N = 4096, A 均为16K字节,退出大约是32K字节)是足够小,可以轻松地适合在GPU的L2缓存,没有颠簸。

然而,对于新的体系结构(CC 3.5和更高版本)的CUDA编译器有时可以做出更多优化的如果只读输入数据被正确识别的这样给内核。因此,如果我们改变我的 conv_Kernel2 定义:

  __ global__无效conv_Kernel2(const int的* __restrict__ A,const int的* __restrict__ B,INT *总分){

然后我目击略有改善的执行时间,在我的情况:

  $ ./t617
成品。结果比赛。 cpu时间:13792us,GPU时间:3209us
成品。结果比赛。 cpu时间:13792us,GPU2时间:1626us
$

我创造了code这做以下修改后的版本:


  1. N 在命令行上指定

  2. 只有CPU 转换次数和GPU conv_Kernel2 均包括在内。

  3. 将数据从GPU转移到/时间成本被包括在GPU定时测量

  4. 的typedef ... MYTYPE; 提供这样的code可以重新编译轻松地与各种数据类型的测试行为

  5. 一个加速因子被打印出来,这是CPU时间由GPU时间分配。

修改code:

 的#include<&stdio.h中GT;
#包括LT&;&stdlib.h中GT;
#包括LT&;&time.h中GT;
#包括LT&; SYS / time.h中>// RG * RG * MAXN必须符合MYTYPE内#定义MAXN 100000
RG的#define 10
#定义USECPSEC 1000000ULL
#定义nTPB 256双的typedef MYTYPE;无效CONV(常量MYTYPE * A,常量MYTYPE * B,* MYTYPE出来,INT N){    的for(int i = 0; I< N ++ I)
        对于(INT J = 0; J< N ++ j)条
            出[I + J] + = A [I] * B [J]。
}无符号长长dtime_usec(无符号长长preV){
  timeval中TV1;
  函数gettimeofday(安培; tv1,0);
  收益率((tv1.tv_sec * USECPSEC)+ tv1.tv_usec) - preV;
}__global__无效conv_Kernel2(常量MYTYPE * __restrict__ A,常量MYTYPE * __restrict__ B,MYTYPE *总分,const int的N){
    INT IDX = threadIdx.x + blockDim.x * blockIdx.x;
    如果(IDX≤(2 * N)-1){
      MYTYPE my_sum = 0;
      的for(int i = 0; I< N;我++)
        如果(((IDX&下; N)及及(I&下; = IDX))||((IDX> = N)及及(I>(IDX-N))))my_sum + = A [I] * B [IDX-I]
      出[IDX] = my_sum;
    }
}INT主(INT ARGC,CHAR *的argv []){
  MYTYPE * h_a,数组* D_A,* h_result,* d_result,*成绩,* h_B,* d_B,* A,* B;
  如果(ARGC = 2!){printf的(必须在命令行\\ n指定N);返回1;}
  INT my_N =的atoi(ARGV [1]);
  如果((my_N< 1)||(my_N> MAXN)){printf的(N超出范围的\\ n);返回1;}
  B =(MYTYPE *)malloc的(my_N *的sizeof(MYTYPE));
  A =(MYTYPE *)malloc的(my_N *的sizeof(MYTYPE));
  h_A =(MYTYPE *)malloc的(my_N *的sizeof(MYTYPE));
  h_B =(MYTYPE *)malloc的(my_N *的sizeof(MYTYPE));
  h_result =(MYTYPE *)malloc的(2 * my_N *的sizeof(MYTYPE));
  结果=(MYTYPE *)malloc的(2 * my_N *的sizeof(MYTYPE));  cudaMalloc(安培; d_B,my_N *的sizeof(MYTYPE));
  cudaMalloc(安培; D_A,my_N *的sizeof(MYTYPE));
  cudaMalloc(安培; d_result,2 * my_N *的sizeof(MYTYPE));  的for(int i = 0; I< my_N;我++){
    A [i] =兰特()%RG;
    B〔I] =兰特()%RG;
    h_A [I] = A [I]
    h_B [I] = B [I];}  对(INT I = 0; I&2 * my_N;我++){
    结果由[i] = 0;
    h_result [I] = 0;}  无符号长长CPU_TIME = dtime_usec(0);
  CONV(A,B,结果,my_N);
  CPU_TIME = dtime_usec(CPU_TIME);  cudaMemset(d_result,0,2 * my_N *的sizeof(MYTYPE));  无符号长长gpu_time = dtime_usec(0);
  cudaMemcpy(D_A h_a,数组my_N *的sizeof(MYTYPE),cudaMemcpyHostToDevice);
  cudaMemcpy(d_B,h_B,my_N *的sizeof(MYTYPE),cudaMemcpyHostToDevice);
  conv_Kernel2&所述;&所述;≤((2 *(my_N-1))+ nTPB-1)/ nTPB,nTPB>>>(D_A,d_B,d_result,my_N);
  cudaDeviceSynchronize();
  cudaMemcpy(h_result,d_result,2 * my_N *的sizeof(MYTYPE),cudaMemcpyDeviceToHost);
  gpu_time = dtime_usec(gpu_time);  的for(int i = 0;我2 * my_N;我++),如果(结果[I] = h_result [I]!){printf的(mismatch2在%D,CPU:%d个,GPU%d个\\ N, I,导致由[i],h_result [I]);返回1;}
  的printf(说完结果匹配CPU时间:%的LDU,GPU时间:%的LDU \\ n,CPU_TIME,gpu_time);
  的printf(CPU / GPU =%F \\ N,CPU_TIME /(浮点)gpu_time);
  返回0;
}

Can someone please help me convert a nested for loop into a CUDA kernel? Here is the function I am trying to convert into a CUDA kernel:

// Convolution on Host
void conv(int* A, int* B, int* out) {

    for (int i = 0; i < N; ++i)
        for (int j = 0; j < N; ++j)
            out[i + j] += A[i] * B[j];
}

I have tried very hard to parallelize this code.
Here is my attempt:

__global__ void conv_Kernel(int* A, int* B, int* out) {

    int i = blockIdx.x;
    int j = threadIdx.x;

    __shared__ int temp[N];

    __syncthreads();
    temp[i + j] = A[i] * B[j];
    __syncthreads();

    int sum = 0;
    for (int k = 0; k < N; k++)
        sum += temp[k];
    out[i + j] = sum;
}

Your cpu conv function appears to be doing this (for N = 4, as an example):

A0B0  A0B1  A0B2  A0B3                   +     ^
      A1B0  A1B1  A1B2  A1B3             +     N
            A2B0  A2B1  A2B2  A2B3       +    rows
                  A3B0  A3B1  A3B2  A3B3 =     v
------------------------------------------
out0  out1  out2  out3  out4  out5  out6
    <-  (2*N)-1 columns ->

Your convolution is (to me) distinguished by the fact that it is convolving 2 signals of equal length. Since the GPU likes to work on "large" problems, this implies N should be large. However one immediate problem with your conv_Kernel realization is that it implies that the block dimension will be used to index into A, and the thread dimension will be used to index into B. But the thread dimension (threadIdx.x) is limited to 512 or 1024 for current CUDA GPUs. This will relegate us to only solving pretty small problems.

There are various other problems with your realization. One problem is that the shared memory size allocated is not enough to fit the i+j range (which can go from 0->2*(N-1)). This is trivial to fix of course, but the more serious issue is that I don't see a way to map your arithmetic onto anything resembling the desired pattern above. After spending a little while thinking about your kernel, I discarded it.

The convolution problem has a great deal of research associated with it, and can be optimized in various ways for massively parallel architectures like the GPU. Therefore I will focus on two very simple realizations which immediately suggest themselves based on the diagram above.

The first realization is simply to re-create the diagram above. We will create an intermediate temp array to store all the individual AxBy products, calculating and storing these products in the conv_Kernel. We will then launch a second kernel (sum_Kernel) which simply sums columns of the temp array, to produce the various out values. The first kernel requires N threads, which will successively calculate each row of the above diagram, in a slanting fashion as we iterate through N for-loop iterations, one per row. The second kernel requires (2*N)-1 threads, one for each column/out value.

My second realization (conv_Kernel2) dispenses with the need for a temp array, and just assigns one thread to each column/out value, and iterates through the N rows, computing the necessary products row-by-row, and summing those products "on-the-fly". The sum result is then directly stored in the out array.

Considering only the calculations, and not the time required for data movement/initialization, the GPU realizations begin to be faster than the naive single-threaded CPU implementation at around N=512 on a K20x GPU, which is what I happened to be using. The second realization is also commended by the fact that the only data movement required is for A, B, and the result. The first realization requires in addition the temp array to be allocated and initialized to all zeroes. The size of the temp array is proportional to N*N, so the second realization also has the benefit that it does not require this temporary storage.

Here's a fully worked test case, running and timing the CPU realization you provided plus the two slightly different GPU realizations that I created:

$ cat t617.cu
#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>

#define N 4096
#define RG 10
#define USECPSEC 1000000ULL
#define nTPB 256


void conv(int* A, int* B, int* out) {

    for (int i = 0; i < N; ++i)
        for (int j = 0; j < N; ++j)
            out[i + j] += A[i] * B[j];
}

unsigned long long dtime_usec(unsigned long long prev){
  timeval tv1;
  gettimeofday(&tv1,0);
  return ((tv1.tv_sec * USECPSEC)+tv1.tv_usec) - prev;
}


__global__ void conv_Kernel(int* A, int *B, int* temp) {

    int idx = threadIdx.x+blockDim.x*blockIdx.x;
    if (idx < N){
      int my_B = B[idx];
      for (int i = 0; i < N; i++)
        temp[idx + (i*2*N) + i] = my_B * A[i];
      }
}

__global__ void sum_Kernel(int *temp, int *out){
    int idx = threadIdx.x+blockDim.x*blockIdx.x;
    if (idx < (2*N)-1){
      int my_sum = 0;
      for (int i = 0; i < N; i++) my_sum += temp[idx + (i*2*N)];
      out[idx] = my_sum;}
}

__global__ void conv_Kernel2(int *A, int *B, int *out){
    int idx = threadIdx.x+blockDim.x*blockIdx.x;
    if (idx < (2*N)-1){
      int my_sum = 0;
      for (int i = 0; i < N; i++)
        if (((idx < N) && (i <= idx)) || ((idx >= N) && (i > (idx-N)))) my_sum += A[i]*B[idx-i];
      out[idx] = my_sum;
    }
}

int main(){

  int *h_A, *d_A, *h_result, *d_result, *result, *h_B, *d_B, *A, *B, *d_temp;

  B   = (int *)malloc(N*sizeof(int));
  A   = (int *)malloc(N*sizeof(int));
  h_A = (int *)malloc(N*sizeof(int));
  h_B = (int *)malloc(N*sizeof(int));
  h_result = (int *)malloc(2*N*sizeof(int));
  result   = (int *)malloc(2*N*sizeof(int));

  cudaMalloc(&d_B, N*sizeof(int));
  cudaMalloc(&d_A, N*sizeof(int));
  cudaMalloc(&d_result, 2*N*sizeof(int));
  cudaMalloc(&d_temp, 2*N*N*sizeof(int));

  for (int i=0; i < N; i++){
    A[i] = rand()%RG;
    B[i] = rand()%RG;
    h_A[i] = A[i];
    h_B[i] = B[i];}

  for (int i=0; i < 2*N; i++){
    result[i]   = 0;
    h_result[i] = 0;}

  unsigned long long cpu_time = dtime_usec(0);
  conv(A, B, result);
  cpu_time = dtime_usec(cpu_time);

  cudaMemcpy(d_A, h_A, N*sizeof(int), cudaMemcpyHostToDevice);
  cudaMemcpy(d_B, h_B, N*sizeof(int), cudaMemcpyHostToDevice);
  cudaMemset(d_result, 0, 2*N*sizeof(int));
  cudaMemset(d_temp, 0, 2*N*N*sizeof(int));

  unsigned long long gpu_time = dtime_usec(0);
  conv_Kernel<<<(N+nTPB-1)/nTPB,nTPB>>>(d_A, d_B, d_temp);
  sum_Kernel<<<((2*(N-1))+nTPB-1)/nTPB, nTPB>>>(d_temp, d_result);
  cudaDeviceSynchronize();
  gpu_time = dtime_usec(gpu_time);

  cudaMemcpy(h_result, d_result, 2*N*sizeof(int), cudaMemcpyDeviceToHost);
  for (int i = 0; i < 2*N; i++) if (result[i] != h_result[i]) {printf("mismatch at %d, cpu: %d, gpu %d\n", i, result[i], h_result[i]); return 1;}
  printf("Finished.  Results match.  cpu time: %ldus, gpu  time: %ldus\n", cpu_time, gpu_time);


  cudaMemset(d_result, 0, 2*N*sizeof(int)); // just for error checking, the kernel2 require no initialization of the result

  gpu_time = dtime_usec(0);
  conv_Kernel2<<<((2*(N-1))+nTPB-1)/nTPB,nTPB>>>(d_A, d_B, d_result);
  cudaDeviceSynchronize();
  gpu_time = dtime_usec(gpu_time);

  cudaMemcpy(h_result, d_result, 2*N*sizeof(int), cudaMemcpyDeviceToHost);
  for (int i = 0; i < 2*N; i++) if (result[i] != h_result[i]) {printf("mismatch2 at %d, cpu: %d, gpu %d\n", i, result[i], h_result[i]); return 1;}
  printf("Finished.  Results match.  cpu time: %ldus, gpu2 time: %ldus\n", cpu_time, gpu_time);
  return 0;
}
$ nvcc -arch=sm_35 -o t617 t617.cu
$ ./t617
Finished.  Results match.  cpu time: 69059us, gpu  time: 3204us
Finished.  Results match.  cpu time: 69059us, gpu2 time: 1883us
$ nvcc -arch=sm_35 -O3 -o t617 t617.cu
$ ./t617
Finished.  Results match.  cpu time: 13750us, gpu  time: 3214us
Finished.  Results match.  cpu time: 13750us, gpu2 time: 1886us
$

(note that even just using the -O3 parameter makes a significant difference in the CPU code execution)

As I mentioned, I would consider both of my examples to be also quite "naive" for GPU code (niether uses shared memory, for example), but they may give you some ideas for how to get started.

For brevity of presentation, I have dispensed with CUDA error checking. However, I would suggest that any time you are having trouble with a CUDA code, that you perform proper cuda error checking. In the case of your conv_Kernel, I believe it would have indicated some errors (if you tried to run it.) As a quick test, you can always run any CUDA code with cuda-memcheck to see if any API errors are occurring.

EDIT: I tried a simple shared memory version of my conv_Kernel2 but it wasn't any faster. I believe the reason for this is that these data sets (at N=4096, A and B are 16Kbytes each, out is approximately 32Kbytes) are small enough to easily fit in the GPU L2 cache, with no thrashing.

However, for newer architectures (cc 3.5 and newer) the CUDA compiler can sometimes make additional optimizations if the read-only input data is properly identified as such to the kernel. Therefore if we change my conv_Kernel2 definition to:

__global__ void conv_Kernel2(const int * __restrict__ A, const int * __restrict__ B, int *out){

then I witness slightly improved execution times, in my case:

$ ./t617
Finished.  Results match.  cpu time: 13792us, gpu  time: 3209us
Finished.  Results match.  cpu time: 13792us, gpu2 time: 1626us
$

I created a modified version of the code which does the following:

  1. N is specified on the command line
  2. only the cpu conv and gpu conv_Kernel2 are included.
  3. time cost to move the data to/from the GPU is included in the GPU timing measurement
  4. a typedef ... mytype; is provided so that the code can be re-compiled easily to test behavior with various datatypes.
  5. a "speedup factor" is printed out, which is the cpu time divided by the gpu time.

modified code:

#include <stdio.h>
#include <stdlib.h>
#include <time.h>
#include <sys/time.h>

// RG*RG*MAXN must fit within mytype

#define MAXN 100000
#define RG 10
#define USECPSEC 1000000ULL
#define nTPB 256

typedef double mytype;

void conv(const mytype *A, const mytype *B, mytype* out, int N) {

    for (int i = 0; i < N; ++i)
        for (int j = 0; j < N; ++j)
            out[i + j] += A[i] * B[j];
}

unsigned long long dtime_usec(unsigned long long prev){
  timeval tv1;
  gettimeofday(&tv1,0);
  return ((tv1.tv_sec * USECPSEC)+tv1.tv_usec) - prev;
}



__global__ void conv_Kernel2(const mytype * __restrict__ A, const mytype * __restrict__ B, mytype *out, const int N){
    int idx = threadIdx.x+blockDim.x*blockIdx.x;
    if (idx < (2*N)-1){
      mytype my_sum = 0;
      for (int i = 0; i < N; i++)
        if (((idx < N) && (i <= idx)) || ((idx >= N) && (i > (idx-N)))) my_sum += A[i]*B[idx-i];
      out[idx] = my_sum;
    }
}

int main(int argc, char *argv[]){


  mytype *h_A, *d_A, *h_result, *d_result, *result, *h_B, *d_B, *A, *B;
  if (argc != 2) {printf("must specify N on the command line\n"); return 1;}
  int my_N = atoi(argv[1]);
  if ((my_N < 1) || (my_N > MAXN)) {printf("N out of range\n"); return 1;}
  B   = (mytype *)malloc(my_N*sizeof(mytype));
  A   = (mytype *)malloc(my_N*sizeof(mytype));
  h_A = (mytype *)malloc(my_N*sizeof(mytype));
  h_B = (mytype *)malloc(my_N*sizeof(mytype));
  h_result = (mytype *)malloc(2*my_N*sizeof(mytype));
  result   = (mytype *)malloc(2*my_N*sizeof(mytype));

  cudaMalloc(&d_B, my_N*sizeof(mytype));
  cudaMalloc(&d_A, my_N*sizeof(mytype));
  cudaMalloc(&d_result, 2*my_N*sizeof(mytype));

  for (int i=0; i < my_N; i++){
    A[i] = rand()%RG;
    B[i] = rand()%RG;
    h_A[i] = A[i];
    h_B[i] = B[i];}

  for (int i=0; i < 2*my_N; i++){
    result[i]   = 0;
    h_result[i] = 0;}

  unsigned long long cpu_time = dtime_usec(0);
  conv(A, B, result, my_N);
  cpu_time = dtime_usec(cpu_time);

  cudaMemset(d_result, 0, 2*my_N*sizeof(mytype));

  unsigned long long gpu_time = dtime_usec(0);
  cudaMemcpy(d_A, h_A, my_N*sizeof(mytype), cudaMemcpyHostToDevice);
  cudaMemcpy(d_B, h_B, my_N*sizeof(mytype), cudaMemcpyHostToDevice);
  conv_Kernel2<<<((2*(my_N-1))+nTPB-1)/nTPB,nTPB>>>(d_A, d_B, d_result, my_N);
  cudaDeviceSynchronize();
  cudaMemcpy(h_result, d_result, 2*my_N*sizeof(mytype), cudaMemcpyDeviceToHost);
  gpu_time = dtime_usec(gpu_time);

  for (int i = 0; i < 2*my_N; i++) if (result[i] != h_result[i]) {printf("mismatch2 at %d, cpu: %d, gpu %d\n", i, result[i], h_result[i]); return 1;}
  printf("Finished.  Results match.  cpu time: %ldus, gpu time: %ldus\n", cpu_time, gpu_time);
  printf("cpu/gpu = %f\n", cpu_time/(float)gpu_time);
  return 0;
}