且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

使用CUDA显示GPU优于CPU的最简单可能示例

更新时间:2022-10-27 18:28:07

First off, I'll reiterate my comment: GPUs are high bandwidth, high latency. Trying to get the GPU to beat a CPU for a nanosecond job (or even a millisecond or second job) is completely missing the point of doing GPU stuff. Below is some simple code, but to really appreciate the performance benefits of GPU, you'll need a big problem size to amortize the startup costs over... otherwise, it's meaningless. I can beat a Ferrari in a two foot race, simply because it take some time to turn the key, start the engine and push the pedal. That doesn't mean I'm faster than the Ferrari in any meaningful way.

Use something like this in C++:

  #define N (1024*1024)
  #define M (1000000)
  int main()
  {
     float data[N]; int count = 0;
     for(int i = 0; i < N; i++)
     {
        data[i] = 1.0f * i / N;
        for(int j = 0; j < M; j++)
        {
           data[i] = data[i] * data[i] - 0.25f;
        }
     }
     int sel;
     printf("Enter an index: ");
     scanf("%d", &sel);
     printf("data[%d] = %f\n", sel, data[sel]);
  }

Use something like this in CUDA/C:

  #define N (1024*1024)
  #define M (1000000)

  __global__ void cudakernel(float *buf)
  {
     int i = threadIdx.x + blockIdx.x * blockDim.x;
     buf[i] = 1.0f * i / N;
     for(int j = 0; j < M; j++)
        buf[i] = buf[i] * buf[i] - 0.25f;
  }

  int main()
  {
     float data[N]; int count = 0;
     float *d_data;
     cudaMalloc(&d_data, N * sizeof(float));
     cudakernel<<<N/256, 256>>>(d_data);
     cudaMemcpy(data, d_data, N * sizeof(float), cudaMemcpyDeviceToHost);
     cudaFree(d_data); 

     int sel;
     printf("Enter an index: ");
     scanf("%d", &sel);
     printf("data[%d] = %f\n", sel, data[sel]);
  }

If that doesn't work, try making N and M bigger, or changing 256 to 128 or 512.