且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

在CUDA中查找矩阵的最大值

更新时间:2022-11-14 11:56:22

平行缩小是棘手的; 细分平行缩小更复杂。现在你是在2-D,你的段/窗口小于线程块。

With CUDA, parallel reduction is tricky; segmented parallel reduction is trickier. Now you are doing it in 2-D, and your segment/window is smaller than the thread block.

对于大窗口大小,我不认为它是一个问题。你可以使用一个线程块来减少一个窗口。例如,如果你有一个16x16窗口,你可以简单地使用16x16线程块。如果你有更大的窗口大小,例如64x64,你仍然可以使用16x16线程块。在数据加载期间,首先将64x64窗口缩小为16x16个元素,然后在线程块中减少到1个标量。

For large window size, I don't think it is a problem. You could use one thread block to reduce one window. For example if you have a 16x16 window, you could simply use 16x16 thread block. If you have even larger window size, for example 64x64, you could still use 16x16 thread block. First reduce the 64x64 window to 16x16 elements during data loading, then reduce to 1 scalar within the thread block.

对于小于块大小的窗口,减少每个线程块的多个窗口以提高性能。您可以使用当前的块/网格配置,其中每个256线程块(16x16)负责16个4x4窗口。但这不是***的,因为每个32线程包装组织在两个部分(2x16)。这不利于合并全局内存访问,并且很难将2x16翘曲映射到一个或多个4x4窗口,以实现高效的并行缩小。

For window size smaller than the block size, you will have to reduce multiple windows per thread block for higher performance. You could use your current block/grid configuration, where each 256-thread block (16x16) is responsible for 16 4x4 windows. But this will not be optimal because each 32-thread wrap is organized in two parts (2x16). This is not good for coalesced global memory access, and it is hard to map a 2x16 warp to one or more 4x4 windows for efficient parallel reduction.

或者,我建议您使用1-D线程块有256个线程。每个 m 线程减少一个 m x m 然后可以使用2-D网格覆盖整个图像。

Alternatively I would suggest you use 1-D thread block with 256 threads. Every m threads reduce one mxm window. Then you could use 2-D grid to cover the whole image.

const int m = window_size;
dim3 blocksize(256);
dim3 gridsize((img_width+255)/256, (img_height+m-1)/m);

在内核函数中,您可以


  1. 将每个 m x m 减少到1x m 向量;

  2. 使用树缩减方法将1x m 向量减少为标量。

  1. reduce each mxm window to a 1xm vector during global data loading;
  2. use tree reduction method to reduce the 1xm vector to a scalar.

以下代码是一个概念性演示,当 m 2的功率和 m 。您可以进一步修改它为任意 m 和更好的边界检查。

This following code is a conceptual demo which works when m is a power of 2 and m <= 32. You could further modify it for arbitrary m and better boundary checking.

#include <assert.h>
#include <cuda.h>
#include <thrust/device_vector.h>

__global__ void calculate_emax_kernel(const float* input, float* output,
                                      int height, int width, int win_size,
                                      int out_width) {
  const int tid = threadIdx.x;
  const int i = blockIdx.y * win_size;
  const int j = blockIdx.x * 256 + tid;
  const int win_id = j % win_size;

  __shared__ float smax[256];

  float tmax = -1e20;
  if (j < width) {
    for (int tile = 0; tile < win_size; tile++) {
      if (i + tile < height) {
        tmax = max(tmax, input[(i + tile) * width + j]);
      }
    }
  }
  smax[tid] = tmax;
  for (int shift = win_size / 2; shift > 0; shift /= 2) {
    if (win_id < shift) {
      smax[tid] = max(smax[tid], smax[tid + shift]);
    }
  }
  if (win_id == 0 && j < width) {
    output[blockIdx.y * out_width + (j / win_size)] = smax[tid];
  }
}

int main() {
  const int height = 1024;
  const int width = 1024;
  const int m = 4;
  thrust::device_vector<float> in(height * width);
  thrust::device_vector<float> out(
      ((height + m - 1) / m) * ((width + m - 1) / m));

  dim3 blocksize(256);
  dim3 gridsize((width + 255) / 256, (height + m - 1) / m);

  assert(m == 2 || m == 4 || m == 8 || m == 16 || m == 32);
  calculate_emax_kernel<<<gridsize, blocksize>>>(
      thrust::raw_pointer_cast(in.data()),
      thrust::raw_pointer_cast(out.data()),
      height, width, m, (width + m - 1) / m);

  return 0;
}