且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

CUDA:瓷砖矩阵的矩阵乘法与共享存储器和矩阵大小是块的大小的非多

更新时间:2022-05-20 03:46:34

在矩阵的各个维度都没有瓷砖尺寸的倍数,那么就可能发生一些瓦片覆盖矩阵只是部分。瓷砖元素属于未完全重叠的瓦片外,应适当零-ED。因此,延长了code到arbitrarly大小的矩阵很容易,但一个简单的索引检查不等于。下面,我复制和粘贴我的版本的平铺矩阵的矩阵乘法内核具有任意大小的矩阵

When the matrix dimensions are not multiples of the tile dimensions, then it can happen that some tiles cover the matrices only partially. The tile elements falling outside the not-fully overlapping tiles should be properly zero-ed. So, extending your code to arbitrarly sized matrices is easy, but does not amount at a simple index check. Below, I'm copying and pasting my version of the tiled matrix-matrix multiplication kernel with arbitrarily sized matrices

__global__ void MatMul(float* A, float* B, float* C, int ARows, int ACols, int BRows, int BCols, int CRows, int CCols) {

    float CValue = 0;

    int Row = blockIdx.y*TILE_DIM + threadIdx.y;
    int Col = blockIdx.x*TILE_DIM + threadIdx.x;

    __shared__ float As[TILE_DIM][TILE_DIM];
    __shared__ float Bs[TILE_DIM][TILE_DIM];

    for (int k = 0; k < (TILE_DIM + ACols - 1)/TILE_DIM; k++) {

         if (k*TILE_DIM + threadIdx.x < ACols && Row < ARows)   As[threadIdx.y][threadIdx.x] = A[Row*ACols + k*TILE_DIM + threadIdx.x];
         else                                                   As[threadIdx.y][threadIdx.x] = 0.0;

         if (k*TILE_DIM + threadIdx.y < BRows && Col < BCols)   Bs[threadIdx.y][threadIdx.x] = B[(k*TILE_DIM + threadIdx.y)*BCols + Col];
         else                                                   Bs[threadIdx.y][threadIdx.x] = 0.0;

         __syncthreads();

         for (int n = 0; n < TILE_DIM; ++n) CValue += As[threadIdx.y][n] * Bs[n][threadIdx.x];

         __syncthreads();
    }

    if (Row < CRows && Col < CCols) C[((blockIdx.y * blockDim.y + threadIdx.y)*CCols)+(blockIdx.x*blockDim.x)+threadIdx.x]=CValue;
}