且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

cuda.jit矩阵乘法崩溃

更新时间:2022-04-11 21:11:31

一些一般性评论:

  • 除非我已验证数字的正确性,否则我不会宣布一切正常.
  • 我认为进行错误检查是一种很好的做法. cuda-memcheck 工具可以检查各种错误,即使使用numba python CUDA代码也是如此.即使您建议的大小正常工作,您的代码也会引发错误.
  • 天真矩阵乘法具有相当典型的格式,并在 CUDA编程指南.如果我正在研究这个问题,那么,如果可能的话,我会以此为出发点.
  • 从性能的角度来看,任意限制CUDA代码只能在1024个线程的单个线程块上运行是一个非常糟糕的主意.我无法想象你为什么要这么做.
  • 尽管如此,如果我想使用任意网格排列来处理CUDA算法,则规范技术将是
  • I wouldn't declare that things are working unless I had verified numerical correctness.
  • I think its good practice to do error checking. The cuda-memcheck tool can check for various errors, even with numba python CUDA codes. Your code throws errors, even for the sizes you suggest are working correctly.
  • A naive matrix multiplication has a fairly typical format and is covered in a variety of places such as in the CUDA programming guide. If I were working on this, I would use that as my starting point, if possible.
  • From a performance perspective, arbitrarily limiting a CUDA code to run on a single threadblock of 1024 threads is a really bad idea. I can't imagine why you would do that.
  • In spite of that, if I wanted to use an arbitrary grid arrangement to process a CUDA algorithm, a canonical technique would be a grid-stride loop.

关于您的代码,一些问题会立即出现:

With respect to your code, a few concerns presented themselves immediately:

  1. 对于规范矩阵乘法,我通常希望从结果( C )矩阵而不是 A 矩阵得出计算范围.如果您将自己限制在 X * Xt 情况下,那么我认为您对 A 的使用应该可以.在一般情况下不是.

  1. For a canonical matrix multiplication, I would normally expect to derive the computation range from the result (C) matrix, not the A matrix. If you restrict yourself to the X*Xt case, then I think your use of A should be OK. In the general case it is not.

对我来说很明显你有索引问题.我不会尝试将它们全部整理出来,甚至不会全部识别出来,但是我已经指出了一个问题.由于您选择了网格大小,您的 tid 变量的范围为0..1023,并且对于该索引模式可能不正确: B [tid,col] (( B 的行数等于1024的情况除外).

Its quite evident to me that you have indexing problems. I'm not going to try to sort them all out or even identify them all, but one problem I have indicated already. Your tid variable covers a range of 0..1023 due to your grid size choice, and that cannot possibly be correct for this indexing pattern: B[tid, col] (excepting the case where the number of rows of B is equal to 1024).

在我看来,您有可能将多个线程写入 C 矩阵中的相同输出位置.CUDA不会为您解决此问题.您不应该期望有多个线程写入相同的输出位置才能正常工作,除非您已采取步骤通过原子或经典的并行归约来做到这一点.而且我不想将任何一种方法引入这个问题,因此我认为基本方法很麻烦.

It appears to me that you have the possibility for multiple threads to be writing to the same output location in the C matrix. CUDA does not sort this out for you. You should not expect having multiple threads write to the same output location to work correctly, unless you have taken steps to make it so, either through atomics or via a classical parallel reduction. And I wouldn't want to introduce either of those methods into this problem, so I consider the fundamental approach to be troublesome.

可能还有其他问题.但是由于上面的考虑3,我不想尝试修改您的代码,而宁愿从规范的朴素矩阵乘法开始,然后使用grid-stride循环.

There may be other problems as well. But because of consideration 3 above, rather than try to fix your code, I would rather start with the canonical naive matrix multiply and use a grid-stride loop.

下面是一个包含这些想法的示例:

Here's an example incorporating those ideas:

$ cat t59.py
import numpy as np
from numba import cuda,jit


@cuda.jit
def matmul_kernel(A, B, C):
    num_of_threads = cuda.gridsize(1)
    tid = cuda.grid(1)
    rows_num = C.shape[0]
    cols_num = C.shape[1]
    idx_range = A.shape[1]
    for mid in range(tid, rows_num*cols_num, num_of_threads):
        row = mid // cols_num
        col = mid - (row*cols_num)
        my_sum = 0.0
        for idx in range(0, idx_range):
            my_sum += A[row, idx] * B[idx, col]
        C[row, col] = my_sum

def matmul_gpu(X, Y):
    # Allocate the output matrix in GPU memory using cuda.to_device
    #
    # invoke the dot kernel with 1 threadBlock with 1024 threads
    #
    # copy the output matrix from GPU to cpu using copy_to_host()
    gpu_mat1 = cuda.to_device(X)
    gpu_mat2 = cuda.to_device(Y)
    res = np.zeros(shape=(X.shape[0], Y.shape[1]), dtype=np.float32)
    gpu_mult_res = cuda.to_device(res)
    threads_per_block = 1024
    blocks_per_grid = 1
    matmul_kernel[blocks_per_grid, threads_per_block](
        gpu_mat1, gpu_mat2, gpu_mult_res)
    mult_res = gpu_mult_res.copy_to_host()
    return mult_res

wA = 256
hA = 128
wB = hA
hB = wA


mA = np.ones(shape=(hA,wA), dtype=np.float32)
mB = np.ones(shape=(hB,wB), dtype=np.float32)
mC = matmul_gpu(mA,mB)
print(mC)
$ cuda-memcheck python t59.py
========= CUDA-MEMCHECK
[[ 256.  256.  256. ...,  256.  256.  256.]
 [ 256.  256.  256. ...,  256.  256.  256.]
 [ 256.  256.  256. ...,  256.  256.  256.]
 ...,
 [ 256.  256.  256. ...,  256.  256.  256.]
 [ 256.  256.  256. ...,  256.  256.  256.]
 [ 256.  256.  256. ...,  256.  256.  256.]]
========= ERROR SUMMARY: 0 errors
$