简单CUDA测试总是由于“遇到非法存储器访问”而失败。错误

更新时间：2021-11-19 04:06:15

一般来说，分配和复制双下标C数组的方法将不起作用。 cudaMemcpy 需要，连续分配，单指针，单下标数组。

In general, your method of allocating and copying a doubly-subscripted C array won't work. cudaMemcpy expects flat, contiguously allocated, single-pointer, single-subscript arrays.

由于这种混乱，传递给你的内核（ int ** a，int ** b ）的指针不能被正确（安全）解引用两次： p>

As a result of this confusion, the pointers being passed to your kernel (int** a, int** b) cannot be properly (safely) dereferenced twice:

b[0][0]=4;

当您尝试在内核代码中执行上述操作时，会得到非法的内存访问，未在设备上正确分配指针指针样式分配。

When you try to do the above in kernel code, you get an illegal memory access, because you have not properly allocated a pointer-to-pointer style allocation on the device.

如果您使用 cuda-memcheck ，你会得到内核代码中非法内存访问的另一个指示。

If you ran your code with cuda-memcheck, you would get another indication of the illegal memory access in the kernel code.

在这些情况下，通常的建议是将你的2D数组，并使用适当的指针或索引算术来模拟2D访问。这是可能的分配二维数组（即双下标，双指针），但它是相当参与（部分由于需要一个深度复制）。如果你想了解更多关于这一点，只需在右上角搜索 CUDA 2D数组。

The usual suggestion in these cases is to "flatten" your 2D arrays to single dimension, and use appropriate pointer or index arithmetic to simulate 2D access. It is possible to allocate 2D arrays (i.e. double-subscript, double-pointer), but it is fairly involved (due in part to the need for a "deep copy"). If you'd like to learn more about that just search on the upper right hand corner for CUDA 2D array.

以下是您的代码的版本，其中包含设备侧数组的数组扁平：

Here's a version of your code that has the array flattening for the device-side array:

$ cat t60.cu
#include <iostream>
#include <cuda.h>
#include <stdio.h>

using namespace std;

#define HANDLE_ERROR( err ) ( HandleError( err, __FILE__, __LINE__ ) )
void printVec(int** a, int n);

static void HandleError( cudaError_t err, const char *file, int line )
{
    if (err != cudaSuccess)
    {
    printf( "%s in %s at line %d\n", cudaGetErrorString( err ),
            file, line );
    exit( EXIT_FAILURE );
    }
}

void checkCUDAError(const char *msg)
{
    cudaError_t err = cudaGetLastError();
    if( cudaSuccess != err)
    {
        fprintf(stderr, "Cuda error: %s: %s.\n", msg,
                              cudaGetErrorString( err) );
        exit(EXIT_FAILURE);
    }
}

__global__ void MatrixMulti(int* b, unsigned n) {
    for (int row = 0; row < n; row++)
      for (int col=0; col < n; col++)
    b[(row*n)+col]=col;  //simulate 2D access in kernel code
}

int main() {
    int N =10;
    int** array, *devarray;  // flatten device-side array
    array = new int*[N];
    array[0] = new int[N*N]; // host allocation needs to be contiguous
    for (int i = 1; i < N; i++) array[i] = array[i-1]+N; //2D on top of contiguous allocation

    HANDLE_ERROR ( cudaMalloc((void**)&devarray, N*N*sizeof(int) ) );
    HANDLE_ERROR ( cudaMemcpy(devarray, array[0], N*N*sizeof(int), cudaMemcpyHostToDevice) );
    MatrixMulti<<<1,1>>>(devarray, N);
    HANDLE_ERROR ( cudaMemcpy(array[0], devarray, N*N*sizeof(int), cudaMemcpyDeviceToHost) );
    HANDLE_ERROR ( cudaFree(devarray) );
    printVec(array,N);

    return 0;
}

void printVec(int** a , int n) {
    for(int i =0 ; i < n; i++) {
        for ( int j = 0; j <n; j++) {
        cout<< a[i][j] <<" ";
        }
        cout<<" "<<endl;
    }
}
$ nvcc -arch=sm_20 -o t60 t60.cu
$ ./t60
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
0 1 2 3 4 5 6 7 8 9
$

上一篇 : ：矩阵OpenCV的大小下一篇 : 不同优化下的 Fortran 矩阵乘法性能

简单CUDA测试总是由于“遇到非法存储器访问”而失败。错误

相关阅读

技术问答最新文章