且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

异常后重置Cuda上下文

更新时间:2023-02-27 13:34:23

在不可恢复(粘滞)CUDA错误之后恢复设备正常功能的唯一方法是终止发起的主机进程(即,发出导致

The only method to restore proper device functionality after a non-recoverable ("sticky") CUDA error is to terminate the host process that initiated (i.e. issued the CUDA runtime API calls that led to) the error.

因此,对于单进程应用程序,唯一的方法是终止应用程序。

Therefore, for a single-process application, the only method is to terminate the application.

应该有可能设计一个多进程应用程序,其中的初始(父)进程完全不使用CUDA ,并产生一个使用GPU的子进程。

It should be possible to design a multi-process application, where the initial ("parent") process makes no usage of CUDA whatsoever, and spawns a child process that uses the GPU. When the child process encounters an unrecoverable CUDA error, it must terminate.

父进程可以选择监视子进程,如果确定子进程已终止,则它必须终止。可以重新生成该过程并恢复CUDA功能行为。

The parent process can, optionally, monitor the child process. If it determines that the child process has terminated, it can re-spawn the process and restore CUDA functional behavior.

粘滞错误和非粘滞错误在其他地方均已涉及,例如此处

Sticky vs. non-sticky errors are covered elsewhere, such as here.

使用例如CUDA示例代码 simpleIPC 中提供了 fork()生成使用CUDA的子进程的信息。以下是从 simpleIPC 示例(对于Linux)汇编而成的粗略示例:

An example of a proper multi-process app that uses e.g. fork() to spawn a child process that uses CUDA is available in the CUDA sample code simpleIPC. Here is a rough example assembled from the simpleIPC example (for linux):

$ cat t477.cu
/*
 * Copyright 1993-2015 NVIDIA Corporation.  All rights reserved.
 *
 * Please refer to the NVIDIA end user license agreement (EULA) associated
 * with this source code for terms and conditions that govern your use of
 * this software. Any use, reproduction, disclosure, or distribution of
 * this software and related documentation outside the terms of the EULA
 * is strictly prohibited.
 *
 */

// Includes
#include <stdio.h>
#include <assert.h>

// CUDA runtime includes
#include <cuda_runtime_api.h>

// CUDA utilities and system includes
#include <helper_cuda.h>

#define MAX_DEVICES          1
#define PROCESSES_PER_DEVICE 1
#define DATA_BUF_SIZE        4096

#ifdef __linux
#include <unistd.h>
#include <sched.h>
#include <sys/mman.h>
#include <sys/wait.h>
#include <linux/version.h>

typedef struct ipcDevices_st
{
    int count;
    int results[MAX_DEVICES];
} ipcDevices_t;


// CUDA Kernel
__global__ void simpleKernel(int *dst, int *src, int num)
{
    // Dummy kernel
    int idx = blockIdx.x * blockDim.x + threadIdx.x;
    dst[idx] = src[idx] / num;
}


void runTest(int index, ipcDevices_t* s_devices)
{
    if (s_devices->results[0] == 0){
        simpleKernel<<<1,1>>>(NULL, NULL, 1);  // make a fault
        cudaDeviceSynchronize();
        s_devices->results[0] = 1;}
    else {
        int *d, *s;
        int n = 1;
        cudaMalloc(&d, n*sizeof(int));
        cudaMalloc(&s, n*sizeof(int));
        simpleKernel<<<1,1>>>(d, s, n);
        cudaError_t err = cudaDeviceSynchronize();
        if (err != cudaSuccess)
          s_devices->results[0] = 0;
        else
          s_devices->results[0] = 2;}
    cudaDeviceReset();
}
#endif

int main(int argc, char **argv)
{

    ipcDevices_t *s_devices = (ipcDevices_t *) mmap(NULL, sizeof(*s_devices),
                                                    PROT_READ | PROT_WRITE, MAP_SHARED | MAP_ANONYMOUS, 0, 0);
    assert(MAP_FAILED != s_devices);

    // We can't initialize CUDA before fork() so we need to spawn a new process
    s_devices->count = 1;
    s_devices->results[0] = 0;

    printf("\nSpawning child process\n");
    int index = 0;

    pid_t pid = fork();

    printf("> Process %3d\n", pid);
    if (pid == 0) { // child process
    // launch our test
      runTest(index, s_devices);
    }
    // Cleanup and shutdown
    else { // parent process
            int status;
            waitpid(pid, &status, 0);
            if (s_devices->results[0] < 2) {
              printf("first process launch reported error: %d\n", s_devices->results[0]);
              printf("respawn\n");
              pid_t newpid = fork();
              if (newpid == 0) { // child process
                    // launch our test
                 runTest(index, s_devices);
                  }
    // Cleanup and shutdown
              else { // parent process
                int status;
                waitpid(newpid, &status, 0);
                if (s_devices->results[0] < 2)
                  printf("second process launch reported error: %d\n", s_devices->results[0]);
                else
                  printf("second process launch successful\n");
                }

            }

    }

    printf("\nShutting down...\n");

    exit(EXIT_SUCCESS);

}
$ nvcc -I/usr/local/cuda/samples/common/inc t477.cu -o t477
$ ./t477

Spawning child process
> Process 10841
> Process   0

Shutting down...
first process launch reported error: 1
respawn

Shutting down...
second process launch successful

Shutting down...
$

Windows,唯一的更改应该是使用Windows IPC机制进行主机进程间通信。

For windows, the only changes need should be to use a windows IPC mechanism for host interprocess communication.