且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

cuda 异常后的内存数据状态

更新时间:2023-02-27 14:37:50

如果发生破坏 CUDA 上下文的 CUDA 错误,则行为未定义.

The behavior is undefined in the event of a CUDA error which corrupts the CUDA context.

这种类型的错误很明显,因为它是粘性的",这意味着一旦发生,每个 CUDA API 调用都会返回该错误,直到上下文被破坏.

This type of error is evident because it is "sticky", meaning once it occurs, every single CUDA API call will return that error, until the context is destroyed.

非粘性错误在由 cuda API 调用返回后会自动清除(cudaPeekAtLastError 除外).任何内核崩溃"类型的错误(无效访问、未指定的启动失败等)都将是一个粘性错误.在您的示例中,第 3 步将(始终)在 cudaMemcpy 调用的结果上返回 API 错误,以将 variableA 从设备传输到主机,因此 cudaMemcpy 操作的结果是未定义且不可靠的——就好像 cudaMemcpy 操作也以某种未指定的方式失败了.

Non-sticky errors are cleared automatically after they are returned by a cuda API call (with the exception of cudaPeekAtLastError). Any "crashed kernel" type error (invalid access, unspecified launch failure, etc.) will be a sticky error. In your example, step 3 would (always) return an API error on the result of the cudaMemcpy call to transfer variableA from device to host, so the results of the cudaMemcpy operation are undefined and unreliable -- it is as if the cudaMemcpy operation also failed in some unspecified way.

由于损坏的 CUDA 上下文的行为是未定义的,因此没有定义任何分配的内容,或者通常是发生此类错误后的机器状态.

Since the behavior of a corrupted CUDA context is undefined, there is no definition for the contents of any allocations, or in general the state of the machine after such an error.

非粘性错误的一个示例可能是尝试cudaMalloc 比设备内存中可用的更多数据.这样的操作会返回一个内存不足的错误,但是该错误在返回后会被清除,并且后续(有效的)cuda API调用可以成功完成,而不会返回错误.非粘性错误不会破坏 CUDA 上下文,并且 cuda 上下文的行为与从未请求过无效操作完全相同.

An example of a non-sticky error might be an attempt to cudaMalloc more data than is available in device memory. Such an operation will return an out-of-memory error, but that error will be cleared after being returned, and subsequent (valid) cuda API calls can complete successfully, without returning an error. A non-sticky error does not corrupt the CUDA context, and the behavior of the cuda context is exactly the same as if the invalid operation had never been requested.

在许多记录的错误代码中都提到了粘性和非粘性错误之间的区别 说明,例如:

This distinction between sticky and non-sticky error is called out in many of the documented error code descriptions, for example:

非粘性、非 cuda-context-corrupting:

non-sticky, non-cuda-context-corrupting:

cudaErrorMemoryAllocation = 2API 调用失败,因为它无法分配足够的内存来执行请求的操作.

cudaErrorMemoryAllocation = 2 The API call failed because it was unable to allocate enough memory to perform the requested operation.

粘性,cuda-context-corrupting:

sticky, cuda-context-corrupting:

cudaErrorMisalignedAddress = 74设备在未对齐的内存地址上遇到加载或存储指令.上下文不能被使用,所以它必须被销毁(并且应该创建一个新的).此上下文中的所有现有设备内存分配都是无效的,如果程序要继续使用 CUDA,则必须重新构建.

cudaErrorMisalignedAddress = 74 The device encountered a load or store instruction on a memory address which is not aligned. The context cannot be used, so it must be destroyed (and a new one should be created). All existing device memory allocations from this context are invalid and must be reconstructed if the program is to continue using CUDA.

请注意,cudaDeviceReset() 本身不足以将 GPU 恢复到正确的功能行为.为了实现这一点,拥有"过程也必须终止.请参阅此处.

Note that cudaDeviceReset() by itself is insufficient to restore a GPU to proper functional behavior. In order to accomplish that, the "owning" process must also terminate. See here.