且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

CUDA中的全局内存与动态全局内存分配

更新时间:2021-09-16 00:16:16

全局内存可以静态分配 a>(使用 __ device __ ),动态(使用设备 malloc new code>),并通过 CUDA运行时(例如使用 cudaMalloc )。

Global memory can be allocated statically (using __device__), dynamically (using device malloc or new) and via the CUDA runtime (e.g. using cudaMalloc).

以上所有方法实际上都分配相同类型的记忆,我.e。从板载(而非片上)DRAM子系统中雕刻出来的内存。无论分配方式如何,该内存都具有相同的访问,合并和缓存规则(因此具有相同的一般性能考虑)。

All of the above methods allocate physically the same type of memory, i.e. memory carved out of the on-board (but not on-chip) DRAM subsystem. This memory has the same access, coalescing, and caching rules regardless of how it is allocated (and therefore has the same general performance considerations).

由于动态分配会占用一些非-零时间,通过在程序开始时使用静态(即 __ device __ )方法或通过进行一次分配,可以提高代码的性能。运行时API(即 cudaMalloc 等),这避免了花时间在代码的性能敏感区域动态分配内存。

Since dynamic allocations take some non-zero time, there may be performance improvement for your code by doing the allocations once, at the beginning of your program, either using the static (i.e. __device__ ) method, or via the runtime API (i.e. cudaMalloc, etc.) This avoids taking the time to dynamically allocate memory during performance-sensitive areas of your code.

还要注意,我概述的3种方法,虽然从设备代码中具有类似C / C ++的访问方法,但与主机的访问方法却有所不同。使用运行时API函数(如 cudaMemcpyToSymbol cudaMemcpyFromSymbol )访问静态分配的内存,通过普通 cudaMalloc / cudaMemcpy 类型的函数,以及动态分配的全局内存(设备 new malloc )不能直接从主机访问。

Also note that the 3 methods I outline, while having similar C/C++ -like access methods from device code, have differing access methods from the host. Statically allocated memory is accessed using the runtime API functions like cudaMemcpyToSymbol and cudaMemcpyFromSymbol, runtime API allocated memory is accessed via ordinary cudaMalloc / cudaMemcpy type functions, and dynamically allocated global memory (device new and malloc) is not directly accessible from the host.