更新时间:2021-09-16 00:16:16
全局内存可以静态分配 a>(使用 __ device __
),动态(使用设备 malloc
或 new code>),并通过 CUDA运行时(例如使用
cudaMalloc
)。
Global memory can be allocated statically (using __device__
), dynamically (using device malloc
or new
) and via the CUDA runtime (e.g. using cudaMalloc
).
以上所有方法实际上都分配相同类型的记忆,我.e。从板载(而非片上)DRAM子系统中雕刻出来的内存。无论分配方式如何,该内存都具有相同的访问,合并和缓存规则(因此具有相同的一般性能考虑)。
All of the above methods allocate physically the same type of memory, i.e. memory carved out of the on-board (but not on-chip) DRAM subsystem. This memory has the same access, coalescing, and caching rules regardless of how it is allocated (and therefore has the same general performance considerations).
由于动态分配会占用一些非-零时间,通过在程序开始时使用静态(即 __ device __
)方法或通过进行一次分配,可以提高代码的性能。运行时API(即 cudaMalloc
等),这避免了花时间在代码的性能敏感区域动态分配内存。
Since dynamic allocations take some non-zero time, there may be performance improvement for your code by doing the allocations once, at the beginning of your program, either using the static (i.e. __device__
) method, or via the runtime API (i.e. cudaMalloc
, etc.) This avoids taking the time to dynamically allocate memory during performance-sensitive areas of your code.
还要注意,我概述的3种方法,虽然从设备代码中具有类似C / C ++的访问方法,但与主机的访问方法却有所不同。使用运行时API函数(如 cudaMemcpyToSymbol
和 cudaMemcpyFromSymbol
)访问静态分配的内存,通过普通 cudaMalloc
/ cudaMemcpy
类型的函数,以及动态分配的全局内存(设备 new
和 malloc
)不能直接从主机访问。
Also note that the 3 methods I outline, while having similar C/C++ -like access methods from device code, have differing access methods from the host. Statically allocated memory is accessed using the runtime API functions like cudaMemcpyToSymbol
and cudaMemcpyFromSymbol
, runtime API allocated memory is accessed via ordinary cudaMalloc
/ cudaMemcpy
type functions, and dynamically allocated global memory (device new
and malloc
) is not directly accessible from the host.