且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

在 CUDA 中,什么是内存合并,它是如何实现的?

更新时间:2022-06-02 01:18:49

此信息可能仅适用于计算能力 1.x 或 cuda 2.0.更新的架构和 cuda 3.0 具有更复杂的全局内存访问,实际上甚至没有为这些芯片配置合并的全局负载".

另外,这个逻辑也可以应用到共享内存中来避免存储库冲突.

合并内存事务是一个半扭曲中的所有线程同时访问全局内存的事务.这太简单了,但正确的做法是让连续的线程访问连续的内存地址.

A coalesced memory transaction is one in which all of the threads in a half-warp access global memory at the same time. This is oversimple, but the correct way to do it is just have consecutive threads access consecutive memory addresses.

因此,如果线程 0、1、2 和 3 读取全局内存 0x0、0x4、0x8 和 0xc,则应该是合并读取.

So, if threads 0, 1, 2, and 3 read global memory 0x0, 0x4, 0x8, and 0xc, it should be a coalesced read.

在矩阵示例中,请记住您希望矩阵线性驻留在内存中.您可以根据需要执行此操作,并且您的内存访问应反映矩阵的布局方式.所以,下面的 3x4 矩阵

In a matrix example, keep in mind that you want your matrix to reside linearly in memory. You can do this however you want, and your memory access should reflect how your matrix is laid out. So, the 3x4 matrix below

0 1 2 3
4 5 6 7
8 9 a b

可以像这样逐行完成,以便 (r,c) 映射到内存 (r*4 + c)

could be done row after row, like this, so that (r,c) maps to memory (r*4 + c)

0 1 2 3 4 5 6 7 8 9 a b

假设您需要访问一次元素,并假设您有四个线程.哪些线程将用于哪个元素?可能是

Suppose you need to access element once, and say you have four threads. Which threads will be used for which element? Probably either

thread 0:  0, 1, 2
thread 1:  3, 4, 5
thread 2:  6, 7, 8
thread 3:  9, a, b

thread 0:  0, 4, 8
thread 1:  1, 5, 9
thread 2:  2, 6, a
thread 3:  3, 7, b

哪个更好?哪些会导致合并读取,哪些不会?

Which is better? Which will result in coalesced reads, and which will not?

无论哪种方式,每个线程都会进行 3 次访问.让我们看一下第一次访问,看看线程是否连续访问内存.在第一个选项中,第一次访问是 0、3、6、9.不连续,不合并.第二个选项,是0、1、2、3.连续!合并!耶!

Either way, each thread makes three accesses. Let's look at the first access and see if the threads access memory consecutively. In the first option, the first access is 0, 3, 6, 9. Not consecutive, not coalesced. The second option, it's 0, 1, 2, 3. Consecutive! Coalesced! Yay!

***的方法可能是编写您的内核,然后对其进行分析以查看您是否有未合并的全局加载和存储.

The best way is probably to write your kernel and then profile it to see if you have non-coalesced global loads and stores.