且构网

分享程序员开发的那些事...
且构网 - 分享程序员编程开发的那些事

CUDA 块/扭曲/线程如何映射到 CUDA 内核?

更新时间:2021-12-20 23:48:47

两个***的参考是

  1. NVIDIA Fermi 计算架构白皮书
  2. GF104 评论

我会尽力回答你的每一个问题.

I'll try to answer each of your questions.

程序员将工作划分为线程,线程划分为线程块,线程块划分为网格.计算工作分配器将线程块分配给流式多处理器 (SM).一旦一个线程块被分配给一个 SM,线程块的资源就会被分配(warp 和共享内存)并且线程被分成 32 个线程的组,称为 warp.一旦分配了一个warp,它就被称为一个活动的warp.两个 warp 调度器每个周期选择两个活动的 warp 并将 warp 分派到执行单元.有关执行单元和指令调度的更多详细信息,请参阅 1 p.7-10 和 2.

The programmer divides work into threads, threads into thread blocks, and thread blocks into grids. The compute work distributor allocates thread blocks to Streaming Multiprocessors (SMs). Once a thread block is distributed to a SM the resources for the thread block are allocated (warps and shared memory) and threads are divided into groups of 32 threads called warps. Once a warp is allocated it is called an active warp. The two warp schedulers pick two active warps per cycle and dispatch warps to execution units. For more details on execution units and instruction dispatch see 1 p.7-10 and 2.

4'.laneid(warp 中的线程索引)和核心之间存在映射.

4'. There is a mapping between laneid (threads index in a warp) and a core.

5'.如果一个 warp 包含少于 32 个线程,在大多数情况下,它的执行方式与它有 32 个线程一样.由于以下几个原因,warp 的活动线程可能少于 32 个:每个块的线程数不能被 32 整除,程序执行一个发散块,因此未采用当前路径的线程被标记为不活动,或者 warp 中的线程退出.

5'. If a warp contains less than 32 threads it will in most cases be executed the same as if it has 32 threads. Warps can have less than 32 active threads for several reasons: number of threads per block is not divisible by 32, the program execute a divergent block so threads that did not take the current path are marked inactive, or a thread in the warp exited.

6'.一个线程块将被划分为WarpsPerBlock = (ThreadsPerBlock + WarpSize - 1)/WarpSizewarp 调度器不需要从同一个线程块中选择两个 warp.

6'. A thread block will be divided into WarpsPerBlock = (ThreadsPerBlock + WarpSize - 1) / WarpSize There is no requirement for the warp schedulers to select two warps from the same thread block.

7'.执行单元不会因内存操作而停止.如果当一条指令准备好被调度时资源不可用,则该指令将在未来资源可用时再次调度.Warp 可能会在障碍处、内存操作、纹理操作、数据依赖关系等方面停滞不前……停滞的 warp 没有资格被 warp 调度程序选择.在 Fermi 上,每个周期至少有 2 个符合条件的 warp 是很有用的,以便 warp 调度程序可以发出指令.

7'. An execution unit will not stall on a memory operation. If a resource is not available when an instruction is ready to be dispatched the instruction will be dispatched again in the future when the resource is available. Warps can stall at barriers, on memory operations, texture operations, data dependencies, ... A stalled warp is ineligible to be selected by the warp scheduler. On Fermi it is useful to have at least 2 eligible warps per cycle so that the warp scheduler can issue an instruction.

参见参考2 了解 GTX480 和 GTX560 之间的差异.

See reference 2 for differences between a GTX480 and GTX560.

如果您阅读参考资料(几分钟),我想您会发现您的目标没有意义.我会尽力回应你的观点.

If you read the reference material (few minutes) I think you will find that your goal does not make sense. I'll try to respond to your points.

1'.如果您启动 kernel>>,您将获得 8 个块,每个块有 2 个 32 和 16 个线程的扭曲.无法保证这 8 个块将分配给不同的 SM.如果将 2 个块分配给 SM,则每个 warp 调度程序都可以选择一个 warp 并执行该 warp.您将只使用 48 个内核中的 32 个.

1'. If you launch kernel<<<8, 48>>> you will get 8 blocks each with 2 warps of 32 and 16 threads. There is no guarantee that these 8 blocks will be assigned to different SMs. If 2 blocks are allocated to a SM then it is possible that each warp scheduler can select a warp and execute the warp. You will only use 32 of the 48 cores.

2'.8块48线程和64块6线程有很大区别.假设您的内核没有分歧,并且每个线程执行 10 条指令.

2'. There is a big difference between 8 blocks of 48 threads and 64 blocks of 6 threads. Let's assume that your kernel has no divergence and each thread executes 10 instructions.

  • 8 个块,48 个线程 = 16 条扭曲 * 10 条指令 = 160 条指令
  • 具有 6 个线程的 64 个块 = 64 条扭曲 * 10 条指令 = 640 条指令

为了获得***效率,工作分工应该是 32 个线程的倍数.硬件不会合并来自不同经线的线程.

In order to get optimal efficiency the division of work should be in multiples of 32 threads. The hardware will not coalesce threads from different warps.

3'.如果内核没有最大化寄存器或共享内存,GTX560 一次可以有 8 个 SM * 8 个块 = 64 个块或 8 个 SM * 48 个扭曲 = 512 个扭曲.在任何给定时间,部分工作都将在 SM 上处于活动状态.每个 SM 都有多个执行单元(多于 CUDA 内核).在任何给定时间使用哪些资源取决于应用程序的扭曲调度程序和指令组合.如果您不进行 TEX 操作,那么 TEX 单元将处于空闲状态.如果您不进行特殊的浮点运算,SUFU 单元将处于空闲状态.

3'. A GTX560 can have 8 SM * 8 blocks = 64 blocks at a time or 8 SM * 48 warps = 512 warps if the kernel does not max out registers or shared memory. At any given time on a portion of the work will be active on SMs. Each SM has multiple execution units (more than CUDA cores). Which resources are in use at any given time is dependent on the warp schedulers and instruction mix of the application. If you don't do TEX operations then the TEX units will be idle. If you don't do a special floating point operation the SUFU units will idle.

4'.Parallel Nsight 和 Visual Profiler 展示

4'. Parallel Nsight and the Visual Profiler show

一个.执行IPC

b.发布IPC

c.每个活动周期的活动扭曲

c. active warps per active cycle

d.每个活动周期的合格经纱(仅限 Nsight)

d. eligible warps per active cycle (Nsight only)

e.翘曲失速原因(仅限 Nsight)

e. warp stall reasons (Nsight only)

f.每条指令执行的活动线程数

f. active threads per instruction executed

分析器不显示任何执行单元的利用率百分比.对于 GTX560,粗略估计为 IssuedIPC/MaxIPC.对于 MaxIPC 假设GF100(GTX480)为2GF10x (GTX560) 是 4 但目标是 3 是更好的目标.

The profiler do not show the utilization percentage of any of the execution units. For GTX560 a rough estimate would be IssuedIPC / MaxIPC. For MaxIPC assume GF100 (GTX480) is 2 GF10x (GTX560) is 4 but target is 3 is a better target.