更新时间:2022-10-26 09:19:59
The latency to the shared/constant/texture memorys is small and depends on which device you have. In general though GPUs are designed as a throughput architecture which means that by creating enough threads the latency to the memorys, including the global memory, is hidden.
The reason the guides talk about the latency to global memory is that the latency is orders of magnitude higher than that of other memories, meaning that it is the dominent latency to be considered for optimization.
You mentioned constant cache in particular. You are quite correct that if all threads within a warp (i.e. group of 32 threads) access the same address then there is no penalty, i.e. the value is read from the cache and broadcast to all threads simultaneously. However, if threads access different addresses then the accesses must serialize since the cache can only provide one value at a time. If you're using the CUDA Profiler, then this will show up under the serialization counter.
Shared memory, unlike constant cache, can provide much higher bandwidth. Check out the CUDA Optimization talk for more details and an explanation of bank conflicts and their impact.