What happens when you run a CUDA kernel?
- AI
- Hardware
- Infrastructure
- Developer Tools
The post is a reverse-engineered tour of the CUDA launch path. It starts at the familiar `<<<>>>` syntax, then follows the kernel through the CUDA runtime, driver, command buffers, queue metadata descriptors, and the final signal that tells the GPU new work is ready. It also explains how warps get scheduled once execution starts. The appeal was not the usual CUDA basics like grids, blocks, and warps. It was the missing middle between “I called a kernel” and “the GPU ran it.” People called out the doorbell and QMD sections in particular because they connect user code to the actual submission protocol on NVIDIA hardware.
If your team writes or tunes GPU code, this is the level of mental model that helps with debugging, profiling, and deciding when to drop below CUDA’s runtime API. It also underlines how much performance and reliability work still sits in proprietary driver behavior, which matters if you are building on NVIDIA at scale.
- fergusfinn.com
- Discuss on HN