What happens when you run a CUDA kernel?

AI
Hardware
Infrastructure
Developer Tools

The post is a reverse-engineered tour of the CUDA launch path. It starts at the familiar `<<<>>>` syntax, then follows the kernel through the CUDA runtime, driver, command buffers, queue metadata descriptors, and the final signal that tells the GPU new work is ready. It also explains how warps get scheduled once execution starts. The appeal was not the usual CUDA basics like grids, blocks, and warps. It was the missing middle between “I called a kernel” and “the GPU ran it.” People called out the doorbell and QMD sections in particular because they connect user code to the actual submission protocol on NVIDIA hardware.

If your team writes or tunes GPU code, this is the level of mental model that helps with debugging, profiling, and deciding when to drop below CUDA’s runtime API. It also underlines how much performance and reliability work still sits in proprietary driver behavior, which matters if you are building on NVIDIA at scale.

June 29, 2026
fergusfinn.com
Discuss on HN

Key insights

Driver API is better for real kernel work

Dropping below CUDA’s runtime API removes a lot of the magic that makes launches feel opaque. Using the driver API with NVRTC lets you compile and load kernels dynamically, treat them more like hot-reloadable shaders, and debug without rebuilding the whole host application. That also matters for library authors because the driver API exposes features the runtime layer does not, even if the surface area is much larger.

If you build GPU tooling, libraries, or fast iteration loops, evaluate the driver API instead of defaulting to the runtime API. The extra complexity can pay back quickly in debuggability and control.

Attribution:

einpoklum #1 #2
mschuetz #1

Some low-level details are documented, but not cleanly

The post’s reverse engineering is useful, but it is not operating in a total vacuum. NVIDIA’s open GPU docs include pieces like QMD formats, and one correction notes that control codes are implemented through a lookup table rather than simple control-word bits. The important read is that the information exists in fragments, so understanding the launch path still requires stitching together partial docs and observed behavior.

Use this post as a map, then cross-check the hardware-specific pieces against NVIDIA’s open GPU docs before relying on them in tooling or research. Expect edge cases and generation-specific wrinkles.

Attribution:

fooblaster #1
saagarjha #1

Kernel optimization is still too workload-specific to automate away

The hard part is not getting a kernel that works on a benchmark happy path. It is sustaining performance across ugly real shapes, odd memory layouts, quantized weights, routing logic, and hardware-specific features the model may not have seen. That is why commenters were skeptical that an open source optimizer or model-driven search will erase specialized kernel engineering soon, even if benchmarks like KernelBench keep improving.

Treat automated kernel generation as a productivity aid, not a substitute for performance engineers, if your workloads are messy or business-critical. Benchmark on your exact shapes and data layouts before assuming generated kernels are production-ready.

Attribution:

spmurrayzzz #1 #2
einpoklum #1

GPU software bugs are survivable until the driver wedges

People with production GPU experience drew a line between ordinary app bugs and driver fragility. Buggy kernels will happen, but commenters argued unprivileged GPU code should not be able to leave memory allocated after process exit, hang in a running state, crash displays, or require watchdog resets. Others pushed back that developers often blame drivers for their own mistakes, but one practical signal stood out: large customers can get patched NVIDIA builds when the fault really is in NVIDIA’s stack.

Plan for GPU fault isolation as an engineering problem, not just a coding problem. Separate display and compute workloads where possible, build recovery paths, and factor vendor support access into platform risk.

Attribution:

connicpu #1 #2
david-gpu #1
Athas #1

Against the grain

CUDA’s hidden synchronization is a feature

Hiding a lot of launch and synchronization complexity is not just hand-wavy abstraction. It gives most developers a sane default stream model where commands are synchronized implicitly, and makes explicit parallel command management an opt-in choice through streams. Compared with Vulkan’s expose-everything approach, that tradeoff looks pragmatic rather than limiting.

Do not rush to lower-level APIs just because they are more transparent. If your bottleneck is developer time rather than a missing capability, CUDA’s defaults may be the right abstraction boundary.

Attribution:

mschuetz #1

Competing stacks may waste even more time

Complaints about NVIDIA bugs landed, but one blunt point cut through: alternative GPU software stacks often consume even more engineering time. The practical comparison is not against an ideal platform. It is against other immature or fragmented toolchains that may be worse on both features and stability.

When evaluating non-CUDA options, budget migration and debugging cost explicitly instead of assuming openness or portability will reduce operational pain. The incumbent can still be the least bad choice.

Attribution:

saagarjha #1

In plain english

API ↩

Application Programming Interface, the defined way software components call into a library or service.

CUDA ↩

Compute Unified Device Architecture, NVIDIA’s platform and programming model for running general-purpose code on its GPUs.

GPU ↩

Graphics Processing Unit, a processor designed for highly parallel work and now widely used for AI and scientific computing.

HPC ↩

High Performance Computing, the use of powerful computers and parallel processing for large technical workloads.

kernel ↩

A function that runs on the graphics processing unit rather than on the main central processing unit.

NVRTC ↩

NVIDIA Runtime Compilation library, which compiles CUDA source code into GPU code while a program is running.

QMD ↩

Queue Metadata Descriptor, a hardware data structure that describes a unit of work submitted to an NVIDIA GPU.

quantized weights ↩

Model weights stored in reduced numerical precision to save memory and improve speed, often with some loss of accuracy.

Vulkan ↩

A low-level graphics and compute API that gives developers explicit control over GPU work submission and synchronization.

watchdog ↩

A monitoring mechanism that detects a stuck component and resets it to recover service.

Reference links

Hardware documentation and low-level references

NVIDIA open GPU docs QMD header
Cited as public documentation for method and QMD format details referenced by the post.

CUDA driver API and runtime compilation examples

NVIDIA cuda-samples vectorAdd_nvrtc
Shared as a raw example of using the CUDA driver API with runtime compilation for better visibility into kernel loading and launch flow.
cuda-api-wrappers vectorAdd_nvrtc example
Shared as a more readable modern C++ wrapper around the same driver API and NVRTC flow.

Kernel optimization benchmarks

KernelBench
Used to ground the claim that models are improving at kernel optimization but still struggle on harder cases and robust performance.

What happens when you run a CUDA kernel?

Discussion mood

Key insights

Against the grain

In plain english

Reference links

Hardware documentation and low-level references

CUDA driver API and runtime compilation examples

Kernel optimization benchmarks