NUMA: Cores, memory, and the distance between them
- Infrastructure
- Hardware
- Performance
- Developer Tools
The post is an introductory walkthrough of NUMA, or non-uniform memory access, which shows why a server can look like one big machine while acting like several smaller ones glued together. A core can reach local memory quickly and remote memory slowly, so thread placement and memory placement become part of performance. Readers treated that as old news in theory but very current in practice. The strongest signal was that NUMA still bites modern software stacks that are supposed to hide hardware details. People described Go services on huge Kubernetes nodes burning CPU and spiking latency until GOMAXPROCS was capped, Redis falling over once its working set spilled past one NUMA domain, and MySQL benchmarks getting skewed by Linux page cache behavior even when the database and client were pinned. Another important addition was that NUMA is not just about RAM. PCIe devices like NICs have locality too, so a thread on one socket talking to a device hanging off another can cut throughput in half or worse. That matters for networking, storage, and virtualization, which is where several people had seen the nastiest failures. A smaller but sharp argument was that this is partly a product choice problem. Some people would rather avoid big NUMA boxes entirely and buy single-socket systems or partition machines more deliberately, because once the hardware crosses certain size boundaries the default scheduler and runtime heuristics turn into an invisible tax.
If you run on large multi-socket or many-core boxes, treat topology as a production concern, not a tuning detail. Pin CPU, memory, and PCIe-heavy work together, and be suspicious of any runtime or orchestrator that claims to abstract the hardware away.
- edera.dev
- Discuss on HN