NUMA: Cores, memory, and the distance between them

Infrastructure
Hardware
Performance
Developer Tools

The post is an introductory walkthrough of NUMA, or non-uniform memory access, which shows why a server can look like one big machine while acting like several smaller ones glued together. A core can reach local memory quickly and remote memory slowly, so thread placement and memory placement become part of performance. Readers treated that as old news in theory but very current in practice. The strongest signal was that NUMA still bites modern software stacks that are supposed to hide hardware details. People described Go services on huge Kubernetes nodes burning CPU and spiking latency until GOMAXPROCS was capped, Redis falling over once its working set spilled past one NUMA domain, and MySQL benchmarks getting skewed by Linux page cache behavior even when the database and client were pinned. Another important addition was that NUMA is not just about RAM. PCIe devices like NICs have locality too, so a thread on one socket talking to a device hanging off another can cut throughput in half or worse. That matters for networking, storage, and virtualization, which is where several people had seen the nastiest failures. A smaller but sharp argument was that this is partly a product choice problem. Some people would rather avoid big NUMA boxes entirely and buy single-socket systems or partition machines more deliberately, because once the hardware crosses certain size boundaries the default scheduler and runtime heuristics turn into an invisible tax.

If you run on large multi-socket or many-core boxes, treat topology as a production concern, not a tuning detail. Pin CPU, memory, and PCIe-heavy work together, and be suspicious of any runtime or orchestrator that claims to abstract the hardware away.

June 29, 2026
edera.dev
Discuss on HN

Key insights

Runtime defaults break on giant boxes

On large servers, software that scales work across every visible CPU can accidentally spray threads and memory across NUMA domains and spend its time paying remote access and garbage collection costs. The concrete fixes people reached for were blunt ones like lowering Go's GOMAXPROCS or running one Redis instance per NUMA node, which says a lot about how weak the default abstractions still are.

Do not assume more visible cores means more usable cores. Benchmark with explicit CPU limits and consider sharding services by NUMA node before you chase algorithmic optimizations.

Attribution:

lukax #1
strifey #1
0x457 #1

The operating system can sabotage clean pinning

Pinning a process to a NUMA node does not guarantee all of its code and data stay local. One report had MySQL benchmarks skewed because libmysql landed in page cache on the node where the client first ran, so later runs reached across sockets for a supposedly local dependency. The follow-on claim that code should be remapped into anonymous hugepage-backed memory was challenged, but the core lesson held up: locality leaks through shared kernel mechanisms you may not be measuring.

When results look inconsistent across identical pinning setups, inspect page cache and shared mappings before blaming the application. Treat binaries, libraries, and file-backed memory as part of your locality model.

Attribution:

Twirrim #1
jeffbee #1
nly #1

NUMA hits PCIe and virtual networking too

The performance penalty is not limited to RAM latency. If the CPU doing the work is far from the NIC or other PCIe device, throughput can collapse, and virtual networking layers can make placement even more fragile. The cited examples were dramatic enough to matter operationally, with 10 Gbps links dropping to 5 Gbps and 100 Gbps links falling to 20 Gbps when work ran on the wrong node.

For networking and storage workloads, map threads, queues, interrupts, and devices to the same node during deployment. If performance swings wildly under virtualization, check device locality before tuning the network stack.

Attribution:

treesknees #1
suprjami #1
alexzenla #1 #2
frollogaston #1

Topology awareness is still an open optimization problem

The complaint that NUMA should have been solved twenty years ago ran into a simpler reality. The machine topology, memory layout, socket count, and I/O attachment create a broad placement problem with no single right answer for every workload. Kernel visibility into topology helps, but it does not turn scheduling into a solved problem for runtimes and orchestrators.

Expect partial automation, not magic. If your product runs on varied server shapes, build topology-aware testing into release and capacity planning instead of trusting generic scheduler behavior.

Attribution:

jpecar #1
mickeyp #1

Against the grain

The article itself may be unreliable

The harshest criticism was not about NUMA at all but about the writeup's wording. One reader called out examples that treated symmetric cases as if they were distinct and took that as evidence of LLM-generated filler. Even if the underlying topic is real, sloppy explanations make it harder for readers to build a correct mental model.

If you plan to share this internally as educational material, sanity check the examples first. On low-level performance topics, vague wording can create more confusion than no primer at all.

Attribution:

senderista #1

Avoiding big NUMA machines is often the better move

A practical counterpoint was that many teams should stop tolerating giant shared-memory boxes in the first place. Single-socket systems, Graviton-style designs with flatter latency, or stricter partitioning can be easier to operate than endlessly teaching every layer of the stack about NUMA. Others pushed back that even single sockets can hide NUMA behavior on modern AMD and Intel servers, so the escape hatch is real but not universal.

When buying infrastructure, include topology simplicity in the decision instead of focusing only on peak core count. In some cases the cheapest performance win is choosing hardware that makes bad placement harder.

Attribution:

jeffbee #1 #2
toast0 #1
frollogaston #1

In plain english

AMD ↩

Advanced Micro Devices, a major CPU vendor.

GOMAXPROCS ↩

A Go runtime setting that limits how many operating system threads can execute Go code at the same time.

Graviton ↩

Amazon Web Services server processors based on the Arm architecture.

Intel ↩

A major CPU vendor best known for x86 processors.

Kubernetes ↩

A system for deploying and managing containers across servers.

libmysql ↩

A client library used by programs to talk to MySQL.

LLM ↩

Large language model, an AI system trained on large amounts of text that can generate and transform language and code.

MySQL ↩

A widely used open source relational database management system.

NIC ↩

Network interface card, the hardware device that connects a server to a network.

NUMA ↩

Non-uniform memory access, a hardware design where different CPU cores reach different parts of memory with different latency and bandwidth.

PCIe ↩

Peripheral Component Interconnect Express, the high-speed hardware bus used to connect devices like network cards and storage controllers to a server.

Redis ↩

An in-memory data store often used as a cache, queue, or fast database.

Reference links

Humor and cultural references

NUMA NUMA video
Posted as a joke about the title rather than as technical context.
Second NUMA NUMA video
A follow-up joke extending the same reference.

NUMA: Cores, memory, and the distance between them

Discussion mood

Key insights

Against the grain

In plain english

Reference links

Humor and cultural references

Related technical reading