The post is a hands-on writeup of getting Gemma 4 26B-A4B running on an old single-socket Xeon E5-2620 v4 system with 128 GB of RAM and no GPU. The author used an ik_llama.cpp fork, Gemma drafter models, and a pile of low-level flags to squeeze usable speed out of hardware most people would treat as e-waste. After people pushed for real numbers, the author shared a noisy benchmark showing about 11.9 tokens per second while the machine was also doing other server work, with a claim that unloaded performance can reach around 20 tokens per second. That puts it in “read along with the output” territory, not “feels like ChatGPT” territory.
Cheap, secondhand servers are becoming viable edge AI boxes for narrow or offline workloads, which chips away at the assumption that every useful model interaction must flow through expensive cloud APIs.
Impressed and energized by the hack, but grounded about its limits. People liked the proof that local inference on recycled hardware is viable, while repeatedly stressing that power efficiency, prompt latency, and overall interactivity still make GPUs or hosted models the practical choice for many workloads.
01 Prompt processing and token generation are different bottlenecks, and this demo only looked good because the prompt was tiny.
For real coding or document tasks with hundreds or thousands of input tokens, prefill speed matters a lot more, and that is where old CPU-only systems fall over first even if generation speed feels readable once the model starts talking.
Readable decode speed does not mean good end-to-end UX. Long prompts are where these machines stop feeling viable.
02 Memory bandwidth is the whole game here, not compute in the usual desktop sense.
Old Xeon platforms can win surprisingly often because fully populated quad-channel or dual-socket memory gives them more usable bandwidth and capacity than many newer consumer boxes, which is exactly what large quantized models and mixture-of-experts inference want.
For local LLMs, cheap RAM channels can matter more than shiny cores. The best bargain box may look more like an old workstation than a new desktop.
03 The economics only work if you value reuse, privacy, or idle hardware more than efficiency.
Once power draw enters the picture, many of these boxes look bad against a modern mini PC, a GPU workstation, or simply paying for hosted inference, especially if the server idles high or lives in a loud rack chassis.
Old servers are cheap to buy, not cheap to run. Power and noise can erase the bargain fast.
04 There is a real product gap between DIY local inference and cloud APIs.
Rising enterprise AI bills, privacy constraints in fields like medical and legal, and the availability of cheap used workstations point toward turnkey on-prem LLM appliances as a plausible business, even if the underlying hardware is unglamorous.
The opportunity is not selling old Xeons. It is packaging local AI so buyers never have to think about old Xeons.
01 Local coding models are still nowhere near good enough for demanding work.
Small local models can look competent on easy refactors, but on harder tasks they produce plausible junk and waste more time than they save, which makes the cloud price premium rational for anyone who depends on quality.
Cheap local inference is not the same as useful local inference. Capability gaps still dominate many professional workflows.
02 The disruption story is overstated because running the model is only one slice of the value stack.
Hosted AI still bundles convenience, maintenance, integrations, uptime, and operational labor, so local inference may commoditize part of the market without collapsing the providers that package everything around it.
Inference can commoditize before the product does. Cloud vendors still own a lot beyond raw tokens.
03 Twelve tokens per second is a fun demo, not a serious interactive system.
If the throughput and latency are this tight, then saying it works risks confusing technical possibility with operational usefulness.
Possible is not the same as practical. The benchmark is better viewed as a proof of concept than a replacement.