Ask HN: Has anyone replaced Claude/GPT with a local model for daily coding?

AI
Developer Tools
Open Source
Hardware
Privacy

The post asked a simple question that got a very specific answer: yes, people are replacing Claude or GPT locally for real coding work, but almost nobody is pretending the tradeoff disappears. The setups that kept coming up were llama.cpp plus a coding harness like Pi, OpenCode, or a custom wrapper, usually driving Qwen 3.6 in either the 27B dense model or the 35B A3B mixture-of-experts variant. Hardware varied from high-memory Macs and Strix Halo laptops to dual-3090 workstations, but the pattern was consistent. Local coding is now fast enough and capable enough to be genuinely useful, especially for privacy-sensitive work, personal projects, repetitive implementation, codebase search, shell tasks, and tightly scoped refactors.

If you care about privacy, predictable cost, or offline use, a local stack is already viable for bounded coding tasks and internal tooling. If your work depends on architecture judgment, long messy contexts, or unattended agents, keep a frontier model in the loop and treat local models as a second lane, not a drop-in swap.

June 15, 2026
news.ycombinator.com
Discuss on HN

Discussion mood

Cautiously positive. People are impressed that local coding models are now genuinely productive, especially Qwen 3.6 on decent hardware, but the prevailing mood is still that frontier cloud models are smarter, more reliable, and better for hard design work. Privacy, cost control, and ownership are the main drivers for going local.

Key insights

Caching and harness details dominate usability

A lot of the pain people blame on local models is really execution-layer friction. Qwen 3.6 improved because it can preserve reasoning between turns, which reduces full context reprocessing in llama.cpp, but harnesses can still sabotage caching by mutating the system prompt every turn or mishandling tool traces. That means two people can run the same model and have very different experiences depending on chat template settings, cache behavior, and the harness itself.

Treat the harness and inference engine as part of the model choice. Before judging a local stack, verify prompt caching, stable system prompts, and reasoning preservation settings like preserve_thinking.

Attribution:

lambda #1 #2
LoganDark #1

Speed metrics hide the real performance tradeoff

Higher decode speed did not reliably produce faster task completion. Several people found that heavier quantization or faster mixture-of-experts models created more loops and edit mistakes, while slower setups with better KV cache settings or denser models finished real tasks sooner. One user also pointed out that adding GPUs often buys context room, not more tokens per second, so buying hardware for benchmark numbers can miss the point.

Benchmark on end-to-end coding tasks, not tokens per second. When tuning a local setup, spend time on quantization, KV cache precision, and model choice before spending money chasing raw throughput.

Attribution:

girvo #1
electronsoup #1
horsawlarway #1

Local models reward spec-driven workflows

The people getting strong results are not asking these models to discover the problem for them. They decompose work into small steps, point to exact files, state architecture constraints explicitly, and restart sessions often. That turns the model into a precise code search and transformation tool, which is where it shines. Leave goals vague and it reaches for the quickest hack, not the right design.

If you want local models to pay off, tighten your development process first. Clear specs, smaller tasks, and explicit constraints are not optional overhead here, they are the operating model.

Attribution:

Greenpants #1 #2
amelius #1

Privacy is not a side benefit

For several people, local inference is not a cost optimization experiment. It is the only acceptable way to use these tools on employer code or sensitive work when policies are unclear or trust in vendors is low. That changes the comparison completely, because a weaker model that never leaves the machine can still be the rational choice if the alternative is a policy violation or a data leak risk.

If your organization has unresolved AI governance or customer data constraints, local models can unlock workflows that cloud tools simply cannot. Evaluate them against your compliance boundary, not just against Claude on raw capability.

Attribution:

pierotofy #1
Greenpants #1 #2

Pi emerged as the default local harness

Pi was the most consistently recommended agent layer because it has a usable default experience, works with local servers, and can be extended without fighting the tool. People contrasted that with OpenCode being more manual to configure for local inference and with other harnesses missing basics like context management or MCP support. The repeated recommendation was not that Pi is magical, just that it gets enough right to stop being the bottleneck.

If you are testing local coding seriously, start with a harness that already has mindshare and working recipes. Pi appears to be the shortest path to a representative result.

Attribution:

horsawlarway #1
Insanity #1
coder543 #1

Hybrid planning is the most credible workflow

The strongest practical pattern was to use a frontier model for planning, architecture, or validation, then let a local model execute scoped implementation work. That is not a cop-out. It is a stable division of labor that matches the current capability gap. People doing production C, C++, Python, and web work reported success with exactly this split, especially when they needed privacy or wanted to keep token costs low during long implementation runs.

Do not frame the decision as all-local or all-cloud. A two-tier workflow can cut spend and data exposure without giving up the stronger reasoning of frontier models where it still matters.

Attribution:

horsawlarway #1
mgsram #1
garethsprice #1

Against the grain

The economics still favor cloud for most teams

For people optimizing for developer throughput rather than privacy or tinkering, the hardware and setup cost still looks bad. A multi-thousand-dollar local rig plus tuning time buys a system that remains below Claude Code or top hosted models on difficult work. If the goal is shipping software, subsidized subscriptions and cheap hosted APIs still win on total cost of getting correct work done.

Run the math against engineer time, not just subscription fees. If you do not have a privacy requirement, a cloud model may still be the cheaper tool even when local inference looks free after purchase.

Attribution:

codinhood #1 #2
sakopov #1

Local capability claims are overstated

Some of the sharpest pushback came from people who use frontier models heavily and think the gap is still enormous. Their complaint was not that local models are useless. It was that enthusiasts blur "good enough for boilerplate" into "replacement for Opus or GPT on hard engineering." On large codebases, subtle debugging, or design-heavy work, they said the drop-off is obvious and costly.

Be skeptical of parity claims unless they come from side-by-side tests on real work. For hard engineering tasks, assume local is still a tier down until your own evaluations prove otherwise.

Attribution:

jwr #1
redox99 #1
user43928 #1

Cloud models may be winning through orchestration

One commenter argued that comparing a single local model run to a hosted frontier endpoint may be the wrong mental model entirely. The claim is that top providers are likely layering hidden orchestration, response shaping, and multi-step internal processing around their models, which makes API behavior stronger than a straightforward one-pass decode. If that is true, then local users may need better orchestration stacks, not just better base weights, to close the gap.

Do not assume the frontier advantage is only in bigger weights. Improving local results may require multi-model workflows, reviewers, and planners rather than waiting for a single local checkpoint to magically match hosted behavior.

Attribution:

blurbleblurble #1
_bobm #1
XCSme #1

In plain english

27B ↩

A shorthand for a model with about 27 billion parameters, which are the learned numerical values inside the model.

DeepSeek ↩

A Chinese AI lab and model family often cited as a major source of low-cost, capable open or open-weight models.

KV cache ↩

Key-value cache, stored intermediate attention data that lets a language model avoid recomputing the entire prompt on each generation step.

llama.cpp ↩

A popular open source project for running language models efficiently on local hardware.

MCP ↩

Model Context Protocol, a standard for connecting AI models to external tools, data sources, and services.

OpenCode ↩

A coding agent tool mentioned by the author as part of the harness used to run the model.

Opus ↩

A higher-end Claude model tier referenced by commenters for coding and planning tasks.

PI ↩

Principal investigator, the lead researcher responsible for a grant or research project.

quantization ↩

A technique that reduces the precision of model weights to cut memory use and speed up AI inference.

Qwen ↩

A family of language models from Alibaba that the authors mentioned as a future student base for further tests.

Sonnet ↩

A Claude model tier often used for coding and general tasks, typically cheaper than Opus.

Strix Halo ↩

An AMD processor platform with large shared memory that some people use for local AI inference.

Reference links

Setup guides and tooling

How to setup a local coding agent on macOS
A practical recipe commenters used to get a local coding stack running on Apple hardware.
LocalCodingLLM
A repository with setup details for running Qwen locally with llama.cpp and OpenCode.
ds4
A local coding agent project repeatedly recommended as close to the current state of the art for self-hosted coding workflows.
oMLX
A macOS-native MLX server suggested for local model serving on Apple hardware.
llama-cpp-manager
A tool for managing llama.cpp model configurations in local setups.
llama-dash
An ops-style tool for proxying, logging, and managing local LLM inference workflows.

Harness and workflow references

The harness problem
A post about hash-based edit tooling to reduce model editing failures.
Crush
A coding harness one commenter uses with local models, praised for a smaller system prompt and built-in LSP support.
Headroom
A utility used to squeeze more effective use out of a limited context window.
BrowserMCP
An MCP tool for driving Chrome from local coding agents.
firefox-devtools-mcp
A Firefox MCP integration mentioned as part of local browsing workflows.

Model files and model-serving resources

Qwen3.6-35B-A3B-MTP-GGUF on Hugging Face
A concrete quantized local model file recommended for 48GB-class Apple hardware.
Gemma 4 26B A4B QAT GGUF
A quantization-aware trained Gemma release suggested as a better quantized option.
Gemma 4 quantization-aware training announcement
Background on why the QAT Gemma builds can preserve quality while reducing memory use.
LiquidAI LFM2.5-1.2B-Instruct
A tiny local model someone is building around for CPU-based experimentation.

Benchmarks, evaluations, and comparisons

Bijan Bowen Claude 4 Opus browser OS test
A YouTube comparison cited to argue about practical coding output versus local Qwen.
Bijan Bowen Qwen 3.6 35B A3B browser OS test
The paired Qwen test used in the same comparison against Claude 4 Opus.
Artificial Analysis small open-source model comparison
A benchmark page used to discuss tool-calling differences across local models.

Opinion and ecosystem commentary

Stop using Ollama
A critique linked during complaints that Ollama is drifting toward cloud monetization.
My Homelab AI Dev Platform
A related HN post pointed out as another hands-on local dev setup reference.