Show HN: Smart model routing directly in Claude, Codex and Cursor

AI
Developer Tools
Infrastructure
Open Source

Weave released a self-hostable model router for coding agents. It exposes an Anthropic or OpenAI-compatible endpoint, watches each inference request from tools like Claude Code, Codex, Cursor, or OpenCode, and chooses which model should handle that step. The claim is that coding sessions do not need frontier models all the time, so the router can send planning or gnarly debugging to expensive models like Opus while pushing simpler exploration or implementation work to cheaper models. Weave says it trained the router on tens of thousands of agent traces with reinforcement learning and has seen about 40% token savings internally with no visible hit to quality or velocity.

Most of the useful discussion landed on one operational constraint: prompt caching is the whole game. A naive proxy that flips models freely would destroy cache reuse and wipe out any savings. Weave's answer is that its router is stateful and cache-aware, so once a session builds cache on one model, the bar to switch gets much higher. That pushed the conversation toward a more realistic picture of how this works in practice. It is not a magical many-model blender. The main agent loop often settles into one to three models, and the cleanest wins come from subagents, which start with fresh context windows and can be routed independently without cache baggage. The other thread that mattered was scope. Several people argued that coding agents already know when they are planning, exploring, implementing, or reviewing, and already do some internal routing. A proxy can hide that state from the harness and break the agent's own retry logic. The more persuasive framing was that the router is less interesting as a universal prompt-level brain and more interesting as a cost-control layer for mixed fleets, especially when agent vendors favor their own models and ignore cheaper open or third-party ones. Even there, people wanted proof. Repeated asks were for published evals on coding-agent benchmarks, cost versus latency curves, and evidence that quality holds up once you include wrong initial routes, recovery steps, and cache misses. Weave said those evals exist internally and should be published. A smaller but important thread was that model behavior depends heavily on the harness, not just the underlying model. People reported the same model acting differently in Claude Code, Copilot CLI, OpenCode, and other shells. That makes routing harder than simply reading the prompt text, because the wrapper's system prompts, context limits, and tool orchestration shape what the model can actually do. That, more than the reinforcement learning story, is why many people treated this as promising but unproven infrastructure. The idea resonates because token budgets are suddenly painful, but the credible version is narrow, cache-aware, benchmarked on real agent traces, and probably aimed at teams with enough usage to justify another control plane.

If you are spending real money on coding agents, model routing is becoming an infrastructure problem, not just a prompt tweak. But the useful version likely needs cache-aware session management, subagent support, and hard evals against your own workload before it is safer than simply locking a few known-good model choices.

June 26, 2026
github.com
Discuss on HN

Discussion mood

Interested but skeptical. People liked the cost-saving goal and agreed token spend has become painful, but they kept circling back to cache invalidation, loss of agent-level control, weak public evidence, and the risk that routing is far less useful than just picking a small set of known-good models per workflow.

Key insights

Subagents are the easiest routing win

Subagents make the product sound more plausible than the main agent loop does. Because they start with a fresh context window, they do not inherit cache from the parent session, so routing them to cheaper or specialized models avoids the biggest penalty that hurts mid-session switching. That turns routing from a brittle per-turn optimization into a cleaner orchestration problem around task decomposition.

If you want to test model routing, start with subagent tasks like code search, summarization, or isolated implementation steps. You will learn faster there than by trying to dynamically swap the primary model in a long-running cached session.

Attribution:

alansaber #1
adchurch #1 #2

Harness behavior can dominate model choice

The wrapper around a model appears to change outcomes enough that routing on model names alone can miss the real source of performance. People described Opus behaving noticeably differently in Copilot CLI, Claude Code, OpenCode, and Pi, with context limits and tool orchestration likely doing as much work as the base model. That means a router trained on traces from one harness may generalize poorly to another, even when the nominal model is the same.

Treat harness and model as a combined unit in your evals, logging, and routing rules. If you swap shells, context windows, or tool setups, assume your routing policy may need to be retrained or re-tuned.

Attribution:

devmor #1
ValentineC #1
adchurch #1

Locked workflows may beat live routing

For repeatable classes of work, predefining which model to use at each phase may be more reliable than making decisions on the fly. If you already have evals, holdback data, and stable task shapes, you can tune prompts and model choices offline and avoid paying routing mistakes during production sessions. That reframes live routing as a convenience layer for messy interactive work, not the default best practice for everything.

Separate your workloads before adopting a router. Keep deterministic internal flows on fixed model policies, and reserve dynamic routing for exploratory sessions where the task shape really changes midstream.

Attribution:

peterbell_nyc #1
gopher_space #1
jpease #1

Recovery logic matters more than first-pass accuracy

The hard part is not guessing the perfect model from the opening prompt. It is detecting when a cheaper model is stuck and escalating quickly enough that the savings survive the mistake. Weave said the system uses prompt and context embeddings plus rescue guardrails, which suggests the router's value depends as much on fallback policy as on the learned classifier itself.

When you evaluate a router, measure cost and latency after retries, stalls, and escalations, not just first-choice accuracy. A weak first route can still be acceptable if failure detection is fast and cheap.

Attribution:

GodelNumbering #1
adchurch #1 #2
mjb #1

Against the grain

Two-model setups may already capture most value

The strongest skeptical case was that cache pressure collapses the fancy routing story into something much simpler. Once switching gets expensive, you mostly end up with one strong planner and one cheaper executor, which many teams already do without adding a proxy. On that view, a learned router is extra moving parts chasing marginal gains.

Before adding a routing layer, benchmark a simple two-model policy against your current setup. If it gets close to the claimed savings, the operational overhead of a smarter router may not be worth it.

Attribution:

GodelNumbering #1 #2

Model vendors may absorb this feature

A credible counterpoint was that the best long-term router may come from the model providers themselves, not an independent proxy. They control pricing, caching, and native agent behavior, so they can route within their own model families more cheaply than a third party can. The limit is that they have little reason to send traffic to competing models.

Do not assume an external router will stay structurally advantaged. If your strategy depends on this layer, watch whether Anthropic, OpenAI, or Cursor add enough native routing to erase the benefit for same-vendor stacks.

Attribution:

asdev #1 #2
adchurch #1

In plain english

cache-aware ↩

Designed to account for the cost or benefit of reusing cached context when making decisions.

Claude Code ↩

Anthropic's coding-agent command-line or tool environment built around Claude models.

Codex ↩

OpenAI’s coding-focused product and model interface for software development tasks.

context window ↩

The amount of prior text and instructions a model can consider at once when generating an answer.

Cursor ↩

An AI-assisted code editor that includes model-selection features such as Auto mode and subagents.

embeddings ↩

Numerical representations of text or other data that let systems compare similarity or cluster related inputs.

evals ↩

Short for evaluations, the tests and benchmarks used to compare model or system performance.

holdback data ↩

A reserved test dataset that is not used during tuning, so it can measure how well a system generalizes.

OpenCode ↩

An open coding-agent harness or interface mentioned as one of the tools this router can sit behind.

prompt caching ↩

A provider feature that reuses computation for repeated prompt context so later requests are cheaper or faster if the earlier context stays the same.

Reference links

Project and demo

workweave/router GitHub repository
The source-available router project being launched.
Weave Router demo video
Short demo showing the router running locally.

Related routing projects and benchmarks

vLLM Semantic Router website
A related model-routing project raised for comparison.
vLLM Semantic Router GitHub repository
Code for the related routing project discussed in comparison.
vLLM Semantic Router paper
Paper describing the routing approach and algorithms mentioned in the comparison.
Sakana Fugu
Referenced as a similar spirit for one of the vLLM routing algorithms.
RouteWorks leaderboard
Leaderboard cited as a place where the project appears to perform well.
RouterArena paper
Open comparison framework for large language model routers mentioned in the benchmark discussion.
RouterArena GitHub repository
Code for the router comparison platform mentioned alongside the paper.

Related tools and references

Murmur GitHub repository
Another tool mentioned for delegating work across coding assistants and subscriptions.
Fortune article on Uber AI token spending
Used as a public example of token costs becoming a management problem.