Weave released a self-hostable model router for coding agents. It exposes an Anthropic or OpenAI-compatible endpoint, watches each inference request from tools like Claude Code, Codex, Cursor, or OpenCode, and chooses which model should handle that step. The claim is that coding sessions do not need frontier models all the time, so the router can send planning or gnarly debugging to expensive models like Opus while pushing simpler exploration or implementation work to cheaper models. Weave says it trained the router on tens of thousands of agent traces with reinforcement learning and has seen about 40% token savings internally with no visible hit to quality or velocity.
Most of the useful discussion landed on one operational constraint:
prompt caching is the whole game. A naive proxy that flips models freely would destroy cache reuse and wipe out any savings. Weave's answer is that its router is stateful and
cache-aware, so once a session builds cache on one model, the bar to switch gets much higher. That pushed the conversation toward a more realistic picture of how this works in practice. It is not a magical many-model blender. The main agent loop often settles into one to three models, and the cleanest wins come from subagents, which start with fresh context windows and can be routed independently without cache baggage.
The other thread that mattered was scope. Several people argued that coding agents already know when they are planning, exploring, implementing, or reviewing, and already do some internal routing. A proxy can hide that state from the harness and break the agent's own retry logic. The more persuasive framing was that the router is less interesting as a universal prompt-level brain and more interesting as a cost-control layer for mixed fleets, especially when agent vendors favor their own models and ignore cheaper open or third-party ones. Even there, people wanted proof. Repeated asks were for published
evals on coding-agent benchmarks, cost versus latency curves, and evidence that quality holds up once you include wrong initial routes, recovery steps, and cache misses. Weave said those evals exist internally and should be published.
A smaller but important thread was that model behavior depends heavily on the harness, not just the underlying model. People reported the same model acting differently in Claude Code, Copilot CLI, OpenCode, and other shells. That makes routing harder than simply reading the prompt text, because the wrapper's system prompts, context limits, and tool orchestration shape what the model can actually do. That, more than the reinforcement learning story, is why many people treated this as promising but unproven infrastructure. The idea resonates because token budgets are suddenly painful, but the credible version is narrow, cache-aware, benchmarked on real agent traces, and probably aimed at teams with enough usage to justify another control plane.