OpenRouter Fusion is a routing layer that fans a single request out to multiple LLMs, then uses a judge model to produce one final answer. OpenRouter frames it as a way to beat individual frontier models on at least one deep-research benchmark, with a cheaper preset built from smaller models and a pricier preset built from top-end ones. The comments mostly landed on a simpler interpretation: this is ensemble inference sold as an API product. People have been doing versions of it in agent harnesses, code-review flows, and consensus tools for a while.
The useful distinction readers kept making is between “more diverse opinions” and “more
test-time compute.” Fusioning a model with itself and still seeing gains strongly suggests the main lift comes from sampling multiple candidate answers and selecting among them, not from some magical cross-model wisdom. That also explains why several people reported better results only on tasks where answers are easy to verify, like resume tailoring or code review. On fuzzy work, judge models often just prefer answers that look like what they would have written anyway. Several builders said extra review rounds made outputs slower, more expensive, and sometimes more timid rather than better.
Where people sounded genuinely positive was on planning, specs, architecture reviews, and other expensive-to-get-wrong tasks where a human can inspect the final synthesis and pick what matters. A recurring pattern was to use multiple personas or multiple strategies in parallel, then have a cheap arbiter rank or merge the outputs. That was seen as more useful than naive “ask several models, trust the consensus.” There was also skepticism about OpenRouter’s published benchmark story. Results like repeated
Opus runs nearly matching stronger models, and some rankings that put
DeepSeek ahead of expected frontier leaders, made readers think the benchmark may reward throwing more tokens and retries at a narrow class of tasks rather than proving broad superiority.
The practical read is that Fusion is convenient infrastructure for teams that do not want to hand-roll a swarm or council system. But it is not free lunch. Multiple commenters measured or felt 4x to 7x cost and big latency hits, some saw failures in the product itself, and one noticed an undisclosed Opus judge call appearing in logs even after choosing other models. The idea seems real. The default use case is much narrower than the marketing.