VibeThinker: 3B param model that beats Opus 4.5 on reasoning with novel SFT+GRPO

AI
Open Source
Developer Tools
Hardware

The paper introduces VibeThinker-3B, a compact model built on Qwen2.5-Coder-3B and trained with supervised fine-tuning plus Group Relative Policy Optimization to push performance on verifiable reasoning tasks. The headline benchmark claim is that this 3B model can beat much larger models like Opus 4.5 on math and coding evaluations. The important context is that this is not a general-purpose assistant. It is a narrow, post-trained reasoning model aimed at closed-world problems where all needed information is already in the prompt and the answer is easy to verify after the fact. People who actually ran it locally reported the same pattern again and again. It can be shockingly good for its size on math, competitive-programming style coding, and tightly scoped analysis. It falls over on normal conversation, structured outputs unless you constrain generation, tool calling, repo-wide bug hunting, factual recall, and tasks like SVG generation that depend on broad world knowledge or richer interaction loops.

Treat this as a specialized reasoning component, not a drop-in general assistant. If you run local coding or analysis stacks, the practical move is to pair a cheap orchestration or tool-use model with a small verifier like this for bounded tasks you can check automatically.

June 23, 2026
arxiv.org
Discuss on HN

Discussion mood

Excited but not gullible. People were impressed that a 3B local model can do real math and bounded coding work, but the dominant reaction was to narrow the claim hard: this is a specialist reasoning module, not evidence that tiny models can replace frontier assistants or that reasoning can be cleanly separated from knowledge.

Key insights

Closed-world reasoning is the whole game

The benchmark wins make sense once you read the model as a solver for closed-world, verifiable tasks rather than a shrunken general assistant. It was trained where the needed facts are already in context and the reward is easy to score, which is exactly where Group Relative Policy Optimization shines because it avoids the extra value model cost of Proximal Policy Optimization. That framing explains both the impressive math and coding numbers and the sharp drop-off on research, agent loops, and factual work.

Use it where success can be checked automatically, like math, unit-sized code generation, or bounded analysis. Do not expect the same model to discover missing context or manage open-ended workflows.

Attribution:

nsingh2 #1 #2
cold_harbor #1

Best role is subagent or validator

The strongest deployment pattern was not “replace your coding agent” but “slot this in behind one.” Because it lacks tool-calling training and struggles beyond one or two messages, it fits better as a fast reasoning pass, gatekeeper, or validator that reviews another model’s work each turn or each tool call. That turns its small size into an advantage instead of forcing it into orchestration work it was never trained for.

If you run multi-model systems, test this as a cheap second opinion on code patches, math, or constraint checking. Keep planning, tool selection, and long-horizon control in a different model.

Attribution:

kristjansson #1
mvitorino #1
troglodytetrain #1

Reasoning without knowledge hits a hard wall

The most useful correction to the hype was that “just train reasoning and fetch facts later” breaks down fast. Choosing what to search, understanding the user’s request, selecting among tools, and connecting terms like “table tennis spin” to a Magnus effect calculator all require stored background knowledge. The point is not that compact specialists are impossible. It is that reasoning depends on a scaffolding of world and domain knowledge, so any claim of a nearly knowledge-free thinker should be treated as marketing shorthand.

When designing small local models, budget for domain priors in the weights or in retrieval that is tightly curated and easy to navigate. Raw internet access is not a substitute for built-in conceptual grounding.

Attribution:

deftio #1
secretslol #1
sigmoid10 #1
XCSme #1

The paper cuts coverage to buy reasoning

Several comments pinned the tradeoff clearly. This model inherits from an older Qwen2.5-Coder-3B base and seems to preserve a compact reasoning core by shedding broad competence and long-tail knowledge. That is why Python-heavy and math-heavy tests look great while pelican SVG prompts, open conversation, and broad factual tasks look terrible. The claim is less “3B now equals Opus” than “a lot of benchmarkable reasoning was cheaper to compress than many assumed.”

Read benchmark claims through the lens of capability coverage. Before adopting a small model, list the exact task family you care about and probe outside it, because the missing capabilities are not edge cases, they are the price paid for the win.

Attribution:

gslepak #1
nolist_policy #1
aero2146 #1
fwipsy #1

Structured output can be bolted on

One practical datapoint was that the model’s poor native structured output is not fatal. A user got clean results for security review by letting the model reason freely inside think tags and then forcing JSON only after the closing tag through constrained generation. Another commenter turned that into a minimal multi-tool harness. That does not make the model good at tool use, but it does show some missing product features can be supplied outside the weights.

If a promising small model is weak on formatting, try grammar-constrained decoding before writing it off. You may be able to recover reliable machine-readable output without retraining the model.

Attribution:

noperator #1 #2
nickalaso #1

Repo-wide security review still needs orchestration

Security testing was a useful reality check. One commenter benchmarked it on a corpus of Mythos-discovered bugs and got zero finds, while another argued that this failure is exactly what you would expect from a model trained on self-contained tasks. Vulnerability hunting often means collecting clues across files and stitching together interactions across a codebase. A compact reasoning model can still help once that context has been assembled, but it is not the agent that gathers it.

Do not swap this into code security workflows that depend on cross-file context collection. Pair it with a stronger retrieval or tool-use layer, or keep using larger models for end-to-end review.

Attribution:

SwellJoe #1
nsingh2 #1
scotty79 #1

Against the grain

Local good enough may still feel bad

Even if local specialists get useful, some people think they will remain psychologically and practically unsatisfying as long as a cheap cloud model is plainly better. The issue is not whether a laptop can run a competent agent. It is whether teams will accept the downgrade once the gap is visible in everyday work. For some users, the threshold is not “usable offline” but “close enough to frontier that I stop thinking about the frontier.”

When evaluating local deployments, measure user tolerance for quality gaps, not just technical feasibility. A cheaper on-device stack can still lose if users keep escalating work to cloud models.

Attribution:

yousif_123123 #1
alkonaut #1
vadansky #1

Benchmarks may flatter a brittle model

A few reactions were openly skeptical because the model looks incoherent in ordinary chat and unstable outside its target tasks. That raises the possibility that the benchmark gains are narrower than the headline implies, or partly a consequence of training very directly against the test style. Even supporters conceded it is not a normal conversational model, which makes “beats Opus” easy to overread.

Validate on your own workload before treating benchmark wins as strategic news. If a model feels broken in adjacent tasks, assume the gains are highly localized until proven otherwise.

Attribution:

Catloafdev #1
andai #1
makethembroke #1

In plain english

3B ↩

About 3 billion parameters, a rough measure of model size.

closed-world ↩

A task setting where all the information needed to solve the problem is already provided in the prompt or context.

JSON ↩

JavaScript Object Notation, a common text format for structured data used in APIs.

Opus 4.5 ↩

A large proprietary frontier language model used here as a comparison point on benchmarks.

Qwen2.5-Coder-3B ↩

A 3 billion-parameter coding-focused base model from the Qwen family that VibeThinker builds on.

structured output ↩

Model output in a strict machine-readable format such as JSON rather than free-form text.

SVG ↩

Scalable Vector Graphics, a text-based format for vector images used on the web and in design tools.

think tags ↩

Special markers such as <think> and </think> used to separate internal reasoning text from the final answer.

tool calling ↩

A model feature that lets it invoke external functions, APIs, or software tools as part of solving a task.

verifiable reasoning tasks ↩

Problems where a model’s answer may be hard to produce but easy to check automatically, such as math or programming tasks with known outputs.

Reference links

Model pages and artifacts

VibeThinker-3B model card on Hugging Face
Primary model page cited for warnings about weak tool calling and multi-turn behavior
VibeThinker-3B GGUF quant by prithivMLmods
Quantized build used for local testing and speed reports
Qwen3.5-122B-A10B GGUF quant
Referenced as an alternative local model that fits large-memory consumer hardware
Unsloth Qwen3.6-35B-A3B-MTP GGUF
Suggested model artifact for a local coding agent setup

Harnesses and tooling

noperator constrained-generation gist
Example of allowing free reasoning then forcing JSON output after think tags
VibeHarness GitHub repository
Barebones tool harness built around the model with constrained output
Crush coding agent
One of the local coding agents people reported using with Qwen models
nocodo GitHub repository
Example agent framework that uses very small models with heavy prompt and handler logic instead of tool calling

Benchmarks and evaluations

Gertlabs agentic coding rankings
Cited to support claims about Qwen 3.6 27B performance on Kotlin agentic coding
Archived Gertlabs rankings page
Archive of the same coding benchmark results page
Will it Mythos? security benchmark post
Used to argue the model performs poorly on security bug hunting

Prompt examples and demos

Pelican output image
Example showing that careful prompting can improve the model’s SVG-style output somewhat
Pelican prompt text
Prompt used to get the improved pelican result
Yogthos post on coding harness expectations
Referenced for the claim that model performance depends heavily on the surrounding harness

Background references from side discussions

Transport for NSW Wrong Way Go Back sign
Used in a side analogy about driving, literacy, and embedded knowledge
Wikipedia no entry sign page
Cited in the same driving analogy to show many road signs are language-independent
Wikipedia no U-turn signs page
Referenced in the back-and-forth over whether reading is required for safe driving
Dog driving a car video
Shared as a joking but relevant counterexample in the driving analogy