HN Debrief

GPT-5.5 hallucinates 3x more than MIT-licensed GLM-5.2

  • AI
  • Developer Tools
  • Open Source

The post uses Artificial Analysis’s AA-Omniscience benchmark to claim that very large models are becoming less trustworthy than smaller ones, with GPT-5.5 and DeepSeek V4 Pro allegedly hallucinating far more often than MIT-licensed GLM-5.2. It then stretches that into a broader thesis that scaling parameter count and data has plateaued, and that smaller models plus better training are now the real path forward. The strongest reaction was that this overreads the benchmark. Several people pointed out that AA-Omniscience hallucination rate is measured on questions a model fails to answer correctly, so it mostly captures whether the model abstains or confidently guesses when it is already in trouble. That makes it useful for measuring refusal policy and calibration, but weak as a stand-alone claim that a smaller model is more truthful overall or that larger models are getting worse in general.

Treat hallucination leaderboards as policy signals, not as a full ranking of model quality. If you ship LLM features, optimize for refusal behavior, retrieval, and task-specific evals instead of assuming a model with a lower benchmark hallucination rate will perform better in production.

Discussion mood

Mostly skeptical of the article’s headline claim and framing. Readers found the benchmark interesting, but thought the post confused abstention behavior with overall model quality, mixed together separate claims about scaling and hallucination, and leaned too hard on one eval to declare a wall in model progress.

Key insights

  1. 01

    Hallucination rate is not overall error rate

    This metric only looks at cases where the model either abstains or gives a wrong answer, which means the denominator changes from model to model. A model can look better on hallucination rate simply because it refuses more often, while another can answer far more questions correctly overall and still score worse for being too willing to guess. That makes the benchmark a measure of calibration under uncertainty, not a clean measure of truthfulness.

    When you compare models, track at least three numbers separately: accuracy, abstention rate, and wrong-answer rate. If a vendor only shows a hallucination score, ask what happened to total correct answers.

      Attribution:
    • aesthesia #1 #2
    • hereme888 #1
  2. 02

    Labs are buying custom expert training data

    Pretraining on web text is no longer the whole story. Commenters with direct experience said labs are paying experts to create novel examples, adversarial prompts, and reusable grading rubrics aimed at specific model failure modes. The claim was that this is already a billion-dollar-scale market, with vendors like Mercor and others supplying specialized human data that functions more like active learning than generic corpus expansion.

    Expect moat-building to shift toward proprietary post-training data pipelines, not just bigger clusters. If you compete in an AI-heavy market, unique evaluation data and domain rubrics may matter more than model choice alone.

  3. 03

    Benchmarks teach models to guess

    Most public evals reward a correct answer and do not meaningfully punish a confident wrong one. Under that incentive, the rational strategy is to answer aggressively, because even a low-probability guess boosts expected score while “I don’t know” often earns nothing. AA-Omniscience is notable precisely because it penalizes wrong answers, which is why it surfaces a different ranking than mainstream accuracy boards.

    Align your internal evals with product risk. If wrong answers are costly in your domain, score abstentions better than plausible nonsense and tune your models against that objective.

      Attribution:
    • wongarsu #1
    • jampekka #1
    • stalfie #1
  4. 04

    Grounding and retrieval still beat pure model bravado

    Several comments converged on the same operational point: many bad answers come from models speaking past their knowledge cutoff or skipping retrieval when they should search. Better harness design, tool use, and evidence-passing can reduce obvious failures, but chaining agents does not magically solve the problem because verification steps can amplify earlier mistakes if they lose the original question or verify the wrong claim.

    Focus engineering effort on search, citations, and passing source context through every step of an agent workflow. Do not assume a second model or longer reasoning trace automatically adds reliability.

      Attribution:
    • embedding-shape #1 #2 #3
    • techpression #1
    • sudosysgen #1
    • reinitctxoffset #1
  5. 05

    This looks like the old knowledge bottleneck again

    The push to hire experts in every domain, create new training examples, and peer review edge cases reminded some readers of expert systems and the classic knowledge acquisition bottleneck. The point is not that the current approach is useless. It is that patching weaknesses one domain at a time can become structurally expensive and may never drive hallucination close to zero across open-ended tasks.

    Be wary of roadmaps that assume reliability will improve uniformly across domains. In regulated or specialized work, plan for a long tail of expensive, domain-specific supervision.

      Attribution:
    • YeGoblynQueenne #1
    • MattRogish #1
    • jmalicki #1
  6. 06

    LLM coding works best as reviewed acceleration

    The coding subthread did not settle on “LLM code is garbage.” It settled on a narrower rule. These systems are useful when a human can review the output, enforce guardrails, and keep the architecture sane. The main risk is not that generated code is uniquely broken. It is that teams can produce technical debt much faster while losing the human mental model that makes large systems maintainable.

    Use LLMs to increase throughput, not to remove ownership. Require reviews, tests, and architecture constraints before you let faster code generation turn into faster debt generation.

      Attribution:
    • gymbeaux #1
    • andybak #1
    • embedding-shape #1
    • xvinci #1
    • ben_w #1

Against the grain

  1. 01

    Fable showed real capability jumps in practice

    A few firsthand reports pushed back on the idea that intelligence has plateaued. One detailed example described Fable catching subtle physics and relativity errors across a novella-length manuscript that Gemini, Claude, and ChatGPT all missed. Another reported that Fable stayed technically deep even in fiction-writing mode, where many models become glib and shallow.

    Do not let one benchmark collapse your model strategy. For high-value workflows, keep live bake-offs running because some capability jumps only show up in realistic tasks, not leaderboard slices.

      Attribution:
    • gcanyon #1
    • Bolwin #1
  2. 02

    User experience can diverge from the benchmark story

    Several anecdotes said GPT-5.5 felt worse than older Codex variants for coding, while others reported the opposite and found GPT-5.5 stronger on long autonomous tasks. That disagreement cuts against any simple reading of the benchmark. Real performance is being shaped by harnesses, prompting, fine-tuned variants, and workload type as much as by the base model itself.

    Evaluate the full stack you actually use, including model variant, agent harness, and prompts. A benchmark result on the base model will not tell you which setup your team will prefer day to day.

      Attribution:
    • wiether #1 #2
    • oshrimpton #1
  3. 03

    Big models still dominate outside narrow evals

    Some commenters rejected the whole premise that smaller models are catching up in any broad sense. Their view was that open models can look excellent on selected benchmarks, but still fall off faster on messy, general tasks where frontier models retain the unmistakable “big model smell” of wider competence. They also noted that policy events like the Fable restriction do not say anything useful about whether scale is still producing capability gains.

    If your product depends on broad reasoning across unpredictable tasks, do not infer parity from one open-model win on one reliability metric. Test on messy real workloads before you trade down for cost or openness.

      Attribution:
    • czk #1
    • bilater #1
    • hyperpape #1

In plain english

AA-Omniscience
An Artificial Analysis benchmark that tests whether a model answers correctly, abstains, or hallucinates when asked knowledge questions.
Active learning
A training approach where you deliberately collect the most informative new examples, often near the model’s decision boundary.
GLM-5.2
A large language model from the GLM family that commenters describe as open-weight and MIT-licensed.
Knowledge cutoff
The latest point in time covered by a model’s training data, after which it may not know newer facts unless given external information.
LLM
Large language model, a machine learning system trained to generate and analyze text, including source code.
MIT-licensed
Released under the MIT License, a permissive open source software license that allows broad reuse.
Retrieval
A system that fetches relevant documents or facts from outside the model and supplies them as context before answering.

Reference links

Benchmarks and evaluations

Training methods and papers

  • OpenAI scaling laws paper
    Cited to support the claim that returns from scaling model size and data have diminishing returns rather than growing linearly.
  • Rubrics as rewards paper
    Shared as an example of reusable human-written rubrics used to judge and train model responses during post-training.
  • MS-MARCO dataset
    Referenced in an experiment that tried to train a model to say 'No answer present' and instead made hallucination worse.

Companies and projects

  • VibeThinker
    Shared as an example of a model trained densely on math that reportedly performs above what its size would suggest.

Interviews and media

  • Sam Altman interview clip
    Cited to argue that model vendors market these systems as far more than plausible-text generators.

Background reading on human error