GPT-5.5 hallucinates 3x more than MIT-licensed GLM-5.2
- AI
- Developer Tools
- Open Source
The post uses Artificial Analysis’s AA-Omniscience benchmark to claim that very large models are becoming less trustworthy than smaller ones, with GPT-5.5 and DeepSeek V4 Pro allegedly hallucinating far more often than MIT-licensed GLM-5.2. It then stretches that into a broader thesis that scaling parameter count and data has plateaued, and that smaller models plus better training are now the real path forward. The strongest reaction was that this overreads the benchmark. Several people pointed out that AA-Omniscience hallucination rate is measured on questions a model fails to answer correctly, so it mostly captures whether the model abstains or confidently guesses when it is already in trouble. That makes it useful for measuring refusal policy and calibration, but weak as a stand-alone claim that a smaller model is more truthful overall or that larger models are getting worse in general.
Treat hallucination leaderboards as policy signals, not as a full ranking of model quality. If you ship LLM features, optimize for refusal behavior, retrieval, and task-specific evals instead of assuming a model with a lower benchmark hallucination rate will perform better in production.
- arrowtsx.dev
- Discuss on HN