GLM-5.2 is the new leading open weights model on Artificial Analysis

AI
Open Source
Developer Tools
Infrastructure

The linked post is a benchmark report from Artificial Analysis saying GLM-5.2, a new open-weights model from Z.ai, now leads the site’s open-weights rankings and sits unusually close to top closed models on coding and general intelligence charts. In practice, people who had already tried it said the headline is directionally right. GLM-5.2 looks like a meaningful step up from prior Chinese open models and lands somewhere around older Opus-tier performance for many coding tasks. Several people said it is the first open model they would seriously compare to frontier closed models instead of treating as a budget compromise.

The excitement came with hard caveats. The biggest one was reasoning efficiency. Users repeatedly reported very long thinking traces, slow starts, and expensive token burn, especially on max effort. A common view was that GLM-5.2 is smart enough to matter, but not yet disciplined enough to be a clean everyday workhorse. Many said the better comparison is not raw benchmark score but intelligence per dollar, per second, and per completed task in a real coding harness. On those dimensions, GPT-5.5 still got the most respect, DeepSeek kept coming up as the value play, and several people argued that benchmark leaders often feel worse in actual multi-turn agent workflows than cheaper models with better tool use and shorter loops. That fed into a broader pattern. People trusted the result more than the usual benchmark hype, but not enough to stop checking other evals. DeepSWE, bespoke codebase tests, bug-finding benchmarks, 3D modeling tasks, and personal coding harnesses all painted a more mixed picture. GLM-5.2 looks strong on coding and notably cautious about hallucinating, but it is not clearly dominant once you care about long-horizon work, agent behavior, or multimodal tasks. The missing vision support was one of the most repeated product gaps. For teams using screenshots, UI iteration, design reproduction, or mixed media workflows, text-only is a real handicap even if a separate vision model can be bolted on. The other major theme was distribution versus capability. Because the weights are open, people saw a path to cheaper subscriptions, alternate providers, self-hosting, and more privacy-sensitive deployments. But the actual product layer around GLM was widely described as messy. API outages, timeout issues, confusing quotas, provider quality differences, stealth quantization, and setup friction all blunted the “open model victory” story. The resulting consensus was sharp: GLM-5.2 is a real advance and a sign that open-weight coding models are now only months behind the frontier, not years. The bottleneck has shifted from raw model quality to reliability, multimodality, harness fit, and whether the surrounding service is good enough to trust in production.

Treat GLM-5.2 as a serious new option for coding workflows, especially if you want open weights or lower-cost access near the frontier. Do not assume the benchmark win translates into the best day-to-day product yet. Test it in your own harness for long tasks, multimodal work, and quota behavior before standardizing on it.

June 17, 2026
artificialanalysis.ai
Discuss on HN

Discussion mood

Excited but skeptical. People were impressed that an open-weights model is now credibly near frontier closed models for coding, but the mood stayed grounded because real use exposed slow reasoning, high token burn, no vision, API instability, and benchmark results that do not cleanly map to multi-turn agent workflows.

Key insights

Reasoning verbosity is the main drag

What held GLM-5.2 back in actual use was not raw ability but how wastefully it spends tokens getting there. People described 15-minute waits, 40k-plus reasoning traces, and repeated self-doubt loops before writing code. Several said the high setting preserves most of the quality while cutting cost and latency sharply, which makes the max setting feel more like benchmark mode than a default for real work.

Benchmark wins on max effort can hide a product problem. If you evaluate this model, test the lower reasoning settings first and track time to first useful output, not just final answer quality.

Attribution:

Tiberium #1
benjiro29 #1
h14h #1
esafak #1

Coding benchmarks are measuring a narrow slice

Artificial Analysis got pushback because its coding index is only two benchmarks, and several people said that misses the parts of coding work that matter most in production. DeepSWE, harness-specific results, and personal codebase tests suggest tool use, long-horizon planning, and agent loop behavior can reshuffle rankings a lot. The same model can look excellent in a benchmark and still feel mediocre inside Cursor CLI, Codex, Claude Code, or a custom harness.

Do not buy into a benchmark label like "best coding model" without testing it in your exact agent stack. Harness choice and workflow shape can move a model from top-tier to frustrating very quickly.

Attribution:

sosodev #1
ttul #1
lukewarm707 #1
cmrdporcupine #1

No vision support limits practical coding use

The lack of image input kept coming up because coding work is no longer just text. Rebuilding a UI from a screenshot, checking layout regressions, reviewing generated documents, and iterating on visual assets are now standard tasks. People said you can patch around this with a separate vision subagent or a model like Gemma 4 or Kimi, but that adds orchestration complexity and loses the tight feedback loop multimodal models provide.

If your team works from screenshots, mockups, PDFs, or rendered outputs, treat GLM-5.2 as incomplete on its own. Plan for a multimodal companion model or skip it for those workflows.

Attribution:

simonw #1
_pdp_ #1
x3cca #1
adrian_b #1

Provider quality is now part of model quality

Once weights are open, the model name stops being the whole story. People warned that third-party hosts may run quantized variants, cut corners on KV cache precision, or expose buggy APIs that change perceived quality by 20 to 40 percent. Moonshot’s vendor verifiers were cited as the kind of infrastructure open models now need, because a cheap endpoint can quietly become a different product than the benchmarked model.

When comparing open-model providers, verify the exact deployment before drawing conclusions about the model itself. Bad hosting can erase most of the advantage and make benchmark results meaningless.

Attribution:

CuriouslyC #1
thehamkercat #1
stanac #1
scrlk #1

Self-hosting demand is real but still enterprise-only

Several people said medium and large businesses are already buying hardware for local inference, especially in Europe and in regulated environments where sending code or documents to OpenAI or Anthropic is a non-starter. The catch is that a near-lossless deployment of models in this class still means serious hardware budgets, uneven utilization, and operational overhead. The thread treated self-hosting less as a hobbyist path than as a privacy and procurement decision for organizations with specific constraints.

Open weights now create a credible procurement alternative for regulated or privacy-sensitive teams, but this is still a budget and ops project. For most companies, hosted open models are the practical bridge before full self-hosting.

Attribution:

wongarsu #1 #2
MikhailTal #1
petesergeant #1

It may be stronger on epistemic caution

One notable bright spot was GLM-5.2's performance on non-hallucination and "I don't know" style behavior. People read that as a sign the model is more willing than peers to avoid bluffing when uncertain. That trait fit anecdotes describing it as cautious and stable, even from users who still preferred other models for overall coding speed or breadth.

If your workflow punishes confident wrong answers more than slower answers, GLM-5.2 may deserve extra attention. Its value may be highest in review, research, and risk-sensitive coding tasks rather than pure speed runs.

Attribution:

wongarsu #1
creamyhorror #1
ashenke #1

Against the grain

The leap may be overstated

Some people argued the celebration is getting ahead of the evidence. On stronger agent-style evals like DeepSWE, GLM-5.2 still appears meaningfully behind GPT-5.5 and likely Fable, which makes "frontier-level" sound bigger than the gap really is. Separate bug-finding tests also placed it closer to Qwen 3.7 Max than to the very top closed models.

Frame GLM-5.2 as a strong open model, not a clean replacement for the best closed systems. If your business depends on the last stretch of long-horizon coding performance, keep frontier closed models in the loop.

Attribution:

maxdo #1
mrngld #1
redbell #1

Code quality still favors Anthropic or GPT

Not everyone accepted that benchmark proximity means equal output quality. Some said GLM can draft well but still needs stronger reviewers, while others kept preferring Claude for readability and UI work or GPT for diligence around tests, race conditions, and failure cases. The split was less about one-shot correctness and more about whether the model writes software you actually want to maintain.

Judge with your code review standards, not just pass rates. If maintainability, testing habits, or UI polish matter, run side-by-side reviews before swapping out your main coding model.

Attribution:

CuriouslyC #1
andai #1
elwebmaster #1
nwienert #1

Cheap model, weak service layer

A recurring objection was that the price-performance story collapses when the official service is slow, rate-limited, or opaque. Several subscribers said they burned through quota faster than expected or hit timeouts often enough that the model became hard to rely on. In that view, the open weights are promising, but Z.ai’s current product experience still trails Claude and OpenAI enough to blunt adoption.

Separate the model from the vendor. You may want the weights without wanting the official API or subscription, especially for team-wide deployment.

Attribution:

nh43215rgb #1
davidwritesbugs #1
Havoc #1
robertwt7 #1

In plain english

Artificial Analysis ↩

A benchmarking and analysis site that compares AI models across tasks like coding, reasoning, and cost.

CLI ↩

Command-line interface, a text-based way to control software by typing commands.

DeepSWE ↩

A benchmark mentioned in comments for measuring AI performance on software engineering tasks.

GLM-5.2 ↩

A large language model from the GLM family that commenters describe as open-weight and MIT-licensed.

GPT-5.5 ↩

An OpenAI model family mentioned as a top competitor in coding and reasoning efficiency.

harness ↩

The software layer around a model that manages prompts, tools, memory, files, system instructions, and agent behavior.

KV cache ↩

Key-value cache, the stored attention state from previous tokens that lets a model generate long outputs or continue long contexts more efficiently.

multimodal ↩

Able to work across more than one kind of input or output, such as text and images together.

open weights ↩

A model release that includes the trained parameters, allowing others to run or fine-tune it themselves.

Opus ↩

A high-end Claude model line from Anthropic that commenters use as a reference point for top cloud coding performance.

quantization ↩

A technique that reduces the numerical precision of model weights to cut memory use and often speed up inference, usually at some quality cost.

tool use ↩

A model’s ability to call external tools like search, shell commands, test runners, or browsers while solving a task.

Reference links

Benchmarks and rankings

Artificial Analysis article on GLM-5.2
The main submitted benchmark report claiming GLM-5.2 leads open-weight models on the site’s intelligence index
Artificial Analysis coding agents chart
Used to argue that cost per task and coding-agent performance tell a more mixed story than the main index
DeepSWE benchmark
Cited as a more realistic benchmark for long-horizon software engineering work
Gertlabs rankings
Independent ranking referenced as placing GLM-5.2 around or above Opus 4.6
Will It Mythos benchmark
A small bug-finding benchmark that placed GLM-5.2 behind several other models on that specific task
AI Benchy comparison for DeepSeek V4 Flash vs GLM-5.2
Shared as evidence that DeepSeek V4 Flash may be more cost-effective on some workloads
AI Benchy comparison for Opus 4.7 vs GLM-5.2
Referenced to support the claim that GLM-5.2 is near Opus 4.7 level

Model docs and tooling

Z.ai Claude Code compatibility guide
Shows how to point Claude Code at Z.ai-hosted models
DeepSeek Anthropic API compatibility guide
Mentioned as a similar trick for using DeepSeek through Anthropic-compatible tooling
ZCode app
Presented as an easier installer-like path for using Z.ai models
Z.ai usage instructions
Cited to explain quota multipliers and why some subscribers hit limits quickly

Provider verification and hosting

Moonshot K2 Vendor Verifier
Example of a tool for checking whether third-party providers are serving an unmodified model
Moonshot Kimi Vendor Verifier
A broader verifier for Kimi-family deployments, cited as a model for open-provider transparency
OpenRouter GLM-5.2 page
Referenced while comparing official and third-party pricing and deployment details
NahCrof pricing
Used as an example of a cheaper third-party GLM-5.2 provider

Research papers

Paper on reducing waffling reasoning tokens
Shared to support the idea that suppressing indecisive reasoning tokens can improve performance and speed
Intermediate tokens may not have semantic import
Cited in a subthread about whether hidden reasoning traces are necessary for distillation
Inverted Reasoning Traces
Referenced as a method for recovering useful reasoning supervision without direct chain-of-thought access

Related projects and discussions

aa-eval-email repository
A script that pulls and sorts Artificial Analysis benchmark data
Kasra blog on LLMs hacking an app
Quoted as an example of frustrating outages with MiniMax and GLM APIs
GrandpaCAD
The 3D modeling software used in a custom benchmark comparing GLM against Gemini models