GLM 5.2 beats Claude in our benchmarks

AI
Security
Developer Tools
Open Source
Infrastructure

Semgrep’s post says its internal benchmark for IDOR bugs found GLM 5.2 ahead of Claude Opus 4.8 and GPT-5.5, with a much lower cost per vulnerability. The setup matters. This is not a general intelligence claim and not even a broad security claim. It is a narrow test around finding one relatively approachable class of web app bug in known open-source projects, using Semgrep’s own harness and scoring. That framing shaped most of the reaction.

The dominant read was that GLM 5.2 is clearly a serious model, especially on price-performance. People with hands-on use said it feels strong for day-to-day coding, fast, cheap, and less refusal-prone than Anthropic’s public offerings. Several people also placed it near the top of the current open-model pack rather than at the absolute frontier, with DeepSeek V4 Pro, Kimi, and others still competitive depending on the task. A recurring caveat was that Chinese labs often look better on public benchmarks than they do on private evals, so the main signal here is not “GLM beats the best closed model everywhere.” It is “GLM is good enough to matter, and cheap enough to change workflow decisions.” The sharpest criticism was about comparison hygiene. Semgrep’s headline says “beats Claude,” but the article is really comparing against Claude Code or public Opus variants under safety constraints, not some pure model capability. Several commenters argued that this likely measures product-layer refusals and harness choices as much as raw model skill. Others pointed out that Anthropic’s own Mythos messaging emphasized exploit generation more than vuln discovery, so a benchmark that only measures finding bugs does not establish a true open-weight replacement for the withheld cyber systems. People were also skeptical of odd version results like Opus 4.6 scoring above newer Opus releases, and of any benchmark built by a company that sells into the same problem. A separate practical thread landed on deployment economics. Running a 753B model locally at useful speed means heavy quantization or a six-figure multi-GPU box. For almost everyone, hosted inference wins on cost unless you need air-gapped deployment, stronger privacy, or access to uncensored models. That made the useful takeaway pretty concrete. Open-weight frontier-adjacent models are becoming operationally relevant long before they are convenient to self-host, and access policy may matter as much as benchmark rank. Many readers, especially outside the US, care less about who is nominally best than about which model they can actually rely on tomorrow.

Treat GLM 5.2 as a real option for coding and vuln-finding workflows today, especially if price, access, or refusal behavior are your constraints. Do not read this result as proof that open models have closed the overall frontier gap, and do not buy a local serving stack unless privacy or policy requirements justify a very expensive hobby or compliance project.

June 28, 2026
semgrep.dev
Discuss on HN

Key insights

Cheap frontier-adjacent, not outright best

GLM 5.2 looks strongest when you price it as a workhorse, not when you crown it the best model available. Private evaluations and bug-hunting tests put it close to top closed models and among the best open ones, but still behind in some settings. That changes the reading of Semgrep’s post. The practical win is cost-adjusted usefulness, not a clean capability lead.

Benchmark your actual workflow against GLM 5.2 before defaulting to premium closed models. If it is within striking distance on quality, the price gap alone can justify switching routine coding and triage tasks.

Attribution:

gertlabs #1
SwellJoe #1

Local GLM is mostly a compliance play

Serving a 753B model locally at decent throughput is a data center project, not a laptop experiment. People quoted 8 to 16 RTX 6000-class GPUs, heavy quantization, painful PCIe constraints, and total system costs well above $100,000. The economics only start to make sense when privacy rules, air-gapped environments, or uncensored use matter more than token cost.

Keep hosted inference as the default. Only budget for self-hosting if legal, privacy, or reliability requirements are strong enough to survive a six-figure procurement review.

Attribution:

bArray #1
dakolli #1
Aurornis #1
CamperBob2 #1
wonnage #1
rekttrader #1

Giving Semgrep as a tool can hurt

One hands-on evaluator found that wiring the open-source Semgrep scanner into model workflows did not improve results and sometimes made them worse. The explanation is straightforward. The model now has to learn the scanner interface while also doing bug hunting, and many models do both jobs badly at once. Better harnesses may hide that complexity, but the tool itself is not an automatic gain.

Do not assume adding more security tools to an agent improves outcomes. Measure whether the tool output is actually helping the model, or whether you are just adding interface burden and extra tokens.

Attribution:

SwellJoe #1

Provider quality changes the verdict

Several people who liked GLM 5.2 stressed that the good experience depended on using an unquantized or well-served version. Others reported nonsense outputs or weaker performance from specific providers and suspected quantized deployments. That means “GLM 5.2 is good” is partly a statement about the serving stack, not just the weights.

Test providers, not just model names. Pin the exact endpoint, latency, and quantization level before rolling a model into production or making a cost-performance judgment.

Attribution:

pimeys #1
jackdawed #1
jeffnash #1

Finding bugs is not the same as weaponizing them

Anthropic’s withheld Mythos system was described as notable for turning vulnerabilities into working exploits, not merely spotting likely bugs. Semgrep’s benchmark only measures detection. That is useful for secure development, but it does not answer the scarier question that drove the original Mythos debate.

If your use case includes exploitability analysis, proof-of-concept generation, or offensive testing, do not infer capability from vuln-detection scores alone. You need a separate eval for the part that actually raises risk.

Attribution:

dist-epoch #1
igregoryca #1

Access model shapes real-world cost

Public API pricing is only part of the story because subscription plans and tool access rules can distort what developers actually pay. Claude’s command-line and agent usage has been in flux, and commenters called out harness lock-in as a real constraint. That makes comparisons between hosted open models and flagship closed products messy. The cheapest path depends on whether you are chatting, automating, or embedding the model in your own tooling.

Price the whole workflow, not just tokens. Include subscription limits, CLI access, automation restrictions, and how much control you have over the harness before choosing a vendor.

Attribution:

horsawlarway #1
cortesoft #1
Onavo #1

Against the grain

The benchmark mostly measures public guardrails

A credible pushback is that public Claude endpoints are intentionally constrained, especially for cyber tasks, and that a less restricted enterprise or special-access service would likely widen the gap again. On this view, GLM 5.2 is beating a product wrapper, not the underlying frontier capability. That weakens any headline about open weights catching closed models outright.

Separate model capability from product policy in your evals. If refusals are central to your use case, test the exact SKU and access tier you would actually buy, not the vendor brand name.

Attribution:

rode1974 #1
Art9681 #1 #2

IDOR results are too narrow

The harshest critics argued the whole claim is overblown because IDOR is an easier bug class and the benchmark says little about broader security work or general coding quality. They also noted that a small edge over a much larger or more expensive competitor can still leave the closed model ahead in the bigger picture. That reframes the post as a narrow product pitch, not a major state-of-the-art update.

Use this result as a signal about one task family, not as a procurement shortcut. Ask whether your security team mostly needs bug discovery in familiar web stacks or harder work like exploit generation, root-cause analysis, and remediation.

Attribution:

danslo #1
vlian2088 #1

Some users still find GLM unreliable

Not everyone saw the breakthrough. A few people reported GLM spiraling into nonsense or simply performing badly for their use cases, with uncertainty over whether the fault was the model or the provider. That does not erase the positive reports, but it does mean the model is not universally impressive out of the box.

Run a short bake-off on your own repositories before switching. A model that looks great on a benchmark can still fail on your stack, your prompts, or your provider’s deployment.

Attribution:

_s_a_m_ #1
csjh #1

In plain english

Air-gapped ↩

Physically isolated from external networks, usually for security or compliance reasons.

API ↩

Application programming interface, a way for software to access another service or model programmatically.

Claude ↩

A family of AI chat models made by Anthropic.

Claude Code ↩

Anthropic’s coding-focused agent product and command-line workflow built around Claude models.

DeepSeek V4 Pro ↩

A model from DeepSeek that commenters cited as a strong open or open-access competitor on security tasks.

GLM 5.2 ↩

A large language model from the GLM family that commenters referenced as a demanding workload for this hardware.

GPT-5.5 ↩

A version of OpenAI’s GPT model family referenced in the benchmark comparison.

GPU ↩

Graphics Processing Unit, a processor commonly used to train and run AI models because it can handle many calculations in parallel.

Harness ↩

The surrounding software that prompts a model, gives it tools, manages subagents, and evaluates results.

IDOR ↩

Insecure Direct Object Reference, a web security flaw where an app exposes data or actions by trusting user-controlled object identifiers without proper authorization checks.

Kimi ↩

A family of models from Moonshot AI that commenters named as another strong coding and reasoning option.

Mythos ↩

A more capable cyber-focused Anthropic system discussed in relation to vulnerability discovery and exploit generation.

Open-weight ↩

A model whose learned parameters are released so others can run or fine-tune it, even if the full original training code and data are not public.

PCIe ↩

Peripheral Component Interconnect Express, the standard expansion bus used to connect devices like GPUs and NICs to a computer.

Quantized ↩

Compressed to use fewer bits per weight so a model can run on cheaper hardware, usually with some tradeoff in quality or speed.

RTX 6000 ↩

A high-end Nvidia workstation GPU often used for professional AI workloads.

Semgrep ↩

A code analysis tool used to scan source code for bugs and security issues, and also the company that published the benchmark.

Reference links

Benchmarks and evaluations

Semgrep blog post on GLM 5.2 vs Claude in cyber benchmarks
The submitted article and benchmark claim under discussion.
Gertlabs model rankings
Cited as a private evaluation showing GLM 5.2 close to Opus but not clearly the best overall.
Will it Mythos? security bug hunting benchmark
An independent benchmark used to compare GLM 5.2, DeepSeek, and other models on vulnerability finding.

Model weights and local deployment

GLM-5.2 on Hugging Face
Referenced for the model’s reported size and availability.
Unsloth GLM-5.2 docs
Shared as guidance for running quantized GLM 5.2 versions locally.
Antirez post about running GLM 5.2
Referenced as an example of someone running the model locally, with the caveat that it was quantized.
XCancel mirror of the Antirez post
Alternative link to the same post without using X directly.

Tools and providers

Nemesis8
Shared as a way to launch GLM-5.2 inside OpenCode using a containerized workflow.
Zed editor
Mentioned to clarify that Zed is the editor one commenter used with GLM 5.2.
Z.ai
Listed as a place to sign up for hosted GLM-5.2 access.
Neuralwatt
Mentioned as another provider for trying open-weight models like GLM 5.2.

GLM 5.2 beats Claude in our benchmarks

Discussion mood

Key insights

Against the grain

In plain english

Reference links

Benchmarks and evaluations

Model weights and local deployment

Tools and providers