MTG Bench: Testing how well LLMs can play Magic

AI
Benchmarks
Games
Developer Tools

The post introduces MTG Bench, an experiment that uses Magic: The Gathering as a benchmark for language models. The appeal is obvious. Magic has dense rules, hidden state, sequencing, and lots of edge cases, so it looks like a rich test of whether an LLM can operate inside a real system instead of just answering questions. But the benchmark, as built, mostly checks whether a model can produce legal turns in a goldfish-style setup and keep its own tool calls consistent. It does not really test head-to-head play or deep strategy.

If you are evaluating agents in complex domains, separate legal action execution from game skill and score both explicitly. Also assume your harness design will dominate the result, especially when tool calling, turn structure, and judging are all done by LLMs.

June 12, 2026
mtgautodeck.com
Discuss on HN

Key insights

Tool calling is the real bottleneck

What breaks here is not usually Magic rules knowledge. It is the ability to sequence actions through tools without contradicting prior state. The examples described show models initiating actions they already know are wrong, then sometimes repeating the same malformed call with filler reasons like "placeholder" or "noop". That shifts the interpretation of results away from game intelligence and toward agent loop stability under structured constraints.

If you use this kind of benchmark, track tool-call validity and self-contradiction as first-class metrics. A model that knows the rules but cannot drive the interface will still fail in production.

Attribution:

CallumFerg #1 #2

A rules engine would clean up scoring

Several people pushed for putting the model behind a real game engine like Forge or XMage and treating each move as a proposal that gets accepted or rejected. That would turn illegal actions into measurable events instead of relying on another LLM to judge legality after the fact. It also opens the door to model-vs-model matches and repeatable tournaments, though one builder noted that this gets expensive fast once you run full games at scale.

Use a deterministic environment when you want benchmark results you can compare over time. Save LLM judging for soft qualities, not for basic rule enforcement.

Attribution:

derac #1
fc417fc802 #1 #2
josh_p #1
jdmoreira #1

Obscure game benchmarks have a short shelf life

People liked this partly because it is unusual enough that frontier models were probably not directly trained against it. That is the same reason RuneBench feels informative right now. The warning is that once a benchmark becomes visible, it starts attracting optimization and loses value as a proxy for general capability. Magic gives you more runway because of its combinatorial mess, but not immunity.

Treat niche benchmarks as perishable signal. Rotate tasks or keep some evals private if you want them to stay diagnostic.

Attribution:

OsrsNeedsf2P #1
purple-leafy #1
TZubiri #1

Reasoning models can do it, but not cheaply

The strongest models with extra thinking time and rule lookup appear able to avoid most legality mistakes. The problem is economics, not just capability. Full simulations consume enough tokens and latency that you either get a sluggish user experience or a very expensive batch job. Another builder reported similar constraints and said they could only afford experiments with cheaper DeepSeek models.

Before turning a benchmark into a product feature, price the full loop with retries and repeated runs. Capability can be there long before the unit economics are.

Attribution:

CallumFerg #1
jdmoreira #1

Magic is hard in a deeper way

One commenter described building a Rust rules engine plus reinforcement learning and Monte Carlo Tree Search, and said simple aggressive decks were manageable while combo decks were much harder without expert demonstrations or reward shaping. Another pointed out that Magic has been shown to be Turing-complete. That is not just trivia. It explains why edge cases and strange interactions keep dominating both handcrafted engines and model-based agents.

Do not assume success on a narrow subset of decks means a model or planner has learned the game. Expand test suites across archetypes, especially combo and rules-bending decks, before drawing broad conclusions.

Attribution:

alasdair_ #1
akoboldfrying #1

Against the grain

This is not really playing Magic

The sharpest pushback is that without opponents, interactive timing, and meaningful mulligan decisions, you are not measuring gameplay in the sense most players care about. You are measuring whether a model can execute legal solitaire turns. The author effectively confirmed that framing by saying the benchmark score is mostly about completing legal turns rather than making strong ones.

Label agent evals by the capability they actually test. Calling a legality harness a gameplay benchmark will confuse both readers and model comparisons.

Attribution:

OwenCR #1
CallumFerg #1
thurn #1

In plain english

forge ↩

A platform for hosting source code repositories and related collaboration tools such as issues, pull requests, and CI.

goldfish ↩

A way of testing a deck by playing it alone without an opponent, to see how its own draws and sequences work.

MTG ↩

Magic: The Gathering, a complex trading card game with detailed rules and many card interactions.

mulligan ↩

The rule that lets a player redraw an opening hand, usually at some cost, if the first hand is poor.

priority ↩

The turn-taking rule in Magic that determines which player is allowed to act at each moment.

reinforcement learning ↩

A training method where a model learns by trying actions and receiving rewards or penalties.

Turing-complete ↩

Capable of expressing any computation that a general-purpose computer can perform, given enough time and memory.

XMage ↩

An open source platform for playing Magic: The Gathering online with a formal rules engine.

Reference links

Game benchmark projects

RuneBench
An example of another niche benchmark for testing how well LLMs can act inside a game world.
mage-bench
A separate project mentioned as another benchmark for model tournaments in Magic.
Card Forge
Suggested as a rules engine that could validate proposed moves instead of relying on LLM judging.

Rules engines and infrastructure

A Tour of CLIPS
Shared as background on CLIPS, the expert system language said to be used in MTG Arena's rules engine.
A simple TCP server written in Go and CLIPS
An example of declarative server logic in CLIPS, mentioned while discussing rules-engine approaches.

Magic complexity references

Judge Tower on MTG Wiki
Referenced as an even harsher format for testing rules handling under bizarre interactions.
Magic: The Gathering is Turing Complete
Cited to support the claim that Magic can encode arbitrary computation and is about as complicated as a game can get.
Matt Parker video on a three-card MTG combo
Given as an accessible example of how small card combinations can explode into huge and unintuitive game states.
Video explanation of the Turing-completeness paper
Shared as a more visual explanation of the paper showing how specific cards implement computation.

MTG Bench: Testing how well LLMs can play Magic

Discussion mood

Key insights

Against the grain

In plain english

Reference links

Game benchmark projects

Rules engines and infrastructure

Magic complexity references