HN Debrief

MTG Bench: Testing how well LLMs can play Magic

  • AI
  • Benchmarks
  • Games
  • Developer Tools

The post introduces MTG Bench, an experiment that uses Magic: The Gathering as a benchmark for language models. The appeal is obvious. Magic has dense rules, hidden state, sequencing, and lots of edge cases, so it looks like a rich test of whether an LLM can operate inside a real system instead of just answering questions. But the benchmark, as built, mostly checks whether a model can produce legal turns in a goldfish-style setup and keep its own tool calls consistent. It does not really test head-to-head play or deep strategy.

If you are evaluating agents in complex domains, separate legal action execution from game skill and score both explicitly. Also assume your harness design will dominate the result, especially when tool calling, turn structure, and judging are all done by LLMs.

Discussion mood

Interested and positive about the idea, but skeptical of the current methodology. Most comments treated Magic as a genuinely hard and useful testbed, while arguing that the benchmark currently measures legal move generation and tool orchestration more than actual gameplay quality.

Key insights

  1. 01

    Tool calling is the real bottleneck

    What breaks here is not usually Magic rules knowledge. It is the ability to sequence actions through tools without contradicting prior state. The examples described show models initiating actions they already know are wrong, then sometimes repeating the same malformed call with filler reasons like "placeholder" or "noop". That shifts the interpretation of results away from game intelligence and toward agent loop stability under structured constraints.

    If you use this kind of benchmark, track tool-call validity and self-contradiction as first-class metrics. A model that knows the rules but cannot drive the interface will still fail in production.

      Attribution:
    • CallumFerg #1 #2
  2. 02

    A rules engine would clean up scoring

    Several people pushed for putting the model behind a real game engine like Forge or XMage and treating each move as a proposal that gets accepted or rejected. That would turn illegal actions into measurable events instead of relying on another LLM to judge legality after the fact. It also opens the door to model-vs-model matches and repeatable tournaments, though one builder noted that this gets expensive fast once you run full games at scale.

    Use a deterministic environment when you want benchmark results you can compare over time. Save LLM judging for soft qualities, not for basic rule enforcement.

      Attribution:
    • derac #1
    • fc417fc802 #1 #2
    • josh_p #1
    • jdmoreira #1
  3. 03

    Obscure game benchmarks have a short shelf life

    People liked this partly because it is unusual enough that frontier models were probably not directly trained against it. That is the same reason RuneBench feels informative right now. The warning is that once a benchmark becomes visible, it starts attracting optimization and loses value as a proxy for general capability. Magic gives you more runway because of its combinatorial mess, but not immunity.

    Treat niche benchmarks as perishable signal. Rotate tasks or keep some evals private if you want them to stay diagnostic.

      Attribution:
    • OsrsNeedsf2P #1
    • purple-leafy #1
    • TZubiri #1
  4. 04

    Reasoning models can do it, but not cheaply

    The strongest models with extra thinking time and rule lookup appear able to avoid most legality mistakes. The problem is economics, not just capability. Full simulations consume enough tokens and latency that you either get a sluggish user experience or a very expensive batch job. Another builder reported similar constraints and said they could only afford experiments with cheaper DeepSeek models.

    Before turning a benchmark into a product feature, price the full loop with retries and repeated runs. Capability can be there long before the unit economics are.

      Attribution:
    • CallumFerg #1
    • jdmoreira #1
  5. 05

    Magic is hard in a deeper way

    One commenter described building a Rust rules engine plus reinforcement learning and Monte Carlo Tree Search, and said simple aggressive decks were manageable while combo decks were much harder without expert demonstrations or reward shaping. Another pointed out that Magic has been shown to be Turing-complete. That is not just trivia. It explains why edge cases and strange interactions keep dominating both handcrafted engines and model-based agents.

    Do not assume success on a narrow subset of decks means a model or planner has learned the game. Expand test suites across archetypes, especially combo and rules-bending decks, before drawing broad conclusions.

      Attribution:
    • alasdair_ #1
    • akoboldfrying #1

Against the grain

  1. 01

    This is not really playing Magic

    The sharpest pushback is that without opponents, interactive timing, and meaningful mulligan decisions, you are not measuring gameplay in the sense most players care about. You are measuring whether a model can execute legal solitaire turns. The author effectively confirmed that framing by saying the benchmark score is mostly about completing legal turns rather than making strong ones.

    Label agent evals by the capability they actually test. Calling a legality harness a gameplay benchmark will confuse both readers and model comparisons.

      Attribution:
    • OwenCR #1
    • CallumFerg #1
    • thurn #1

In plain english

Forge
An open source digital implementation of Magic: The Gathering that includes rules enforcement and AI play.
goldfish
A way of testing a deck by playing it alone without an opponent, to see how its own draws and sequences work.
MTG
Magic: The Gathering, a complex trading card game with detailed rules and many card interactions.
mulligan
The rule that lets a player redraw an opening hand, usually at some cost, if the first hand is poor.
priority
The turn-taking rule in Magic that determines which player is allowed to act at each moment.
reinforcement learning
A machine learning method where an agent learns by taking actions and receiving rewards or penalties.
Turing-complete
Capable of expressing any computation that a general-purpose computer can perform, given enough time and memory.
XMage
An open source platform for playing Magic: The Gathering online with a formal rules engine.

Reference links

Game benchmark projects

  • RuneBench
    An example of another niche benchmark for testing how well LLMs can act inside a game world.
  • mage-bench
    A separate project mentioned as another benchmark for model tournaments in Magic.
  • Card Forge
    Suggested as a rules engine that could validate proposed moves instead of relying on LLM judging.

Rules engines and infrastructure

Magic complexity references