MTG Bench: Testing how well LLMs can play Magic
- AI
- Benchmarks
- Games
- Developer Tools
The post introduces MTG Bench, an experiment that uses Magic: The Gathering as a benchmark for language models. The appeal is obvious. Magic has dense rules, hidden state, sequencing, and lots of edge cases, so it looks like a rich test of whether an LLM can operate inside a real system instead of just answering questions. But the benchmark, as built, mostly checks whether a model can produce legal turns in a goldfish-style setup and keep its own tool calls consistent. It does not really test head-to-head play or deep strategy.
If you are evaluating agents in complex domains, separate legal action execution from game skill and score both explicitly. Also assume your harness design will dominate the result, especially when tool calling, turn structure, and judging are all done by LLMs.
- mtgautodeck.com
- Discuss on HN