The paper introduces “Benchmarks in Leipzig,” a set of 100 mathematics questions written by 49 researchers. These are not unsolved problems and not publication-worthy research prompts. They are questions with known answers that require understanding and applying existing literature, often at roughly the level of a specialist PhD student working in that area. The headline result is that after a multi-stage evaluation, current frontier models left only 2 questions unsolved in any configuration. That landed as another sign that benchmark design is getting squeezed. If experts try to write hard, answerable questions based on known math, top models now clear almost all of them.
The strongest reaction was not “AI can do math research now.” It was more specific. People accepted that solving never-before-seen questions built from existing theory is impressive, especially because the exact problems were newly written and not lifted from standard sources. But they also kept drawing a hard line between “can derive an answer from the literature” and “can create new mathematics.” The author made that distinction repeatedly and bluntly. These questions test whether a model can digest existing work and apply it. They do not test whether it can produce publishable ideas. Several readers said that framing is the useful one. Exercise-style benchmarks over public research are nearing saturation, so future evaluations need to target reliability, proof quality, and genuinely open-ended discovery rather than just ever-harder closed-form answers.
A second theme was benchmark validity. One commenter pushed hard on the paper’s “research-level” language and argued that some items looked more like computational or brute-force exercises than the kind of conceptual problems mathematicians actually care about. That criticism did not overturn the overall result, but it did sharpen the main caveat. The benchmark is best read as a test of advanced mathematical problem solving over known material, not as a clean measure of abstract theorem-proving or original research taste. Related complaints focused on uneven model settings and sponsorship. OpenAI models looked much stronger, but readers noted that effort settings and retry behavior were not fully matched across vendors, and the runs were subsidized by Surge AI because the evaluation cost was too high for the researchers to absorb directly. Another practical point landed well: aggregate “problems solved” can hide ugly error rates. For real use,
one-shot accuracy and answer calibration matter more than best-of-many success.