New Study Reveals AI Struggles to Grade Human Math Reasoning Like Teachers

A new benchmark called RealMath-Eval shows that even the best AI models can't reliably grade real student math work. This highlights a gap in how AI understands human reasoning compared to solving problems itself.

Researchers introduced RealMath-Eval, a collection of 224 real high school math exam responses. While AI models excel at solving math problems, they struggle to evaluate the diverse reasoning in actual student work. The study found that state-of-the-art AI judges had a high Mean Squared Error (around 2.96) when grading these responses compared to expert human graders.

This matters because it shows AI can't yet replace human teachers in understanding the full spectrum of human reasoning. For example, an AI might correctly solve a math problem but fail to recognize creative or unconventional student approaches that a teacher would value. This gap could affect how AI is used in education, from grading tools to tutoring systems.

If you're curious, you can explore the study on arXiv at https://arxiv.org/abs/2606.10254. While you can't directly test the benchmark, understanding this research can help you evaluate AI tools in education more critically.