GeoRepEval: Testing LLMs' Geometry Reasoning Across Representations

Researchers introduce GeoRepEval, a framework to evaluate large language models' robustness to different geometric problem representations. The study highlights that current benchmarks overlook representation invariance, masking potential failures.

A new paper on arXiv introduces GeoRepEval, a framework designed to test large language models' (LLMs) ability to handle geometric problems presented in different forms. The study notes that while LLMs are often evaluated on mathematical reasoning, existing benchmarks assume representation invariance, meaning they don't account for how the same problem might be expressed in Euclidean, coordinate, or vector forms.

The significance of this research lies in its potential to uncover hidden vulnerabilities in LLMs. Current benchmarks report accuracy on fixed formats, which can mask failures caused by representational changes. GeoRepEval aims to measure correctness, invariance, and consistency, providing a more comprehensive evaluation of LLMs' geometric reasoning capabilities.

The introduction of GeoRepEval could lead to more robust and reliable LLMs, particularly in fields requiring precise mathematical reasoning. Future research might explore how this framework can be applied to other domains, such as algebra or calculus, to ensure that LLMs are truly representation-invariant across various mathematical disciplines.