Why AI Benchmarks Need More Than Just Accuracy: The CORE-Bench Case Study

Researchers argue that after a benchmark's accuracy saturates, the focus should shift to six other key dimensions of AI performance: construct validity (shortcuts), out-of-distribution generalizability, efficiency, reliability, model vs. scaffold importance, and human–AI collaboration uplift.

A new paper on CORE-Bench Hard—a benchmark for computational reproducibility of scientific code—argues that when a benchmark's accuracy saturates, simply retiring it for a harder version misses the full picture. The authors propose that benchmark evaluations should consider six additional dimensions of agent performance: construct validity issues (such as taking shortcuts), out-of-distribution generalizability, efficiency, reliability, the relative importance of the model versus the scaffold (the supporting infrastructure), and uplift from human–agent collaboration.

This matters because a high accuracy score can conceal serious weaknesses. For example, an AI might be efficient but unreliable, perform well on in-distribution tasks but fail out-of-distribution, or depend heavily on its scaffold rather than its core model. By measuring these dimensions, researchers can build more robust and trustworthy AI systems.

The paper, titled 'Life After Benchmark Saturation: A Case Study of CORE-Bench,' is available on ArXiv cs.AI.