UnpredictaBench: Testing AI's Ability to Capture Real-World Distributional Randomness

Researchers introduced a new benchmark, UnpredictaBench, to evaluate whether large language models (LLMs) can capture true underlying distributions rather than collapsing to a single plausible answer. This is critical as AI is increasingly used as a substitute for real entities in economic simulations and other modeling tasks.

Researchers have introduced UnpredictaBench, a new evaluation tool that tests how well large language models (LLMs) capture true underlying distributions—i.e., whether AI can mimic the genuine randomness and variety found in real-world systems. As LLMs are increasingly used as substitutes for other entities—for example, simulating human behavior in economic models—their tendency to collapse toward a single plausible answer means they often fail to capture the unpredictability of real systems.

Recent work on improving output diversity is insufficient for this setting, because simulations require samples calibrated to a target distribution, not merely varied outputs. UnpredictaBench directly measures whether models can reproduce the full spectrum of possible outcomes, not just a representative or average one.

The full paper is available on arXiv. While technical details may be complex, the core takeaway is that AI models need to improve at capturing real-world unpredictability to be reliable in simulation tasks.