New Benchmark Reveals Why Some AI Forecasters Outperform Others

Researchers introduce BTF-2, a benchmark with 1,417 pastcasting questions, to analyze AI forecasters' reasoning. The study identifies accuracy differences as small as 0.004 Brier score and builds a more accurate composite forecaster.

Researchers have developed Bench to the Future 2 (BTF-2), a new benchmark designed to evaluate the reasoning processes of AI forecasting agents. BTF-2 includes 1,417 pastcasting questions and a frozen 15-million-document research corpus, allowing agents to research and forecast offline while producing full reasoning traces. This benchmark can detect minute accuracy differences of 0.004 Brier score and distinguish between agents' strengths in research versus judgment.

The significance of BTF-2 lies in its ability to provide deeper insights into why some AI forecasters are more accurate than others. Traditional forecasting benchmarks often focus on accuracy leaderboards without explaining the underlying reasons for performance differences. By analyzing reasoning traces, BTF-2 can identify specific areas where agents excel or falter, such as research quality or judgment accuracy. This granular analysis enables the construction of more effective forecasting models.

Using BTF-2, researchers built a composite forecaster that is 0.011 Brier score more accurate than any single frontier agent. This achievement highlights the potential of combining the strengths of multiple agents to create a more accurate forecasting system. Future research could explore how these insights can be applied to real-world forecasting tasks, such as economic predictions or policy planning, to enhance decision-making processes.