ThermoQA Benchmark Reveals Gaps in LLMs' Thermodynamic Reasoning

ThermoQA is a new benchmark for evaluating LLMs on engineering thermodynamics problems. It shows significant performance drops as problem complexity increases, with top models like Claude Opus 4.6 leading at 94.1%.

Researchers have introduced ThermoQA, a benchmark comprising 293 open-ended thermodynamics problems divided into three tiers: property lookups, component analysis, and full cycle analysis. The benchmark uses CoolProp 7.2.0 for ground truth computations, covering water, R-134a, and variable-cp air. Six frontier LLMs were evaluated across three runs each, with Claude Opus 4.6 leading the composite leaderboard at 94.1%, followed by GPT-5.4 at 93.1% and Gemini 3.1 Pro at 92.5%.

The results highlight a notable performance degradation across tiers, with the smallest drop of 2.8 percentage points for Claude Opus 4.6 and the largest at 32.5 percentage points for MiniMax. This indicates that while top models excel at simpler tasks, their performance significantly diminishes as problem complexity increases. The benchmark underscores the need for improved reasoning capabilities in LLMs, particularly in specialized engineering domains.

Moving forward, the ThermoQA benchmark will likely spur further research into enhancing LLMs' thermodynamic reasoning abilities. The significant performance drops suggest that current models may struggle with real-world engineering applications requiring multi-step, complex analyses. Future work could focus on developing more robust training methodologies and fine-tuning techniques to bridge these gaps and improve model performance across all tiers.