AI Judges Flip Decisions 13.6% of the Time – Here's What That Means

Researchers found that AI judges used to rank other AI models often change their minds when given the same question repeatedly. This inconsistency could affect how we measure AI performance and trust public leaderboards. The study tested two OpenAI judge models across 29 tasks and found that pairwise preferences flipped an average of 13.6% of the time, with 28% of questions exceeding a 20% flip rate. The findings highlight the need for more reliable evaluation methods.

Researchers from arXiv released a study on LLM-as-a-Judge systems, which are AI models used to rank other AI outputs. They found that these judges flip their decisions 13.6% of the time when given the same questions multiple times.

This inconsistency matters because these AI judges are used to train reward models and create public leaderboards. If the judges aren't reliable, the rankings and training data might be flawed, affecting everything from AI development to consumer trust.

The study examined 29 tasks across 10 categories using two OpenAI judge models (GPT-4o-mini and GPT-4.1-mini), running 50 pairwise trials and 50 pointwise trials per question. They also tested how temperature settings and prompt wording affected results. Across all judges, pairwise preferences flipped 13.6% of the time on average, and 28% of questions had a flip rate exceeding 20%.