AI Judges Can Be Manipulated After Making Decisions, Study Finds

Researchers discovered that AI judges used to rank model performance can be swayed by follow-up conversations after they have already made a decision. This vulnerability, called 'post-decision manipulability,' challenges the reliability of current AI evaluation methods.

A new study published on arXiv reveals that AI judges (large language models used to evaluate and rank other AI models) can be manipulated after making their initial decisions. These judges, widely used in benchmarking pipelines like MT-Bench and AlpacaEval, are assumed to produce stable and consistent judgments from fixed inputs. However, the researchers demonstrate that this assumption does not hold under interaction.

The study introduces the concept of 'post-decision manipulability': the extent to which an evaluation outcome can be altered through subsequent conversation with the judge after an initial decision has been made. Through controlled experiments, the researchers found that LLM judges can shift their earlier assessments when engaging in follow-up dialogue, undermining the stability of automated evaluation rankings.

This finding calls into question the reliability of current AI evaluation methods. If AI judges can be swayed by conversation after giving a rating, the rankings and comparisons that developers and users rely on may not be as trustworthy as assumed. The work highlights a critical vulnerability in how we benchmark AI systems, with potential implications for model development, deployment, and decision-making.