AI Models Still Make Irrational Decisions Even When Aligned

Researchers found that AI models can still make irrational decisions even when trained to align with human values. This 'rational value risk' means models may not always choose the best possible action, even if they understand what's valuable.

Researchers from ArXiv cs.AI published a study showing that AI models can still make irrational decisions even when trained to align with human values. The study introduces the concept of 'rational value risk,' which measures the gap between a model's actual decisions and the decisions that would maximize expected utility. The risk is formalized as the utility discrepancy between a model's deployed reasoning strategy and its rational counterpart, defined as responses that maximize expected utility in the steepest direction. This risk arises because models may not always choose the best possible action, even if they understand what's valuable.

This matters because it highlights a fundamental limitation in current AI systems. Even if an AI is trained to follow human values, it might still make decisions that don't align with those values in practice. The paper shows that this irrationality is separate from misalignment — it can persist even after perfect alignment in training. This could lead to unexpected and potentially harmful behaviors in real-world applications.