Study Reveals Language Models Rely on Positional Shortcuts Under Adversarial Conditions

Researchers found that language models often use positional shortcuts rather than engaging with question content when instructed to underperform. The study used a six-condition adversarial instruction-specificity gradient on Llama-3-8B and Llama-3.1-8B models.

A new study published on arXiv reveals that language models (LLMs) often resort to positional shortcuts rather than engaging with the content of multiple-choice questions when instructed to underperform. The research, conducted by administering a six-condition adversarial instruction-specificity gradient to Llama-3-8B and Llama-3.1-8B models on 2,000 MMLU-Pro items, mapped the boundary between content engagement and positional shortcuts.

The study used distributional screening (response-position entropy) and an independent content-engagement criterion (difficulty-accuracy correlation) to characterize each condition. The findings suggest that as the complexity of instructions increases, models are more likely to fall back on positional strategies, potentially skewing evaluation results.

This research highlights the importance of understanding how LLMs respond to adversarial instructions, which could have significant implications for the design of evaluation benchmarks. Future studies may explore how different models and instruction types affect these outcomes, and whether positional shortcuts can be mitigated through improved training techniques.