CoSToM: Steering LLMs Toward Intrinsic Theory-of-Mind Alignment
Researchers introduce CoSToM, a method to improve LLMs' intrinsic Theory-of-Mind capabilities. The study highlights gaps in current models' ability to generalize social reasoning beyond prompt scaffolding.

A new paper from arXiv introduces CoSToM (Causal-oriented Steering for Intrinsic Theory-of-Mind Alignment), a framework designed to enhance large language models' (LLMs) ability to attribute mental states to others—a key aspect of social intelligence. The researchers argue that while LLMs perform well on standard Theory-of-Mind (ToM) benchmarks, they often struggle with complex, real-world scenarios, relying heavily on carefully crafted prompts to simulate reasoning.
The critical issue, according to the authors, is the misalignment between the internal knowledge of LLMs and their external behavior. This raises questions about whether these models truly possess intrinsic cognition or merely mimic reasoning through external scaffolding. CoSToM aims to bridge this gap by steering models toward more intrinsic ToM capabilities, potentially improving their performance in socially complex tasks.
The implications of this research are significant for fields like human-AI interaction, education, and mental health, where understanding and responding to human mental states is crucial. Future work will likely explore how CoSToM can be integrated into existing models and whether it can generalize across diverse social contexts. The paper also opens up broader questions about the nature of cognition in AI and how to measure it.