Rethinking Generalization in Reasoning SFT

Researchers challenge the notion that supervised finetuning memorizes while reinforcement learning generalizes. They find that cross-domain generalization is conditional, influenced by optimization, data, and model capability. This challenges prevailing narratives in LLM post-training.

A recent study published on arXiv revisits the claim that supervised finetuning (SFT) in large language models (LLMs) primarily memorizes patterns, whereas reinforcement learning (RL) facilitates generalization. The researchers focused on reasoning SFT with long chain-of-thought (CoT) supervision, aiming to understand the conditions under which SFT can achieve cross-domain generalization.

The study reveals that cross-domain generalization is not inherently absent in SFT but is instead conditional, depending on the interplay of optimization dynamics, the quality and diversity of training data, and the capability of the base model. Interestingly, some reported failures of SFT to generalize are attributed to under-optimization, where cross-domain performance initially degrades before recovering and improving with extended training, a phenomenon described as a 'dip-and-recovery' pattern.

The implications of this research are significant, as they suggest that with appropriate optimization strategies, diverse and representative training data, and sufficiently capable base models, SFT can achieve meaningful generalization across domains. This challenges the prevailing narrative and opens new avenues for improving the generalizability of LLMs through SFT, potentially bridging the gap between memorization and true understanding in AI models.