🤖 AI Summary
This work investigates how outcome-driven reinforcement learning (RL) can induce chain-of-thought reasoning capabilities in Transformer models. By designing synthetic graph traversal tasks and integrating gradient flow analysis, theoretical proofs, and language model experiments, the study provides the first theoretical guarantee that, under appropriate data distributions, reward signals based solely on final answer correctness can guide a single-layer Transformer to converge to an interpretable iterative reasoning algorithm. A key finding is that the presence of sufficient “easy samples”—instances requiring fewer reasoning steps—in the training data is critical for successful generalization to longer reasoning chains; without such samples, learning fails. This research uncovers the intrinsic mechanism linking data structure in outcome-driven RL to the emergence of reasoning abilities.
📝 Abstract
Transformers trained via Reinforcement Learning (RL) with outcome-based supervision can spontaneously develop the ability to generate intermediate reasoning steps (Chain-of-Thought). Yet the mechanism by which sparse rewards drive policy gradient to discover such systematic reasoning remains poorly understood. We address this by analyzing the policy gradient dynamics of single-layer Transformers on a synthetic graph traversal task that cannot be solved without Chain-of-Thought but admits a simple iterative solution. We prove that despite training solely on final-answer correctness, policy gradient drives the Transformer to converge to a structured, interpretable algorithm that iteratively traverses the graph vertex-by-vertex. We characterize the distributional properties required for this emergence, identifying the critical role of"simple examples": instances requiring fewer reasoning steps. When the training distribution places sufficient mass on these simpler examples, the Transformer learns a generalizable traversal strategy that extrapolates to longer chains; when this mass vanishes, policy gradient learning becomes infeasible. We corroborate our theoretical results through experiments on synthetic data and with real-world language models on mathematical reasoning tasks, validating that our theoretical findings carry over to practical settings.