🤖 AI Summary
This paper investigates the root cause of inconsistent length generalization in generative Transformers on arithmetic reasoning tasks—particularly the performance gap between addition and multiplication, and the abrupt generalization breakdown observed under modular arithmetic (e.g., mod 100 vs. mod 101).
Method: We propose the first structure-aware generalization framework that formalizes the coupling among task algebraic structure, relative positional encoding mechanisms, and training distribution. We theoretically prove that length extrapolation critically depends on the synergy between addition’s translation invariance and relative position encoding, and that modulus mismatch disrupts this synergy, causing generalization collapse.
Contribution/Results: Through cross-architecture experiments on GPT-family models and systematic ablation studies, our framework accurately predicts sharp generalization transitions across different moduli. It offers both explanatory power—unifying disparate empirical phenomena—and predictive capability—guiding structure-informed, sample-efficient training strategies for arithmetic reasoning.
📝 Abstract
Transformer-based models excel in various tasks but their generalization capabilities, especially in arithmetic reasoning, remain incompletely understood. Arithmetic tasks provide a controlled framework to explore these capabilities, yet performance anomalies persist, such as inconsistent effectiveness in multiplication and erratic generalization in modular addition (e.g., modulo 100 vs. 101). This paper develops a unified theoretical framework for understanding the generalization behaviors of transformers in arithmetic tasks, focusing on length generalization. Through detailed analysis of addition, multiplication, and modular operations, we reveal that translation invariance in addition aligns with relative positional encoding for robust generalization, while base mismatch in modular operations disrupts this alignment. Experiments across GPT-family models validate our framework, confirming its ability to predict generalization behaviors. Our work highlights the importance of task structure and training data distribution for achieving data-efficient and structure-aware training, providing a systematic approach to understanding of length generalization in transformers.