Principled Understanding of Generalization for Generative Transformer Models in Arithmetic Reasoning Tasks

📅 2024-07-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper investigates the root cause of inconsistent length generalization in generative Transformers on arithmetic reasoning tasks—particularly the performance gap between addition and multiplication, and the abrupt generalization breakdown observed under modular arithmetic (e.g., mod 100 vs. mod 101). Method: We propose the first structure-aware generalization framework that formalizes the coupling among task algebraic structure, relative positional encoding mechanisms, and training distribution. We theoretically prove that length extrapolation critically depends on the synergy between addition’s translation invariance and relative position encoding, and that modulus mismatch disrupts this synergy, causing generalization collapse. Contribution/Results: Through cross-architecture experiments on GPT-family models and systematic ablation studies, our framework accurately predicts sharp generalization transitions across different moduli. It offers both explanatory power—unifying disparate empirical phenomena—and predictive capability—guiding structure-informed, sample-efficient training strategies for arithmetic reasoning.

Technology Category

Application Category

📝 Abstract
Transformer-based models excel in various tasks but their generalization capabilities, especially in arithmetic reasoning, remain incompletely understood. Arithmetic tasks provide a controlled framework to explore these capabilities, yet performance anomalies persist, such as inconsistent effectiveness in multiplication and erratic generalization in modular addition (e.g., modulo 100 vs. 101). This paper develops a unified theoretical framework for understanding the generalization behaviors of transformers in arithmetic tasks, focusing on length generalization. Through detailed analysis of addition, multiplication, and modular operations, we reveal that translation invariance in addition aligns with relative positional encoding for robust generalization, while base mismatch in modular operations disrupts this alignment. Experiments across GPT-family models validate our framework, confirming its ability to predict generalization behaviors. Our work highlights the importance of task structure and training data distribution for achieving data-efficient and structure-aware training, providing a systematic approach to understanding of length generalization in transformers.
Problem

Research questions and friction points this paper is trying to address.

Understanding generalization in transformers for arithmetic reasoning tasks
Explaining performance anomalies in multiplication and modular addition
Developing a framework for length generalization in arithmetic operations
Innovation

Methods, ideas, or system contributions that make the work stand out.

Develops theoretical framework for transformer generalization
Analyzes translation invariance in addition tasks
Validates framework with GPT-family models
🔎 Similar Papers
No similar papers found.
X
Xingcheng Xu
Shanghai Artificial Intelligence Laboratory
Zibo Zhao
Zibo Zhao
Hunyuan, Tencent; ShanghaiTech
H
Haipeng Zhang
Shanghaitech University
Y
Yanqing Yang
Fudan University