🤖 AI Summary
This study investigates the limitations of Transformers in in-distribution state tracking tasks, particularly concerning data efficiency and generalization across sequence lengths. Through large-scale comparative experiments integrating multiple supervised training paradigms, state space modeling, and weight-sharing analyses, the work systematically evaluates performance differences between Transformers and recurrent neural networks (RNNs). The findings reveal that Transformers require training data that scales sharply with both state space complexity and sequence length, and they struggle to transfer knowledge across varying sequence lengths. In contrast, RNNs—benefiting from their inherent inductive bias and weight-sharing mechanisms—demonstrate substantially higher data efficiency and robust cross-length generalization. This work underscores the critical role of inductive bias in sequence modeling and offers new insights for designing efficient architectures for state tracking.
📝 Abstract
Despite the remarkable practical success of transformer-based language models, recent work has raised concerns about their ability to perform state tracking. In particular, a growing body of literature has shown this limitation primarily through failures in out-of-distribution (OOD) generalization, such as length extrapolation. In this work, we shift attention to the in-distribution implications of these limitations. We conduct a large-scale experimental study of the data efficiency of transformers and recurrent neural networks (RNNs) across multiple supervision regimes. We find that the amount of training data required by transformers grows much more rapidly with state-space size and sequence length than for RNNs. Furthermore, we analyze the extent to which learned state-tracking mechanisms are shared across different sequence lengths. We show that transformers exhibit negligible or even detrimental weight sharing across lengths, indicating that they learn length-specific solutions in isolation. In contrast, recurrent models exhibit effective amortized learning by sharing weights across lengths, allowing data from one sequence length to improve performance on others. Together, these results demonstrate that state tracking remains a fundamental challenge for transformers, even when training and evaluation distributions match.