🤖 AI Summary
This study investigates how Transformers organize their internal representations during next-token prediction pretraining to reflect the underlying structure of the world. By constructing synthetic sequential data with known latent factors and employing geometric activation analysis, subspace dimensionality estimation, and modeling of contextual embedding distributions, the authors find that Transformers exhibit an inductive bias toward decomposing inputs into orthogonal low-dimensional subspaces. When conditional independence holds among latent factors, the model learns lossless factorized representations. Remarkably, even in the presence of noise or hidden dependencies, such structured representations are prioritized during early training stages. These findings reveal fundamental principles governing representational formation in Transformers and demonstrate their inherent preference for factorized structures that mirror compositional aspects of the environment.
📝 Abstract
Transformers pretrained via next token prediction learn to factor their world into parts, representing these factors in orthogonal subspaces of the residual stream. We formalize two representational hypotheses: (1) a representation in the product space of all factors, whose dimension grows exponentially with the number of parts, or (2) a factored representation in orthogonal subspaces, whose dimension grows linearly. The factored representation is lossless when factors are conditionally independent, but sacrifices predictive fidelity otherwise, creating a tradeoff between dimensional efficiency and accuracy. We derive precise predictions about the geometric structure of activations for each, including the number of subspaces, their dimensionality, and the arrangement of context embeddings within them. We test between these hypotheses on transformers trained on synthetic processes with known latent structure. Models learn factored representations when factors are conditionally independent, and continue to favor them early in training even when noise or hidden dependencies undermine conditional independence, reflecting an inductive bias toward factoring at the cost of fidelity. This provides a principled explanation for why transformers decompose the world into parts, and suggests that interpretable low dimensional structure may persist even in models trained on complex data.