π€ AI Summary
This study investigates the dynamic evolution of conceptual features during language model pretraining and its mechanistic impact on downstream performance. We propose a fine-grained analytical framework based on cross-coder sparse dictionary learning, enabling the first cross-temporal tracking and attribution of linearly interpretable features across Transformer training stages. Methodologically, we model sequences of pretraining snapshots to quantify the evolution of feature activation patterns, emergence timing, and representational complexity. Our key contributions are threefold: (1) empirical validation and refinement of the two-phase learning theoryβearly stages prioritize statistical pattern acquisition, while later stages shift toward high-order semantic feature construction; (2) discovery that ~80% of critical features emerge concentratively during mid-training, with their emergence timeline tightly synchronized with downstream performance gains; (3) establishment of causal links between feature evolution trajectories and generalization capability, providing theoretical foundations and analytical tools for interpretable AI and efficient pretraining.
π Abstract
Language models obtain extensive capabilities through pre-training. However, the pre-training process remains a black box. In this work, we track linear interpretable feature evolution across pre-training snapshots using a sparse dictionary learning method called crosscoders. We find that most features begin to form around a specific point, while more complex patterns emerge in later training stages. Feature attribution analyses reveal causal connections between feature evolution and downstream performance. Our feature-level observations are highly consistent with previous findings on Transformer's two-stage learning process, which we term a statistical learning phase and a feature learning phase. Our work opens up the possibility to track fine-grained representation progress during language model learning dynamics.