🤖 AI Summary
Existing generalization bounds for Transformers exhibit exponential dependence on depth and require predefined norm constraints, making them inadequate for characterizing the generalization behavior of trained models. This work proposes a post-hoc generalization bound based on layerwise spectral norm control, quantifying model complexity through Schatten norms of query-key, value, and feedforward weight matrices. Crucially, the method adaptively selects the Schatten exponent for each matrix type in every layer without requiring it to be fixed a priori. This approach precisely captures the relationship between the spectral structure of trained Transformers and their generalization performance. Experiments on BERT demonstrate that the proposed complexity surrogate grows markedly slower with increasing depth and hidden dimension, substantially outperforming existing norm-based generalization bounds.
📝 Abstract
Understanding why trained Transformers generalize well is a fundamental problem in modern machine learning theory, and complexity-based generalization bounds provide a principled way to study this question. While existing norm-based bounds for Transformers remove the explicit polynomial dependence on the hidden dimension, they typically impose fixed norm constraints specified a priori and can exhibit unfavorable exponential dependence on depth. In this paper, we derive spectrum-adaptive post hoc generalization bounds for multi-layer Transformers. Under layerwise spectral norm control, the bounds are expressed in terms of layerwise Schatten quantities of the query-key, value, and feedforward weight matrices. Since the Schatten indices need not be fixed a priori and can instead be selected after training, separately for each matrix type and layer, the bounds adaptively trade off spectral complexity against the dimension- and depth-dependent factors according to the learned singular-value profiles. Empirical comparisons of BERT-adapted proxies for the leading complexity factors suggest that the proxies induced by our bounds grow more slowly with depth and hidden dimension than the corresponding norm-based proxies. Overall, our results provide a complexity-based perspective on how the spectral structure of trained Transformers is reflected in generalization analyses.