🤖 AI Summary
This work addresses the puzzle of why over-parameterized deep neural networks generalize well despite defying classical parameter-count-based generalization theories. From the perspective of representation learning, the paper proposes a novel generalization analysis framework that eschews dependence on model size. It characterizes the geometric convergence of learned embeddings via the Wasserstein distance and quantifies the sensitivity of the prediction mapping through its Lipschitz constant, thereby deriving an embedding-dependent generalization bound. Theoretically, the bound reveals that generalization error is primarily governed by the embedding dimensionality. Empirical validation across diverse architectures and datasets demonstrates that this metric exhibits strong correlation with actual generalization performance and remains consistently predictive under varying conditions.
📝 Abstract
Deep neural networks often generalize well despite heavy over-parameterization, challenging classical parameter-based analyses. We study generalization from a representation-centric perspective and analyze how the geometry of learned embeddings controls predictive performance for a fixed trained model. We show that population risk can be bounded by two factors: (i) the intrinsic dimension of the embedding distribution, which determines the convergence rate of empirical embedding distribution to the population distribution in Wasserstein distance, and (ii) the sensitivity of the downstream mapping from embeddings to predictions, characterized by Lipschitz constants. Together, these yield an embedding-dependent error bound that does not rely on parameter counts or hypothesis class complexity. At the final embedding layer, architectural sensitivity vanishes and the bound is dominated by embedding dimension, explaining its strong empirical correlation with generalization performance. Experiments across architectures and datasets validate the theory and demonstrate the utility of embedding-based diagnostics.