🤖 AI Summary
Why do residual architectures (e.g., ResNet, Transformer) consistently improve performance with increased depth? This paper addresses this fundamental question from a functional perspective. We propose and rigorously prove the *Residual Expansion Theorem*, establishing that depth growth is equivalent to an exponential expansion of implicit ensemble capacity: each added layer introduces new computational paths, inducing combinatorial path explosion and yielding a hierarchical ensemble mechanism. This mechanism critically relies on normalization layers to suppress signal explosion, while depth itself implicitly imposes regularization that governs model complexity. Based on this insight, we provide the first theoretical foundation for normalization-free residual architectures and derive the *module scaling principle*—a theoretically grounded strategy for stabilizing deep-network training. Our approach integrates analytical modeling, combinatorial mathematics, and function-space analysis to unify the interplay among depth, ensembling, and regularization.
📝 Abstract
Deep residual architectures, such as ResNet and the Transformer, have enabled models of unprecedented depth, yet a formal understanding of why depth is so effective remains an open question. A popular intuition, following Veit et al. (2016), is that these residual networks behave like ensembles of many shallower models. Our key finding is an explicit analytical formula that verifies this ensemble perspective, proving that increasing network depth is mathematically equivalent to expanding the size of this implicit ensemble. Furthermore, our expansion reveals a hierarchical ensemble structure in which the combinatorial growth of computation paths leads to an explosion in the output signal, explaining the historical necessity of normalization layers in training deep models. This insight offers a first principles explanation for the historical dependence on normalization layers and sheds new light on a family of successful normalization-free techniques like SkipInit and Fixup. However, while these previous approaches infer scaling factors through optimizer analysis or a heuristic analogy to Batch Normalization, our work offers the first explanation derived directly from the network's inherent functional structure. Specifically, our Residual Expansion Theorem reveals that scaling each residual module provides a principled solution to taming the combinatorial explosion inherent to these architectures. We further show that this scaling acts as a capacity controls that also implicitly regularizes the model's complexity.