Hamiltonian Mechanics of Feature Learning: Bottleneck Structure in Leaky ResNets

📅 2024-05-27
🏛️ arXiv.org
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work investigates the representation dynamics of Leaky ResNets in the infinite-depth limit, focusing on “representation geodesics” and their intrinsic connection to parameter norm minimization. We propose a variational modeling framework grounded in Hamiltonian mechanics—the first such integration of physical principles into feature learning theory—revealing how the competition between kinetic energy (suppressing representation discontinuities) and potential energy (driving low-dimensional compression) spontaneously induces bottleneck structures and explains multiscale representation transitions and timescale separation. Theoretical analysis uncovers a three-phase representation evolution under large effective depth: rapid dimensionality reduction → slow manifold evolution → rapid dimensionality expansion. Leveraging this insight, we design an adaptive layer-wise step-size training strategy that significantly improves both convergence speed and generalization performance on Leaky ResNets.

Technology Category

Application Category

📝 Abstract
We study Leaky ResNets, which interpolate between ResNets and Fully-Connected nets depending on an 'effective depth' hyper-parameter $ ilde{L}$. In the infinite depth limit, we study 'representation geodesics' $A_{p}$: continuous paths in representation space (similar to NeuralODEs) from input $p=0$ to output $p=1$ that minimize the parameter norm of the network. We give a Lagrangian and Hamiltonian reformulation, which highlight the importance of two terms: a kinetic energy which favors small layer derivatives $partial_{p}A_{p}$ and a potential energy that favors low-dimensional representations, as measured by the 'Cost of Identity'. The balance between these two forces offers an intuitive understanding of feature learning in ResNets. We leverage this intuition to explain the emergence of a bottleneck structure, as observed in previous work: for large $ ilde{L}$ the potential energy dominates and leads to a separation of timescales, where the representation jumps rapidly from the high dimensional inputs to a low-dimensional representation, move slowly inside the space of low-dimensional representations, before jumping back to the potentially high-dimensional outputs. Inspired by this phenomenon, we train with an adaptive layer step-size to adapt to the separation of timescales.
Problem

Research questions and friction points this paper is trying to address.

Study feature learning in Leaky ResNets using Hamiltonian mechanics.
Analyze bottleneck structure emergence in deep networks.
Develop adaptive training methods for timescale separation.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Lagrangian and Hamiltonian reformulation of ResNets
Adaptive layer step-size for training efficiency
Analysis of bottleneck structure in deep networks
🔎 Similar Papers
No similar papers found.
Arthur Jacot
Arthur Jacot
Assistant Professor, Courant Institute of Mathematical Sciences, NYU
Deep Learning
A
Alexandre Kaiser
Courant Institute, New York University, New York, NY 10012