🤖 AI Summary
Residual identity shortcuts impede generative models’ learning of abstract semantic features, limiting representational capacity and degrading generation quality. This work is the first to identify and characterize this detrimental mechanism. We propose a layer-depth-adaptive identity decay strategy: monotonically reducing the weight of the identity path with increasing network depth to facilitate progressive feature abstraction. Our method is broadly applicable to masked autoencoders (MAEs) and diffusion models. Integrated into ViT-B/16 backbones, it achieves 72.7% linear probe accuracy and boosts k-NN accuracy on ImageNet-1K from 27.4% to 63.9%. It also improves image generation fidelity in diffusion models. Unlike heuristic architectural modifications, our approach offers an interpretable, plug-and-play residual structure optimization paradigm for generative representation learning—requiring no retraining or architectural redesign, yet delivering consistent gains across self-supervised and generative settings.
📝 Abstract
We show that introducing a weighting factor to reduce the influence of identity shortcuts in residual networks significantly enhances semantic feature learning in generative representation learning frameworks, such as masked autoencoders (MAEs) and diffusion models. Our modification notably improves feature quality, raising ImageNet-1K K-Nearest Neighbor accuracy from 27.4% to 63.9% and linear probing accuracy from 67.8% to 72.7% for MAEs with a ViT-B/16 backbone, while also enhancing generation quality in diffusion models. This significant gap suggests that, while residual connection structure serves an essential role in facilitating gradient propagation, it may have a harmful side effect of reducing capacity for abstract learning by virtue of injecting an echo of shallower representations into deeper layers. We ameliorate this downside via a fixed formula for monotonically decreasing the contribution of identity connections as layer depth increases. Our design promotes the gradual development of feature abstractions, without impacting network trainability. Analyzing the representations learned by our modified residual networks, we find correlation between low effective feature rank and downstream task performance.