These are Not All the Features You are Looking For: A Fundamental Bottleneck In Supervised Pretraining

📅 2025-06-22
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This paper identifies a pervasive “information saturation bottleneck” in supervised pretraining: when models learn highly correlated, competing features, discriminative features encoded early are irreversibly overwritten, severely degrading transfer performance—especially to unseen sub-distributions—without mitigation via mere model scaling. To address this, the authors propose the first systematic analytical framework, integrating transfer evaluation, quantitative feature retention measurement, and dynamic learning trajectory tracking. Empirical validation confirms the bottleneck’s ubiquity across mainstream architectures (e.g., ResNet, ViT) and reveals data distribution shift and training order as key modulating factors. The core contribution is the formal definition and empirical verification of this bottleneck mechanism, directly challenging the “larger models universally transfer better” assumption. This work establishes foundational theoretical insights and practical guidelines for developing robust, evolvable feature representations.

Technology Category

Application Category

📝 Abstract
Transfer learning is a cornerstone of modern machine learning, promising a way to adapt models pretrained on a broad mix of data to new tasks with minimal new data. However, a significant challenge remains in ensuring that transferred features are sufficient to handle unseen datasets, amplified by the difficulty of quantifying whether two tasks are "related". To address these challenges, we evaluate model transfer from a pretraining mixture to each of its component tasks, assessing whether pretrained features can match the performance of task-specific direct training. We identify a fundamental limitation in deep learning models -- an "information saturation bottleneck" -- where networks fail to learn new features once they encode similar competing features during training. When restricted to learning only a subset of key features during pretraining, models will permanently lose critical features for transfer and perform inconsistently on data distributions, even components of the training mixture. Empirical evidence from published studies suggests that this phenomenon is pervasive in deep learning architectures -- factors such as data distribution or ordering affect the features that current representation learning methods can learn over time. This study suggests that relying solely on large-scale networks may not be as effective as focusing on task-specific training, when available. We propose richer feature representations as a potential solution to better generalize across new datasets and, specifically, present existing methods alongside a novel approach, the initial steps towards addressing this challenge.
Problem

Research questions and friction points this paper is trying to address.

Identify fundamental bottleneck in supervised pretraining transfer learning
Assess pretrained features' performance versus task-specific training
Propose richer feature representations to generalize across datasets
Innovation

Methods, ideas, or system contributions that make the work stand out.

Identifies information saturation bottleneck in models
Proposes richer feature representations for generalization
Suggests task-specific training over large-scale networks
🔎 Similar Papers
No similar papers found.
Xingyu Alice Yang
Xingyu Alice Yang
Fundamental AI Research team, Meta NY
ml theoryinterpretabilityoptimization
J
Jianyu Zhang
New York University, New York, United States
L
Léon Bottou
Fundamental AI Research (FAIR) at Meta, New York, United States