🤖 AI Summary
In prototype-based self-supervised learning, partial prototype collapse—where multiple prototypes converge to similar representations—is a prevalent issue that undermines the ability of prototypes to guide the encoder in learning diverse features. This paper identifies, for the first time, joint optimization of the encoder and prototypes as the root cause of collapse. To address this, we propose a fully decoupled training paradigm: prototype learning is strictly separated from encoder optimization. Prototypes are updated independently via a Gaussian Mixture Model (GMM) coupled with an online Expectation-Maximization (EM) algorithm—requiring no explicit regularization or over-parameterization. This mechanism eliminates collapse at its source, substantially enhancing prototype diversity and representation discriminability. Empirically, our approach yields more stable and consistently superior performance across downstream tasks, outperforming state-of-the-art methods without architectural or loss-function modifications.
📝 Abstract
Prototypical self-supervised learning methods consistently suffer from partial prototype collapse, where multiple prototypes converge to nearly identical representations. This undermines their central purpose -- providing diverse and informative targets to guide encoders toward rich representations -- and has led practitioners to over-parameterize prototype sets or add ad-hoc regularizers, which mitigate symptoms rather than address the root cause. We empirically trace the collapse to the joint optimization of encoders and prototypes, which encourages a type of shortcut learning: early in training prototypes drift toward redundant representations that minimize loss without necessarily enhancing representation diversity. To break the joint optimization, we introduce a fully decoupled training strategy that learns prototypes and encoders under separate objectives. Concretely, we model prototypes as a Gaussian mixture updated with an online EM-style procedure, independent of the encoder's loss. This simple yet principled decoupling eliminates prototype collapse without explicit regularization and yields consistently diverse prototypes and stronger downstream performance.