Conjuring Positive Pairs for Efficient Unification of Representation Learning and Image Synthesis

📅 2025-03-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses two key limitations in self-supervised learning: the decoupling of representation learning from image synthesis, and the high computational overhead incurred by online tokenization. To this end, we propose Sorcen—a self-supervised unified framework that eliminates reliance on external tokenizers. Its core innovations are threefold: (1) the novel “Echo Contrast” mechanism, which constructs high-quality positive pairs by generating “echo samples” within a precomputed semantic token space; (2) the first pure precomputed-token training paradigm, completely removing real-time tokenization overhead; and (3) joint optimization of contrastive learning and masked image modeling, tightly integrating discriminative representation learning with generative capability. On ImageNet-1K, Sorcen achieves a +0.4% linear probe accuracy, −1.48 FID, +1.76% few-shot classification accuracy, +1.53% transfer learning performance, and +60.8% training efficiency over prior methods.

Technology Category

Application Category

📝 Abstract
While representation learning and generative modeling seek to understand visual data, unifying both domains remains unexplored. Recent Unified Self-Supervised Learning (SSL) methods have started to bridge the gap between both paradigms. However, they rely solely on semantic token reconstruction, which requires an external tokenizer during training -- introducing a significant overhead. In this work, we introduce Sorcen, a novel unified SSL framework, incorporating a synergic Contrastive-Reconstruction objective. Our Contrastive objective,"Echo Contrast", leverages the generative capabilities of Sorcen, eliminating the need for additional image crops or augmentations during training. Sorcen"generates"an echo sample in the semantic token space, forming the contrastive positive pair. Sorcen operates exclusively on precomputed tokens, eliminating the need for an online token transformation during training, thereby significantly reducing computational overhead. Extensive experiments on ImageNet-1k demonstrate that Sorcen outperforms the previous Unified SSL SoTA by 0.4%, 1.48 FID, 1.76%, and 1.53% on linear probing, unconditional image generation, few-shot learning, and transfer learning, respectively, while being 60.8% more efficient. Additionally, Sorcen surpasses previous single-crop MIM SoTA in linear probing and achieves SoTA performance in unconditional image generation, highlighting significant improvements and breakthroughs in Unified SSL models.
Problem

Research questions and friction points this paper is trying to address.

Unifies representation learning and image synthesis efficiently
Eliminates need for external tokenizer and image augmentations
Improves performance in various tasks with reduced computational overhead
Innovation

Methods, ideas, or system contributions that make the work stand out.

Sorcen integrates Contrastive-Reconstruction objective
Echo Contrast eliminates need for image augmentations
Sorcen operates on precomputed tokens, reducing overhead
Imanol G. Estepa
Imanol G. Estepa
Universitat de Barcelona
Self-supervised learningGenerative AI
J
Jes'us M. Rodr'iguez-de-Vera
Universitat de Barcelona, Spain
I
Ignacio Saras'ua
NVIDIA Computing Spain
Bhalaji Nagarajan
Bhalaji Nagarajan
Life Sciences Department, Barcelona Supercomputing Center
Deep LearningMachine LearningComputer Vision
P
Petia Radeva
Universitat de Barcelona, Spain