Test-Time Conditioning with Representation-Aligned Visual Features

📅 2026-02-03
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing diffusion models struggle to achieve fine-grained conditional control during inference due to ambiguous text prompts or coarse category labels. To address this limitation, this work proposes the REPA-G framework, which, for the first time, integrates multi-scale visual features from self-supervised pre-trained models into test-time guidance of diffusion models. By optimizing the feature similarity between generated images and target semantics in latent space, REPA-G enables flexible and precise control ranging from local textures to global structures, while supporting composition of multiple concepts. Experiments on ImageNet and COCO demonstrate that the proposed method significantly improves both semantic fidelity and diversity of generated images, validating its superiority in fine-grained conditional generation.

Technology Category

Application Category

📝 Abstract
While representation alignment with self-supervised models has been shown to improve diffusion model training, its potential for enhancing inference-time conditioning remains largely unexplored. We introduce Representation-Aligned Guidance (REPA-G), a framework that leverages these aligned representations, with rich semantic properties, to enable test-time conditioning from features in generation. By optimizing a similarity objective (the potential) at inference, we steer the denoising process toward a conditioned representation extracted from a pre-trained feature extractor. Our method provides versatile control at multiple scales, ranging from fine-grained texture matching via single patches to broad semantic guidance using global image feature tokens. We further extend this to multi-concept composition, allowing for the faithful combination of distinct concepts. REPA-G operates entirely at inference time, offering a flexible and precise alternative to often ambiguous text prompts or coarse class labels. We theoretically justify how this guidance enables sampling from the potential-induced tilted distribution. Quantitative results on ImageNet and COCO demonstrate that our approach achieves high-quality, diverse generations. Code is available at https://github.com/valeoai/REPA-G.
Problem

Research questions and friction points this paper is trying to address.

test-time conditioning
representation alignment
diffusion models
visual features
inference guidance
Innovation

Methods, ideas, or system contributions that make the work stand out.

Representation Alignment
Test-Time Conditioning
Diffusion Models
Feature-Guided Generation
Multi-Concept Composition
🔎 Similar Papers
No similar papers found.
N
Nicolas Sereyjol-Garros
Valeo.ai, Paris, France
E
Ellington Kirby
Valeo.ai, Paris, France
V
Victor Letzelter
Valeo.ai, Paris, France; LTCI, Télécom Paris, Institut Polytechnique de Paris, France
Victor Besnier
Victor Besnier
Valeo.ai
Deep learningComputer Vision
N
Nermin Samet
Valeo.ai, Paris, France