Test-Time Conditioning with Representation-Aligned Visual Features

📅 2026-02-03

📈 Citations: 0

✨ Influential: 0

career value

167K/year

🤖 AI Summary

Existing diffusion models struggle to achieve fine-grained conditional control during inference due to ambiguous text prompts or coarse category labels. To address this limitation, this work proposes the REPA-G framework, which, for the first time, integrates multi-scale visual features from self-supervised pre-trained models into test-time guidance of diffusion models. By optimizing the feature similarity between generated images and target semantics in latent space, REPA-G enables flexible and precise control ranging from local textures to global structures, while supporting composition of multiple concepts. Experiments on ImageNet and COCO demonstrate that the proposed method significantly improves both semantic fidelity and diversity of generated images, validating its superiority in fine-grained conditional generation.

Technology Category

Application Category

📝 Abstract

While representation alignment with self-supervised models has been shown to improve diffusion model training, its potential for enhancing inference-time conditioning remains largely unexplored. We introduce Representation-Aligned Guidance (REPA-G), a framework that leverages these aligned representations, with rich semantic properties, to enable test-time conditioning from features in generation. By optimizing a similarity objective (the potential) at inference, we steer the denoising process toward a conditioned representation extracted from a pre-trained feature extractor. Our method provides versatile control at multiple scales, ranging from fine-grained texture matching via single patches to broad semantic guidance using global image feature tokens. We further extend this to multi-concept composition, allowing for the faithful combination of distinct concepts. REPA-G operates entirely at inference time, offering a flexible and precise alternative to often ambiguous text prompts or coarse class labels. We theoretically justify how this guidance enables sampling from the potential-induced tilted distribution. Quantitative results on ImageNet and COCO demonstrate that our approach achieves high-quality, diverse generations. Code is available at https://github.com/valeoai/REPA-G.

Problem

Research questions and friction points this paper is trying to address.

test-time conditioning

representation alignment

diffusion models

visual features

inference guidance

Innovation

Methods, ideas, or system contributions that make the work stand out.

Representation Alignment

Test-Time Conditioning

Diffusion Models