U-REPA: Aligning Diffusion U-Nets to ViTs

📅 2025-03-24
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the unverified and challenging adaptation of Representation Alignment (REPA) between Diffusion U-Nets and ViT visual encoders. To overcome three key obstacles—functional heterogeneity across U-Net blocks, spatial dimension mismatch induced by downsampling, and token-level alignment failure between ViT and U-Net—we propose U-REPA, a novel REPA paradigm. U-REPA introduces mid-layer skip-connection feature alignment, employs MLP-based dimensional lifting coupled with bilinear upsampling for adaptive spatial matching, and designs a manifold loss grounded in pairwise relative similarity to enforce latent-space structural consistency. This is the first work to successfully extend REPA to standard Diffusion U-Nets. On ImageNet 256×256, U-REPA achieves FID < 1.5 within only 200 training epochs—doubling convergence speed—and significantly outperforms baseline methods in generation quality.

Technology Category

Application Category

📝 Abstract
Representation Alignment (REPA) that aligns Diffusion Transformer (DiT) hidden-states with ViT visual encoders has proven highly effective in DiT training, demonstrating superior convergence properties, but it has not been validated on the canonical diffusion U-Net architecture that shows faster convergence compared to DiTs. However, adapting REPA to U-Net architectures presents unique challenges: (1) different block functionalities necessitate revised alignment strategies; (2) spatial-dimension inconsistencies emerge from U-Net's spatial downsampling operations; (3) space gaps between U-Net and ViT hinder the effectiveness of tokenwise alignment. To encounter these challenges, we propose U-REPA, a representation alignment paradigm that bridges U-Net hidden states and ViT features as follows: Firstly, we propose via observation that due to skip connection, the middle stage of U-Net is the best alignment option. Secondly, we propose upsampling of U-Net features after passing them through MLPs. Thirdly, we observe difficulty when performing tokenwise similarity alignment, and further introduces a manifold loss that regularizes the relative similarity between samples. Experiments indicate that the resulting U-REPA could achieve excellent generation quality and greatly accelerates the convergence speed. With CFG guidance interval, U-REPA could reach $FID<1.5$ in 200 epochs or 1M iterations on ImageNet 256 $ imes$ 256, and needs only half the total epochs to perform better than REPA. Codes are available at https://github.com/YuchuanTian/U-REPA.
Problem

Research questions and friction points this paper is trying to address.

Aligning Diffusion U-Nets with ViT visual encoders
Addressing spatial-dimension inconsistencies in U-Net architectures
Improving tokenwise alignment between U-Net and ViT features
Innovation

Methods, ideas, or system contributions that make the work stand out.

Aligns U-Net middle stage with ViT features
Upsamples U-Net features via MLPs
Uses manifold loss for sample similarity regularization
🔎 Similar Papers
No similar papers found.
Y
Yuchuan Tian
State Key Lab of General AI, School of Intelligence Science and Technology, Peking University.
Hanting Chen
Hanting Chen
Noah's Ark Lab, Huawei
deep learningmachine learningcomputer vision
M
Mengyu Zheng
The University of Sydney
Yuchen Liang
Yuchen Liang
The Ohio State University
Diffusion modelsGenerative modelsAnomaly detectionBayesian analysisSignal processing
C
Chao Xu
State Key Lab of General AI, School of Intelligence Science and Technology, Peking University.
Yunhe Wang
Yunhe Wang
Noah's Ark Lab, Huawei Technologies
Deep LearningLanguage ModelMachine LearningComputer Vision