U-REPA: Aligning Diffusion U-Nets to ViTs

📅 2025-03-24

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the unverified and challenging adaptation of Representation Alignment (REPA) between Diffusion U-Nets and ViT visual encoders. To overcome three key obstacles—functional heterogeneity across U-Net blocks, spatial dimension mismatch induced by downsampling, and token-level alignment failure between ViT and U-Net—we propose U-REPA, a novel REPA paradigm. U-REPA introduces mid-layer skip-connection feature alignment, employs MLP-based dimensional lifting coupled with bilinear upsampling for adaptive spatial matching, and designs a manifold loss grounded in pairwise relative similarity to enforce latent-space structural consistency. This is the first work to successfully extend REPA to standard Diffusion U-Nets. On ImageNet 256×256, U-REPA achieves FID < 1.5 within only 200 training epochs—doubling convergence speed—and significantly outperforms baseline methods in generation quality.

Technology Category

Application Category

📝 Abstract

Representation Alignment (REPA) that aligns Diffusion Transformer (DiT) hidden-states with ViT visual encoders has proven highly effective in DiT training, demonstrating superior convergence properties, but it has not been validated on the canonical diffusion U-Net architecture that shows faster convergence compared to DiTs. However, adapting REPA to U-Net architectures presents unique challenges: (1) different block functionalities necessitate revised alignment strategies; (2) spatial-dimension inconsistencies emerge from U-Net's spatial downsampling operations; (3) space gaps between U-Net and ViT hinder the effectiveness of tokenwise alignment. To encounter these challenges, we propose U-REPA, a representation alignment paradigm that bridges U-Net hidden states and ViT features as follows: Firstly, we propose via observation that due to skip connection, the middle stage of U-Net is the best alignment option. Secondly, we propose upsampling of U-Net features after passing them through MLPs. Thirdly, we observe difficulty when performing tokenwise similarity alignment, and further introduces a manifold loss that regularizes the relative similarity between samples. Experiments indicate that the resulting U-REPA could achieve excellent generation quality and greatly accelerates the convergence speed. With CFG guidance interval, U-REPA could reach $FID<1.5$ in 200 epochs or 1M iterations on ImageNet 256 $ imes$ 256, and needs only half the total epochs to perform better than REPA. Codes are available at https://github.com/YuchuanTian/U-REPA.

Problem

Research questions and friction points this paper is trying to address.

Aligning Diffusion U-Nets with ViT visual encoders

Addressing spatial-dimension inconsistencies in U-Net architectures

Improving tokenwise alignment between U-Net and ViT features

Innovation

Methods, ideas, or system contributions that make the work stand out.

Aligns U-Net middle stage with ViT features

Upsamples U-Net features via MLPs

Uses manifold loss for sample similarity regularization

🔎 Similar Papers

No similar papers found.

Authors to Follow