Unleashing Diffusion Transformers for Visual Correspondence by Modulating Massive Activations

📅 2025-05-24

📈 Citations: 0

✨ Influential: 0

career value

180K/year

🤖 AI Summary

Diffusion Transformers (DiTs) suffer from dimensionally fixed, massive activations in visual dense matching: a small subset of channels exhibits persistently high responses, lacking local discriminability and degrading feature representation. This work is the first to systematically diagnose the root cause of this phenomenon. We propose DiTF, a training-free framework that localizes anomalous channels via AdaLN-zero, then applies channel-adaptive normalization and dynamic channel pruning to decouple semantic features in intermediate layers. DiTF requires no fine-tuning yet significantly enhances feature discriminability. On Spair-71k and AP-10K-C.S., it achieves absolute improvements of +9.4% and +4.4%, respectively—substantially outperforming DINO and Stable Diffusion baselines. Our method establishes a new state-of-the-art for DiTs in visual correspondence tasks, demonstrating that architectural diagnosis and lightweight, inference-time adaptation can unlock superior performance without additional training overhead.

Technology Category

Application Category

📝 Abstract

Pre-trained stable diffusion models (SD) have shown great advances in visual correspondence. In this paper, we investigate the capabilities of Diffusion Transformers (DiTs) for accurate dense correspondence. Distinct from SD, DiTs exhibit a critical phenomenon in which very few feature activations exhibit significantly larger values than others, known as extit{massive activations}, leading to uninformative representations and significant performance degradation for DiTs. The massive activations consistently concentrate at very few fixed dimensions across all image patch tokens, holding little local information. We trace these dimension-concentrated massive activations and find that such concentration can be effectively localized by the zero-initialized Adaptive Layer Norm (AdaLN-zero). Building on these findings, we propose Diffusion Transformer Feature (DiTF), a training-free framework designed to extract semantic-discriminative features from DiTs. Specifically, DiTF employs AdaLN to adaptively localize and normalize massive activations with channel-wise modulation. In addition, we develop a channel discard strategy to further eliminate the negative impacts from massive activations. Experimental results demonstrate that our DiTF outperforms both DINO and SD-based models and establishes a new state-of-the-art performance for DiTs in different visual correspondence tasks (eg, with +9.4% on Spair-71k and +4.4% on AP-10K-C.S.).

Problem

Research questions and friction points this paper is trying to address.

Address massive activations degrading Diffusion Transformers performance

Develop training-free framework for semantic-discriminative feature extraction

Enhance visual correspondence accuracy in Diffusion Transformers

Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses Diffusion Transformers for dense correspondence

Employs AdaLN-zero to localize massive activations

Develops channel discard strategy for better features

🔎 Similar Papers

Brain Mapping with Dense Features: Grounding Cortical Semantic Selectivity in Natural Images With Vision Transformers