🤖 AI Summary
Generative diffusion models face a fundamental mismatch in discriminative tasks (e.g., referring image segmentation): their inherently tolerant generation mechanism conflicts with the strict accuracy requirements imposed on intermediate inference steps. This work systematically characterizes the non-uniform impact of denoising timesteps on perceptual quality—a previously unexplored insight. We propose three key innovations: (1) a timestep-aware loss that dynamically weights supervision across diffusion steps; (2) distribution-shift-robust data augmentation to enhance generalization; and (3) a prompt-driven interactive backward sampling mechanism enabling iterative refinement. All techniques operate via discriminative fine-tuning—no architectural modifications are required. Evaluated on depth estimation, referring image segmentation, and general perception benchmarks, our approach achieves state-of-the-art performance, significantly improving multimodal discriminative accuracy and interaction robustness.
📝 Abstract
With the success of image generation, generative diffusion models are increasingly adopted for discriminative tasks, as pixel generation provides a unified perception interface. However, directly repurposing the generative denoising process for discriminative objectives reveals critical gaps rarely addressed previously. Generative models tolerate intermediate sampling errors if the final distribution remains plausible, but discriminative tasks require rigorous accuracy throughout, as evidenced in challenging multi-modal tasks like referring image segmentation. Motivated by this gap, we analyze and enhance alignment between generative diffusion processes and perception tasks, focusing on how perception quality evolves during denoising. We find: (1) earlier denoising steps contribute disproportionately to perception quality, prompting us to propose tailored learning objectives reflecting varying timestep contributions; (2) later denoising steps show unexpected perception degradation, highlighting sensitivity to training-denoising distribution shifts, addressed by our diffusion-tailored data augmentation; and (3) generative processes uniquely enable interactivity, serving as controllable user interfaces adaptable to correctional prompts in multi-round interactions. Our insights significantly improve diffusion-based perception models without architectural changes, achieving state-of-the-art performance on depth estimation, referring image segmentation, and generalist perception tasks. Code available at https://github.com/ziqipang/ADDP.