Diff-Prompt: Diffusion-Driven Prompt Generator with Mask Supervision

📅 2025-04-30
📈 Citations: 2
Influential: 0
📄 PDF
🤖 AI Summary
Existing prompt learning methods for fine-grained pixel-level tasks—such as referring expression comprehension—suffer from impoverished prompt representations and poor generalization, as they directly optimize prompt parameters via loss backpropagation. Method: This paper pioneers the integration of diffusion models into prompt generation, proposing a latent-space modeling framework: (i) Mask-VAE for mask-aware encoding; (ii) an enhanced Diffusion Transformer (DiT) for high-fidelity prompt synthesis; and (iii) a cross-modal semantic alignment fine-tuning mechanism to ensure rich, interpretable, and task-adaptive prompts. Contribution/Results: The framework achieves significant performance gains on referring expression comprehension, improving Recall@1 and Recall@5 by 8.87 and 14.05 points, respectively—outperforming state-of-the-art parameter-efficient fine-tuning approaches. It establishes a novel paradigm for generating semantically grounded, fine-grained visual prompts without requiring extensive architectural modifications or full-model retraining.

Technology Category

Application Category

📝 Abstract
Prompt learning has demonstrated promising results in fine-tuning pre-trained multimodal models. However, the performance improvement is limited when applied to more complex and fine-grained tasks. The reason is that most existing methods directly optimize the parameters involved in the prompt generation process through loss backpropagation, which constrains the richness and specificity of the prompt representations. In this paper, we propose Diffusion-Driven Prompt Generator (Diff-Prompt), aiming to use the diffusion model to generate rich and fine-grained prompt information for complex downstream tasks. Specifically, our approach consists of three stages. In the first stage, we train a Mask-VAE to compress the masks into latent space. In the second stage, we leverage an improved Diffusion Transformer (DiT) to train a prompt generator in the latent space, using the masks for supervision. In the third stage, we align the denoising process of the prompt generator with the pre-trained model in the semantic space, and use the generated prompts to fine-tune the model. We conduct experiments on a complex pixel-level downstream task, referring expression comprehension, and compare our method with various parameter-efficient fine-tuning approaches. Diff-Prompt achieves a maximum improvement of 8.87 in R@1 and 14.05 in R@5 compared to the foundation model and also outperforms other state-of-the-art methods across multiple metrics. The experimental results validate the effectiveness of our approach and highlight the potential of using generative models for prompt generation. Code is available at https://github.com/Kelvin-ywc/diff-prompt.
Problem

Research questions and friction points this paper is trying to address.

Enhancing prompt richness for complex fine-grained tasks
Overcoming limitations of direct prompt parameter optimization
Generating fine-grained prompts using diffusion models
Innovation

Methods, ideas, or system contributions that make the work stand out.

Uses diffusion model for prompt generation
Trains Mask-VAE for mask compression
Aligns denoising with pre-trained model
🔎 Similar Papers
No similar papers found.
Weicai Yan
Weicai Yan
Zhejiang University
multimodal
Wang Lin
Wang Lin
Zhejiang University
Computer VisionMulti-Modal LearningVideo Understanding
Z
Zirun Guo
Zhejiang University
Y
Yejin Wang
F
Fangming Feng
Zhejiang University
X
Xiaoda Yang
Zhejiang University
Z
Zehan Wang
Zhejiang University
T
Tao Jin
Zhejiang University