Masked Diffusion Captioning for Visual Feature Learning

📅 2025-10-30

📈 Citations: 0

✨ Influential: 0

career value

170K/year

🤖 AI Summary

This work addresses the problem of decaying visual feature learning signals across token positions and reliance on auxiliary objectives in autoregressive image captioning models. To this end, we propose Masked Diffusion Captioning (MDC), a novel framework that integrates image-conditioned masked diffusion language modeling into visual representation learning. MDC jointly trains an image encoder and a diffusion-based language model to reconstruct randomly masked text tokens via denoising—thereby establishing robust, position-invariant vision–language alignment. Notably, MDC is the first to employ masked diffusion for image captioning to learn general-purpose visual features, eliminating the need for contrastive losses or additional auxiliary tasks. Under linear probe evaluation, MDC achieves visual feature quality on par with state-of-the-art autoregressive and contrastive methods across multiple benchmark datasets—including COCO and Flickr30k—and across varying model scales.

Technology Category

Application Category

📝 Abstract

We learn visual features by captioning images with an image-conditioned masked diffusion language model, a formulation we call masked diffusion captioning (MDC). During training, text tokens in each image-caption pair are masked at a randomly chosen ratio, and a decoder conditioned on visual features is trained to reconstruct the original text. After training, the learned visual features can be applied to downstream vision tasks. Unlike autoregressive captioning, the strength of the visual learning signal in MDC does not depend on each token's position in the sequence, reducing the need for auxiliary objectives. Linear probing experiments across a variety of academic-scale models and datasets show that the learned visual features are competitive with those produced by autoregressive and contrastive approaches.

Problem

Research questions and friction points this paper is trying to address.

Learning visual features through masked diffusion image captioning

Reconstructing masked text tokens using visual-conditioned decoder

Applying learned features to downstream vision tasks competitively

Innovation

Methods, ideas, or system contributions that make the work stand out.

Masked diffusion model reconstructs image captions

Visual features trained via token reconstruction

Position-independent learning reduces auxiliary objectives

🔎 Similar Papers

No similar papers found.