Modest-Align: Data-Efficient Alignment for Vision-Language Models

📅 2025-10-24

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Cross-modal alignment models (e.g., CLIP) suffer from overconfidence and performance degradation under low-resource, low-quality data—particularly due to ambiguous or weakly correlated image–text pairs. Method: We propose a robust and efficient contrastive learning framework that explicitly models input uncertainty via stochastic input perturbations and enforces embedding-space smoothing to suppress overconfident predictions on noisy samples during contrastive training. Contribution/Results: Our method enables lightweight training with only a small number of image–text pairs, significantly improving few-shot generalization. Experiments across multiple cross-modal retrieval benchmarks demonstrate that it achieves comparable or superior performance to fully trained CLIP—using merely 1% of CLIP’s training data and approximately 0.17% of its GPU training time—thereby validating its efficiency and robustness under data scarcity and noise.

Technology Category

Application Category

📝 Abstract

Cross-modal alignment aims to map heterogeneous modalities into a shared latent space, as exemplified by models like CLIP, which benefit from large-scale image-text pretraining for strong recognition capabilities. However, when operating in resource-constrained settings with limited or low-quality data, these models often suffer from overconfidence and degraded performance due to the prevalence of ambiguous or weakly correlated image-text pairs. Current contrastive learning approaches, which rely on single positive pairs, further exacerbate this issue by reinforcing overconfidence on uncertain samples. To address these challenges, we propose Modest-Align, a lightweight alignment framework designed for robustness and efficiency. Our approach leverages two complementary strategies -- Random Perturbation, which introduces controlled noise to simulate uncertainty, and Embedding Smoothing, which calibrates similarity distributions in the embedding space. These mechanisms collectively reduce overconfidence and improve performance on noisy or weakly aligned samples. Extensive experiments across multiple benchmark datasets demonstrate that Modest-Align outperforms state-of-the-art methods in retrieval tasks, achieving competitive results with over 100x less training data and 600x less GPU time than CLIP. Our method offers a practical and scalable solution for cross-modal alignment in real-world, low-resource scenarios.

Problem

Research questions and friction points this paper is trying to address.

Addresses overconfidence in vision-language models with limited data

Improves cross-modal alignment robustness using noise and smoothing

Enables efficient training with minimal data and computational resources

Innovation

Methods, ideas, or system contributions that make the work stand out.

Random Perturbation introduces noise to simulate uncertainty

Embedding Smoothing calibrates similarity distributions in embedding space

Lightweight framework achieves robust alignment with minimal data

🔎 Similar Papers

No similar papers found.

Authors to Follow