OViP: Online Vision-Language Preference Learning

📅 2025-05-21

📈 Citations: 0

✨ Influential: 0

career value

188K/year

🤖 AI Summary

Large Vision-Language Models (LVLMs) suffer from pervasive vision-text hallucinations, while existing multimodal Direct Preference Optimization (DPO) methods rely on static, unrealistic negative samples, limiting their effectiveness in hallucination suppression. This paper proposes a failure-driven online preference learning framework: it dynamically captures the model’s own hallucinated outputs to construct positive-negative text pairs, and employs diffusion models to synthesize semantically aligned negative images—enabling bidirectional image-text preference alignment. Its core innovations are (i) the first “online—failure-driven” multimodal DPO paradigm, generating highly relevant, dynamic negative samples to overcome static sampling limitations; and (ii) a redefined evaluation protocol balancing hallucination reduction and generation fidelity. Experiments show significant hallucination rate reduction on dedicated hallucination benchmarks, while maintaining strong performance on general VQA and generation tasks—demonstrating the critical role of authentic, high-fidelity training signals.

Technology Category

Application Category

📝 Abstract

Large vision-language models (LVLMs) remain vulnerable to hallucination, often generating content misaligned with visual inputs. While recent approaches advance multi-modal Direct Preference Optimization (DPO) to mitigate hallucination, they typically rely on predefined or randomly edited negative samples that fail to reflect actual model errors, limiting training efficacy. In this work, we propose an Online Vision-language Preference Learning (OViP) framework that dynamically constructs contrastive training data based on the model's own hallucinated outputs. By identifying semantic differences between sampled response pairs and synthesizing negative images using a diffusion model, OViP generates more relevant supervision signals in real time. This failure-driven training enables adaptive alignment of both textual and visual preferences. Moreover, we refine existing evaluation protocols to better capture the trade-off between hallucination suppression and expressiveness. Experiments on hallucination and general benchmarks demonstrate that OViP effectively reduces hallucinations while preserving core multi-modal capabilities.

Problem

Research questions and friction points this paper is trying to address.

LVLMs generate content misaligned with visual inputs

Existing methods use irrelevant negative samples for training

Need better evaluation of hallucination suppression vs expressiveness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic contrastive data construction from model outputs

Real-time negative image synthesis via diffusion model

Adaptive alignment of textual and visual preferences

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs