Transferring Textual Preferences to Vision-Language Understanding through Model Merging

📅 2025-02-19
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing large vision-language models (LVLMs) exhibit weak capabilities in evaluating generated content, while vision-language reward models (VLRMs) typically require costly, human-annotated preference data for fine-tuning. Method: We propose a zero-training, zero-shot model fusion approach that directly combines a pre-trained text-based reward model (LLM-based RM) with an LVLM via parameter-space interpolation techniques (e.g., Task Arithmetic, SLERP), yielding a VLRM without any fine-tuning. Contribution/Results: This is the first work to achieve zero-shot cross-modal transfer from text RMs to vision-language tasks. Our fused model significantly outperforms baseline LVLMs and standalone text RMs on vision-language preference ranking, incurs zero inference overhead beyond standard LVLM inference, and eliminates training costs entirely—reducing them by 100%. It breaks the conventional paradigm reliant on large-scale, multimodal preference datasets.

Technology Category

Application Category

📝 Abstract
Large vision-language models (LVLMs) perform outstandingly across various multimodal tasks. However, their ability to evaluate generated content remains limited, and training vision-language reward models (VLRMs) with preference data is computationally expensive. This paper explores a training-free alternative by merging text-based reward models (RMs) with LVLMs to create VLRMs. Our approach shows that integrating these models leads to improved performance over LVLMs' scoring and text-based RMs, offering an efficient method for incorporating textual preferences into LVLMs.
Problem

Research questions and friction points this paper is trying to address.

Enhancing LVLMs with textual preferences
Creating VLRMs without training costs
Improving content evaluation in LVLMs
Innovation

Methods, ideas, or system contributions that make the work stand out.

Merge text-based reward models
Create vision-language reward models
Enhance multimodal task performance
🔎 Similar Papers
No similar papers found.