🤖 AI Summary
Existing large vision-language models (LVLMs) exhibit weak capabilities in evaluating generated content, while vision-language reward models (VLRMs) typically require costly, human-annotated preference data for fine-tuning. Method: We propose a zero-training, zero-shot model fusion approach that directly combines a pre-trained text-based reward model (LLM-based RM) with an LVLM via parameter-space interpolation techniques (e.g., Task Arithmetic, SLERP), yielding a VLRM without any fine-tuning. Contribution/Results: This is the first work to achieve zero-shot cross-modal transfer from text RMs to vision-language tasks. Our fused model significantly outperforms baseline LVLMs and standalone text RMs on vision-language preference ranking, incurs zero inference overhead beyond standard LVLM inference, and eliminates training costs entirely—reducing them by 100%. It breaks the conventional paradigm reliant on large-scale, multimodal preference datasets.
📝 Abstract
Large vision-language models (LVLMs) perform outstandingly across various multimodal tasks. However, their ability to evaluate generated content remains limited, and training vision-language reward models (VLRMs) with preference data is computationally expensive. This paper explores a training-free alternative by merging text-based reward models (RMs) with LVLMs to create VLRMs. Our approach shows that integrating these models leads to improved performance over LVLMs' scoring and text-based RMs, offering an efficient method for incorporating textual preferences into LVLMs.