Metropolis-Hastings Captioning Game: Knowledge Fusion of Vision Language Models via Decentralized Bayesian Inference

📅 2025-04-13
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing vision-language model (VLM) fusion approaches suffer from high inference overhead and architectural coupling. To address this, we propose the Metropolis-Hastings Captioning Game (MHCG), a decentralized, language-game-based framework for multi-VLM knowledge fusion. MHCG enables collaborative inference without model modification or joint training, leveraging alternating image caption generation and cross-model mutual learning. Its key innovation lies in establishing the first reference-free, vocabulary-level knowledge sharing paradigm among VLMs—integrating Metropolis-Hastings sampling, multi-agent distillation, and dual evaluation via BERTScore and CLIPScore. In two-VLM experiments, MHCG achieves significant gains in reference-free metrics (+2.1 BERTScore, +3.4 CLIPScore) and empirically validates effective cross-dataset transfer and lexical distribution alignment of category-specific vocabularies across heterogeneous VLMs.

Technology Category

Application Category

📝 Abstract
We propose the Metropolis-Hastings Captioning Game (MHCG), a method to fuse knowledge of multiple vision-language models (VLMs) by learning from each other. Although existing methods that combine multiple models suffer from inference costs and architectural constraints, MHCG avoids these problems by performing decentralized Bayesian inference through a process resembling a language game. The knowledge fusion process establishes communication between two VLM agents alternately captioning images and learning from each other. We conduct two image-captioning experiments with two VLMs, each pre-trained on a different dataset. The first experiment demonstrates that MHCG achieves consistent improvement in reference-free evaluation metrics. The second experiment investigates how MHCG contributes to sharing VLMs' category-level vocabulary by observing the occurrence of the vocabulary in the generated captions.
Problem

Research questions and friction points this paper is trying to address.

Fuse knowledge of multiple vision-language models
Avoid inference costs and architectural constraints
Improve reference-free evaluation metrics
Innovation

Methods, ideas, or system contributions that make the work stand out.

Decentralized Bayesian inference for model fusion
Language game approach for VLM communication
Improves reference-free evaluation metrics consistently
🔎 Similar Papers
No similar papers found.
Y
Yuta Matsui
Ritsumeikan University
R
Ryosuke Yamaki
Ritsumeikan University
Ryo Ueda
Ryo Ueda
The University of Tokyo
computational linguisticsemergent communicationlanguage emergence
S
Seitaro Shinagawa
Nara Institute of Science and Technology
Tadahiro Taniguchi
Tadahiro Taniguchi
Kyoto University
symbol emergenceartificial intelligencemachine learningcognitive robotics