MoCHA: Advanced Vision-Language Reasoning with MoE Connector and Hierarchical Group Attention

📅 2025-07-30

📈 Citations: 0

✨ Influential: 0

career value

227K/year

🤖 AI Summary

Existing vision-language large models (VLLMs) suffer from high computational overhead, insufficient fine-grained visual detail capture, and weak cross-modal alignment. To address these bottlenecks, this paper proposes MoCHA—a modular, hierarchical, and collaborative architecture. MoCHA introduces sparse Mixture-of-Experts Connectors (MoECs) for dynamic routing and lightweight fusion of visual features, and Hierarchical Group Attention (HGA) to enhance multi-scale visual detail modeling. The framework integrates four state-of-the-art vision backbones—CLIP, SigLIP, DINOv2, and ConvNeXt—and is compatible with language models including Phi-2 (2.7B) and Vicuna-7B. Extensive experiments demonstrate that MoCHA achieves a +3.25% improvement on POPE, a +153-point gain on MME, significantly mitigates hallucination, and strengthens visual instruction-following capability. It outperforms leading open-source VLLMs across multiple benchmarks, establishing new efficiency–accuracy trade-offs in fine-grained visual understanding.

Technology Category

Application Category

📝 Abstract

Vision large language models (VLLMs) are focusing primarily on handling complex and fine-grained visual information by incorporating advanced vision encoders and scaling up visual models. However, these approaches face high training and inference costs, as well as challenges in extracting visual details, effectively bridging across modalities. In this work, we propose a novel visual framework, MoCHA, to address these issues. Our framework integrates four vision backbones (i.e., CLIP, SigLIP, DINOv2 and ConvNeXt) to extract complementary visual features and is equipped with a sparse Mixture of Experts Connectors (MoECs) module to dynamically select experts tailored to different visual dimensions. To mitigate redundant or insufficient use of the visual information encoded by the MoECs module, we further design a Hierarchical Group Attention (HGA) with intra- and inter-group operations and an adaptive gating strategy for encoded visual features. We train MoCHA on two mainstream LLMs (e.g., Phi2-2.7B and Vicuna-7B) and evaluate their performance across various benchmarks. Notably, MoCHA outperforms state-of-the-art open-weight models on various tasks. For example, compared to CuMo (Mistral-7B), our MoCHA (Phi2-2.7B) presents outstanding abilities to mitigate hallucination by showing improvements of 3.25% in POPE and to follow visual instructions by raising 153 points on MME. Finally, ablation studies further confirm the effectiveness and robustness of the proposed MoECs and HGA in improving the overall performance of MoCHA.

Problem

Research questions and friction points this paper is trying to address.

High training and inference costs in VLLMs

Challenges in extracting fine-grained visual details

Ineffective bridging across visual and language modalities

Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates four vision backbones for complementary features

Uses sparse Mixture of Experts Connectors dynamically

Implements Hierarchical Group Attention with adaptive gating

🔎 Similar Papers

ClawMachine: Learning to Fetch Visual Tokens for Referential Comprehension