Hyperbolic and Evidence-Prioritized Experts for Large Vision-Language Models

📅 2026-05-29

📈 Citations: 0

✨ Influential: 0

career value

201K/year

🤖 AI Summary

This work addresses the limitations of existing Mixture-of-Experts (MoE) approaches in vision-language models, which typically employ symmetric architectures that overlook the inherent asymmetry between modalities and struggle to capture the partial descriptive relationship of text toward visual content, often leading to evidence detachment in deep layers. To overcome these issues, the authors propose AsyMoE, the first method to explicitly model hierarchical inclusion relations between vision and language in hyperbolic space. AsyMoE introduces three expert types: intra-modal experts for unimodal feature processing, hyperbolic cross-modal experts to capture asymmetric interactions, and evidence-prioritized language experts that suppress parametric memorization and enhance contextual grounding. The approach achieves a 25.45% reduction in activated parameters while maintaining performance, yielding an average gain of 1.5% across multiple tasks and up to 3.8% improvement on high-hallucination-risk benchmarks.

📝 Abstract

Large Vision-Language Models (LVLMs) have demonstrated impressive performance on multimodal tasks through scaled architectures and extensive training. Recent studies introduce Mixture of Experts (MoE) into LVLMs for improved computational efficiency. However, existing MoE approaches treat visual and linguistic modalities with symmetric architectures, overlooking the inherent asymmetry in how these two modalities are processed. This asymmetry causes two critical issues. First, text and vision form hierarchical rather than parallel relationships, as text queries typically describe partial aspects of complete visual scenes. Euclidean expert space struggles to encode such containment structures. Second, language experts in deeper layers progressively shift from evidence-based processing to parametric memory dependence, losing grounding in the provided visual and linguistic information. To address these issues, we propose AsyMoE, a novel architecture that explicitly models this asymmetry through three specialized expert groups. Intra-modality experts handle modality-specific processing. Hyperbolic inter-modality experts capture hierarchical cross-modal relationships through negative curvature geometry. Evidence-priority language experts suppress parametric memory activation and maintain contextual grounding throughout network depth. Extensive experiments demonstrate that AsyMoE achieves consistent improvements over baseline methods, with average gains of 1.5\% over MoE variants and up to 3.8\% on hallucination-sensitive tasks. AsyMoE activates 25.45\% fewer parameters compared to dense models.

Problem

Research questions and friction points this paper is trying to address.

asymmetry

hierarchical relationships

evidence grounding

visual-language models

Mixture of Experts

Innovation

Methods, ideas, or system contributions that make the work stand out.

Asymmetric Mixture of Experts

Hyperbolic Geometry

Evidence-Prioritized Language Modeling