Hyperbolic and Evidence-Prioritized Experts for Large Vision-Language Models

📅 2026-05-29
📈 Citations: 0
Influential: 0
📄 PDF

career value

201K/year
🤖 AI Summary
This work addresses the limitations of existing Mixture-of-Experts (MoE) approaches in vision-language models, which typically employ symmetric architectures that overlook the inherent asymmetry between modalities and struggle to capture the partial descriptive relationship of text toward visual content, often leading to evidence detachment in deep layers. To overcome these issues, the authors propose AsyMoE, the first method to explicitly model hierarchical inclusion relations between vision and language in hyperbolic space. AsyMoE introduces three expert types: intra-modal experts for unimodal feature processing, hyperbolic cross-modal experts to capture asymmetric interactions, and evidence-prioritized language experts that suppress parametric memorization and enhance contextual grounding. The approach achieves a 25.45% reduction in activated parameters while maintaining performance, yielding an average gain of 1.5% across multiple tasks and up to 3.8% improvement on high-hallucination-risk benchmarks.
📝 Abstract
Large Vision-Language Models (LVLMs) have demonstrated impressive performance on multimodal tasks through scaled architectures and extensive training. Recent studies introduce Mixture of Experts (MoE) into LVLMs for improved computational efficiency. However, existing MoE approaches treat visual and linguistic modalities with symmetric architectures, overlooking the inherent asymmetry in how these two modalities are processed. This asymmetry causes two critical issues. First, text and vision form hierarchical rather than parallel relationships, as text queries typically describe partial aspects of complete visual scenes. Euclidean expert space struggles to encode such containment structures. Second, language experts in deeper layers progressively shift from evidence-based processing to parametric memory dependence, losing grounding in the provided visual and linguistic information. To address these issues, we propose AsyMoE, a novel architecture that explicitly models this asymmetry through three specialized expert groups. Intra-modality experts handle modality-specific processing. Hyperbolic inter-modality experts capture hierarchical cross-modal relationships through negative curvature geometry. Evidence-priority language experts suppress parametric memory activation and maintain contextual grounding throughout network depth. Extensive experiments demonstrate that AsyMoE achieves consistent improvements over baseline methods, with average gains of 1.5\% over MoE variants and up to 3.8\% on hallucination-sensitive tasks. AsyMoE activates 25.45\% fewer parameters compared to dense models.
Problem

Research questions and friction points this paper is trying to address.

asymmetry
hierarchical relationships
evidence grounding
visual-language models
Mixture of Experts
Innovation

Methods, ideas, or system contributions that make the work stand out.

Asymmetric Mixture of Experts
Hyperbolic Geometry
Evidence-Prioritized Language Modeling
Multimodal Hierarchy
Hallucination Mitigation
🔎 Similar Papers
Z
Zijie Zhou
China University of Petroleum (Beijing), Beijing, China; Hainan Institute of China University of Petroleum (Beijing), Sanya, Hainan, China
D
Dandan Zhu
China University of Petroleum (Beijing), Beijing, China; Hainan Institute of China University of Petroleum (Beijing), Sanya, Hainan, China
H
Hangxiangpan Wang
China University of Petroleum (Beijing), Beijing, China; Hainan Institute of China University of Petroleum (Beijing), Sanya, Hainan, China
H
Heng Zhang
South China Normal University, Foshan, Guangdong, China
H
Huishen Jiao
China University of Petroleum (Beijing), Beijing, China
Yi Zhao
Yi Zhao
Harbin Institute of Technology (Shenzhen), China
Applied nonlinear dynamicsNonlinear time series analysisBiomathematicsData science