Long-Tailed Distribution-Aware Router For Mixture-of-Experts in Large Vision-Language Model

📅 2025-07-02

📈 Citations: 0

✨ Influential: 0

career value

228K/year

🤖 AI Summary

Existing Mixture-of-Experts (MoE) frameworks in Large Vision-Language Models (LVLMs) employ uniform load-balancing routing, overlooking fundamental distributional disparities between visual and linguistic tokens: language tokens follow an approximately uniform distribution, whereas visual tokens exhibit a pronounced long-tailed distribution. This work is the first to identify this cross-modal distributional heterogeneity and proposes the Tail-aware Adaptive Router (TAR), a modality-adaptive routing mechanism. TAR’s core innovations are: (1) distinct routing strategies tailored separately for language and visual tokens; and (2) an oversampling-inspired heuristic that explicitly increases expert activation frequency for tail-region visual tokens, thereby enhancing sparse representation learning. Evaluated across multiple mainstream multimodal benchmarks, TAR consistently outperforms conventional routing methods—achieving superior vision-language understanding performance while preserving computational efficiency.

Technology Category

Application Category

📝 Abstract

The mixture-of-experts (MoE), which replaces dense models with sparse architectures, has gained attention in large vision-language models (LVLMs) for achieving comparable performance with fewer activated parameters. Existing MoE frameworks for LVLMs focus on token-to-expert routing (TER), encouraging different experts to specialize in processing distinct tokens. However, these frameworks often rely on the load balancing mechanism, overlooking the inherent distributional differences between vision and language. To this end, we propose a Long-Tailed Distribution-aware Router (LTDR) for vision-language TER, tackling two challenges: (1) Distribution-aware router for modality-specific routing. We observe that language TER follows a uniform distribution, whereas vision TER exhibits a long-tailed distribution. This discrepancy necessitates distinct routing strategies tailored to each modality. (2) Enhancing expert activation for vision tail tokens. Recognizing the importance of vision tail tokens, we introduce an oversampling-like strategy by increasing the number of activated experts for these tokens. Experiments on extensive benchmarks validate the effectiveness of our approach.

Problem

Research questions and friction points this paper is trying to address.

Addresses modality-specific routing imbalance in vision-language MoE models

Enhances expert activation for long-tailed vision token distribution

Proposes oversampling strategy for vision tail tokens in TER

Innovation

Methods, ideas, or system contributions that make the work stand out.

Long-Tailed Distribution-aware Router for modality routing

Oversampling-like strategy for vision tail tokens

Modality-specific routing for vision-language models

🔎 Similar Papers

Solving Token Gradient Conflict in Mixture-of-Experts for Large Vision-Language Model