Solving Token Gradient Conflict in Mixture-of-Experts for Large Vision-Language Model

📅 2024-06-28

🏛️ arXiv.org

📈 Citations: 1

✨ Influential: 0

career value

206K/year

🤖 AI Summary

To address parameter update interference caused by intra-expert token gradient direction conflicts in Mixture-of-Experts (MoE)–based Large Vision-Language Models (LVLMs), this work introduces, for the first time, a token-level gradient conflict modeling framework. We propose a plug-and-play gradient-aware regularization loss that enables dynamic rerouting of conflicting tokens. Our method requires no modification to the backbone architecture and is fully compatible with mainstream MoE-LVLM frameworks. By performing multi-stage gradient conflict detection and expert-adaptive token reallocation, it significantly improves performance across multiple LVLM benchmarks—including VQAv2, OK-VQA, and TextVQA—while preserving inference latency and computational overhead unchanged. The approach achieves consistent gains without increasing model size or FLOPs, demonstrating strong generalization across diverse MoE configurations. All code is publicly released.

Technology Category

Application Category

📝 Abstract

The Mixture-of-Experts (MoE) has gained increasing attention in studying Large Vision-Language Models (LVLMs). It uses a sparse model to replace the dense model, achieving comparable performance while activating fewer parameters during inference, thus significantly reducing the inference cost. Existing MoE methods in LVLM encourage different experts to specialize in different tokens, and they usually employ a router to predict the routing of each token. However, the router is not optimized concerning distinct parameter optimization directions generated from tokens within an expert. This may lead to severe interference between tokens within an expert. To address this problem, we propose to use the token-level gradient analysis to Solving Token Gradient Conflict (STGC) in this paper. Specifically, we first use token-level gradients to identify conflicting tokens in experts. After that, we add a regularization loss tailored to encourage conflicting tokens routing from their current experts to other experts, for reducing interference between tokens within an expert. Our method can serve as a plug-in for diverse LVLM methods, and extensive experimental results demonstrate its effectiveness. The code will be publicly available at https://github.com/longrongyang/STGC.

Problem

Research questions and friction points this paper is trying to address.

Address token gradient conflict in Mixture-of-Experts for LVLMs

Optimize router to reduce interference between tokens within experts

Propose token-level gradient analysis to improve expert specialization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Token-level gradient analysis for conflict identification

Regularization loss to reroute conflicting tokens

Plug-in solution for Large Vision-Language Models

🔎 Similar Papers

Eagle: Exploring The Design Space for Multimodal LLMs with Mixture of Encoders