Eve: Efficient Multimodal Vision Language Models with Elastic Visual Experts

📅 2025-01-08
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the challenge of balancing performance and efficiency when deploying multimodal vision-language models (VLMs) on edge devices, this paper proposes the Elastic Vision Expert (EVE) architecture. During training, lightweight vision expert modules are dynamically injected in multiple stages to enable progressive adaptation of visual capabilities. Combined with parameter-efficient fine-tuning and vision–language co-distillation, EVE significantly enhances cross-modal alignment in compact models. To our knowledge, this is the first work achieving state-of-the-art performance in both language understanding and multimodal perception simultaneously within the ≤3B parameter regime: the 1.8B-parameter EVE model outperforms same-scale models on mainstream language benchmarks and attains 68.87% accuracy on VLM Benchmarks—surpassing the 7B-parameter LLaVA-1.5. This establishes a new paradigm for efficient, lightweight VLM deployment on resource-constrained edge platforms.

Technology Category

Application Category

📝 Abstract
Multimodal vision language models (VLMs) have made significant progress with the support of continuously increasing model sizes and data volumes. Running VLMs on edge devices has become a challenge for their widespread application. There are several efficient VLM efforts, but they often sacrifice linguistic capabilities to enhance multimodal abilities, or require extensive training. To address this quandary,we introduce the innovative framework of Efficient Vision Language Models with Elastic Visual Experts (Eve). By strategically incorporating adaptable visual expertise at multiple stages of training, Eve strikes a balance between preserving linguistic abilities and augmenting multimodal capabilities. This balanced approach results in a versatile model with only 1.8B parameters that delivers significant improvements in both multimodal and linguistic tasks. Notably, in configurations below 3B parameters, Eve distinctly outperforms in language benchmarks and achieves state-of-the-art results 68.87% in VLM Benchmarks. Additionally, its multimodal accuracy outstrips that of the larger 7B LLaVA-1.5 model.
Problem

Research questions and friction points this paper is trying to address.

Efficient Inference
Multimodal Vision-Language Models
Resource-Constrained Devices
Innovation

Methods, ideas, or system contributions that make the work stand out.

Eve model
multimodal vision-language
efficient performance
🔎 Similar Papers
No similar papers found.