Language-Specific Layer Matters: Efficient Multilingual Enhancement for Large Vision-Language Models

📅 2025-08-25
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the imbalanced multilingual capabilities of Large Vision-Language Models (LVLMs), this paper proposes PLAST: a parameter-efficient fine-tuning method that identifies language-specific neurons via activation analysis and selectively fine-tunes only the shallow-layer modules most critical for linguistic understanding—accounting for just 14% of total parameters—while jointly optimizing with multilingual question-answer translation pairs to enforce cross-lingual alignment. PLAST requires neither full-model fine-tuning nor external data augmentation. It significantly enhances low-resource language comprehension and complex visual reasoning. Evaluated on multilingual multimodal benchmarks—including MM-Bench and MMMB—PLAST achieves systematic performance gains with minimal parameter overhead, demonstrating high efficiency, strong generalization, and scalable cross-lingual transfer. This work establishes a novel paradigm for lightweight, multilingual LVLM optimization.

Technology Category

Application Category

📝 Abstract
Large vision-language models (LVLMs) have demonstrated exceptional capabilities in understanding visual information with human languages but also exhibit an imbalance in multilingual capabilities. In this work, we delve into the multilingual working pattern of LVLMs and identify a salient correlation between the multilingual understanding ability of LVLMs and language-specific neuron activations in shallow layers. Building on this insight, we introduce PLAST, a training recipe that achieves efficient multilingual enhancement for LVLMs by Precise LAnguage-Specific layers fine-Tuning. PLAST first identifies layers involved in multilingual understanding by monitoring language-specific neuron activations. These layers are then precisely fine-tuned with question-translation pairs to achieve multilingual alignment. Our empirical results on MM-Bench and MMMB demonstrate that PLAST effectively improves the multilingual capabilities of LVLMs and achieves significant efficiency with only 14% of the parameters tuned. Further analysis reveals that PLAST can be generalized to low-resource and complex visual reasoning tasks, facilitating the language-specific visual information engagement in shallow layers.
Problem

Research questions and friction points this paper is trying to address.

Addresses multilingual capability imbalance in vision-language models
Identifies language-specific neuron activations in shallow layers
Proposes efficient fine-tuning method for multilingual enhancement
Innovation

Methods, ideas, or system contributions that make the work stand out.

Precise language-specific layer fine-tuning
Monitors neuron activations for multilingual understanding
Uses question-translation pairs for alignment
🔎 Similar Papers
No similar papers found.
Yuchun Fan
Yuchun Fan
Northeastern University
Y
Yilin Wang
NLP Lab, School of Computer Science and Engineering, Northeastern University, Shenyang, China
Yongyu Mu
Yongyu Mu
Northeastern University
multilingualismmachine translationefficient models
L
Lei Huang
Harbin Institute of Technology, Harbin, China
Bei Li
Bei Li
Meituan LLM Team
Machine TranslationDeep LearningLarge Language Models
Xiaocheng Feng
Xiaocheng Feng
Harbin Institute of Technology
NLPDeep Learning MachineLearning
T
Tong Xiao
NLP Lab, School of Computer Science and Engineering, Northeastern University, Shenyang, China; NiuTrans Research, Shenyang, China
Jingbo Zhu
Jingbo Zhu
Northeastern University, China
Machine TranslationLanguage ParsingNatural Language Processing