🤖 AI Summary
Existing parameter-efficient fine-tuning (PEFT) methods for multimodal instruction tuning underperform full fine-tuning, suffer from poor interpretability, and lack fine-grained behavioral control. Method: We propose Multimodal Representation Tuning (MRT), the first approach to shift instruction alignment from the parameter space to a semantically rich representation space. MRT enables efficient and controllable model behavior modulation via explicit editing of critical representation units (e.g., instrumental tokens), lightweight projection modules, instruction-aware token masking and reweighting, and a semantic consistency alignment loss. Contribution/Results: With only 0.03% trainable parameters, MRT achieves 1580.40 on the MME benchmark—significantly surpassing state-of-the-art methods—while delivering high performance, strong interpretability, and minimal computational overhead.
📝 Abstract
Multimodal instruction tuning has proven to be an effective strategy for achieving zero-shot generalization by fine-tuning pre-trained Large Multimodal Models (LMMs) with instruction-following data. However, as the scale of LMMs continues to grow, fully fine-tuning these models has become highly parameter-intensive. Although Parameter-Efficient Fine-Tuning (PEFT) methods have been introduced to reduce the number of tunable parameters, a significant performance gap remains compared to full fine-tuning. Furthermore, existing PEFT approaches are often highly parameterized, making them difficult to interpret and control. In light of this, we introduce Multimodal Representation Tuning (MRT), a novel approach that focuses on directly editing semantically rich multimodal representations to achieve strong performance and provide intuitive control over LMMs. Empirical results show that our method surpasses current state-of-the-art baselines with significant performance gains (e.g., 1580.40 MME score) while requiring substantially fewer tunable parameters (e.g., 0.03% parameters). Additionally, we conduct experiments on editing instrumental tokens within multimodal representations, demonstrating that direct manipulation of these representations enables simple yet effective control over network behavior.