Re-Imagining Multimodal Instruction Tuning: A Representation View

📅 2025-03-02
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing parameter-efficient fine-tuning (PEFT) methods for multimodal instruction tuning underperform full fine-tuning, suffer from poor interpretability, and lack fine-grained behavioral control. Method: We propose Multimodal Representation Tuning (MRT), the first approach to shift instruction alignment from the parameter space to a semantically rich representation space. MRT enables efficient and controllable model behavior modulation via explicit editing of critical representation units (e.g., instrumental tokens), lightweight projection modules, instruction-aware token masking and reweighting, and a semantic consistency alignment loss. Contribution/Results: With only 0.03% trainable parameters, MRT achieves 1580.40 on the MME benchmark—significantly surpassing state-of-the-art methods—while delivering high performance, strong interpretability, and minimal computational overhead.

Technology Category

Application Category

📝 Abstract
Multimodal instruction tuning has proven to be an effective strategy for achieving zero-shot generalization by fine-tuning pre-trained Large Multimodal Models (LMMs) with instruction-following data. However, as the scale of LMMs continues to grow, fully fine-tuning these models has become highly parameter-intensive. Although Parameter-Efficient Fine-Tuning (PEFT) methods have been introduced to reduce the number of tunable parameters, a significant performance gap remains compared to full fine-tuning. Furthermore, existing PEFT approaches are often highly parameterized, making them difficult to interpret and control. In light of this, we introduce Multimodal Representation Tuning (MRT), a novel approach that focuses on directly editing semantically rich multimodal representations to achieve strong performance and provide intuitive control over LMMs. Empirical results show that our method surpasses current state-of-the-art baselines with significant performance gains (e.g., 1580.40 MME score) while requiring substantially fewer tunable parameters (e.g., 0.03% parameters). Additionally, we conduct experiments on editing instrumental tokens within multimodal representations, demonstrating that direct manipulation of these representations enables simple yet effective control over network behavior.
Problem

Research questions and friction points this paper is trying to address.

Addresses parameter-intensive fine-tuning in large multimodal models.
Reduces performance gap in parameter-efficient fine-tuning methods.
Introduces intuitive control over multimodal representations for better interpretability.
Innovation

Methods, ideas, or system contributions that make the work stand out.

Multimodal Representation Tuning for efficient control
Direct editing of multimodal semantic representations
Fewer parameters with significant performance gains