Where and What Matters: Sensitivity-Aware Task Vectors for Many-Shot Multimodal In-Context Learning

📅 2025-11-11

📈 Citations: 0

✨ Influential: 0

career value

172K/year

🤖 AI Summary

To address the context-length limitations and high inference costs of large multimodal models (LMMs) in many-shot in-context learning (ICL), this paper proposes a sensitivity-aware task vector insertion framework. The method uniquely integrates structural sensitivity analysis of activation differences with reinforcement learning: it first constructs an activation vector repository via clustering, then dynamically identifies optimal insertion positions and selects contextually appropriate task vectors based on activation difference patterns—jointly resolving the “where to insert” and “what to insert” challenges. Evaluated on LMMs including Qwen-VL and Idefics-2, the framework achieves significant performance gains over existing task vector methods on multimodal benchmarks such as VizWiz and OK-VQA. It demonstrates strong generalization across diverse tasks and robust cross-architecture applicability, without requiring model fine-tuning or architectural modification.

Technology Category

Application Category

📝 Abstract

Large Multimodal Models (LMMs) have shown promising in-context learning (ICL) capabilities, but scaling to many-shot settings remains difficult due to limited context length and high inference cost. To address these challenges, task-vector-based methods have been explored by inserting compact representations of many-shot in-context demonstrations into model activations. However, existing task-vector-based methods either overlook the importance of where to insert task vectors or struggle to determine suitable values for each location. To this end, we propose a novel Sensitivity-aware Task Vector insertion framework (STV) to figure out where and what to insert. Our key insight is that activation deltas across query-context pairs exhibit consistent structural patterns, providing a reliable cue for insertion. Based on the identified sensitive-aware locations, we construct a pre-clustered activation bank for each location by clustering the activation values, and then apply reinforcement learning to choose the most suitable one to insert. We evaluate STV across a range of multimodal models (e.g., Qwen-VL, Idefics-2) and tasks (e.g., VizWiz, OK-VQA), demonstrating its effectiveness and showing consistent improvements over previous task-vector-based methods with strong generalization.

Problem

Research questions and friction points this paper is trying to address.

Determining optimal insertion locations for task vectors

Identifying suitable activation values for each insertion point

Improving many-shot multimodal in-context learning efficiency

Innovation

Methods, ideas, or system contributions that make the work stand out.

Identifies sensitive locations for task vector insertion

Constructs pre-clustered activation bank for each location

Uses reinforcement learning to select optimal activation values

🔎 Similar Papers

From Introspection to Best Practices: Principled Analysis of Demonstrations in Multimodal In-Context Learning