🤖 AI Summary
To address the context-length limitations and high inference costs of large multimodal models (LMMs) in many-shot in-context learning (ICL), this paper proposes a sensitivity-aware task vector insertion framework. The method uniquely integrates structural sensitivity analysis of activation differences with reinforcement learning: it first constructs an activation vector repository via clustering, then dynamically identifies optimal insertion positions and selects contextually appropriate task vectors based on activation difference patterns—jointly resolving the “where to insert” and “what to insert” challenges. Evaluated on LMMs including Qwen-VL and Idefics-2, the framework achieves significant performance gains over existing task vector methods on multimodal benchmarks such as VizWiz and OK-VQA. It demonstrates strong generalization across diverse tasks and robust cross-architecture applicability, without requiring model fine-tuning or architectural modification.
📝 Abstract
Large Multimodal Models (LMMs) have shown promising in-context learning (ICL) capabilities, but scaling to many-shot settings remains difficult due to limited context length and high inference cost. To address these challenges, task-vector-based methods have been explored by inserting compact representations of many-shot in-context demonstrations into model activations. However, existing task-vector-based methods either overlook the importance of where to insert task vectors or struggle to determine suitable values for each location. To this end, we propose a novel Sensitivity-aware Task Vector insertion framework (STV) to figure out where and what to insert. Our key insight is that activation deltas across query-context pairs exhibit consistent structural patterns, providing a reliable cue for insertion. Based on the identified sensitive-aware locations, we construct a pre-clustered activation bank for each location by clustering the activation values, and then apply reinforcement learning to choose the most suitable one to insert. We evaluate STV across a range of multimodal models (e.g., Qwen-VL, Idefics-2) and tasks (e.g., VizWiz, OK-VQA), demonstrating its effectiveness and showing consistent improvements over previous task-vector-based methods with strong generalization.