OASIS: Online Sample Selection for Continual Visual Instruction Tuning

📅 2025-05-27

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

In continual vision-instruction tuning (CVIT), streaming multimodal data induces significant training latency, and existing reference-free online sample selection methods—constrained by fixed sampling budgets (e.g., top-k)—fail to adapt to inter-batch variations in information content and distributional shifts. Method: We propose a dynamic inter-batch informativeness-aware mechanism and an iterative redundancy-aware scoring update strategy. Our approach employs relative information gain for adaptive sampling and gradient-sensitivity-based reweighting to perform reference-free online importance estimation. Contribution/Results: By breaking the fixed-sampling constraint, our method substantially enhances selection robustness under distribution drift. Evaluated on mainstream multimodal large language models (MLLMs), including LLaVA-1.5 and Qwen-VL-2.5, it achieves full-data performance using only 25% of the training samples—outperforming all existing state-of-the-art methods across benchmarks.

Technology Category

Application Category

📝 Abstract

In continual visual instruction tuning (CVIT) scenarios, where multi-modal data continuously arrive in an online streaming manner, training delays from large-scale data significantly hinder real-time adaptation. While existing data selection strategies reduce training overheads, they rely on pre-trained reference models, which are impractical in CVIT setups due to unknown future data. Recent reference model-free online sample selection methods address this issue but typically select a fixed number of samples per batch (e.g., top-k), causing them to suffer from distribution shifts where informativeness varies across batches. To address these limitations, we propose OASIS, an adaptive online sample selection approach for CVIT that: (1) dynamically adjusts selected samples per batch based on relative inter-batch informativeness, and (2) minimizes redundancy of selected samples through iterative selection score updates. Empirical results across various MLLMs, such as LLaVA-1.5 and Qwen-VL-2.5, show that OASIS achieves comparable performance to full-data training using only 25% of the data and outperforms the state-of-the-art.

Problem

Research questions and friction points this paper is trying to address.

Addresses training delays in continual visual instruction tuning

Overcomes reliance on pre-trained reference models

Adapts sample selection dynamically to batch informativeness

Innovation

Methods, ideas, or system contributions that make the work stand out.

Dynamic sample selection per batch

Minimizes redundancy via iterative updates

Adapts to inter-batch informativeness shifts

🔎 Similar Papers

Cropper: Vision-Language Model for Image Cropping through In-Context Learning