Metis-SPECS: Decoupling Multimodal Learning via Self-distilled Preference-based Cold Start

📅 2025-10-28
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Vision-language models (VLMs) employed in reinforcement learning (RL) cold-start settings typically rely on supervised fine-tuning (SFT), which induces instruction overfitting and degrades out-of-distribution generalization and subsequent RL performance. To address this, we propose a decoupled multimodal learning framework: (1) high-quality preference pairs are generated via self-distillation; (2) preference learning is performed using the Direct Preference Optimization (DPO) algorithm, explicitly disentangling surface-level output formatting from deep reasoning logic; and (3) a verifiable reward function is introduced to enhance RL training stability, alongside a Generalization Factor (GF) metric for quantifying generalization capability. Evaluated on MEGA-Bench and MathVista, our method achieves absolute improvements of 4.1% and 12.2%, respectively—substantially outperforming strong baselines—while demonstrating superior exploratory behavior and training robustness.

Technology Category

Application Category

📝 Abstract
Reinforcement learning (RL) with verifiable rewards has recently catalyzed a wave of "MLLM-r1" approaches that bring RL to vision language models. Most representative paradigms begin with a cold start, typically employing supervised fine-tuning (SFT), to initialize the policy before RL. However, SFT-based cold start adopts the reasoning paradigm intertwined with task solution and output format, which may induce instruction-style overfitting, weakens out-of-distribution generalization, and ultimately affects downstream RL. We revisit the cold start along two views, its training method and data construction, and introduce the Generalization Factor (GF) coefficient to quantify the generalization capability under different methods. Our empirical study finds that preference-based training methods (e.g. DPO) generalizes better than SFT-based methods in cold start. Motivated by this, we propose SPECS-a Self-distilled, Preference-based Cold Start framework that decouples multimodal learning: (1) generates introspective preference data pairs via self-distillation, avoiding reliance on larger teachers or manual annotation; (2) performs preference-based training to learn, focusing on shallow, transferable surface-form criteria (format, structure, style) rather than memorizing content; and (3) hands off to RL with verifiable rewards for deep reasoning results. Experimental results across multiple multimodal benchmarks show that our decoupling learning framework yields consistent performance gains over strong baselines, improving MEGA-Bench by 4.1% and MathVista by 12.2%. Additional experiments indicate that SPECS contributes to reducing in-distribution "stuckness," improving exploration, stabilizing training, and raising the performance ceiling.
Problem

Research questions and friction points this paper is trying to address.

Addressing instruction-style overfitting in multimodal learning cold start
Improving generalization through preference-based training over SFT methods
Decoupling surface-form learning from deep reasoning via self-distillation
Innovation

Methods, ideas, or system contributions that make the work stand out.

Self-distilled preference data generation without external teachers
Preference-based training focusing on surface-form criteria
Decoupling multimodal learning before reinforcement learning phase
🔎 Similar Papers
No similar papers found.