Decomposed On-Policy Distillation for Vision-Language Reasoning: Steering Gradients for Visual Grounding

📅 2026-05-30

📈 Citations: 0

✨ Influential: 0

career value

174K/year

🤖 AI Summary

Existing policy distillation methods for vision-language models struggle to balance linguistic priors with visual grounding objectives, leading to suboptimal training. This work reveals, for the first time, that gradients from language and vision modalities are approximately orthogonal, and leverages this insight to decompose the distillation loss into two orthogonal components. Building on this decomposition, we propose a dynamic gradient redirection mechanism that prioritizes optimization within the visual subspace. Our approach overcomes the implicit trade-offs inherent in conventional monolithic distillation, enabling more effective multimodal alignment. Evaluated across multiple challenging benchmarks, the proposed method significantly outperforms standard distillation in visual grounding performance, achieving substantial gains with negligible additional training overhead.

📝 Abstract

While on-policy distillation offers dense supervision for training small reasoning models, its optimization dynamics in the multimodal domain remain under-explored. In this work, we challenge the standard monolithic view of Vision-Language Model (VLM) distillation by mathematically decomposing the loss into two distinct components: the language prior and visual grounding. Our analysis uncovers that gradient vectors for these components are nearly orthogonal, indicating that the objective of aligning with the teacher's language distribution is geometrically independent from the objective of matching its visual perception. Consequently, standard optimization passively follows a suboptimal compromise trajectory that implicitly balances the two objectives. Hypothesizing that visual grounding constitutes the primary bottleneck for vision-language reasoning, we introduce Visual Gradient Steering (VGS), a method that dynamically reorients the update vector to prioritize the visual subspace. Experimental results on multiple distillation settings and complex multimodal benchmarks demonstrate that VGS significantly outperforms the standard monolithic formulation of on-policy distillation, achieving superior grounding with minimal training overhead.

Problem

Research questions and friction points this paper is trying to address.

on-policy distillation

vision-language reasoning

visual grounding

gradient orthogonality

multimodal optimization

Innovation

Methods, ideas, or system contributions that make the work stand out.

on-policy distillation

visual grounding

gradient steering