ViCuR: Visual Cues as Recoverable Privilege for Multimodal On-Policy Distillation

📅 2026-06-04

📈 Citations: 0

✨ Influential: 0

career value

153K/year

🤖 AI Summary

This work addresses the train-test mismatch in multimodal policy distillation, where teachers leverage privileged signals from unseen reference answers, inadvertently encouraging students to exploit shortcuts rather than learn genuine visual reasoning. To resolve this, the authors propose ViCuR, a novel framework that reformulates answer-side privilege into recoverable visual cues and introduces a lightweight cue recovery module, enabling consistent distillation using only raw visual inputs at inference time. ViCuR innovatively incorporates a sink-token cross-attention mechanism that efficiently aggregates visual cues with internal representations without requiring auxiliary losses. Built upon Qwen3-VL and on-policy distillation, ViCuR achieves substantial performance gains across seven benchmarks, yielding average improvements of +1.19 and +1.24 for 2B and 8B student models, respectively, maintains a +1.08 gain even under strong teacher settings, and demonstrates robustness on out-of-domain tasks.

📝 Abstract

On-policy distillation (OPD) improves reasoning by training a student on trajectories sampled from its own policy under supervision from a teacher. In multimodal reasoning, a common extension is to use a privileged teacher that observes training-time-only signals such as reference answers or rationales. However, such answer-side privilege creates a train-test mismatch: the teacher's supervision may depend on signals unavailable to the student, encouraging shortcut imitation rather than visually grounded reasoning. We propose ViCuR, a visually grounded privileged-teacher distillation framework that replaces answer-side privilege with visual cues (query-related evidence in the input). Because these cues are derived from the same visual input available at inference, their evidence is recoverable by the student. To support this, ViCuR introduces a lightweight cue recovery module that uses dedicated sink-token cross-attention during prefill to aggregate task-relevant visual evidence into an internal representation, without changing the inference interface or requiring auxiliary cue-generation losses. Across seven benchmarks with Qwen3-VL-2B and 8B students, ViCuR consistently improves over answer-based on-policy self-distillation by +1.19 and +1.24 on overall average performance. It also extends naturally to stronger-teacher OPD, surpassing OPD baselines by +0.64 and +1.08, with consistent out-of-domain gains at the 8B scale. These results show that, in multimodal on-policy distillation, the design of teacher privilege is as important as teacher strength.

Problem

Research questions and friction points this paper is trying to address.

on-policy distillation

multimodal reasoning

train-test mismatch

privileged teacher

visual grounding

Innovation

Methods, ideas, or system contributions that make the work stand out.

on-policy distillation

visual cues

privileged teacher