AVATAR: Reinforcement Learning to See, Hear, and Reason Over Video

📅 2025-08-05

📈 Citations: 0

✨ Influential: 0

career value

200K/year

🤖 AI Summary

This work addresses three key challenges in multimodal long-video reasoning: (1) low data efficiency induced by online policy optimization, (2) vanishing advantage estimates due to homogeneous intra-group rewards, and (3) ineffective uniform credit assignment that fails to identify critical reasoning steps. To tackle these, we propose AVATAR—a novel framework featuring: (i) offline policy training to mitigate advantage collapse, (ii) Group Relative Policy Optimization (GRPO) for enhanced training stability, and (iii) Temporal Advantage Shaping (TAS), a dynamic credit allocation mechanism that prioritizes temporally salient reasoning steps. Evaluated on MMVU, OmniBench, and Video-Holmes, AVATAR outperforms the Qwen2.5-Omni baseline by +5.4, +4.9, and +4.5 points, respectively, while improving sample efficiency by over 35%. The framework significantly advances spatiotemporal understanding and decision-making capabilities for long-video multimodal reasoning.

Technology Category

Application Category

📝 Abstract

Multimodal reasoning over long-horizon video is challenging due to the need for precise spatiotemporal fusion and alignment across modalities. While recent methods such as Group Relative Policy Optimization (GRPO) have shown promise in this domain, they suffer from three key limitations: (1) data inefficiency from their on-policy design, (2) a vanishing advantage problem, where identical or near-identical rewards within a group eliminate the learning signal by producing zero-valued advantages, and (3) uniform credit assignment that fails to emphasize critical reasoning steps. We introduce AVATAR (Audio-Video Agent for Alignment and Reasoning), a framework that addresses these limitations through two core components: (1) an off-policy training architecture that improves sample efficiency and resolves vanishing advantages by reusing past experiences with greater reward diversity, and (2) Temporal Advantage Shaping (TAS), a novel credit assignment strategy that upweights key reasoning phases during learning. AVATAR achieves strong performance across various benchmarks, outperforming the Qwen2.5-Omni baseline by +5.4on MMVU, +4.9 on OmniBench, and +4.5 on Video-Holmes, while demonstrating over 35% higher sample efficiency.

Problem

Research questions and friction points this paper is trying to address.

Improving data efficiency in multimodal video reasoning

Addressing vanishing advantage in reinforcement learning

Enhancing credit assignment for critical reasoning steps

Innovation

Methods, ideas, or system contributions that make the work stand out.

Off-policy training for better sample efficiency

Temporal Advantage Shaping for credit assignment

Multimodal fusion for video reasoning

🔎 Similar Papers

Chrono: A Simple Blueprint for Representing Time in MLLMs