VIDEOP2R: Video Understanding from Perception to Reasoning

📅 2025-11-14
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
To address the insufficient reasoning capability of Large Video-Language Models (LVLMs), this paper proposes a two-stage reinforcement fine-tuning framework that decouples perception from reasoning. Methodologically: (1) we construct a process-aware chain-of-thought dataset that explicitly separates video perception from logical reasoning; (2) we design PA-GRPO—a phased, population-relative reward optimization algorithm; and (3) we adopt a three-stage data generation pipeline integrating supervised fine-tuning and reinforcement learning. Our key contribution is the first explicit formulation of perception outputs as reliable, sufficient premises for downstream reasoning—empirically validated through ablation studies. Experiments demonstrate state-of-the-art performance on 6 out of 7 mainstream video understanding benchmarks. Ablation analyses confirm the effectiveness of each component, establishing the critical role of perception-reasoning decoupling in enhancing LVLM reasoning fidelity.

Technology Category

Application Category

📝 Abstract
Reinforcement fine-tuning (RFT), a two-stage framework consisting of supervised fine-tuning (SFT) and reinforcement learning (RL) has shown promising results on improving reasoning ability of large language models (LLMs). Yet extending RFT to large video language models (LVLMs) remains challenging. We propose VideoP2R, a novel process-aware video RFT framework that enhances video reasoning by modeling perception and reasoning as distinct processes. In the SFT stage, we develop a three-step pipeline to generate VideoP2R-CoT-162K, a high-quality, process-aware chain-of-thought (CoT) dataset for perception and reasoning. In the RL stage, we introduce a novel process-aware group relative policy optimization (PA-GRPO) algorithm that supplies separate rewards for perception and reasoning. Extensive experiments show that VideoP2R achieves state-of-the-art (SotA) performance on six out of seven video reasoning and understanding benchmarks. Ablation studies further confirm the effectiveness of our process-aware modeling and PA-GRPO and demonstrate that model's perception output is information-sufficient for downstream reasoning.
Problem

Research questions and friction points this paper is trying to address.

Extends reinforcement fine-tuning to video language models
Models perception and reasoning as distinct processes
Enhances video reasoning through process-aware optimization
Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-stage reinforcement fine-tuning for video models
Process-aware chain-of-thought dataset generation
Separate rewards for perception and reasoning processes
🔎 Similar Papers
No similar papers found.
Y
Yifan Jiang
USC
Yueying Wang
Yueying Wang
Amazon
R
Rui Zhao
Amazon
T
T. Parag
Amazon
Z
Zhimin Chen
Amazon
Z
Zhenyu Liao
Amazon
Jayakrishnan Unnikrishnan
Jayakrishnan Unnikrishnan
Amazon
Artificial IntelligenceStatistical InferenceMachine LearningRoboticsSignal processing