π€ AI Summary
This work addresses the limitation of existing video-based multimodal large language models, which passively observe long-horizon robotic tasks without the ability to supervise task progress toward a goal. To overcome this, the authors propose the PRIMO R1 framework, which for the first time integrates reinforcement learning into video multimodal models. By introducing an outcome-oriented reward mechanism, PRIMO R1 elicits explicit reasoning chains and incorporates a structured temporal input anchored on initial and current states, effectively transforming the model into an active βcriticβ capable of evaluating task progress. Evaluated on the PRIMO benchmark using a 7B-parameter model and a dedicated dataset, the method reduces mean squared error by 50% and achieves 67.0% accuracy on the RoboFail failure detection task, substantially outperforming current models, including OpenAI o1.
π Abstract
Accurate process supervision remains a critical challenge for long-horizon robotic manipulation. A primary bottleneck is that current video MLLMs, trained primarily under a Supervised Fine-Tuning (SFT) paradigm, function as passive "Observers" that recognize ongoing events rather than evaluating the current state relative to the final task goal. In this paper, we introduce PRIMO R1 (Process Reasoning Induced Monitoring), a 7B framework that transforms video MLLMs into active "Critics". We leverage outcome-based Reinforcement Learning to incentivize explicit Chain-of-Thought generation for progress estimation. Furthermore, our architecture constructs a structured temporal input by explicitly anchoring the video sequence between initial and current state images. Supported by the proposed PRIMO Dataset and Benchmark, extensive experiments across diverse in-domain environments and out-of-domain real-world humanoid scenarios demonstrate that PRIMO R1 achieves state-of-the-art performance. Quantitatively, our 7B model achieves a 50% reduction in the mean absolute error of specialized reasoning baselines, demonstrating significant relative accuracy improvements over 72B-scale general MLLMs. Furthermore, PRIMO R1 exhibits strong zero-shot generalization on difficult failure detection tasks. We establish state-of-the-art performance on RoboFail benchmark with 67.0% accuracy, surpassing closed-source models like OpenAI o1 by 6.0%.