Omni-R1: Reinforcement Learning for Omnimodal Reasoning via Two-System Collaboration

📅 2025-05-26

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

Multimodal long-video–audio reasoning faces an inherent trade-off between spatial resolution and temporal coverage: dense temporal modeling demands low-resolution frames for efficiency, whereas pixel-level localization requires high-resolution inputs. To address this, we propose a dual-system collaborative architecture—(i) a global reasoning system that employs reinforcement learning (RL) to dynamically select keyframes and reformulate tasks, and (ii) a detail-understanding system that performs precise spatiotemporal grounding on high-resolution selected segments. We introduce the first end-to-end RL framework for large-scale multimodal reasoning, featuring Group Relative Policy Optimization (GRPO) and a hierarchical collaborative RL training paradigm, enabling full end-to-end optimization with only a single round of few-shot training. Our method achieves significant improvements over strong supervised baselines and dedicated state-of-the-art methods on RefAVS and REVOS benchmarks, enhances cross-domain generalization, and mitigates multimodal hallucination—establishing a scalable inference paradigm for general-purpose multimodal foundation models.

Technology Category

Application Category

📝 Abstract

Long-horizon video-audio reasoning and fine-grained pixel understanding impose conflicting requirements on omnimodal models: dense temporal coverage demands many low-resolution frames, whereas precise grounding calls for high-resolution inputs. We tackle this trade-off with a two-system architecture: a Global Reasoning System selects informative keyframes and rewrites the task at low spatial cost, while a Detail Understanding System performs pixel-level grounding on the selected high-resolution snippets. Because ``optimal'' keyframe selection and reformulation are ambiguous and hard to supervise, we formulate them as a reinforcement learning (RL) problem and present Omni-R1, an end-to-end RL framework built on Group Relative Policy Optimization. Omni-R1 trains the Global Reasoning System through hierarchical rewards obtained via online collaboration with the Detail Understanding System, requiring only one epoch of RL on small task splits. Experiments on two challenging benchmarks, namely Referring Audio-Visual Segmentation (RefAVS) and Reasoning Video Object Segmentation (REVOS), show that Omni-R1 not only surpasses strong supervised baselines but also outperforms specialized state-of-the-art models, while substantially improving out-of-domain generalization and mitigating multimodal hallucination. Our results demonstrate the first successful application of RL to large-scale omnimodal reasoning and highlight a scalable path toward universally foundation models.

Problem

Research questions and friction points this paper is trying to address.

Balancing dense temporal coverage and precise pixel grounding in omnimodal models

Optimizing keyframe selection and task reformulation via reinforcement learning

Improving out-of-domain generalization and reducing multimodal hallucination

Innovation

Methods, ideas, or system contributions that make the work stand out.

Two-system architecture for omnimodal reasoning

Reinforcement learning for keyframe selection

Hierarchical rewards via online collaboration

🔎 Similar Papers

Mutual Enhancement of Large Language and Reinforcement Learning Models through Bi-Directional Feedback Mechanisms: A Case Study

2024-01-12arXiv.orgCitations: 0

Authors to Follow