ROCKET-2: Steering Visuomotor Policy via Cross-View Goal Alignment

📅 2025-03-04
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the challenge of aligning goal instructions between embodied agents and human users due to inherent egocentric–allocentric viewpoint discrepancies. We propose a semantically clear, spatially aware, and intuitive goal specification method. Our core innovation is enabling users to directly specify goals via first-person segmentation masks—eliminating the need for explicit viewpoint registration—and achieving the first cross-viewpoint literal goal translation. To enhance spatial reasoning and intent alignment, we introduce two novel losses: cross-viewpoint consistency loss and target visibility loss. Built upon a behavior cloning framework, our approach integrates multi-view geometric constraints and auxiliary supervision, trained in a Minecraft simulation environment. Experiments demonstrate state-of-the-art performance, 3–6× higher inference efficiency, and—critically—the first end-to-end closed-loop system mapping human first-person inputs to precise agent execution.

Technology Category

Application Category

📝 Abstract
We aim to develop a goal specification method that is semantically clear, spatially sensitive, and intuitive for human users to guide agent interactions in embodied environments. Specifically, we propose a novel cross-view goal alignment framework that allows users to specify target objects using segmentation masks from their own camera views rather than the agent's observations. We highlight that behavior cloning alone fails to align the agent's behavior with human intent when the human and agent camera views differ significantly. To address this, we introduce two auxiliary objectives: cross-view consistency loss and target visibility loss, which explicitly enhance the agent's spatial reasoning ability. According to this, we develop ROCKET-2, a state-of-the-art agent trained in Minecraft, achieving an improvement in the efficiency of inference 3x to 6x. We show ROCKET-2 can directly interpret goals from human camera views for the first time, paving the way for better human-agent interaction.
Problem

Research questions and friction points this paper is trying to address.

Develop goal specification method for human-agent interaction
Align human and agent camera views for accurate goal interpretation
Enhance agent spatial reasoning with auxiliary training objectives
Innovation

Methods, ideas, or system contributions that make the work stand out.

Cross-view goal alignment framework
Auxiliary objectives enhance spatial reasoning
ROCKET-2 interprets goals from human views
🔎 Similar Papers
No similar papers found.