Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models

📅 2026-06-09

📈 Citations: 0

✨ Influential: 0

🤖 AI Summary

This work addresses the challenge of building a unified embodied foundation model with general physical intelligence capable of executing real-world tasks over extended durations through multimodal perception and self-correction. The authors propose Embodied-R1.5, introducing the first Planner-Grounder-Corrector (PGC) closed-loop architecture that integrates task planning, embodied grounding, and online error correction. Trained end-to-end via multitask-balanced reinforcement learning, the 8B-parameter model leverages over 15 billion tokens of multimodal data generated through three automated data pipelines. It achieves state-of-the-art performance on 16 out of 24 embodied vision-language benchmarks, demonstrating strong capabilities in complex instruction following, affordance understanding, and task generalization. The model supports efficient fine-tuning and zero-shot deployment on real robots, with all code, data, and evaluation tools publicly released.

📝 Abstract

We introduce Embodied-R1.5, a unified Embodied Foundation Model (EFM) that integrates comprehensive embodied reasoning capabilities, spanning embodied cognition, task planning, correction, and pointing, within a single architecture toward general physical intelligence. Leveraging three automated data construction pipelines to significantly expand the data coverage of critical capabilities, we build a large-scale data system of over 15B tokens, and design a multi-task balanced RL recipe to alleviate heterogeneous task conflicts. We further introduce a Planner-Grounder-Corrector (PGC) closed-loop framework that enables a single model to autonomously execute and self-correct over long-horizon tasks. With only 8B parameters, Embodied-R1.5 achieves SOTA on 16 out of 24 embodied VLM benchmarks, surpassing leading models like Gemini-Robotics-ER-1.5 and GPT-5.4. Benefiting from the internalized embodied capabilities, Embodied-R1.5 can be fine-tuned into a VLA with only a small amount of data, outperforming leading VLA models like $π_{0.5}$ across 4 popular manipulation benchmark suites. We further conduct extensive zero-shot real-robot experiments, validating performance in instruction following, affordance grounding, articulated object manipulation, and long-horizon complex tasks, demonstrating strong generalization to the physical world. We open-source model weights, datasets, training code, and EmbodiedEvalKit, an evaluation framework tailored for embodied tasks, to facilitate future research in EFMs.

Problem

Research questions and friction points this paper is trying to address.

Embodied Intelligence

Foundation Models

Physical Reasoning

Task Planning

Generalization

Innovation

Methods, ideas, or system contributions that make the work stand out.

Embodied Foundation Model

Planner-Grounder-Corrector

multi-task reinforcement learning