🤖 AI Summary
This work addresses the challenge of building a unified embodied foundation model with general physical intelligence capable of executing real-world tasks over extended durations through multimodal perception and self-correction. The authors propose Embodied-R1.5, introducing the first Planner-Grounder-Corrector (PGC) closed-loop architecture that integrates task planning, embodied grounding, and online error correction. Trained end-to-end via multitask-balanced reinforcement learning, the 8B-parameter model leverages over 15 billion tokens of multimodal data generated through three automated data pipelines. It achieves state-of-the-art performance on 16 out of 24 embodied vision-language benchmarks, demonstrating strong capabilities in complex instruction following, affordance understanding, and task generalization. The model supports efficient fine-tuning and zero-shot deployment on real robots, with all code, data, and evaluation tools publicly released.
📝 Abstract
We introduce Embodied-R1.5, a unified Embodied Foundation Model (EFM) that integrates comprehensive embodied reasoning capabilities, spanning embodied cognition, task planning, correction, and pointing, within a single architecture toward general physical intelligence. Leveraging three automated data construction pipelines to significantly expand the data coverage of critical capabilities, we build a large-scale data system of over 15B tokens, and design a multi-task balanced RL recipe to alleviate heterogeneous task conflicts. We further introduce a Planner-Grounder-Corrector (PGC) closed-loop framework that enables a single model to autonomously execute and self-correct over long-horizon tasks. With only 8B parameters, Embodied-R1.5 achieves SOTA on 16 out of 24 embodied VLM benchmarks, surpassing leading models like Gemini-Robotics-ER-1.5 and GPT-5.4. Benefiting from the internalized embodied capabilities, Embodied-R1.5 can be fine-tuned into a VLA with only a small amount of data, outperforming leading VLA models like $π_{0.5}$ across 4 popular manipulation benchmark suites. We further conduct extensive zero-shot real-robot experiments, validating performance in instruction following, affordance grounding, articulated object manipulation, and long-horizon complex tasks, demonstrating strong generalization to the physical world. We open-source model weights, datasets, training code, and EmbodiedEvalKit, an evaluation framework tailored for embodied tasks, to facilitate future research in EFMs.