FOUNDER: Grounding Foundation Models in World Models for Open-Ended Embodied Decision Making

📅 2025-07-15
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
This work addresses the insufficient synergy between foundation models (FMs) and world models (WMs) in embodied intelligence, hindering open-ended task solving under sparse explicit rewards and complex, high-dimensional observations. We propose the first FM-WM co-grounding framework. Methodologically, a cross-modal mapping network implicitly aligns FM semantic representations to the WM’s latent state space, enabling imagination-based state prediction; an intrinsic reward is then derived from predicted temporal distances to guide goal-conditioned policy learning. Our contributions are: (i) the first end-to-end co-grounding framework that jointly integrates FM semantic understanding with WM dynamics modeling; and (ii) an implicit goal-mapping mechanism coupled with temporal-distance-based intrinsic reward, eliminating reliance on external reward signals. Evaluated on a multi-task offline visual control benchmark, our approach significantly improves cross-domain generalization and robustness to complex observations, achieving higher semantic task completion rates and stronger alignment between intrinsic and ground-truth rewards than prior methods.

Technology Category

Application Category

📝 Abstract
Foundation Models (FMs) and World Models (WMs) offer complementary strengths in task generalization at different levels. In this work, we propose FOUNDER, a framework that integrates the generalizable knowledge embedded in FMs with the dynamic modeling capabilities of WMs to enable open-ended task solving in embodied environments in a reward-free manner. We learn a mapping function that grounds FM representations in the WM state space, effectively inferring the agent's physical states in the world simulator from external observations. This mapping enables the learning of a goal-conditioned policy through imagination during behavior learning, with the mapped task serving as the goal state. Our method leverages the predicted temporal distance to the goal state as an informative reward signal. FOUNDER demonstrates superior performance on various multi-task offline visual control benchmarks, excelling in capturing the deep-level semantics of tasks specified by text or videos, particularly in scenarios involving complex observations or domain gaps where prior methods struggle. The consistency of our learned reward function with the ground-truth reward is also empirically validated. Our project website is https://sites.google.com/view/founder-rl.
Problem

Research questions and friction points this paper is trying to address.

Integrate Foundation Models and World Models for open-ended task solving
Learn mapping to ground FM representations in WM state space
Leverage predicted temporal distance as reward signal for policy learning
Innovation

Methods, ideas, or system contributions that make the work stand out.

Integrates Foundation Models with World Models
Learns mapping function for FM-WM grounding
Uses temporal distance as reward signal
🔎 Similar Papers
No similar papers found.