π€ AI Summary
This work addresses the significant performance degradation of foundation model agents when deployed from simulation to real-world settings, which stems from distributional mismatches across the four components of Markov Decision Processes (MDPs)βobservations, actions, transitions, and rewards. For the first time, classical sim-to-real theory is formally introduced into this domain, proposing an MDP-based analytical framework that systematically models and mitigates distribution shifts in each space by integrating robustness techniques such as domain randomization. Through case studies involving multilingual tool usage, the paper reveals how observation gaps can yield semantically correct but operationally ineffective behaviors. It further establishes unified terminology and standardized stress-test benchmarks, laying a theoretical and practical foundation for developing high-assurance autonomous agents.
π Abstract
Foundation model agents are increasingly deployed for real-world decision-making, but suffer from the sim-to-real gap. While robotics and classical control have mature frameworks to address this gap, the foundation model community is treating agent robustness as an entirely novel phenomenon. Our paper proposes formalizing the foundation model agent evaluation and training gap as a classical sim-to-real problem structured entirely around the four elements of a Markov Decision Process, including Observation, Action, Transition, and Reward. In this paper, we set a comprehensive research agenda that translates classical discrepancies into the foundation model domain and advocates for adopting established solutions like domain randomization. We provide concrete examples, such as a multilingual tool calling to demonstrate how severe observation space gaps lead to operationally invalid actions despite correct semantic intent. Ultimately, this agenda aims to drive a paradigm shift, yielding a unified vocabulary and standardized stress test benchmarks to foster a new generation of highly trustworthy agents for reliable real-world applications.