JARVIS: A Neuro-Symbolic Commonsense Reasoning Framework for Conversational Embodied Agents

📅 2022-08-28
🏛️ arXiv.org
📈 Citations: 44
Influential: 3
📄 PDF
🤖 AI Summary
Conversational embodied agents for real-world tasks face synergistic challenges in multimodal perception, long-horizon decision-making, and interpretable reasoning. To address these, we propose a neuro-symbolic fusion framework featuring: (i) the first LLM-driven symbolic representation acquisition method jointly modeled with visual-semantic mapping; and (ii) a modular symbolic reasoning mechanism guided by task-level and action-level commonsense knowledge, balancing generalizability, interpretability, and few-shot adaptability. The framework integrates prompt engineering, semantic map construction, a symbolic planning engine, and a neuro-symbolic collaborative reasoning architecture. On the TEACh benchmark, our approach achieves state-of-the-art performance across all three conversational embodied tasks. Notably, success rate on unseen scenes in the EDH setting improves significantly—from 6.1% to 15.8%. Furthermore, the framework secured first place in the Alexa Prize SimBot Challenge.
📝 Abstract
Building a conversational embodied agent to execute real-life tasks has been a long-standing yet quite challenging research goal, as it requires effective human-agent communication, multi-modal understanding, long-range sequential decision making, etc. Traditional symbolic methods have scaling and generalization issues, while end-to-end deep learning models suffer from data scarcity and high task complexity, and are often hard to explain. To benefit from both worlds, we propose JARVIS, a neuro-symbolic commonsense reasoning framework for modular, generalizable, and interpretable conversational embodied agents. First, it acquires symbolic representations by prompting large language models (LLMs) for language understanding and sub-goal planning, and by constructing semantic maps from visual observations. Then the symbolic module reasons for sub-goal planning and action generation based on task- and action-level common sense. Extensive experiments on the TEACh dataset validate the efficacy and efficiency of our JARVIS framework, which achieves state-of-the-art (SOTA) results on all three dialog-based embodied tasks, including Execution from Dialog History (EDH), Trajectory from Dialog (TfD), and Two-Agent Task Completion (TATC) (e.g., our method boosts the unseen Success Rate on EDH from 6.1% to 15.8%). Moreover, we systematically analyze the essential factors that affect the task performance and also demonstrate the superiority of our method in few-shot settings. Our JARVIS model ranks first in the Alexa Prize SimBot Public Benchmark Challenge.
Problem

Research questions and friction points this paper is trying to address.

Develops neuro-symbolic framework for conversational embodied agents
Addresses data scarcity and explainability in task execution
Integrates commonsense reasoning for multimodal language understanding
Innovation

Methods, ideas, or system contributions that make the work stand out.

Neuro-symbolic framework combining LLMs and semantic maps
Symbolic reasoning for sub-goal planning and action generation
Leverages task- and action-level commonsense knowledge
🔎 Similar Papers
No similar papers found.
Kaizhi Zheng
Kaizhi Zheng
University of California, Santa Cruz
visual and languagerobot learning
K
Kaiwen Zhou
University of California, Santa Cruz
J
Jing Gu
University of California, Santa Cruz
Y
Yue Fan
University of California, Santa Cruz
J
Jialu Wang
University of California, Santa Cruz
Z
Zonglin Di
University of California, Santa Cruz
Xuehai He
Xuehai He
Microsoft
Machine LearningLanguage and VisionGenerative AI
Xin Eric Wang
Xin Eric Wang
Assistant Professor, University of California, Santa Barbara, Simular
NLPCVMLLanguage and VisionAI Agents