Text-to-Decision Agent: Learning Generalist Policies from Natural Language Supervision

📅 2025-04-21
📈 Citations: 0
Influential: 0
📄 PDF
🤖 AI Summary
Existing reinforcement learning generalization relies on high-quality samples or environment pre-exploration, entailing prohibitive supervision costs and poor scalability to unseen tasks. This paper proposes a zero-shot policy learning framework that directly generates decision actions from natural language instructions—without task annotations or environmental pre-exploration. Our approach introduces three key innovations: (1) the first language-to-decision contrastive pre-training paradigm; (2) a dynamics-aware universal world model enabling cross-modal alignment between textual semantics and environmental dynamics; and (3) an integrated architecture combining multi-task world model encoding, text-conditioned policy networks, and a CLIP-style alignment mechanism. Evaluated on MuJoCo and Meta-World, our method substantially outperforms supervised fine-tuning, instruction tuning, and imitation learning baselines, achieving, for the first time, genuine text-driven zero-shot generalization across diverse robotic control tasks.

Technology Category

Application Category

📝 Abstract
RL systems usually tackle generalization by inferring task beliefs from high-quality samples or warmup explorations. The restricted form limits their generality and usability since these supervision signals are expensive and even infeasible to acquire in advance for unseen tasks. Learning directly from the raw text about decision tasks is a promising alternative to leverage a much broader source of supervision. In the paper, we propose Text-to-Decision Agent (T2DA), a simple and scalable framework that supervises generalist policy learning with natural language. We first introduce a generalized world model to encode multi-task decision data into a dynamics-aware embedding space. Then, inspired by CLIP, we predict which textual description goes with which decision embedding, effectively bridging their semantic gap via contrastive language-decision pre-training and aligning the text embeddings to comprehend the environment dynamics. After training the text-conditioned generalist policy, the agent can directly realize zero-shot text-to-decision generation in response to language instructions. Comprehensive experiments on MuJoCo and Meta-World benchmarks show that T2DA facilitates high-capacity zero-shot generalization and outperforms various types of baselines.
Problem

Research questions and friction points this paper is trying to address.

Learning generalist policies from natural language supervision
Bridging semantic gap between text and decision embeddings
Enabling zero-shot text-to-decision generation for unseen tasks
Innovation

Methods, ideas, or system contributions that make the work stand out.

Generalized world model encodes multi-task data
Contrastive language-decision pre-training bridges semantics
Text-conditioned policy enables zero-shot decision generation
🔎 Similar Papers
No similar papers found.
S
Shilin Zhang
Department of Control Science and Intelligent Engineering, Nanjing University
Zican Hu
Zican Hu
Nanjing University
Reinforcement LearningLarge Language Models
W
Wenhao Wu
Department of Control Science and Intelligent Engineering, Nanjing University
X
Xinyi Xie
School of Information Engineering, Nanchang University
J
Jianxiang Tang
Department of Control Science and Intelligent Engineering, Nanjing University
Chunlin Chen
Chunlin Chen
Nanjing University
Reinforcement LearningQuantum ControlMobile Robotics
Daoyi Dong
Daoyi Dong
IEEE Fellow, Professor at University of Technology Sydney/Australian National University, Australia
quantum controlcontrol and optimisationsystems engineeringmachine learningrenewable energy
Y
Yu Cheng
Department of Computer Science and Engineering, The Chinese University of Hong Kong
Zhenhong Sun
Zhenhong Sun
ANU/UNSW
AIGCComputer VisionDeep Learning
Z
Zhi Wang
Department of Control Science and Intelligent Engineering, Nanjing University