PriorZero: Bridging Language Priors and World Models for Decision Making

📅 2026-05-12

📈 Citations: 0

✨ Influential: 0

career value

185K/year

🤖 AI Summary

Existing approaches struggle to effectively integrate the static prior knowledge of large language models (LLMs) into the dynamic decision-making process of reinforcement learning, often resulting in inefficient exploration and unstable training. This work proposes PriorZero, a novel framework that injects LLM-derived conceptual priors directly into the root node of Monte Carlo Tree Search (MCTS). By decoupling the replay and training pipelines and leveraging a world model to provide fine-grained credit assignment signals, PriorZero enables synergistic optimization between semantic guidance and environmental adaptation. Empirical evaluations on the Jericho text-based adventure games and BabyAI instruction-following tasks demonstrate substantial improvements in both exploration efficiency and final performance, validating the efficacy of the proposed approach.

📝 Abstract

Leveraging the rich world knowledge of Large Language Models (LLMs) to enhance Reinforcement Learning (RL) agents offers a promising path toward general intelligence. However, a fundamental prior-dynamics mismatch hinders existing approaches: static LLM knowledge cannot directly adapt to the complex transition dynamics of long-horizon tasks. Using LLM priors as fixed policies limits exploration diversity, as the prior is blind to environment-specific dynamics; while end-to-end fine-tuning suffers from optimization instability and credit assignment issues. To bridge this gap, we propose PriorZero, a unified framework that integrates LLM-derived conceptual priors into world-model-based planning through a decoupled rollout-training design. During rollout, a novel root-prior injection mechanism incorporates LLM priors exclusively at the root node of Monte Carlo Tree Search (MCTS), focusing search on semantically promising actions while preserving the world model's deep lookahead capability. During training, PriorZero decouples world-model learning from LLM adaptation: the world model is continuously refined on interaction data to jointly improve its dynamics, policy, and value predictions, its value estimates are then leveraged to provide fine-grained credit assignment signals for stable LLM fine-tuning via alternating optimization. Experiments across diverse benchmarks, including text-based adventure games in Jericho and instruction-following gridworld tasks in BabyAI, demonstrate that PriorZero consistently improves both exploration efficiency and asymptotic performance, establishing a promising framework for LLM-empowered decision-making. Our code is available at https://github.com/opendilab/LightZero.

Problem

Research questions and friction points this paper is trying to address.

language priors

world models

reinforcement learning

prior-dynamics mismatch

decision making

Innovation

Methods, ideas, or system contributions that make the work stand out.

LLM priors

world models

Monte Carlo Tree Search