Autoregressive Diffusion World Models for Off-Policy Evaluation of LLM Agents

📅 2026-06-03

📈 Citations: 0

✨ Influential: 0

career value

216K/year

🤖 AI Summary

This work addresses the high cost and risk associated with online evaluation of large language model (LLM) agents in multi-turn interactive environments. To enable reliable offline policy evaluation without real-world interaction, the authors propose the Autoregressive Diffusion World Model (ADWM) framework, which leverages pre-collected trajectories to simulate environment dynamics and agent decisions alternately. ADWM models each state transition as a denoising process conditioned on the agent’s policy, integrating latent-variable diffusion, autoregressive structure, and a scoring function to jointly handle the discrete nature of textual actions while preserving causal temporal dependencies. This design effectively mitigates error accumulation across steps. Experimental results demonstrate that ADWM accurately estimates policy value across diverse multi-turn LLM tasks, significantly enhancing the reliability of offline evaluation.

📝 Abstract

Evaluating large language model (LLM) agents in multi-turn interactive environments is expensive and risky, as it requires online environment interaction. We propose ADWM (Autoregressive Diffusion World Model), an evaluation framework that estimates the performance of a new LLM agent policy purely from pre-collected trajectories. The core idea is to learn a latent diffusion world model that simulates how the environment responds to the evaluation policy, without ever executing it in the real environment. Existing diffusion-based OPE methods guide full trajectories in a single pass by jointly diffusing states and actions, an assumption that breaks down for LLM agents whose actions are discrete text that must be sampled from the policy after observing the environment. Unlike autoregressive world models that suffer from compounding errors, ADWM models each transition as an independent denoising process, enabling reliable step-by-step rollouts where the world model and agent alternate in causal order. Crucially, the LLM agent under evaluation directly guides the diffusion generation at each step via a policy-conditioned score function, ensuring that simulated trajectories accurately reflect its decision-making patterns. Empirically, ADWM achieves accurate value estimates and evaluation reliability across diverse multi-turn agent tasks, demonstrating its promise as a practical framework for offline LLM agent evaluation.

Problem

Research questions and friction points this paper is trying to address.

off-policy evaluation

LLM agents

world models

multi-turn interaction

offline evaluation

Innovation

Methods, ideas, or system contributions that make the work stand out.

Autoregressive Diffusion

World Model

Off-Policy Evaluation