Autoregressive Diffusion World Models for Off-Policy Evaluation of LLM Agents
Pith reviewed 2026-06-28 02:50 UTC · model grok-4.3
The pith
ADWM enables accurate offline evaluation of LLM agents by simulating step-by-step trajectories with a policy-conditioned diffusion world model.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central discovery is that modeling each transition in an LLM agent trajectory as an independent denoising process in a latent diffusion world model, with direct conditioning from the evaluation policy's score function, allows the generation of simulated trajectories that accurately reflect the policy's behavior and yield precise value estimates.
What carries the argument
The autoregressive diffusion world model that denoises one transition at a time while the LLM agent guides the process through policy-conditioned scores, alternating with the environment simulation.
Load-bearing premise
That independent per-transition denoising combined with policy guidance at each step generates rollouts whose statistics match those of real interactions with the evaluation policy.
What would settle it
Run the same policies both in the real environment and through ADWM simulations on identical tasks and compare the resulting value estimates; large differences would indicate the simulations do not accurately reflect policy behavior.
Figures
read the original abstract
Evaluating large language model (LLM) agents in multi-turn interactive environments is expensive and risky, as it requires online environment interaction. We propose ADWM (Autoregressive Diffusion World Model), an evaluation framework that estimates the performance of a new LLM agent policy purely from pre-collected trajectories. The core idea is to learn a latent diffusion world model that simulates how the environment responds to the evaluation policy, without ever executing it in the real environment. Existing diffusion-based OPE methods guide full trajectories in a single pass by jointly diffusing states and actions, an assumption that breaks down for LLM agents whose actions are discrete text that must be sampled from the policy after observing the environment. Unlike autoregressive world models that suffer from compounding errors, ADWM models each transition as an independent denoising process, enabling reliable step-by-step rollouts where the world model and agent alternate in causal order. Crucially, the LLM agent under evaluation directly guides the diffusion generation at each step via a policy-conditioned score function, ensuring that simulated trajectories accurately reflect its decision-making patterns. Empirically, ADWM achieves accurate value estimates and evaluation reliability across diverse multi-turn agent tasks, demonstrating its promise as a practical framework for offline LLM agent evaluation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes ADWM (Autoregressive Diffusion World Model), a framework for off-policy evaluation of LLM agents from pre-collected trajectories. It learns a latent diffusion world model that simulates environment responses via per-transition independent denoising processes, with the evaluation policy providing direct guidance through a policy-conditioned score function at each step. This enables causal step-by-step rollouts alternating between the world model and agent, claimed to avoid compounding errors and yield accurate value estimates for multi-turn tasks without online interaction.
Significance. If the empirical claims hold, the work addresses a practically important problem in safe, low-cost evaluation of LLM agents. The combination of independent transition denoising with explicit policy guidance offers a targeted solution to limitations of prior diffusion-based OPE methods on discrete actions and multi-turn settings, and could serve as a reusable offline evaluation tool.
major comments (2)
- [Abstract and §3] Abstract and §3 (method): the claim that independent per-transition denoising plus policy guidance produces reliable multi-turn rollouts without compounding errors is load-bearing for the central contribution, yet the manuscript provides no quantitative analysis of rollout error accumulation as a function of horizon length or comparison against joint-trajectory diffusion baselines on the same metric.
- [§4] §4 (experiments): the assertion of 'accurate value estimates and evaluation reliability across diverse multi-turn agent tasks' is presented without reported numerical values for value estimation error, confidence intervals, baseline comparisons (e.g., standard OPE or autoregressive world models), dataset sizes, or ablation results on the policy-guidance component.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which identify areas where additional empirical support would strengthen the central claims. We address each point below and will revise the manuscript to incorporate the requested analyses.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (method): the claim that independent per-transition denoising plus policy guidance produces reliable multi-turn rollouts without compounding errors is load-bearing for the central contribution, yet the manuscript provides no quantitative analysis of rollout error accumulation as a function of horizon length or comparison against joint-trajectory diffusion baselines on the same metric.
Authors: We agree that explicit quantitative analysis of rollout error accumulation would provide stronger validation of the independent-denoising design. Section 3 motivates the approach via the per-transition factorization and policy-conditioned guidance, but does not include horizon-dependent error curves or direct comparisons to joint-trajectory diffusion models. In the revised manuscript we will add these: (i) plots of state/action reconstruction error versus rollout horizon on held-out trajectories, and (ii) side-by-side evaluation against a joint-trajectory diffusion baseline using the same error metric and datasets. revision: yes
-
Referee: [§4] §4 (experiments): the assertion of 'accurate value estimates and evaluation reliability across diverse multi-turn agent tasks' is presented without reported numerical values for value estimation error, confidence intervals, baseline comparisons (e.g., standard OPE or autoregressive world models), dataset sizes, or ablation results on the policy-guidance component.
Authors: The current experimental section reports qualitative and aggregate performance but omits the detailed numerical reporting requested. We will expand §4 to include: tables with value-estimation error (e.g., MSE or absolute error) and 95% confidence intervals, explicit dataset sizes, comparisons against standard OPE estimators and autoregressive world-model baselines, and an ablation isolating the policy-guidance term. These additions will be placed in the main text or a dedicated appendix table. revision: yes
Circularity Check
No significant circularity
full rationale
The provided abstract and description contain no equations, derivations, or self-citations that reduce any claimed result to a fitted input or prior result by construction. The method is presented as a novel modeling choice (independent per-transition denoising with policy guidance) whose performance is asserted via empirical evaluation on tasks, without any load-bearing mathematical step that equates output to input by definition. This is the common case of a self-contained empirical proposal with no detectable circularity in the derivation chain.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Is Conditional Generative Modeling all you need for Decision-Making?
Anurag Ajay, Yilun Du, Abhi Gupta, Joshua Tenenbaum, Tommi Jaakkola, and Pulkit Agrawal. Is conditional generative modeling all you need for decision-making?arXiv preprint arXiv:2211.15657,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Combating the Compounding-Error Problem with a Multi-step Model
Kavosh Asadi, Dipendra Misra, Seungchan Kim, and Michel L Littman. Combating the compounding-error problem with a multi-step model.arXiv preprint arXiv:1905.13320,
work page internal anchor Pith review Pith/arXiv arXiv 1905
-
[3]
Better than your teacher: Llm agents that learn from privileged ai feedback
Sanjiban Choudhury and Paloma Sodhi. Better than your teacher: Llm agents that learn from privileged ai feedback. arXiv preprint arXiv:2410.05434,
-
[4]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Let- man, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
Mastering Diverse Domains through World Models
Danijar Hafner, Jurgis Pasukonis, Jimmy Ba, and Timothy Lillicrap. Mastering diverse domains through world models. arXiv preprint arXiv:2301.04104,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Classifier-Free Diffusion Guidance
Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance.arXiv preprint arXiv:2207.12598,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Siqiao Huang, Jialong Wu, Qixing Zhou, Shangchen Miao, and Mingsheng Long. Vid2world: Crafting video diffusion models to interactive world models.arXiv preprint arXiv:2505.14357,
-
[8]
Policy-guided diffusion.arXiv preprint arXiv:2404.06356,
9 Matthew Thomas Jackson, Michael Tryfan Matthews, Cong Lu, Benjamin Ellis, Shimon Whiteson, and Jakob Foerster. Policy-guided diffusion.arXiv preprint arXiv:2404.06356,
-
[9]
Planning with Diffusion for Flexible Behavior Synthesis
Michael Janner, Yilun Du, Joshua B Tenenbaum, and Sergey Levine. Planning with diffusion for flexible behavior synthesis.arXiv preprint arXiv:2205.09991,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770,
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Transformers are sample-efficient world models.arXiv preprint arXiv:2209.00588, 2022
Vincent Micheli, Eloi Alonso, and François Fleuret. Transformers are sample-efficient world models.arXiv preprint arXiv:2209.00588,
-
[12]
Representation Learning with Contrastive Predictive Coding
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748,
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Hyperparameter selection for offline reinforcement learning.arXiv preprint arXiv:2007.09055,
Tom Le Paine, Cosmin Paduraru, Andrea Michi, Caglar Gulcehre, Konrad Zolna, Alexander Novikov, Ziyu Wang, and Nando de Freitas. Hyperparameter selection for offline reinforcement learning.arXiv preprint arXiv:2007.09055,
-
[14]
ALFWorld: Aligning Text and Embodied Environments for Interactive Learning
Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning.arXiv preprint arXiv:2010.03768,
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[15]
Model-based reinforcement learning with an approximate, learned model
Leonid Kuvayev Rich Sutton. Model-based reinforcement learning with an approximate, learned model. InProceedings of the ninth Yale workshop on adaptive and learning systems, volume 1996, pages 101–105,
1996
-
[16]
Cameron V oloshin, Hoang M Le, Nan Jiang, and Yisong Yue. Empirical study of off-policy policy evaluation for reinforcement learning.arXiv preprint arXiv:1911.06854,
-
[17]
Scienceworld: Is your agent smarter than a 5th grader? InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11279–11298,
Ruoyao Wang, Peter Jansen, Marc-Alexandre Côté, and Prithviraj Ammanabrolu. Scienceworld: Is your agent smarter than a 5th grader? InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 11279–11298,
2022
-
[18]
Hotpotqa: A dataset for diverse, explainable multi-hop question answering
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 conference on empirical methods in natural language processing, pages 2369–2380,
2018
-
[19]
WebArena: A Realistic Web Environment for Building Autonomous Agents
Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
TY t=1 πb(at |h t)P(o t+1 |h t, at)· TY t=1 πe(at |h t) πb(at |h t) =p πb(τ)· TY t=1 πe(at |h t) πb(at |h t) ,(25) where the environment transitionP(o t+1 |h t, at)cancels between numerator and denominator. Taking logarithms, logp πe(τ) = logp πb(τ) + TX t=1 logπ e(at |h t)− TX t=1 logπ b(at |h t).(26) Equation (26) is the starting point shared by importa...
2024
-
[21]
All non-linearities are SiLU except the IDM / policy / projector heads, which use ReLU and GELU respectively
and zero-initialised output projection; the IDM and BC heads are 2-layer MLPs over (z, h). All non-linearities are SiLU except the IDM / policy / projector heads, which use ReLU and GELU respectively. Table 4: Per-module parameter counts in the ADWMworld model (7.38M parameters total ford z=64). Module Function Params (M) % of total Observation encoder (f...
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.