pith. sign in

arxiv: 2605.20506 · v1 · pith:HRKHKQ6Rnew · submitted 2026-05-19 · 💻 cs.LG · cs.CL

Reinforcing Human Behavior Simulation via Verbal Feedback

Pith reviewed 2026-05-21 07:17 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords reinforcement learningverbal feedbackhuman behavior simulationlarge language modelsSOUL benchmarkDITTOGRPOpersona simulation
0
0 comments X

The pith

Treating verbal feedback as a core reinforcement signal trains LLMs to produce more human-like behavior in simulations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that LLMs used for simulating humans can improve by learning directly from verbal critiques instead of scalar rewards suited to code or math. DITTO implements this by having the model generate a response, receive natural language feedback, then create an improved version that is jointly optimized back into the policy. The result is a base model that embeds the lessons from feedback and no longer needs it at test time. A new benchmark suite called SOUL supplies tasks in theory of mind, role play, social skills, and persona simulation to measure the gains. If the approach holds, simulated users, patients, and students become noticeably more realistic without extra prompting or external guidance.

Core claim

DITTO trains by rolling out an initial behavior, receiving verbal feedback, generating a feedback-conditioned improved rollout, and jointly optimizing both outputs with GRPO so that the verbal guidance is distilled into the base policy for use without feedback at test time.

What carries the argument

Feedback-conditioned improved rollout generation optimized jointly with GRPO to distill verbal signals into the policy.

If this is right

  • The model internalizes verbal lessons so that improved human-like behavior appears without any feedback supplied at inference time.
  • Performance rises across tasks spanning theory of mind, character role play, social skills, learner simulation, user simulation, and persona simulation.
  • The same training loop can be applied whenever the desired output is subjective or multi-faceted rather than strictly verifiable.
  • Verbal feedback becomes a reusable training signal that does not need to be converted into numeric rewards.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method may generalize to other subjective domains such as ethical reasoning or creative collaboration where scalar rewards are hard to define.
  • Training pipelines could shift from reward-model engineering toward curating natural language critique datasets.
  • Interactive agents might be fine-tuned periodically with logged human comments rather than static preference data.
  • The approach raises the question of how much of human social learning can be captured by distilling one round of verbal correction.

Load-bearing premise

Gains on the SOUL tasks and metrics reflect genuine increases in human-likeness that hold for new scenarios and different sources of feedback.

What would settle it

Independent human raters find DITTO outputs less human-like than the base model on simulation tasks drawn from domains outside the SOUL suite.

Figures

Figures reproduced from arXiv: 2605.20506 by Graham Neubig, Haojia Sun, Jiarui Liu, Longqi Yang, Maarten Sap, Mengting Wan, Pei Zhou, Qianou Ma, Sean Welleck, Sherry Wu, Sihao Chen, Weihua Du, Weiwei Sun, Xuhui Zhou, Yiming Yang, Yiqing Xie.

Figure 1
Figure 1. Figure 1: Benchmark results on human simulation tasks. Ditto wins 6 of 10 benchmarks, [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of DITTO. ➀ Given prompt x from SOUL, the policy generates a student rollout y0, scored as r0. ➁ An LLM judge produces verbal feedback: per-dimension critiques and a reflection h. ➂ h is concatenated with x and fed into πθ to generate a teacher rollout y1, scored as r1. ➃ Both are jointly optimized via GRPO. At test time, no feedback is needed. However, recent efforts have revealed systematic Sim2… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of SOUL, our unified benchmark and training data suite spanning 10 tasks across six categories of human simulation. for evaluation and a curated training set for RL. When training splits are available, we use them directly; otherwise, we construct training contexts from held-out evaluation data or related datasets. Full task descriptions, examples, and statistics are provided in Section C. Theory … view at source ↗
Figure 4
Figure 4. Figure 4: Ablation study on Sotopia, comparing different RL algorithms. [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Score delta between the feedback-conditioned teacher response and the original [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Training results on Sotopia [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Training results on UserLLM Future Work. Promising directions include (1) incorporating human feedback to correct shared judge biases, (2) extending to multilingual and cross-cultural social simulation, (3) developing more efficient jury mechanisms (e.g., adaptive judge selection), and (4) exploring whether the social reasoning capabilities learned through our approach transfer to downstream agent tasks. B… view at source ↗
Figure 8
Figure 8. Figure 8: Training results on Soul tasks Benchmark GPT-5.4 GPT-5 nano Sotopia-RL 7B HumanLM 8B HER 32B Coser 8B Base 8B GRPO 8B DITTO 8B FanToM 0.900 0.720 0.150 0.780 0.550 0.030 0.780 0.940 0.950 HiToM 0.700 0.370 0.320 0.510 0.560 0.020 0.580 0.770 0.780 ToMi 0.880 0.850 0.610 0.720 0.630 0.430 0.680 0.820 0.930 CoSER 0.659 0.352 0.296 0.227 0.419 0.305 0.435 0.541 0.512 LifeChoices 0.870 0.600 0.670 0.660 0.750 … view at source ↗
Figure 9
Figure 9. Figure 9: CoSER example from Pride and Prejudice. The model captures the general narrative arc (reading Mr. Collins’ letter) but softens Mr Bennet’s distinctive sardonic personality into generic warmth, resulting in low Character Fidelity (34.5). The judge identifies this as the primary failure mode: the character behaves too much like a “helpful AI assistant” rather than the complex, detached patriarch of the origi… view at source ↗
Figure 10
Figure 10. Figure 10: LifeChoice example. The model receives ∼25K characters of character biography and must predict Amelia’s decision. All four options are psychologically plausible—the distractors represent reasonable alternatives—but only one matches the character’s actual choice in the novel. Selecting the correct answer requires understanding Amelia’s personal￾ity, values, and character arc. Metrics. MirrorBench combines … view at source ↗
Figure 11
Figure 11. Figure 11: MirrorBench (Arena) example. This Arena conversation has a single user turn. [PITH_FULL_IMAGE:figures/full_fig_p027_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: MirrorBench (ClariQ) example. All 4 user turns are generated; assistant turns [PITH_FULL_IMAGE:figures/full_fig_p028_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: MirrorBench (OASST1) example. Both user turns are generated; assistant turns [PITH_FULL_IMAGE:figures/full_fig_p029_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: MirrorBench (QuLAC) example. QuLAC conversations consist of one assistant [PITH_FULL_IMAGE:figures/full_fig_p030_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: UserLLM (CSQA) example. The assistant feigns uncertainty to test whether the [PITH_FULL_IMAGE:figures/full_fig_p031_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: UserLLM (NQ) example. The assistant deflects the user’s factoid question and [PITH_FULL_IMAGE:figures/full_fig_p031_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: UserLLM (PRISM) examples. Example A: First-turn generation from a high￾level intent; the model rephrases naturally without copying intent keywords. Example B: Sequential turn with conversation history; the model generates a natural follow-up but fails to predict conversation termination (the ground truth marks this as the last turn, but the model does not emit the <|endconversation|> token). This terminat… view at source ↗
read the original abstract

Humans learn social norms and behaviors from verbal feedback (e.g., a parent saying "that was rude" or a friend explaining "here's why that hurt"). Yet, learning from feedback for LLMs has largely focused on domains like code and math, where RL rewards are directly verifiable and condensed into scalar values. As LLMs are increasingly used to simulate human behavior, e.g., standing in for users, patients, students, and other personas, there is a pressing need to make them more human-like, which requires embracing a fundamentally different kind of signal: feedback that is verbal, subjective, and multi-faceted. We present DITTO, a model trained by treating verbal feedback as a first-class signal in reinforcement learning. After each rollout, DITTO receives verbal feedback and generates a feedback-conditioned improved rollout; both outputs are jointly optimized with GRPO, distilling verbal guidance into the base policy without requiring feedback at test time. We also introduce SOUL (Simulation gym Of hUman-Like behavior), a unified benchmark and training data suite spanning 10 tasks across six categories: Theory of Mind, character role play, social skill, learner simulation, user simulation, and persona simulation. DITTO achieves an average 36% improvement over the base model and exceeds GPT-5.4 on 6 of 10 SOUL benchmarks, demonstrating that RL with verbal feedback is a promising direction for training LLMs to simulate human behavior.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes DITTO, a reinforcement learning method for enhancing LLMs' simulation of human behavior by treating verbal feedback as a primary signal. It describes generating improved rollouts conditioned on verbal feedback and jointly optimizing them with the base policy using GRPO. The SOUL benchmark is introduced, covering 10 tasks in six categories, with reported results showing a 36% average improvement over the base model and surpassing GPT-5.4 on 6 out of 10 benchmarks.

Significance. Should the experimental results prove robust upon closer inspection, this work would be significant in shifting RL applications for LLMs from verifiable domains like code to subjective, multi-faceted verbal feedback for behavioral simulation. The SOUL benchmark could serve as a valuable resource for future research in human-like AI agents.

major comments (2)
  1. Abstract and Experiments section: The abstract states clear performance numbers (36% average improvement, exceeding GPT-5.4 on 6/10 SOUL tasks) but provides no details on experimental controls, statistical significance, feedback quality, or potential confounds. This leaves the central claim without visible supporting derivation or data handling steps, as highlighted by the low soundness rating. Please expand on rollout generation, GRPO optimization details, number of runs, and baseline controls.
  2. SOUL benchmark description (likely §4): The evaluation risks overstated generalization because the 10 tasks (Theory of Mind, role play, etc.) and metrics may not separate training feedback distributions from test scenarios. Clarify whether held-out task splits, diverse external feedback sources, or independent human calibration of metrics are used, since the method jointly optimizes feedback-conditioned rollouts with the base policy.
minor comments (2)
  1. Introduction: Add references to prior work on verbal feedback in RL or LLM-based human simulation to better contextualize the contribution beyond scalar-reward domains.
  2. Results presentation: Include error bars, confidence intervals, or variance measures in any tables reporting the 36% improvement and per-task comparisons to support the average claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We have revised the manuscript to provide greater transparency on experimental procedures and benchmark construction while preserving the core contributions of DITTO and the SOUL benchmark.

read point-by-point responses
  1. Referee: Abstract and Experiments section: The abstract states clear performance numbers (36% average improvement, exceeding GPT-5.4 on 6/10 SOUL tasks) but provides no details on experimental controls, statistical significance, feedback quality, or potential confounds. This leaves the central claim without visible supporting derivation or data handling steps, as highlighted by the low soundness rating. Please expand on rollout generation, GRPO optimization details, number of runs, and baseline controls.

    Authors: We agree that the abstract is high-level and that the Experiments section would benefit from a consolidated summary of controls. Rollout generation is described in Section 3: after each base-policy rollout we sample verbal feedback and condition a second rollout on it. GRPO is applied to the joint objective over both rollouts (detailed in Section 3.2 and Appendix B with the exact loss and KL coefficient). We ran each configuration three times with distinct random seeds and report mean and standard deviation; pairwise t-tests against the base model yield p < 0.05 on the aggregate metric. Baselines comprise the untuned base model, SFT on the same data, and direct comparison to GPT-5.4. Feedback quality was controlled by using a held-out human annotation pool and discarding low-consensus feedback. We will add a short paragraph to the abstract and a dedicated “Experimental Controls” subsection that enumerates these points. revision: yes

  2. Referee: SOUL benchmark description (likely §4): The evaluation risks overstated generalization because the 10 tasks (Theory of Mind, role play, etc.) and metrics may not separate training feedback distributions from test scenarios. Clarify whether held-out task splits, diverse external feedback sources, or independent human calibration of metrics are used, since the method jointly optimizes feedback-conditioned rollouts with the base policy.

    Authors: We share the concern about distribution shift. Within each of the six categories we partition scenarios into disjoint training and test sets before feedback collection; test scenarios therefore never appear in the verbal feedback used for optimization. Feedback is sourced from two external pools (crowd workers and a separate LLM) that are disjoint from any training data. Metric scores are obtained from a third, independent panel of human raters who see only the final outputs and are not involved in feedback generation. We will expand Section 4 to state these splits and sources explicitly and will add a short paragraph discussing why the joint optimization still permits generalization to the held-out test distribution. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical claims independent of inputs

full rationale

The paper describes DITTO as an RL method using verbal feedback to generate improved rollouts then jointly optimizing via GRPO, and introduces the SOUL benchmark spanning 10 tasks. No equations, derivations, or self-referential definitions appear that reduce the reported 36% improvement or GPT-5.4 outperformance to a fitted parameter or construction from the training feedback itself. The method is presented as distilling guidance into the base policy for test-time use without feedback, and results are framed as empirical measurements on the benchmark. No load-bearing self-citations, uniqueness theorems, or ansatz smuggling are evident in the text; the central claims rest on external benchmark evaluation rather than reducing to the method's own inputs by definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

Only the abstract is available, so the ledger is limited to high-level assumptions visible in the summary; no free parameters, formal axioms, or invented entities with independent evidence are detailed.

axioms (1)
  • domain assumption Verbal feedback from unspecified sources can be reliably converted into improved rollouts that distill into a better base policy
    This premise underpins the GRPO optimization step and the claim of test-time feedback-free performance.
invented entities (2)
  • DITTO no independent evidence
    purpose: RL-trained model that incorporates verbal feedback for human simulation
    New model name and training procedure introduced in the abstract.
  • SOUL no independent evidence
    purpose: Unified benchmark and data suite for evaluating human-like behavior across 10 tasks
    New benchmark introduced to measure the method's effectiveness.

pith-pipeline@v0.9.0 · 5842 in / 1429 out tokens · 46493 ms · 2026-05-21T07:17:47.076691+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

152 extracted references · 152 canonical work pages · 19 internal anchors

  1. [1]

    Asking clarifying questions in open-domain information-seeking conversations

    Mohammad Aliannejadi, Hamed Zamani, Fabio Crestani, and W Bruce Croft. Asking clarifying questions in open-domain information-seeking conversations. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.\ 475--484, 2019

  2. [2]

    Building and evaluating open-domain dialogue corpora with clarifying questions

    Mohammad Aliannejadi, Julia Kiseleva, Aleksandr Chuklin, Jeff Dalton, and Mikhail Burtsev. Building and evaluating open-domain dialogue corpora with clarifying questions. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 4473--4484, 2021

  3. [3]

    Simuser: Simulating user behavior with large language models for recommender system evaluation

    Nicolas Bougie and Narimasa Watanabe. Simuser: Simulating user behavior with large language models for recommender system evaluation. In Annual Meeting of the Association for Computational Linguistics, 2025. URL https://aclanthology.org/2025.acl-industry.5/

  4. [6]

    Simulatorarena: Are user simulators reliable proxies for multi-turn evaluation of AI assistants? In Conference on Empirical Methods in Natural Language Processing, 2025

    Yao Dou, Michel Galley, Baolin Peng, Chris Kedzie, Weixin Cai, Alan Ritter, Chris Quirk, Wei Xu, and Jianfeng Gao. Simulatorarena: Are user simulators reliable proxies for multi-turn evaluation of AI assistants? In Conference on Empirical Methods in Natural Language Processing, 2025. URL https://arxiv.org/abs/2510.05444

  5. [7]

    TwinVoice : A multi-dimensional benchmark towards digital twins via LLM persona simulation

    Bangde Du, Minghao Guo, Songming He, Ziyi Ye, Xi Zhu, Weihang Su, Shuqi Zhu, Yujia Zhou, Yongfeng Zhang, Qingyao Ai, and Yiqun Liu. TwinVoice : A multi-dimensional benchmark towards digital twins via LLM persona simulation. arXiv preprint arXiv:2510.25536, 2025. URL https://arxiv.org/abs/2510.25536

  6. [8]

    HER: Human-like Reasoning and Reinforcement Learning for LLM Role-playing

    Chengyu Du, Xintao Wang, Aili Chen, Weiyuan Li, Rui Xu, Junteng Liu, Zishan Huang, Rong Tian, Zijun Sun, Yuhao Li, Liheng Feng, Deming Ding, Pengyu Zhao, and Yanghua Xiao. Her: Human-like reasoning and reinforcement learning for llm role-playing. ArXiv, abs/2601.21459, 2026

  7. [9]

    MirrorBench: A Benchmark to Evaluate Conversational User-Proxy Agents for Human-Likeness

    Ashutosh Hathidara, Julien Yu, Vaishali Senthil, Sebastian Schreiber, and Anil Babu Ankisettipalli. MirrorBench : A benchmark to evaluate conversational user-proxy agents for human-likeness. arXiv preprint arXiv:2601.08118, 2026. URL https://arxiv.org/abs/2601.08118

  8. [10]

    HI-TOM : A benchmark for evaluating higher-order theory of mind reasoning in large language models

    Yinghui He, Yufan Wu, Yilin Jia, Rada Mihalcea, Yulong Chen, and Naihao Deng. HI-TOM : A benchmark for evaluating higher-order theory of mind reasoning in large language models. arXiv preprint arXiv:2310.16755, 2024. URL https://arxiv.org/abs/2310.16755

  9. [14]

    MMToM-QA : Multimodal theory of mind question answering

    Chuanyang Jin, Yutong Wu, Jing Cao, Jiannan Xiang, Yen-Ling Kuo, Zhiting Hu, Tomer Ullman, Antonio Torralba, Joshua B Tenenbaum, and Tianmin Shu. MMToM-QA : Multimodal theory of mind question answering. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024. URL https://arxiv.org/abs/2401.08743

  10. [15]

    FANToM : A benchmark for stress-testing machine theory of mind in interactions

    Hyunwoo Kim, Melanie Sclar, Xuhui Zhou, Ronan Le Bras, Gunhee Kim, Yejin Choi, and Maarten Sap. FANToM : A benchmark for stress-testing machine theory of mind in interactions. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023. URL https://arxiv.org/abs/2310.15421

  11. [16]

    o pf, Yannic Kilcher, Dimitri von R \

    Andreas K \"o pf, Yannic Kilcher, Dimitri von R \"u tte, Sotiris Anagnostidis, Zhi Rui Tam, Keith Stevens, Abdullah Barhoum, Duc Minh Nguyen, Oliver Stanley, Rich \'a rd Nagyfi, et al. Open A ssistant conversations -- democratizing large language model alignment. In Advances in Neural Information Processing Systems, 2023

  12. [18]

    Chain of hindsight aligns language models with feedback

    Hao Liu, Carmelo Sferrazza, and Pieter Abbeel. Chain of hindsight aligns language models with feedback. In International Conference on Learning Representations, 2024. URL https://arxiv.org/abs/2302.02676

  13. [19]

    Self-Refine: Iterative Refinement with Self-Feedback

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. In Advances in Neural Information Processing Sy...

  14. [21]

    Machines and mindlessness: Social responses to computers

    Clifford Nass and Youngme Moon. Machines and mindlessness: Social responses to computers. Journal of Social Issues, 56 0 (1): 0 81--103, 2000

  15. [22]

    Evaluating Theory of Mind in Question Answering

    Aida Nematzadeh, Kaylee Burns, Erin Grant, Alison Gopnik, and Thomas L. Griffiths. Evaluating theory of mind in question answering, 2018. URL https://arxiv.org/abs/1808.09352

  16. [23]

    To model human linguistic prediction, make LLMs less superhuman

    Byung-Doh Oh and Tal Linzen. To model human linguistic prediction, make LLMs less superhuman. arXiv preprint arXiv:2510.05141, 2025. URL https://arxiv.org/abs/2510.05141

  17. [24]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35, 2022. URL https://arxiv.org/abs/2203.02155

  18. [25]

    Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning

    Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. ArXiv, abs/1910.00177, 2019

  19. [26]

    Direct Preference Optimization: Your Language Model is Secretly a Reward Model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2023. URL https://arxiv.org/abs/2305.18290

  20. [27]

    Bases: Large-scale web search user simulation with large language model based agents

    Ruiyang Ren, Peng Qiu, Yingqi Qu, Jing Liu, Wayne Xin Zhao, Hua Wu, Ji-Rong Wen, and Haifeng Wang. Bases: Large-scale web search user simulation with large language model based agents. In Findings of the Association for Computational Linguistics: EMNLP, 2024. URL https://arxiv.org/abs/2402.17505

  21. [28]

    Learning to make MISTAKEs : Modeling incorrect student thinking and key errors

    Alexis Ross and Jacob Andreas. Learning to make MISTAKEs : Modeling incorrect student thinking and key errors. arXiv preprint arXiv:2510.11502, 2025. URL https://arxiv.org/abs/2510.11502

  22. [29]

    Personagym: Evaluating persona agents and llms

    Vinay Samuel, Henry Peng Zou, Yue Zhou, Shreyas Chaudhari, Ashwin Kalyan, Tanmay Rajpurohit, Ameet Deshpande, Karthik Narasimhan, and Vishvak Murahari. Personagym: Evaluating persona agents and llms. In Findings of the Association for Computational Linguistics: EMNLP, 2025. URL https://arxiv.org/abs/2407.18416

  23. [30]

    A survey of statistical user simulation techniques for reinforcement-learning of dialogue management strategies

    Jost Schatzmann, Karl Weilhammer, Matt Stuttle, and Steve Young. A survey of statistical user simulation techniques for reinforcement-learning of dialogue management strategies. The Knowledge Engineering Review, 21 0 (2): 0 97--126, 2006. URL https://doi.org/10.1017/S0269888906000944

  24. [31]

    Training language models with language feedback at scale

    J \'e r \'e my Scheurer, Jon Ander Campos, Tomasz Korbak, Jun Shern Chan, Angelica Chen, Kyunghyun Cho, and Ethan Perez. Training language models with language feedback at scale. Transactions on Machine Learning Research, 2023. URL https://arxiv.org/abs/2303.16755

  25. [32]

    Agentclinic: a multimodal agent benchmark to evaluate ai in simulated clinical environments, 2024

    Samuel Schmidgall, Rojin Ziaei, Carl Harris, Eduardo Reis, Jeffrey Jopling, and Michael Moor. Agentclinic: a multimodal agent benchmark to evaluate ai in simulated clinical environments, 2024. URL https://arxiv.org/abs/2405.07960

  26. [34]

    Reliable llm-based user simulator for task-oriented dialogue systems

    Ivan Sekulic, Silvia Terragni, Victor Guimaraes, Nghia Khau, Bruna Guedes, Modestas Filipavicius, Andre Ferreira Manso, and Roland Mathis. Reliable llm-based user simulator for task-oriented dialogue systems. In Proceedings of the 1st Workshop on Simulating Conversational Intelligence in Chat (SCI-CHAT), 2024. URL https://arxiv.org/abs/2402.13374

  27. [37]

    Self-Distillation Enables Continual Learning

    Idan Shenfeld, Mehul Damani, Jonas H \"u botter, and Pulkit Agrawal. Self-distillation enables continual learning. ArXiv, abs/2601.19897, 2026

  28. [39]

    Reflexion: Language Agents with Verbal Reinforcement Learning

    Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. In Advances in Neural Information Processing Systems, 2023. URL https://arxiv.org/abs/2303.11366

  29. [41]

    Simulating user satisfaction for the evaluation of task-oriented dialogue systems

    Weiwei Sun, Shuo Zhang, Krisztian Balog, Zhaochun Ren, Pengjie Ren, Zhumin Chen, and Maarten de Rijke. Simulating user satisfaction for the evaluation of task-oriented dialogue systems. Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2021. URL https://dl.acm.org/doi/10.1145/3404835.3463241

  30. [42]

    de Rijke, and Zhaochun Ren

    Weiwei Sun, Shuyu Guo, Shuo Zhang, Pengjie Ren, Zhumin Chen, M. de Rijke, and Zhaochun Ren. Metaphorical user simulators for evaluating task-oriented dialogue systems. ACM Transactions on Information Systems, 42: 0 1 -- 29, 2022

  31. [44]

    Do llms exhibit human-like response biases? a case study in survey design

    Lindia Tjuatja, Valerie Chen, Tongshuang Wu, Ameet Talwalkar, and Graham Neubig. Do llms exhibit human-like response biases? a case study in survey design. Transactions of the Association for Computational Linguistics, 12: 0 1011--1026, 2024

  32. [45]

    Becoming Human: A Theory of Ontogeny

    Michael Tomasello. Becoming Human: A Theory of Ontogeny. Harvard University Press, Cambridge, MA, 2019. ISBN 9780674248281

  33. [46]

    A new learning paradigm: Learning using privileged information

    Vladimir Vapnik and Akshay Vashist. A new learning paradigm: Learning using privileged information. Neural Networks, 22 0 (5-6): 0 544--557, 2009. URL https://doi.org/10.1016/j.neunet.2009.06.042

  34. [47]

    Learning using privileged information: similarity control and knowledge transfer

    Vladimir Naumovich Vapnik and Rauf Izmailov. Learning using privileged information: similarity control and knowledge transfer. J. Mach. Learn. Res., 16: 0 2023--2049, 2015

  35. [48]

    User behavior simulation with large language model-based agents

    Lei Wang, Jingsen Zhang, Hao Yang, Zhi-Yuan Chen, Jiakai Tang, Zeyu Zhang, Xu Chen, Yankai Lin, Ruihua Song, Wayne Xin Zhao, Jun Xu, Zhicheng Dou, Jun Wang, and Ji rong Wen. User behavior simulation with large language model-based agents. ACM Transactions on Information Systems, 2025. URL https://dl.acm.org/doi/10.1145/3708985

  36. [49]

    Coser: A comprehensive literary dataset and framework for training and evaluating llm role-playing and persona simulation, 2026

    Xintao Wang, Heng Wang, Yifei Zhang, Xinfeng Yuan, Rui Xu, Jen tse Huang, Siyu Yuan, Haoran Guo, Jiangjie Chen, Shuchang Zhou, Wei Wang, and Yanghua Xiao. Coser: A comprehensive literary dataset and framework for training and evaluating llm role-playing and persona simulation, 2026. URL https://arxiv.org/abs/2502.09082

  37. [50]

    Humanlm: Simulating users with state alignment beats response imitation, 2026

    Shirley Wu, Evelyn Choi, Arpandeep Khatua, Zhanghan Wang, Joy He-Yueya, Tharindu Cyril Weerasooriya, Wei Wei, Diyi Yang, Jure Leskovec, and James Zou. Humanlm: Simulating users with state alignment beats response imitation, 2026. URL https://arxiv.org/abs/2603.03303

  38. [51]

    Smith, Mari Ostendorf, and Hannaneh Hajishirzi

    Zeqiu Wu, Yushi Hu, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Ammanabrolu, Noah A. Smith, Mari Ostendorf, and Hannaneh Hajishirzi. Fine-grained human feedback gives better rewards for language model training. In Advances in Neural Information Processing Systems, 2023. URL https://arxiv.org/abs/2306.01693

  39. [52]

    Character is destiny: Can role-playing language agents make persona-driven decisions? 2024 a

    Rui Xu, Xintao Wang, Jiangjie Chen, Siyu Yuan, Xinfeng Yuan, Jiaqing Liang, Zulong Chen, Xiaoqing Dong, and Yanghua Xiao. Character is destiny: Can role-playing language agents make persona-driven decisions? 2024 a

  40. [54]

    On-Policy Context Distillation for Language Models

    Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei. On-policy context distillation for language models. ArXiv, abs/2602.12275, 2026

  41. [56]

    Self-Rewarding Language Models

    Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. Self-rewarding language models. In International Conference on Machine Learning, 2024. URL https://arxiv.org/abs/2401.10020

  42. [57]

    Usimagent: Large language models for simulating search users

    Erhan Zhang, Xingzhu Wang, Peiyuan Gong, Yankai Lin, and Jiaxin Mao. Usimagent: Large language models for simulating search users. Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2024. URL https://arxiv.org/abs/2403.09142

  43. [59]

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. In Advances in Neural Information Processing Systems, volume 36, 2023. URL https://arxiv.org/abs/2306.05685

  44. [60]

    SOTOPIA : Interactive evaluation for social intelligence in language agents

    Xuhui Zhou, Hao Zhu, Leena Mathur, Ruohong Zhang, Haofei Yu, Zhengyang Qi, Louis-Philippe Morency, Yonatan Bisk, Daniel Fried, Graham Neubig, and Maarten Sap. SOTOPIA : Interactive evaluation for social intelligence in language agents. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=mM7VurbA4r

  45. [61]

    Xu, Bill Yuchen Lin, Yejin Choi, Niloofar Mireshghallah, Ronan Le Bras, and Maarten Sap

    Xuhui Zhou, Hyunwoo Kim, Faeze Brahman, Liwei Jiang, Hao Zhu, Ximing Lu, Frank F. Xu, Bill Yuchen Lin, Yejin Choi, Niloofar Mireshghallah, Ronan Le Bras, and Maarten Sap. HAICOSYSTEM : An ecosystem for sandboxing safety risks in human-ai interactions. In Second Conference on Language Modeling, 2025. URL https://arxiv.org/abs/2409.16427

  46. [62]

    Social world models, 2026 a

    Xuhui Zhou, Jiarui Liu, Akhila Yerukola, Hyunwoo Kim, and Maarten Sap. Social world models, 2026 a . URL https://arxiv.org/abs/2509.00559

  47. [63]

    Mind the sim2real gap in user simulation for agentic tasks, 2026 b

    Xuhui Zhou, Weiwei Sun, Qianou Ma, Yiqing Xie, Jiarui Liu, Weihua Du, Sean Welleck, Yiming Yang, Graham Neubig, Sherry Tongshuang Wu, and Maarten Sap. Mind the sim2real gap in user simulation for agentic tasks, 2026 b . URL https://arxiv.org/abs/2603.11245

  48. [64]

    Advances in Neural Information Processing Systems , year=

    Fine-Grained Human Feedback Gives Better Rewards for Language Model Training , author=. Advances in Neural Information Processing Systems , year=

  49. [65]

    2026 , eprint=

    Social World Models , author=. 2026 , eprint=

  50. [66]

    arXiv preprint arXiv:2508.03905 , year=

    Sotopia-RL: Reward Design for Social Intelligence , author=. arXiv preprint arXiv:2508.03905 , year=

  51. [67]

    2026 , eprint=

    Mind the Sim2Real Gap in User Simulation for Agentic Tasks , author=. 2026 , eprint=

  52. [68]

    arXiv preprint arXiv:2511.02208 , year=

    Training Proactive and Personalized LLM Agents , author=. arXiv preprint arXiv:2511.02208 , year=

  53. [69]

    arXiv preprint arXiv:2510.22954 , year=

    Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond) , author=. arXiv preprint arXiv:2510.22954 , year=

  54. [70]

    To Model Human Linguistic Prediction, Make

    Oh, Byung-Doh and Linzen, Tal , journal=. To Model Human Linguistic Prediction, Make. 2025 , url=

  55. [71]

    2019 , publisher=

    Becoming Human: A Theory of Ontogeny , author=. 2019 , publisher=

  56. [72]

    Neural Networks , volume=

    A New Learning Paradigm: Learning Using Privileged Information , author=. Neural Networks , volume=. 2009 , url=

  57. [73]

    Transactions on Machine Learning Research , year=

    Training Language Models with Language Feedback at Scale , author=. Transactions on Machine Learning Research , year=

  58. [74]

    Advances in Neural Information Processing Systems , year=

    Reflexion: Language Agents with Verbal Reinforcement Learning , author=. Advances in Neural Information Processing Systems , year=

  59. [75]

    Advances in Neural Information Processing Systems , year=

    Self-Refine: Iterative Refinement with Self-Feedback , author=. Advances in Neural Information Processing Systems , year=

  60. [76]

    International Conference on Learning Representations , year=

    Chain of Hindsight Aligns Language Models with Feedback , author=. International Conference on Learning Representations , year=

  61. [77]

    International Conference on Machine Learning , year=

    Self-Rewarding Language Models , author=. International Conference on Machine Learning , year=

  62. [78]

    arXiv preprint arXiv:2602.02482 , year=

    Expanding the Capabilities of Reinforcement Learning via Text Feedback , author=. arXiv preprint arXiv:2602.02482 , year=

  63. [79]

    arXiv preprint arXiv:2506.03106 , year=

    Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback , author=. arXiv preprint arXiv:2506.03106 , year=

  64. [80]

    arXiv preprint arXiv:2602.13949 , year=

    Experiential Reinforcement Learning , author=. arXiv preprint arXiv:2602.13949 , year=

  65. [81]

    Advances in Neural Information Processing Systems , year=

    MediQ: Question-Asking LLMs and a Benchmark for Reliable Interactive Clinical Reasoning , author=. Advances in Neural Information Processing Systems , year=

  66. [82]

    $\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains

    -bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains , author=. arXiv preprint arXiv:2406.12045 , year=

  67. [83]

    International Conference on Learning Representations , year=

    WebArena: A Realistic Web Environment for Building Autonomous Agents , author=. International Conference on Learning Representations , year=

  68. [84]

    International Conference on Learning Representations , year=

    GAIA: A Benchmark for General AI Assistants , author=. International Conference on Learning Representations , year=

  69. [85]

    International Conference on Learning Representations , year=

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues? , author=. International Conference on Learning Representations , year=

  70. [86]

    IEEE/RSJ International Conference on Intelligent Robots and Systems , year=

    Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World , author=. IEEE/RSJ International Conference on Intelligent Robots and Systems , year=

  71. [87]

    IEEE Symposium Series on Computational Intelligence , year=

    Sim-to-Real Transfer in Deep Reinforcement Learning for Robotics: A Survey , author=. IEEE Symposium Series on Computational Intelligence , year=

  72. [88]

    The Knowledge Engineering Review , volume=

    A Survey of Statistical User Simulation Techniques for Reinforcement-Learning of Dialogue Management Strategies , author=. The Knowledge Engineering Review , volume=. 2006 , url=

  73. [89]

    A User Simulator for Task-Completion Dialogues

    A User Simulator for Task-Completion Dialogues , author=. arXiv preprint arXiv:1612.05688 , year=

  74. [90]

    arXiv preprint arXiv:2309.13233 , year=

    User Simulation with Large Language Models for Evaluating Task-Oriented Dialogue , author=. arXiv preprint arXiv:2309.13233 , year=

  75. [91]

    Proceedings of the 1st Workshop on Simulating Conversational Intelligence in Chat (SCI-CHAT) , year=

    Reliable LLM-based User Simulator for Task-Oriented Dialogue Systems , author=. Proceedings of the 1st Workshop on Simulating Conversational Intelligence in Chat (SCI-CHAT) , year=

  76. [92]

    Findings of the Association for Computational Linguistics: EMNLP , year=

    PersonaGym: Evaluating Persona Agents and LLMs , author=. Findings of the Association for Computational Linguistics: EMNLP , year=

  77. [93]

    Advances in Neural Information Processing Systems , volume=

    Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , author=. Advances in Neural Information Processing Systems , volume=. 2023 , url=

  78. [94]

    Proceedings of the 30th International Conference on Intelligent User Interfaces , year=

    Limitations of the LLM-as-a-Judge Approach for Evaluating LLM Outputs in Expert Knowledge Tasks , author=. Proceedings of the 30th International Conference on Intelligent User Interfaces , year=

  79. [95]

    Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics , year=

    All That's `Human' Is Not Gold: Evaluating Human Evaluation of Generated Text , author=. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics , year=

  80. [96]

    arXiv preprint arXiv:2601.17087 , year=

    Lost in Simulation: LLM-Simulated Users are Unreliable Proxies for Human Users in Agentic Evaluations , author=. arXiv preprint arXiv:2601.17087 , year=

Showing first 80 references.