Reinforcing Human Behavior Simulation via Verbal Feedback
Pith reviewed 2026-05-21 07:17 UTC · model grok-4.3
The pith
Treating verbal feedback as a core reinforcement signal trains LLMs to produce more human-like behavior in simulations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
DITTO trains by rolling out an initial behavior, receiving verbal feedback, generating a feedback-conditioned improved rollout, and jointly optimizing both outputs with GRPO so that the verbal guidance is distilled into the base policy for use without feedback at test time.
What carries the argument
Feedback-conditioned improved rollout generation optimized jointly with GRPO to distill verbal signals into the policy.
If this is right
- The model internalizes verbal lessons so that improved human-like behavior appears without any feedback supplied at inference time.
- Performance rises across tasks spanning theory of mind, character role play, social skills, learner simulation, user simulation, and persona simulation.
- The same training loop can be applied whenever the desired output is subjective or multi-faceted rather than strictly verifiable.
- Verbal feedback becomes a reusable training signal that does not need to be converted into numeric rewards.
Where Pith is reading between the lines
- The method may generalize to other subjective domains such as ethical reasoning or creative collaboration where scalar rewards are hard to define.
- Training pipelines could shift from reward-model engineering toward curating natural language critique datasets.
- Interactive agents might be fine-tuned periodically with logged human comments rather than static preference data.
- The approach raises the question of how much of human social learning can be captured by distilling one round of verbal correction.
Load-bearing premise
Gains on the SOUL tasks and metrics reflect genuine increases in human-likeness that hold for new scenarios and different sources of feedback.
What would settle it
Independent human raters find DITTO outputs less human-like than the base model on simulation tasks drawn from domains outside the SOUL suite.
Figures
read the original abstract
Humans learn social norms and behaviors from verbal feedback (e.g., a parent saying "that was rude" or a friend explaining "here's why that hurt"). Yet, learning from feedback for LLMs has largely focused on domains like code and math, where RL rewards are directly verifiable and condensed into scalar values. As LLMs are increasingly used to simulate human behavior, e.g., standing in for users, patients, students, and other personas, there is a pressing need to make them more human-like, which requires embracing a fundamentally different kind of signal: feedback that is verbal, subjective, and multi-faceted. We present DITTO, a model trained by treating verbal feedback as a first-class signal in reinforcement learning. After each rollout, DITTO receives verbal feedback and generates a feedback-conditioned improved rollout; both outputs are jointly optimized with GRPO, distilling verbal guidance into the base policy without requiring feedback at test time. We also introduce SOUL (Simulation gym Of hUman-Like behavior), a unified benchmark and training data suite spanning 10 tasks across six categories: Theory of Mind, character role play, social skill, learner simulation, user simulation, and persona simulation. DITTO achieves an average 36% improvement over the base model and exceeds GPT-5.4 on 6 of 10 SOUL benchmarks, demonstrating that RL with verbal feedback is a promising direction for training LLMs to simulate human behavior.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes DITTO, a reinforcement learning method for enhancing LLMs' simulation of human behavior by treating verbal feedback as a primary signal. It describes generating improved rollouts conditioned on verbal feedback and jointly optimizing them with the base policy using GRPO. The SOUL benchmark is introduced, covering 10 tasks in six categories, with reported results showing a 36% average improvement over the base model and surpassing GPT-5.4 on 6 out of 10 benchmarks.
Significance. Should the experimental results prove robust upon closer inspection, this work would be significant in shifting RL applications for LLMs from verifiable domains like code to subjective, multi-faceted verbal feedback for behavioral simulation. The SOUL benchmark could serve as a valuable resource for future research in human-like AI agents.
major comments (2)
- Abstract and Experiments section: The abstract states clear performance numbers (36% average improvement, exceeding GPT-5.4 on 6/10 SOUL tasks) but provides no details on experimental controls, statistical significance, feedback quality, or potential confounds. This leaves the central claim without visible supporting derivation or data handling steps, as highlighted by the low soundness rating. Please expand on rollout generation, GRPO optimization details, number of runs, and baseline controls.
- SOUL benchmark description (likely §4): The evaluation risks overstated generalization because the 10 tasks (Theory of Mind, role play, etc.) and metrics may not separate training feedback distributions from test scenarios. Clarify whether held-out task splits, diverse external feedback sources, or independent human calibration of metrics are used, since the method jointly optimizes feedback-conditioned rollouts with the base policy.
minor comments (2)
- Introduction: Add references to prior work on verbal feedback in RL or LLM-based human simulation to better contextualize the contribution beyond scalar-reward domains.
- Results presentation: Include error bars, confidence intervals, or variance measures in any tables reporting the 36% improvement and per-task comparisons to support the average claim.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We have revised the manuscript to provide greater transparency on experimental procedures and benchmark construction while preserving the core contributions of DITTO and the SOUL benchmark.
read point-by-point responses
-
Referee: Abstract and Experiments section: The abstract states clear performance numbers (36% average improvement, exceeding GPT-5.4 on 6/10 SOUL tasks) but provides no details on experimental controls, statistical significance, feedback quality, or potential confounds. This leaves the central claim without visible supporting derivation or data handling steps, as highlighted by the low soundness rating. Please expand on rollout generation, GRPO optimization details, number of runs, and baseline controls.
Authors: We agree that the abstract is high-level and that the Experiments section would benefit from a consolidated summary of controls. Rollout generation is described in Section 3: after each base-policy rollout we sample verbal feedback and condition a second rollout on it. GRPO is applied to the joint objective over both rollouts (detailed in Section 3.2 and Appendix B with the exact loss and KL coefficient). We ran each configuration three times with distinct random seeds and report mean and standard deviation; pairwise t-tests against the base model yield p < 0.05 on the aggregate metric. Baselines comprise the untuned base model, SFT on the same data, and direct comparison to GPT-5.4. Feedback quality was controlled by using a held-out human annotation pool and discarding low-consensus feedback. We will add a short paragraph to the abstract and a dedicated “Experimental Controls” subsection that enumerates these points. revision: yes
-
Referee: SOUL benchmark description (likely §4): The evaluation risks overstated generalization because the 10 tasks (Theory of Mind, role play, etc.) and metrics may not separate training feedback distributions from test scenarios. Clarify whether held-out task splits, diverse external feedback sources, or independent human calibration of metrics are used, since the method jointly optimizes feedback-conditioned rollouts with the base policy.
Authors: We share the concern about distribution shift. Within each of the six categories we partition scenarios into disjoint training and test sets before feedback collection; test scenarios therefore never appear in the verbal feedback used for optimization. Feedback is sourced from two external pools (crowd workers and a separate LLM) that are disjoint from any training data. Metric scores are obtained from a third, independent panel of human raters who see only the final outputs and are not involved in feedback generation. We will expand Section 4 to state these splits and sources explicitly and will add a short paragraph discussing why the joint optimization still permits generalization to the held-out test distribution. revision: partial
Circularity Check
No significant circularity; empirical claims independent of inputs
full rationale
The paper describes DITTO as an RL method using verbal feedback to generate improved rollouts then jointly optimizing via GRPO, and introduces the SOUL benchmark spanning 10 tasks. No equations, derivations, or self-referential definitions appear that reduce the reported 36% improvement or GPT-5.4 outperformance to a fitted parameter or construction from the training feedback itself. The method is presented as distilling guidance into the base policy for test-time use without feedback, and results are framed as empirical measurements on the benchmark. No load-bearing self-citations, uniqueness theorems, or ansatz smuggling are evident in the text; the central claims rest on external benchmark evaluation rather than reducing to the method's own inputs by definition.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Verbal feedback from unspecified sources can be reliably converted into improved rollouts that distill into a better base policy
invented entities (2)
-
DITTO
no independent evidence
-
SOUL
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Asking clarifying questions in open-domain information-seeking conversations
Mohammad Aliannejadi, Hamed Zamani, Fabio Crestani, and W Bruce Croft. Asking clarifying questions in open-domain information-seeking conversations. In Proceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval, pp.\ 475--484, 2019
work page 2019
-
[2]
Building and evaluating open-domain dialogue corpora with clarifying questions
Mohammad Aliannejadi, Julia Kiseleva, Aleksandr Chuklin, Jeff Dalton, and Mikhail Burtsev. Building and evaluating open-domain dialogue corpora with clarifying questions. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 4473--4484, 2021
work page 2021
-
[3]
Simuser: Simulating user behavior with large language models for recommender system evaluation
Nicolas Bougie and Narimasa Watanabe. Simuser: Simulating user behavior with large language models for recommender system evaluation. In Annual Meeting of the Association for Computational Linguistics, 2025. URL https://aclanthology.org/2025.acl-industry.5/
work page 2025
-
[6]
Yao Dou, Michel Galley, Baolin Peng, Chris Kedzie, Weixin Cai, Alan Ritter, Chris Quirk, Wei Xu, and Jianfeng Gao. Simulatorarena: Are user simulators reliable proxies for multi-turn evaluation of AI assistants? In Conference on Empirical Methods in Natural Language Processing, 2025. URL https://arxiv.org/abs/2510.05444
-
[7]
TwinVoice : A multi-dimensional benchmark towards digital twins via LLM persona simulation
Bangde Du, Minghao Guo, Songming He, Ziyi Ye, Xi Zhu, Weihang Su, Shuqi Zhu, Yujia Zhou, Yongfeng Zhang, Qingyao Ai, and Yiqun Liu. TwinVoice : A multi-dimensional benchmark towards digital twins via LLM persona simulation. arXiv preprint arXiv:2510.25536, 2025. URL https://arxiv.org/abs/2510.25536
-
[8]
HER: Human-like Reasoning and Reinforcement Learning for LLM Role-playing
Chengyu Du, Xintao Wang, Aili Chen, Weiyuan Li, Rui Xu, Junteng Liu, Zishan Huang, Rong Tian, Zijun Sun, Yuhao Li, Liheng Feng, Deming Ding, Pengyu Zhao, and Yanghua Xiao. Her: Human-like reasoning and reinforcement learning for llm role-playing. ArXiv, abs/2601.21459, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[9]
MirrorBench: A Benchmark to Evaluate Conversational User-Proxy Agents for Human-Likeness
Ashutosh Hathidara, Julien Yu, Vaishali Senthil, Sebastian Schreiber, and Anil Babu Ankisettipalli. MirrorBench : A benchmark to evaluate conversational user-proxy agents for human-likeness. arXiv preprint arXiv:2601.08118, 2026. URL https://arxiv.org/abs/2601.08118
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[10]
HI-TOM : A benchmark for evaluating higher-order theory of mind reasoning in large language models
Yinghui He, Yufan Wu, Yilin Jia, Rada Mihalcea, Yulong Chen, and Naihao Deng. HI-TOM : A benchmark for evaluating higher-order theory of mind reasoning in large language models. arXiv preprint arXiv:2310.16755, 2024. URL https://arxiv.org/abs/2310.16755
-
[14]
MMToM-QA : Multimodal theory of mind question answering
Chuanyang Jin, Yutong Wu, Jing Cao, Jiannan Xiang, Yen-Ling Kuo, Zhiting Hu, Tomer Ullman, Antonio Torralba, Joshua B Tenenbaum, and Tianmin Shu. MMToM-QA : Multimodal theory of mind question answering. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics, 2024. URL https://arxiv.org/abs/2401.08743
-
[15]
FANToM : A benchmark for stress-testing machine theory of mind in interactions
Hyunwoo Kim, Melanie Sclar, Xuhui Zhou, Ronan Le Bras, Gunhee Kim, Yejin Choi, and Maarten Sap. FANToM : A benchmark for stress-testing machine theory of mind in interactions. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023. URL https://arxiv.org/abs/2310.15421
-
[16]
o pf, Yannic Kilcher, Dimitri von R \
Andreas K \"o pf, Yannic Kilcher, Dimitri von R \"u tte, Sotiris Anagnostidis, Zhi Rui Tam, Keith Stevens, Abdullah Barhoum, Duc Minh Nguyen, Oliver Stanley, Rich \'a rd Nagyfi, et al. Open A ssistant conversations -- democratizing large language model alignment. In Advances in Neural Information Processing Systems, 2023
work page 2023
-
[18]
Chain of hindsight aligns language models with feedback
Hao Liu, Carmelo Sferrazza, and Pieter Abbeel. Chain of hindsight aligns language models with feedback. In International Conference on Learning Representations, 2024. URL https://arxiv.org/abs/2302.02676
-
[19]
Self-Refine: Iterative Refinement with Self-Feedback
Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. In Advances in Neural Information Processing Sy...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[21]
Machines and mindlessness: Social responses to computers
Clifford Nass and Youngme Moon. Machines and mindlessness: Social responses to computers. Journal of Social Issues, 56 0 (1): 0 81--103, 2000
work page 2000
-
[22]
Evaluating Theory of Mind in Question Answering
Aida Nematzadeh, Kaylee Burns, Erin Grant, Alison Gopnik, and Thomas L. Griffiths. Evaluating theory of mind in question answering, 2018. URL https://arxiv.org/abs/1808.09352
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[23]
To model human linguistic prediction, make LLMs less superhuman
Byung-Doh Oh and Tal Linzen. To model human linguistic prediction, make LLMs less superhuman. arXiv preprint arXiv:2510.05141, 2025. URL https://arxiv.org/abs/2510.05141
-
[24]
Training language models to follow instructions with human feedback
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in Neural Information Processing Systems, 35, 2022. URL https://arxiv.org/abs/2203.02155
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[25]
Advantage-Weighted Regression: Simple and Scalable Off-Policy Reinforcement Learning
Xue Bin Peng, Aviral Kumar, Grace Zhang, and Sergey Levine. Advantage-weighted regression: Simple and scalable off-policy reinforcement learning. ArXiv, abs/1910.00177, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1910
-
[26]
Direct Preference Optimization: Your Language Model is Secretly a Reward Model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Stefano Ermon, Christopher D. Manning, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in Neural Information Processing Systems, 36, 2023. URL https://arxiv.org/abs/2305.18290
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[27]
Bases: Large-scale web search user simulation with large language model based agents
Ruiyang Ren, Peng Qiu, Yingqi Qu, Jing Liu, Wayne Xin Zhao, Hua Wu, Ji-Rong Wen, and Haifeng Wang. Bases: Large-scale web search user simulation with large language model based agents. In Findings of the Association for Computational Linguistics: EMNLP, 2024. URL https://arxiv.org/abs/2402.17505
-
[28]
Learning to make MISTAKEs : Modeling incorrect student thinking and key errors
Alexis Ross and Jacob Andreas. Learning to make MISTAKEs : Modeling incorrect student thinking and key errors. arXiv preprint arXiv:2510.11502, 2025. URL https://arxiv.org/abs/2510.11502
-
[29]
Personagym: Evaluating persona agents and llms
Vinay Samuel, Henry Peng Zou, Yue Zhou, Shreyas Chaudhari, Ashwin Kalyan, Tanmay Rajpurohit, Ameet Deshpande, Karthik Narasimhan, and Vishvak Murahari. Personagym: Evaluating persona agents and llms. In Findings of the Association for Computational Linguistics: EMNLP, 2025. URL https://arxiv.org/abs/2407.18416
-
[30]
Jost Schatzmann, Karl Weilhammer, Matt Stuttle, and Steve Young. A survey of statistical user simulation techniques for reinforcement-learning of dialogue management strategies. The Knowledge Engineering Review, 21 0 (2): 0 97--126, 2006. URL https://doi.org/10.1017/S0269888906000944
-
[31]
Training language models with language feedback at scale
J \'e r \'e my Scheurer, Jon Ander Campos, Tomasz Korbak, Jun Shern Chan, Angelica Chen, Kyunghyun Cho, and Ethan Perez. Training language models with language feedback at scale. Transactions on Machine Learning Research, 2023. URL https://arxiv.org/abs/2303.16755
-
[32]
AgentClinic: a multimodal agent benchmark to evaluate AI in simulated clinical environments
Samuel Schmidgall, Rojin Ziaei, Carl Harris, Eduardo Reis, Jeffrey Jopling, and Michael Moor. Agentclinic: a multimodal agent benchmark to evaluate ai in simulated clinical environments, 2024. URL https://arxiv.org/abs/2405.07960
work page internal anchor Pith review arXiv 2024
-
[34]
Reliable llm-based user simulator for task-oriented dialogue systems
Ivan Sekulic, Silvia Terragni, Victor Guimaraes, Nghia Khau, Bruna Guedes, Modestas Filipavicius, Andre Ferreira Manso, and Roland Mathis. Reliable llm-based user simulator for task-oriented dialogue systems. In Proceedings of the 1st Workshop on Simulating Conversational Intelligence in Chat (SCI-CHAT), 2024. URL https://arxiv.org/abs/2402.13374
-
[37]
Self-Distillation Enables Continual Learning
Idan Shenfeld, Mehul Damani, Jonas H \"u botter, and Pulkit Agrawal. Self-distillation enables continual learning. ArXiv, abs/2601.19897, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[39]
Reflexion: Language Agents with Verbal Reinforcement Learning
Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. In Advances in Neural Information Processing Systems, 2023. URL https://arxiv.org/abs/2303.11366
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[41]
Simulating user satisfaction for the evaluation of task-oriented dialogue systems
Weiwei Sun, Shuo Zhang, Krisztian Balog, Zhaochun Ren, Pengjie Ren, Zhumin Chen, and Maarten de Rijke. Simulating user satisfaction for the evaluation of task-oriented dialogue systems. Proceedings of the 44th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2021. URL https://dl.acm.org/doi/10.1145/3404835.3463241
-
[42]
Weiwei Sun, Shuyu Guo, Shuo Zhang, Pengjie Ren, Zhumin Chen, M. de Rijke, and Zhaochun Ren. Metaphorical user simulators for evaluating task-oriented dialogue systems. ACM Transactions on Information Systems, 42: 0 1 -- 29, 2022
work page 2022
-
[44]
Do llms exhibit human-like response biases? a case study in survey design
Lindia Tjuatja, Valerie Chen, Tongshuang Wu, Ameet Talwalkar, and Graham Neubig. Do llms exhibit human-like response biases? a case study in survey design. Transactions of the Association for Computational Linguistics, 12: 0 1011--1026, 2024
work page 2024
-
[45]
Becoming Human: A Theory of Ontogeny
Michael Tomasello. Becoming Human: A Theory of Ontogeny. Harvard University Press, Cambridge, MA, 2019. ISBN 9780674248281
work page 2019
-
[46]
A new learning paradigm: Learning using privileged information
Vladimir Vapnik and Akshay Vashist. A new learning paradigm: Learning using privileged information. Neural Networks, 22 0 (5-6): 0 544--557, 2009. URL https://doi.org/10.1016/j.neunet.2009.06.042
-
[47]
Learning using privileged information: similarity control and knowledge transfer
Vladimir Naumovich Vapnik and Rauf Izmailov. Learning using privileged information: similarity control and knowledge transfer. J. Mach. Learn. Res., 16: 0 2023--2049, 2015
work page 2023
-
[48]
User behavior simulation with large language model-based agents
Lei Wang, Jingsen Zhang, Hao Yang, Zhi-Yuan Chen, Jiakai Tang, Zeyu Zhang, Xu Chen, Yankai Lin, Ruihua Song, Wayne Xin Zhao, Jun Xu, Zhicheng Dou, Jun Wang, and Ji rong Wen. User behavior simulation with large language model-based agents. ACM Transactions on Information Systems, 2025. URL https://dl.acm.org/doi/10.1145/3708985
-
[49]
Xintao Wang, Heng Wang, Yifei Zhang, Xinfeng Yuan, Rui Xu, Jen tse Huang, Siyu Yuan, Haoran Guo, Jiangjie Chen, Shuchang Zhou, Wei Wang, and Yanghua Xiao. Coser: A comprehensive literary dataset and framework for training and evaluating llm role-playing and persona simulation, 2026. URL https://arxiv.org/abs/2502.09082
-
[50]
Humanlm: Simulating users with state alignment beats response imitation, 2026
Shirley Wu, Evelyn Choi, Arpandeep Khatua, Zhanghan Wang, Joy He-Yueya, Tharindu Cyril Weerasooriya, Wei Wei, Diyi Yang, Jure Leskovec, and James Zou. Humanlm: Simulating users with state alignment beats response imitation, 2026. URL https://arxiv.org/abs/2603.03303
-
[51]
Smith, Mari Ostendorf, and Hannaneh Hajishirzi
Zeqiu Wu, Yushi Hu, Weijia Shi, Nouha Dziri, Alane Suhr, Prithviraj Ammanabrolu, Noah A. Smith, Mari Ostendorf, and Hannaneh Hajishirzi. Fine-grained human feedback gives better rewards for language model training. In Advances in Neural Information Processing Systems, 2023. URL https://arxiv.org/abs/2306.01693
-
[52]
Character is destiny: Can role-playing language agents make persona-driven decisions? 2024 a
Rui Xu, Xintao Wang, Jiangjie Chen, Siyu Yuan, Xinfeng Yuan, Jiaqing Liang, Zulong Chen, Xiaoqing Dong, and Yanghua Xiao. Character is destiny: Can role-playing language agents make persona-driven decisions? 2024 a
work page 2024
-
[54]
On-Policy Context Distillation for Language Models
Tianzhu Ye, Li Dong, Xun Wu, Shaohan Huang, and Furu Wei. On-policy context distillation for language models. ArXiv, abs/2602.12275, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[56]
Self-Rewarding Language Models
Weizhe Yuan, Richard Yuanzhe Pang, Kyunghyun Cho, Xian Li, Sainbayar Sukhbaatar, Jing Xu, and Jason Weston. Self-rewarding language models. In International Conference on Machine Learning, 2024. URL https://arxiv.org/abs/2401.10020
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[57]
Usimagent: Large language models for simulating search users
Erhan Zhang, Xingzhu Wang, Peiyuan Gong, Yankai Lin, and Jiaxin Mao. Usimagent: Large language models for simulating search users. Proceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2024. URL https://arxiv.org/abs/2403.09142
-
[59]
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena. In Advances in Neural Information Processing Systems, volume 36, 2023. URL https://arxiv.org/abs/2306.05685
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[60]
SOTOPIA : Interactive evaluation for social intelligence in language agents
Xuhui Zhou, Hao Zhu, Leena Mathur, Ruohong Zhang, Haofei Yu, Zhengyang Qi, Louis-Philippe Morency, Yonatan Bisk, Daniel Fried, Graham Neubig, and Maarten Sap. SOTOPIA : Interactive evaluation for social intelligence in language agents. In The Twelfth International Conference on Learning Representations, 2024. URL https://openreview.net/forum?id=mM7VurbA4r
work page 2024
-
[61]
Xu, Bill Yuchen Lin, Yejin Choi, Niloofar Mireshghallah, Ronan Le Bras, and Maarten Sap
Xuhui Zhou, Hyunwoo Kim, Faeze Brahman, Liwei Jiang, Hao Zhu, Ximing Lu, Frank F. Xu, Bill Yuchen Lin, Yejin Choi, Niloofar Mireshghallah, Ronan Le Bras, and Maarten Sap. HAICOSYSTEM : An ecosystem for sandboxing safety risks in human-ai interactions. In Second Conference on Language Modeling, 2025. URL https://arxiv.org/abs/2409.16427
-
[62]
Xuhui Zhou, Jiarui Liu, Akhila Yerukola, Hyunwoo Kim, and Maarten Sap. Social world models, 2026 a . URL https://arxiv.org/abs/2509.00559
-
[63]
Mind the sim2real gap in user simulation for agentic tasks, 2026 b
Xuhui Zhou, Weiwei Sun, Qianou Ma, Yiqing Xie, Jiarui Liu, Weihua Du, Sean Welleck, Yiming Yang, Graham Neubig, Sherry Tongshuang Wu, and Maarten Sap. Mind the sim2real gap in user simulation for agentic tasks, 2026 b . URL https://arxiv.org/abs/2603.11245
-
[64]
Advances in Neural Information Processing Systems , year=
Fine-Grained Human Feedback Gives Better Rewards for Language Model Training , author=. Advances in Neural Information Processing Systems , year=
- [65]
-
[66]
arXiv preprint arXiv:2508.03905 , year=
Sotopia-RL: Reward Design for Social Intelligence , author=. arXiv preprint arXiv:2508.03905 , year=
-
[67]
Mind the Sim2Real Gap in User Simulation for Agentic Tasks , author=. 2026 , eprint=
work page 2026
-
[68]
arXiv preprint arXiv:2511.02208 , year=
Training Proactive and Personalized LLM Agents , author=. arXiv preprint arXiv:2511.02208 , year=
-
[69]
arXiv preprint arXiv:2510.22954 , year=
Artificial Hivemind: The Open-Ended Homogeneity of Language Models (and Beyond) , author=. arXiv preprint arXiv:2510.22954 , year=
-
[70]
To Model Human Linguistic Prediction, Make
Oh, Byung-Doh and Linzen, Tal , journal=. To Model Human Linguistic Prediction, Make. 2025 , url=
work page 2025
- [71]
-
[72]
A New Learning Paradigm: Learning Using Privileged Information , author=. Neural Networks , volume=. 2009 , url=
work page 2009
-
[73]
Transactions on Machine Learning Research , year=
Training Language Models with Language Feedback at Scale , author=. Transactions on Machine Learning Research , year=
-
[74]
Advances in Neural Information Processing Systems , year=
Reflexion: Language Agents with Verbal Reinforcement Learning , author=. Advances in Neural Information Processing Systems , year=
-
[75]
Advances in Neural Information Processing Systems , year=
Self-Refine: Iterative Refinement with Self-Feedback , author=. Advances in Neural Information Processing Systems , year=
-
[76]
International Conference on Learning Representations , year=
Chain of Hindsight Aligns Language Models with Feedback , author=. International Conference on Learning Representations , year=
-
[77]
International Conference on Machine Learning , year=
Self-Rewarding Language Models , author=. International Conference on Machine Learning , year=
-
[78]
arXiv preprint arXiv:2602.02482 , year=
Expanding the Capabilities of Reinforcement Learning via Text Feedback , author=. arXiv preprint arXiv:2602.02482 , year=
-
[79]
arXiv preprint arXiv:2506.03106 , year=
Critique-GRPO: Advancing LLM Reasoning with Natural Language and Numerical Feedback , author=. arXiv preprint arXiv:2506.03106 , year=
-
[80]
arXiv preprint arXiv:2602.13949 , year=
Experiential Reinforcement Learning , author=. arXiv preprint arXiv:2602.13949 , year=
-
[81]
Advances in Neural Information Processing Systems , year=
MediQ: Question-Asking LLMs and a Benchmark for Reliable Interactive Clinical Reasoning , author=. Advances in Neural Information Processing Systems , year=
-
[82]
$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains , author=. arXiv preprint arXiv:2406.12045 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[83]
International Conference on Learning Representations , year=
WebArena: A Realistic Web Environment for Building Autonomous Agents , author=. International Conference on Learning Representations , year=
-
[84]
International Conference on Learning Representations , year=
GAIA: A Benchmark for General AI Assistants , author=. International Conference on Learning Representations , year=
-
[85]
International Conference on Learning Representations , year=
SWE-bench: Can Language Models Resolve Real-World GitHub Issues? , author=. International Conference on Learning Representations , year=
-
[86]
IEEE/RSJ International Conference on Intelligent Robots and Systems , year=
Domain Randomization for Transferring Deep Neural Networks from Simulation to the Real World , author=. IEEE/RSJ International Conference on Intelligent Robots and Systems , year=
-
[87]
IEEE Symposium Series on Computational Intelligence , year=
Sim-to-Real Transfer in Deep Reinforcement Learning for Robotics: A Survey , author=. IEEE Symposium Series on Computational Intelligence , year=
-
[88]
The Knowledge Engineering Review , volume=
A Survey of Statistical User Simulation Techniques for Reinforcement-Learning of Dialogue Management Strategies , author=. The Knowledge Engineering Review , volume=. 2006 , url=
work page 2006
-
[89]
A User Simulator for Task-Completion Dialogues
A User Simulator for Task-Completion Dialogues , author=. arXiv preprint arXiv:1612.05688 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[90]
arXiv preprint arXiv:2309.13233 , year=
User Simulation with Large Language Models for Evaluating Task-Oriented Dialogue , author=. arXiv preprint arXiv:2309.13233 , year=
-
[91]
Proceedings of the 1st Workshop on Simulating Conversational Intelligence in Chat (SCI-CHAT) , year=
Reliable LLM-based User Simulator for Task-Oriented Dialogue Systems , author=. Proceedings of the 1st Workshop on Simulating Conversational Intelligence in Chat (SCI-CHAT) , year=
-
[92]
Findings of the Association for Computational Linguistics: EMNLP , year=
PersonaGym: Evaluating Persona Agents and LLMs , author=. Findings of the Association for Computational Linguistics: EMNLP , year=
-
[93]
Advances in Neural Information Processing Systems , volume=
Judging LLM-as-a-Judge with MT-Bench and Chatbot Arena , author=. Advances in Neural Information Processing Systems , volume=. 2023 , url=
work page 2023
-
[94]
Proceedings of the 30th International Conference on Intelligent User Interfaces , year=
Limitations of the LLM-as-a-Judge Approach for Evaluating LLM Outputs in Expert Knowledge Tasks , author=. Proceedings of the 30th International Conference on Intelligent User Interfaces , year=
-
[95]
Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics , year=
All That's `Human' Is Not Gold: Evaluating Human Evaluation of Generated Text , author=. Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics , year=
-
[96]
arXiv preprint arXiv:2601.17087 , year=
Lost in Simulation: LLM-Simulated Users are Unreliable Proxies for Human Users in Agentic Evaluations , author=. arXiv preprint arXiv:2601.17087 , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.