Recognition: no theorem link
Quantifying the Utility of User Simulators for Building Collaborative LLM Assistants
Pith reviewed 2026-05-12 02:15 UTC · model grok-4.3
The pith
Training LLM assistants against fine-tuned user simulators produces 58% higher win rates with real humans than training against role-playing simulators.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Simulator quality is best quantified by its downstream utility: how well an LLM assistant trained with it performs against real humans. In the controlled RL setup the fine-tuned simulator on human data delivers statistically significant gains of 58% over the initial assistant and 57% over the role-play-trained assistant in pairwise win rates from 283 participants and on WildBench. Role-playing simulators remain inferior even after persona conditioning or model scaling, and assistants trained against them do not generalize when paired with other simulators at test time.
What carries the argument
Controlled reinforcement learning training of LLM assistants that varies only the user simulator, evaluated by real-human pairwise win rates and performance on the WildBench benchmark derived from actual conversations.
If this is right
- Persona conditioning and other realism tweaks on role-playing simulators improve trained assistants but do not match fine-tuned performance.
- Scaling simulator model size improves downstream assistant quality only for fine-tuned simulators, not role-playing ones.
- Assistants trained against role-playing simulators fail to generalize when tested with different simulators, unlike those trained on fine-tuned simulators.
- Grounding simulators in real human utterances is required to produce assistants that succeed with actual users.
Where Pith is reading between the lines
- Purely prompted role-play may systematically miss interaction patterns that fine-tuning on real data captures.
- The same downstream-utility test could be used to compare simulators for non-LLM agents or other collaborative tasks.
- Collecting and maintaining high-quality real conversation datasets may be more valuable than engineering better role-play prompts.
Load-bearing premise
The experiment fully isolates the simulator's contribution without confounding differences in RL training details or human-study biases.
What would settle it
A follow-up study in which an improved role-playing simulator produces assistants with win rates statistically indistinguishable from or higher than the fine-tuned simulator in a comparable 283-participant evaluation.
Figures
read the original abstract
User simulators are increasingly leveraged to build interactive AI assistants, yet how to measure the quality of these simulators remains an open question. In this work, we show how simulator quality can be quantified in terms of its downstream utility: how an LLM assistant trained with this user simulator performs in the wild when interacting with real humans. In a controlled experiment where only the user simulator varies, we train LLM assistants via reinforcement learning against a spectrum of simulators, from an LLM prompted to role-play a user to one fine-tuned on human utterances from WildChat. As evaluation, we measure pairwise win rates in a user study with 283 participants and on WildBench, a benchmark derived from real human--AI conversations. Training against the role-playing LLM yields an assistant statistically indistinguishable from the initial assistant in our user study (51% win rate), whereas training against the fine-tuned simulator yields significant gains (58% over the initial and 57% over the one trained against role-playing). Closer inspection reveals three further patterns: methods for making role-playing LLMs more realistic (e.g., persona conditioning) improve trained assistants but do not close the gap to the fine-tuned simulator; scaling the simulator's model size benefits the fine-tuned simulator but yields no gain for role-playing ones; and assistants trained against role-playing simulators fail to generalize when paired with other simulators at test time, while the one trained against fine-tuned simulator does. Together, these results argue for grounding user simulators in real human behavior and measuring their quality by their downstream effect on real users.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that user simulator quality for training collaborative LLM assistants is best quantified via downstream utility: the performance of RL-trained assistants when interacting with real humans. In a controlled experiment varying only the simulator (role-playing LLM vs. fine-tuned on WildChat human utterances), training against the fine-tuned simulator produces statistically significant gains (58% win rate over the initial assistant and 57% over the role-playing variant) in a 283-participant user study and on WildBench. Additional patterns—persona conditioning and scaling improve role-play simulators modestly but do not close the gap, while fine-tuned simulators generalize better across test simulators—are reported to support grounding simulators in real human data rather than pure role-play.
Significance. If the controlled conditions and statistical claims hold, the work provides a practical, outcome-based metric for evaluating user simulators that directly ties to real-user utility in collaborative settings. The 283-participant study and WildBench benchmark derived from actual conversations are notable strengths, as is the demonstration of scaling and generalization differences. This could shift evaluation practices away from proxy metrics toward downstream human interaction results.
major comments (2)
- Abstract: the central claim of statistically significant gains (58% and 57% win rates) from the fine-tuned simulator rests on a controlled RL experiment, yet the abstract provides no details on the RL procedure, reward model, training hyperparameters, or the exact statistical tests and error analysis used. This absence makes it impossible to verify whether the experiment truly isolates the simulator effect or whether post-hoc choices or small effect sizes influence the reported differences.
- Abstract: the weakest assumption—that the 283-participant study plus WildBench fully capture downstream utility without confounding factors in training or evaluation—is load-bearing for the recommendation to ground simulators in real data, but no information is given on participant recruitment, task distribution, or how the user study controls for variables such as conversation length or topic.
minor comments (1)
- Abstract: the phrasing 'statistically significant differences' and specific win-rate percentages would be clearer if accompanied by confidence intervals or p-values even in the abstract.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We address the two major comments point by point below and will make the indicated revisions to strengthen the abstract.
read point-by-point responses
-
Referee: Abstract: the central claim of statistically significant gains (58% and 57% win rates) from the fine-tuned simulator rests on a controlled RL experiment, yet the abstract provides no details on the RL procedure, reward model, training hyperparameters, or the exact statistical tests and error analysis used. This absence makes it impossible to verify whether the experiment truly isolates the simulator effect or whether post-hoc choices or small effect sizes influence the reported differences.
Authors: We agree that the abstract omits these specifics due to space constraints. The full manuscript details the RL procedure (PPO with a Bradley-Terry reward model trained on human preference data), key hyperparameters, and statistical analysis (paired t-tests with multiple-comparison correction) in the Methods and Experiments sections. We will revise the abstract to include a brief clause summarizing the RL training protocol and statistical testing approach so that the isolation of the simulator variable is clearer to readers. revision: yes
-
Referee: Abstract: the weakest assumption—that the 283-participant study plus WildBench fully capture downstream utility without confounding factors in training or evaluation—is load-bearing for the recommendation to ground simulators in real data, but no information is given on participant recruitment, task distribution, or how the user study controls for variables such as conversation length or topic.
Authors: We acknowledge that these methodological details are essential for evaluating potential confounds. The manuscript specifies recruitment through a crowdsourcing platform with screening criteria, task distribution drawn from WildChat-derived collaborative scenarios, and controls including fixed turn limits and topic balancing; these appear in the User Study subsection. We will add a short phrase to the abstract noting the study scale and its basis in real human–AI conversations to better substantiate the downstream-utility claim. revision: yes
Circularity Check
No circularity: purely empirical comparison with independent evaluations
full rationale
The paper reports a controlled RL experiment that trains LLM assistants against different user simulators (role-play LLM vs. fine-tuned on WildChat) and measures downstream performance via a 283-participant user study and WildBench benchmark. The abstract contains no equations, derivations, fitted parameters, or self-citations. All claims rest on direct experimental contrasts rather than any reduction to inputs by construction. The reported patterns (persona conditioning, scaling, generalization) are observational results from the setup and do not invoke uniqueness theorems or ansatzes.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The 283-participant user study and WildBench accurately reflect real-world performance differences attributable to the simulator.
Reference graph
Works this paper leans on
-
[1]
LLMs Get Lost In Multi-Turn Conversation
Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville. Llms get lost in multi-turn conversation.arXiv preprint arXiv:2505.06120, 2025
work page internal anchor Pith review arXiv 2025
-
[2]
Assistancezero: Scalably solving assistance games.arXiv preprint arXiv:2504.07091, 2025
Cassidy Laidlaw, Eli Bronstein, Timothy Guo, Dylan Feng, Lukas Berglund, Justin Svegliato, Stuart Russell, and Anca Dragan. Assistancezero: Scalably solving assistance games.arXiv preprint arXiv:2504.07091, 2025
-
[3]
Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference
Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference.arXiv preprint arXiv:2403.04132, 2024
work page internal anchor Pith review arXiv 2024
-
[4]
A survey on llm-based conversational user simulation
Bo Ni, Yu Wang, Leyao Wang, Branislav Kveton, Franck Dernoncourt, Yu Xia, Hongjie Chen, Reuben Luera, Samyadeep Basu, Subhojyoti Mukherjee, et al. A survey on llm-based conversational user simulation. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 4266–4301, 2026
work page 2026
-
[5]
Sim-to-real transfer of robotic control with dynamics randomization
Xue Bin Peng, Marcin Andrychowicz, Wojciech Zaremba, and Pieter Abbeel. Sim-to-real transfer of robotic control with dynamics randomization. InIEEE International Conference on Robotics and Automation (ICRA), pages 3803–3810, 2018
work page 2018
-
[6]
Generative agents: Interactive simulacra of human behavior
Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th annual acm symposium on user interface software and technology, pages 1–22, 2023
work page 2023
-
[7]
$\tau$-bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains
Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. tau-bench: A benchmark for tool-agent-user interaction in real-world domains.arXiv preprint arXiv:2406.12045, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
Ruosen Li, Ruochen Li, Barry Wang, and Xinya Du. Iqa-eval: Automatic evaluation of human-model interactive question answering.Advances in Neural Information Processing Systems, 37:109894–109921, 2024
work page 2024
-
[9]
Duetsim: Building user simulator with dual large language models for task-oriented dialogues
Xiang Luo, Zhiwen Tang, Jin Wang, and Xuejie Zhang. Duetsim: Building user simulator with dual large language models for task-oriented dialogues. InProceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024), pages 5414–5424, 2024
work page 2024
-
[10]
Regressing the relative future: Efficient policy optimization for multi-turn rlhf
Zhaolin Gao, Wenhao Zhan, Jonathan D Chang, Gokul Swamy, Kianté Brantley, Jason D Lee, and Wen Sun. Regressing the relative future: Efficient policy optimization for multi-turn rlhf. arXiv preprint arXiv:2410.04612, 2024
-
[11]
Lior Shani, Aviv Rosenberg, Asaf Cassel, Oran Lang, Daniele Calandriello, Avital Zipori, Hila Noga, Orgad Keller, Bilal Piot, Idan Szpektor, et al. Multi-turn reinforcement learning with preference human feedback.Advances in Neural Information Processing Systems, 37: 118953–118993, 2024
work page 2024
-
[12]
Modeling future conversation turns to teach llms to ask clarifying questions
Michael JQ Zhang, W Bradley Knox, and Eunsol Choi. Modeling future conversation turns to teach llms to ask clarifying questions.arXiv preprint arXiv:2410.13788, 2024
-
[13]
Platolm: Teaching llms in multi-round dialogue via a user simulator
Chuyi Kong, Yaxin Fan, Xiang Wan, Feng Jiang, and Benyou Wang. Platolm: Teaching llms in multi-round dialogue via a user simulator. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7841–7863, 2024
work page 2024
-
[14]
Collabllm: From passive responders to active collaborators
Shirley Wu, Michel Galley, Baolin Peng, Hao Cheng, Gavin Li, Yao Dou, Weixin Cai, James Zou, Jure Leskovec, and Jianfeng Gao. Collabllm: From passive responders to active collaborators.arXiv preprint arXiv:2502.00640, 2025. 11
-
[15]
Training proactive and personalized llm agents.arXiv preprint arXiv:2511.02208, 2025
Weiwei Sun, Xuhui Zhou, Weihua Du, Xingyao Wang, Sean Welleck, Graham Neubig, Maarten Sap, and Yiming Yang. Training proactive and personalized llm agents.arXiv preprint arXiv:2511.02208, 2025
-
[16]
David Dinucu-Jianu, Jakub Macina, Nico Daheim, Ido Hakimi, Iryna Gurevych, and Mrinmaya Sachan. From problem-solving to teaching problem-solving: Aligning llms with pedagogy using reinforcement learning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 272–292, 2025
work page 2025
-
[17]
Cheng Qian, Zuxin Liu, Akshara Prabhakar, Jielin Qiu, Zhiwei Liu, Haolin Chen, Shirley Kokane, Heng Ji, Weiran Yao, Shelby Heinecke, et al. Userrl: Training interactive user-centric agent via reinforcement learning.arXiv preprint arXiv:2509.19736, 2025
-
[18]
Evaluating large language models as generative user simulators for conversational recommendation
Se-eun Yoon, Zhankui He, Jessica Echterhoff, and Julian McAuley. Evaluating large language models as generative user simulators for conversational recommendation. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 1490–1504, 2024
work page 2024
-
[19]
Zhefan Wang, Ning Geng, Zhiqiang Guo, Weizhi Ma, and Min Zhang. Human vs. agent in task-oriented conversations. InProceedings of the 2025 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region, pages 133–142, 2025
work page 2025
-
[20]
Jonathan Ivey, Shivani Kumar, Jiayu Liu, Hua Shen, Sushrita Rakshit amd Rohan Raju, Haotian Zhang, Aparna Ananthasubramaniam, Junghwan Kim, Bowen Yi, Dustin Wright, Abraham Israeli, Anders Giovanni Møller, Lechen Zhang, and David Jurgens. Real or robotic? assessing whether llms accurately simulate qualities of human responses in dialogue.arXiv preprint ar...
-
[21]
Scaling synthetic data creation with 1,000,000,000 personas
Tao Ge, Xin Chan, Xiaoyang Wang, Dian Yu, Haitao Mi, and Dong Yu. Scaling synthetic data creation with 1,000,000,000 personas.arXiv preprint arXiv:2406.20094, 2024
-
[22]
Yao Dou, Michel Galley, Baolin Peng, Chris Kedzie, Weixin Cai, Alan Ritter, Chris Quirk, Wei Xu, and Jianfeng Gao. SimulatorArena: Are user simulators reliable proxies for multi- turn evaluation of AI assistants? In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Metho...
work page 2025
-
[23]
LongEval: Guidelines for human evaluation of faithfulness in long-form summariza- tion
Association for Computational Linguistics. ISBN 979-8-89176-332-6. doi: 10.18653/v1/ 2025.emnlp-main.1786. URLhttps://aclanthology.org/2025.emnlp-main.1786/
-
[24]
Hyunwoo Kim, Jihyeon Ryu, Jinho Lee, Hyungon Ryu, Kiran Praveen, Shyamala Prayaga, Kirit Thadaka, Will Jennings, Bardiya Sadeghi, Ashton Sharabiani, Yejin Choi, and Yev Meyer. Nemotron-personas-korea: Synthetic personas aligned to real-world distri- butions for korea, April 2026. URL https://huggingface.co/datasets/nvidia/ Nemotron-Personas-Korea
work page 2026
-
[25]
Know you first and be you better: Modeling human-like user simulators via implicit profiles
Kuang Wang, Xianfei Li, Shenghao Yang, Li Zhou, Feng Jiang, and Haizhou Li. Know you first and be you better: Modeling human-like user simulators via implicit profiles. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 21082–21107, 2025
work page 2025
-
[26]
Tarek Naous, Philippe Laban, Wei Xu, and Jennifer Neville. Flipping the dialogue: Training and evaluating user language models.arXiv preprint arXiv:2510.06552, 2025
-
[27]
Chatbench: From static benchmarks to human-ai evaluation
Serina Chang, Ashton Anderson, and Jake M Hofman. Chatbench: From static benchmarks to human-ai evaluation. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 26009–26038, 2025
work page 2025
-
[28]
Shirley Wu, Evelyn Choi, Arpandeep Khatua, Zhanghan Wang, Joy He-Yueya, Tharindu Cyril Weerasooriya, Wei Wei, Diyi Yang, Jure Leskovec, and James Zou. Humanlm: Simulating users with state alignment beats response imitation.arXiv preprint arXiv:2603.03303, 2026. 12
- [29]
-
[30]
Esther Levin, Roberto Pieraccini, Wieland Eckert, et al. A stochastic model of human-machine interaction for learning dialog strategies.IEEE Transactions on speech and audio processing, 8(1):11–23, 2000
work page 2000
-
[31]
Policy optimization of dialogue management in spoken dialogue system for out-of-domain utterances
Yuhong Xu, Peijie Huang, Jiecong Tang, Qiangjia Huang, Zhenpeng Deng, Weimou Peng, and Jiajie Lu. Policy optimization of dialogue management in spoken dialogue system for out-of-domain utterances. In2016 International Conference on Asian Language Processing (IALP), pages 10–13. IEEE, 2016
work page 2016
-
[32]
Guided dialog policy learning: Reward estimation for multi-domain task-oriented dialog
Ryuichi Takanobu, Hanlin Zhu, and Minlie Huang. Guided dialog policy learning: Reward estimation for multi-domain task-oriented dialog. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 100–110, 2019
work page 2019
-
[33]
Adversarial learning of neural user simulators for dialogue policy optimisation
Simon Keizer, Caroline Dockes, Norbert Braunschweiler, Svetlana Stoyanchev, and Rama Doddipatla. Adversarial learning of neural user simulators for dialogue policy optimisation. arXiv preprint arXiv:2306.00858, 2023
-
[34]
Micah Carroll, Rohin Shah, Mark K Ho, Tom Griffiths, Sanjit Seshia, Pieter Abbeel, and Anca Dragan. On the utility of learning about humans for human-ai coordination.Advances in neural information processing systems, 32, 2019
work page 2019
-
[35]
Wild- Chat: 1M ChatGPT interaction logs in the wild.arXiv preprint arXiv:2405.01470,
Wenting Zhao, Xiang Ren, Jack Hessel, Claire Cardie, Yejin Choi, and Yuntian Deng. Wildchat: 1m chatgpt interaction logs in the wild.arXiv preprint arXiv:2405.01470, 2024
-
[36]
Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, Faeze Brahman, Abhilasha Ravichander, Valentina Pyatkin, Nouha Dziri, Ronan Le Bras, and Yejin Choi. Wildbench: Benchmarking llms with challenging tasks from real users in the wild.arXiv preprint arXiv:2406.04770, 2024
-
[37]
Wildvis: Open source visualizer for million-scale chat logs in the wild
Yuntian Deng, Wenting Zhao, Jack Hessel, Xiang Ren, Claire Cardie, and Yejin Choi. Wildvis: Open source visualizer for million-scale chat logs in the wild. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: System Demonstrations, pages 497–506, 2024
work page 2024
-
[38]
Olivier Pietquin and Thierry Dutoit. A probabilistic framework for dialog simulation and optimal strategy learning.IEEE Transactions on Audio, Speech, and Language Processing, 14 (2):589–599, 2006
work page 2006
-
[39]
Alan Fern, Sriraam Natarajan, Kshitij Judah, and Prasad Tadepalli. A decision-theoretic model of assistance.Journal of Artificial Intelligence Research, 50:71–104, 2014
work page 2014
-
[40]
Christine Herlihy, Jennifer Neville, Tobias Schnabel, and Adith Swaminathan. On overcoming miscalibrated conversational priors in llm-based chatbots.arXiv preprint arXiv:2406.01633, 2024
-
[41]
What Prompts Don't Say: Understanding and Managing Underspecification in LLM Prompts
Chenyang Yang, Yike Shi, Qianou Ma, Michael Xieyang Liu, Christian Kästner, and Tong- shuang Wu. What prompts don’t say: Understanding and managing underspecification in llm prompts.arXiv preprint arXiv:2505.13360, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[42]
The communicative function of ambiguity in language.Cognition, 122(3):280–291, 2012
Steven T Piantadosi, Harry Tily, and Edward Gibson. The communicative function of ambiguity in language.Cognition, 122(3):280–291, 2012
work page 2012
-
[43]
Domain randomization for transferring deep neural networks from simulation to the real world
Josh Tobin, Rachel Fong, Alex Ray, Jonas Schneider, Wojciech Zaremba, and Pieter Abbeel. Domain randomization for transferring deep neural networks from simulation to the real world. InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pages 23–30, 2017. 13
work page 2017
-
[44]
Quantifying generalization in reinforcement learning
Karl Cobbe, Oleg Klimov, Chris Hesse, Taehoon Kim, and John Schulman. Quantifying generalization in reinforcement learning. InInternational Conference on Machine Learning (ICML), pages 1282–1289, 2019
work page 2019
-
[45]
Rubrics as Rewards: Reinforcement Learning Beyond Verifiable Domains
Anisha Gunjal, Anthony Wang, Elaine Lau, Vaskar Nath, Yunzhong He, Bing Liu, and Sean Hendryx. Rubrics as rewards: Reinforcement learning beyond verifiable domains.arXiv preprint arXiv:2507.17746, 2025
work page internal anchor Pith review arXiv 2025
-
[46]
Prometheus 2: An open source language model specialized in evaluating other language models
Seungone Kim, Juyoung Suk, Shayne Longpre, Bill Yuchen Lin, Jamin Shin, Sean Welleck, Graham Neubig, Moontae Lee, Kyungjae Lee, and Minjoon Seo. Prometheus 2: An open source language model specialized in evaluating other language models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 4334–4353, 2024
work page 2024
-
[47]
G-eval: Nlg evaluation using gpt-4 with better human alignment
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation using gpt-4 with better human alignment. InProceedings of the 2023 conference on empirical methods in natural language processing, pages 2511–2522, 2023
work page 2023
-
[48]
Asymmetric Actor Critic for Image-Based Robot Learning
Lerrel Pinto, Marcin Andrychowicz, Peter Welinder, Wojciech Zaremba, and Pieter Abbeel. Asymmetric actor critic for image-based robot learning.arXiv preprint arXiv:1710.06542, 2017
work page Pith review arXiv 2017
-
[49]
Ryan Lowe, Yi I Wu, Aviv Tamar, Jean Harb, OpenAI Pieter Abbeel, and Igor Mordatch. Multi-agent actor-critic for mixed cooperative-competitive environments.Advances in neural information processing systems, 30, 2017
work page 2017
-
[50]
Constitutional AI: Harmlessness from AI Feedback
Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback.arXiv preprint arXiv:2212.08073, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[51]
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022
work page 2022
-
[52]
Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback
Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[53]
Quantifying the persona effect in llm simulations
Tiancheng Hu and Nigel Collier. Quantifying the persona effect in llm simulations. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10289–10307, 2024
work page 2024
-
[54]
Persona: A reproducible testbed for pluralistic alignment
Louis Castricato, Nathan Lile, Rafael Rafailov, Jan-Philipp Fränken, and Chelsea Finn. Persona: A reproducible testbed for pluralistic alignment. InProceedings of the 31st International Conference on Computational Linguistics, pages 11348–11368, 2025
work page 2025
-
[55]
Non- collaborative user simulators for tool agents.arXiv preprint arXiv:2509.23124, 2025
Jeonghoon Shim, Woojung Song, Cheyon Jin, Seungwon KooK, and Yohan Jo. Non- collaborative user simulators for tool agents.arXiv preprint arXiv:2509.23124, 2025
-
[56]
Ofer Meshi, Krisztian Balog, Sally Goldman, Avi Caciularu, Guy Tennenholtz, Jihwan Jeong, Amir Globerson, and Craig Boutilier. Convapparel: A benchmark dataset and validation frame- work for user simulators in conversational recommenders.arXiv preprint arXiv:2602.16938, 2026
-
[57]
A framework for behavioural cloning
Michael Bain and Claude Sammut. A framework for behavioural cloning. InMachine intelligence 15, pages 103–129, 1995
work page 1995
-
[58]
Algorithms for inverse reinforcement learning
Andrew Y Ng, Stuart Russell, et al. Algorithms for inverse reinforcement learning. InIcml, volume 1, page 2, 2000
work page 2000
-
[59]
Jonathan Ho and Stefano Ermon. Generative adversarial imitation learning.Advances in neural information processing systems, 29, 2016. 14
work page 2016
-
[60]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[61]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[62]
Albert Q. Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, Lélio Renard Lavaud, Marie-Anne Lachaux, Pierre Stock, Teven Le Scao, Thibaut Lavril, Thomas Wang, Timothée Lacroix, and William El Sayed. Mistral 7b, 2023. URL https://arxi...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[63]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[64]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt- bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023
work page 2023
-
[65]
Michael Janner, Justin Fu, Marvin Zhang, and Sergey Levine. When to trust your model: Model-based policy optimization.Advances in neural information processing systems, 32, 2019
work page 2019
-
[66]
Vladimir Feinberg, Alvin Wan, Ion Stoica, Michael I Jordan, Joseph E Gonzalez, and Sergey Levine. Model-based value estimation for efficient model-free reinforcement learning.arXiv preprint arXiv:1803.00101, 2018
- [67]
-
[68]
arXiv preprint arXiv:2306.00774 , year=
Silvia Terragni, Modestas Filipavicius, Nghia Khau, Bruna Guedes, André Manso, and Roland Mathis. In-context learning user simulators for task-oriented dialog systems.arXiv preprint arXiv:2306.00774, 2023
-
[69]
arXiv preprint arXiv:2309.13233 , year=
Sam Davidson, Salvatore Romeo, Raphael Shu, James Gung, Arshit Gupta, Saab Mansour, and Yi Zhang. User simulation with large language models for evaluating task-oriented dialogue. arXiv preprint arXiv:2309.13233, 2023
-
[70]
Llm-powered user simulator for recommender system
Zijian Zhang, Shuchang Liu, Ziru Liu, Rui Zhong, Qingpeng Cai, Xiangyu Zhao, Chunxu Zhang, Qidong Liu, and Peng Jiang. Llm-powered user simulator for recommender system. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 13339–13347, 2025
work page 2025
-
[71]
Llm roleplay: Simulating human- chatbot interaction
Hovhannes Tamoyan, Hendrik Schuff, and Iryna Gurevych. Llm roleplay: Simulating human- chatbot interaction. InProceedings of the Third Workshop on Social Influence in Conversations (SICon 2025), pages 1–26, 2025
work page 2025
-
[72]
Learning to simulate human dialogue
Kanishk Gandhi, Agam Bhatia, and Noah D Goodman. Learning to simulate human dialogue. arXiv preprint arXiv:2601.04436, 2026
-
[73]
Enhancing human-like responses in large language models.arXiv preprint arXiv:2501.05032, 2025
Ethem Ya˘gız Çalık and Talha Rüzgar Akku¸ s. Enhancing human-like responses in large language models.arXiv preprint arXiv:2501.05032, 2025. 15
-
[74]
Direct preference optimization: Your language model is secretly a reward model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023
work page 2023
-
[75]
Direct multi-turn preference optimization for language agents
Wentao Shi, Mengqi Yuan, Junkang Wu, Qifan Wang, and Fuli Feng. Direct multi-turn preference optimization for language agents. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 2312–2324, 2024
work page 2024
-
[76]
Star-gate: Teaching language models to ask clarifying questions
Chinmaya Andukuri, Jan-Philipp Fränken, Tobias Gerstenberg, and Noah D Goodman. Star- gate: Teaching language models to ask clarifying questions.arXiv preprint arXiv:2403.19154, 2024
-
[77]
Richard S Sutton. Dyna, an integrated architecture for learning, planning, and reacting.ACM Sigart Bulletin, 2(4):160–163, 1991
work page 1991
-
[78]
A reduction of imitation learning and structured prediction to no-regret online learning
Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the fourteenth international conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011
work page 2011
-
[79]
Kurtland Chua, Roberto Calandra, Rowan McAllister, and Sergey Levine. Deep reinforcement learning in a handful of trials using probabilistic dynamics models.Advances in neural information processing systems, 31, 2018
work page 2018
-
[80]
Bhairav Mehta, Manfred Diaz, Florian Golemo, Christopher J Pal, and Liam Paull. Active domain randomization. InConference on Robot Learning, pages 1162–1176. PMLR, 2020
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.