arxiv: 2604.24977 · v1 · submitted 2026-04-27 · 💻 cs.CL · cs.HC

Recognition: unknown

A Survey on LLM-based Conversational User Simulation

Bo Ni , Leyao Wang , Yu Wang , Branislav Kveton , Franck Dernoncourt , Yu Xia , Hongjie Chen , Reuben Leura

show 22 more authors

Samyadeep Basu Subhojyoti Mukherjee Puneet Mathur Nesreen Ahmed Junda Wu Li Li Huixin Zhang Ruiyi Zhang Tong Yu Sungchul Kim Jiuxiang Gu Zhengzhong Tu Alexa Siu Zichao Wang David Seunghyun Yoon Nedim Lipka Namyong Park Zihao Lin Trung Bui Yue Zhao Tyler Derr Ryan A. Rossi

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:16 UTC · model grok-4.3

classification 💻 cs.CL cs.HC

keywords LLMconversational user simulationtaxonomysurveydialogue systemsevaluation methodologiesopen challenges

0 comments

The pith

A new taxonomy classifies LLM-based conversational user simulations by user granularity and simulation objectives.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper surveys how large language models are used to generate synthetic conversational users for applications like training dialogue systems and testing robustness. It proposes a taxonomy that organizes existing work along two axes: the level of detail at which users are modeled and the specific goals the simulation is meant to achieve. The survey then reviews common techniques for building these simulators, methods for evaluating their quality, and remaining open problems in the field.

Core claim

We introduce a novel taxonomy covering user granularity and simulation objectives. Additionally, we systematically analyze core techniques and evaluation methodologies, identifying open challenges and organizing existing work under a unified framework.

What carries the argument

The taxonomy based on user granularity (how detailed or individualized the simulated users are) and simulation objectives (the intended purpose such as training or evaluation).

If this is right

Existing papers can be reorganized and compared more systematically using the shared taxonomy.
New simulators can be designed with explicit choices about granularity and objective from the start.
Evaluation protocols can be aligned to the taxonomy categories rather than developed ad hoc.
Open challenges identified in the survey become clearer targets for follow-up research.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The taxonomy may accelerate progress by making it easier to spot gaps, such as simulations that combine high user granularity with specific robustness-testing objectives.
Adoption could lead to shared benchmark suites tailored to each taxonomy cell rather than generic dialogue metrics.
Extensions of the framework might incorporate temporal consistency across long conversations or cultural variation in user behavior.

Load-bearing premise

The taxonomy based on user granularity and simulation objectives provides a comprehensive and non-overlapping classification of all relevant LLM-based conversational user simulation work.

What would settle it

A published LLM-based user simulator that cannot be placed into any category of the proposed taxonomy without stretching or redefining the axes.

Figures

Figures reproduced from arXiv: 2604.24977 by Alexa Siu, Bo Ni, Branislav Kveton, David Seunghyun Yoon, Franck Dernoncourt, Hongjie Chen, Huixin Zhang, Jiuxiang Gu, Junda Wu, Leyao Wang, Li Li, Namyong Park, Nedim Lipka, Nesreen Ahmed, Puneet Mathur, Reuben Leura, Ruiyi Zhang, Ryan A. Rossi, Samyadeep Basu, Subhojyoti Mukherjee, Sungchul Kim, Tong Yu, Trung Bui, Tyler Derr, Yue Zhao, Yu Wang, Yu Xia, Zhengzhong Tu, Zichao Wang, Zihao Lin.

**Figure 1.** Figure 1: Overview of the proposed taxonomies for user view at source ↗

**Figure 2.** Figure 2: Overview of the proposed taxonomy for who is being simulated, i.e., the target of simulation. where p i t ∈ P denotes the participant p i speaking at turn t, and ut is the utterance they produce from a vocabulary V. The core of conversational simulation is to model the behavior of one or more target participants. Let Ct−1 be the conversational history and Ct−1 = (c1, . . . , ct−1). Additionally, let Ψp i … view at source ↗

**Figure 3.** Figure 3: Taxonomy of Individual User Simulation. Explicit traits LiveChat (Gao et al., 2023) mines detailed persona profiles and interactions from live-streaming platforms, constructing a large-scale Chinese corpus with naturally occurring individual variation. C What: Simulation Objectives (Extended) C.1 Human-AI Simulation Generating such conversations has attracted significant attention due to the scarcity of … view at source ↗

**Figure 4.** Figure 4: Comparing different types of conversational trajectories starting from an initial input query by the user. view at source ↗

read the original abstract

User simulation has long played a vital role in computer science due to its potential to support a wide range of applications. Language, as the primary medium of human communication, forms the foundation of social interaction and behavior. Consequently, simulating conversational behavior has become a key area of study. Recent advancements in large language models (LLMs) have significantly catalyzed progress in this domain by enabling high-fidelity generation of synthetic user conversation. In this paper, we survey recent advancements in LLM-based conversational user simulation. We introduce a novel taxonomy covering user granularity and simulation objectives. Additionally, we systematically analyze core techniques and evaluation methodologies. We aim to keep the research community informed of the latest advancements in conversational user simulation and to further facilitate future research by identifying open challenges and organizing existing work under a unified framework.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper surveys recent advancements in LLM-based conversational user simulation. It introduces a novel taxonomy based on user granularity and simulation objectives, systematically analyzes core techniques and evaluation methodologies, identifies open challenges, and organizes existing work under a unified framework to inform the research community.

Significance. If the taxonomy proves comprehensive and non-overlapping and the literature coverage is exhaustive, the survey could usefully consolidate a rapidly growing subfield, providing researchers with a shared vocabulary and highlighting gaps in techniques and evaluation practices. The absence of machine-checked elements or new empirical results is expected for a survey, but the value hinges on verifiable systematicity rather than post-hoc organization.

major comments (3)

[Introduction and §2 (or equivalent methods/literature review section)] The central claim of a 'systematic analysis' and 'unified framework' organizing 'existing work' requires an explicit literature search protocol. No section describes the databases queried, search keywords, date range, inclusion/exclusion criteria, or number of papers screened. Without this, the taxonomy and subsequent technique/evaluation analysis cannot be assessed for completeness or selection bias.
[§3] §3 (Taxonomy section): The taxonomy is asserted to cover user granularity and simulation objectives comprehensively with negligible overlap. The manuscript provides no inter-coder agreement metric, discussion of edge cases (e.g., multi-objective or hybrid-granularity simulations), or explicit mapping of all surveyed papers to categories. This directly affects the claim that the taxonomy partitions the literature without unclassifiable cases.
[Evaluation methodologies section] The analysis of evaluation methodologies lacks a clear breakdown of how many papers use each method and whether the taxonomy dimensions correlate with evaluation choices. If the taxonomy is meant to organize the field, the evaluation section should include a contingency table or similar cross-tabulation showing coverage.

minor comments (2)

[Figure 1] Figure 1 (taxonomy diagram) would benefit from explicit arrows or labels showing how the two dimensions interact, rather than a simple grid.
[Core techniques section] Some citations in the techniques section appear to be grouped by high-level category without individual paper summaries; adding one-sentence contributions for the most influential works would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important opportunities to improve the transparency and rigor of our survey. We address each major comment point by point below and commit to revisions that strengthen the manuscript without altering its core contributions.

read point-by-point responses

Referee: The central claim of a 'systematic analysis' and 'unified framework' organizing 'existing work' requires an explicit literature search protocol. No section describes the databases queried, search keywords, date range, inclusion/exclusion criteria, or number of papers screened. Without this, the taxonomy and subsequent technique/evaluation analysis cannot be assessed for completeness or selection bias.

Authors: We agree that an explicit literature search protocol was not detailed in the original manuscript. In the revised version, we will add a new subsection (placed in the Introduction) that fully documents the search process. This will specify the databases and sources queried (arXiv, ACL Anthology, Google Scholar, and selected conference proceedings), the search keywords and Boolean combinations employed (e.g., 'LLM user simulation', 'conversational user simulation', 'synthetic conversational agents'), the date range (primarily 2022–2024 to capture post-LLM developments), inclusion/exclusion criteria, and the counts of papers identified, screened, and included. This addition will enable readers to evaluate completeness and potential biases directly. revision: yes
Referee: §3 (Taxonomy section): The taxonomy is asserted to cover user granularity and simulation objectives comprehensively with negligible overlap. The manuscript provides no inter-coder agreement metric, discussion of edge cases (e.g., multi-objective or hybrid-granularity simulations), or explicit mapping of all surveyed papers to categories. This directly affects the claim that the taxonomy partitions the literature without unclassifiable cases.

Authors: We acknowledge that additional validation details would strengthen the taxonomy claims. In the revision, we will insert an explicit mapping (as a table in §3 or an appendix) that assigns every surveyed paper to its primary taxonomy categories. We will also expand the section to discuss edge cases, including multi-objective and hybrid-granularity simulations, with concrete examples of how they are classified and any boundary decisions made. While formal inter-coder agreement statistics are less common in single-team surveys, we will describe the iterative internal refinement process used to minimize overlap and ensure coverage, supported by the new mapping table. revision: yes
Referee: The analysis of evaluation methodologies lacks a clear breakdown of how many papers use each method and whether the taxonomy dimensions correlate with evaluation choices. If the taxonomy is meant to organize the field, the evaluation section should include a contingency table or similar cross-tabulation showing coverage.

Authors: We agree that quantitative cross-analysis would better demonstrate the taxonomy's organizing value. In the revised evaluation methodologies section, we will add (1) explicit counts and percentages of papers using each evaluation method and (2) a contingency table (or equivalent cross-tabulation) that breaks down evaluation methods by the two taxonomy dimensions (user granularity and simulation objectives). This table will highlight coverage, potential correlations, and gaps, directly supporting the unified-framework claim. revision: yes

Circularity Check

0 steps flagged

No circularity: survey taxonomy and analysis rest on external literature synthesis

full rationale

This is a literature survey paper with no derivations, equations, fitted parameters, or predictions. The central contribution is a proposed taxonomy (user granularity × simulation objectives) plus systematic review of techniques and evaluations drawn from cited external works. No step reduces by construction to the paper's own inputs; the taxonomy is explicitly introduced as novel rather than derived from prior self-citations, and completeness claims are framed as synthesis rather than self-verifying. Standard review practices (citing prior papers) do not trigger any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a survey the paper introduces no new free parameters, axioms, or invented entities; it relies on standard definitions from the NLP literature for concepts such as user simulation and evaluation metrics.

pith-pipeline@v0.9.0 · 5545 in / 1088 out tokens · 48936 ms · 2026-05-08T03:16:43.203401+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

40 extracted references · 19 canonical work pages · 4 internal anchors

[1]

Yuanpu Cao, Tianrong Zhang, Bochuan Cao, Ziyi Yin, Lu Lin, Fenglong Ma, and Jinghui Chen

Association for Computational Linguistics. Yuanpu Cao, Tianrong Zhang, Bochuan Cao, Ziyi Yin, Lu Lin, Fenglong Ma, and Jinghui Chen. 2024. Per- sonalized steering of large language models: Versa- tile steering vectors through bi-directional preference optimization. InAdvances in Neural Information Processing Systems 38: Annual Conference on Neu- ral Infor...

work page arXiv 2024
[2]

Association for Computational Linguistics. Jiangjie Chen, Xintao Wang, Rui Xu, Siyu Yuan, Yikai Zhang, Wei Shi, Jian Xie, Shuang Li, Ruihan Yang, Tinghui Zhu, Aili Chen, Nianqi Li, Lida Chen, Caiyu Hu, Siye Wu, Scott Ren, Ziquan Fu, and Yanghua Xiao. 2024b. From persona to personalization: A sur- vey on role-playing language agents.Transactions on Machine...

work page arXiv 2025
[3]

A Survey on LLM-as-a-Judge

Mathematical capabilities of chatgpt. InAd- vances in Neural Information Processing Systems 36: Annual Conference on Neural Information Process- ing Systems 2023, NeurIPS 2023. Jingsheng Gao, Yixin Lian, Ziyi Zhou, Yuzhuo Fu, and Baoyuan Wang. 2023. Livechat: A large-scale per- sonalized dialogue dataset automatically constructed from live streaming. InPr...

work page internal anchor Pith review arXiv 2023
[4]

In Proceedings of the 2023 CHI Conference on Hu- man Factors in Computing Systems, CHI 2023, pages 433:1–433:19

Evaluating large language models in generat- ing synthetic HCI research data: a case study. In Proceedings of the 2023 CHI Conference on Hu- man Factors in Computing Systems, CHI 2023, pages 433:1–433:19. ACM. F. Maxwell Harper and Joseph A. Konstan. 2015. The movielens datasets: History and context.ACM Trans- actions on Interactive Intelligent Systems, 5...

2023
[5]

In Proceedings of the 12th International Conference on Human-Agent Interaction, HAI 2024, pages 393–395

Beyond pretend-reality dualism: Frame anal- ysis of llm-powered role play with social agents. In Proceedings of the 12th International Conference on Human-Agent Interaction, HAI 2024, pages 393–395. ACM. Joey Hong, Jessica Lin, Anca D. Dragan, and Sergey Levine. 2024. Interactive dialogue agents via re- inforcement learning on hindsight regenerations. CoR...

work page arXiv 2024
[6]

Aobo Kong, Wentao Ma, Shiwan Zhao, Yongbin Li, Yuchuan Wu, Ke Wang, Xiaoqian Liu, Qicheng Li, Yong Qin, and Fei Huang

Chatcollab: Exploring collaboration between humans and AI agents in software teams.CoRR, abs/2412.01992. Aobo Kong, Wentao Ma, Shiwan Zhao, Yongbin Li, Yuchuan Wu, Ke Wang, Xiaoqian Liu, Qicheng Li, Yong Qin, and Fei Huang. 2025. SDPO: segment- level direct preference optimization for social agents. CoRR, abs/2501.01821. Chuyi Kong, Yaxin Fan, Xiang Wan, ...

work page arXiv 2025
[7]

Branislav Kveton, Csaba Szepesvári, Zheng Wen, and Azin Ashkan

Matrix factorization techniques for recom- mender systems.IEEE Computer, 42(8):30–37. Branislav Kveton, Csaba Szepesvári, Zheng Wen, and Azin Ashkan. 2015. Cascading bandits: Learning to rank in the cascade model. InProceedings of the 32nd International Conference on Machine Learn- ing, ICML 2015, volume 37 ofJMLR Workshop and Conference Proceedings, page...

2015
[8]

InGenerative Intelligence and Intelligent Tutoring Systems - 20th International Conference, ITS 2024, volume 14798 ofLecture Notes in Computer Science, pages 131–148

Developing conversational intelligent tutor- ing for speaking skills in second language learning. InGenerative Intelligence and Intelligent Tutoring Systems - 20th International Conference, ITS 2024, volume 14798 ofLecture Notes in Computer Science, pages 131–148. Springer. Joanne Leong, John C. Tang, Edward Cutrell, Sasa Junuzovic, Gregory Paul Baribault...

2024
[9]

Proceedings of the ACM on Human-Computer Inter- action, 8(CSCW2):1–28

Dittos: Personalized, embodied agents that participate in meetings when you are unavailable. Proceedings of the ACM on Human-Computer Inter- action, 8(CSCW2):1–28. Mike Lewis, Denis Yarats, Yann N. Dauphin, Devi Parikh, and Dhruv Batra. 2017. Deal or no deal? end-to-end learning for negotiation dialogues.CoRR, abs/1706.05125. Patrick Lewis, Ethan Perez, A...

work page arXiv 2017
[10]

InProceedings of the 54th Annual Meeting of the As- sociation for Computational Linguistics, ACL 2016

A persona-based neural conversation model. InProceedings of the 54th Annual Meeting of the As- sociation for Computational Linguistics, ACL 2016. The Association for Computer Linguistics. Juntao Li, Chang Liu, Chongyang Tao, Zhangming Chan, Dongyan Zhao, Min Zhang, and Rui Yan

2016
[11]

A personalized conversational benchmark: Towards simulating personalized conversations

Dialogue history matters! personalized re- sponse selection in multi-turn retrieval-based chat- bots.ACM Transactions on Information Systems (TOIS), 39(4):45:1–45:25. Li Li, Peilin Cai, Ryan A. Rossi, Franck Dernon- court, Branislav Kveton, Junda Wu, Tong Yu, Linxin Song, Tiankai Yang, Yuehan Qin, Nesreen K. Ahmed, Samyadeep Basu, Subhojyoti Mukherjee, Ru...

work page arXiv 2024
[12]

Shuai Li, Yasin Abbasi-Yadkori, Branislav Kveton, S

AAAI Press. Shuai Li, Yasin Abbasi-Yadkori, Branislav Kveton, S. Muthukrishnan, Vishwa Vinay, and Zheng Wen
[13]

InProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, pages 1685–

Offline evaluation of ranking policies with click models. InProceedings of the 24th ACM SIGKDD International Conference on Knowledge Discovery & Data Mining, KDD 2018, pages 1685–

2018
[14]

A survey of personalized large language models: Progress and future directions

ACM. Jieun Lim, Unggi Lee, Junbo Koh, Yeil Jeong, Yun- seo Lee, Gyuri Byun, Haewon Jung, Yoonsun Jang, Sanghyeok Lee, and Jewoong Moon. 2025. De- velopment and implementation of a generative ar- tificial intelligence-enhanced simulation to enhance problem-solving skills for pre-service teachers.Com- puters & Education, 232:105306. Chien-Chang Lin, Anna Y ...

work page arXiv 2025
[15]

Yajiao Liu, Xin Jiang, Yichun Yin, Yasheng Wang, Fei Mi, Qun Liu, Xiang Wan, and Benyou Wang

Uncertainty estimation and quantification for llms: A simple supervised approach. Yajiao Liu, Xin Jiang, Yichun Yin, Yasheng Wang, Fei Mi, Qun Liu, Xiang Wan, and Benyou Wang. 2023. One cannot stand for everyone! leveraging multi- ple user simulators to train task-oriented dialogue systems. InProceedings of the 61st Annual Meet- ing of the Association for...

2023
[16]

Agen- tRewardBench: Evaluating automatic evaluations of web agent trajectories.arXiv preprint arXiv:2504.08942, 2025

Large language models are superpositions of all characters: Attaining arbitrary role-play via self-alignment. InProceedings of the 62nd Annual Meeting of the Association for Computational Lin- guistics (Volume 1: Long Papers), ACL 2024, pages 7828–7840. Association for Computational Linguis- tics. Xing Han Lù, Amirhossein Kazemnejad, Nicholas Meade, Arkil...

work page arXiv 2024
[17]

InProceedings of the 2024 Joint International Con- ference on Computational Linguistics, Language Re- sources and Evaluation, LREC/COLING 2024, pages 5414–5424

Duetsim: Building user simulator with dual large language models for task-oriented dialogues. InProceedings of the 2024 Joint International Con- ference on Computational Linguistics, Language Re- sources and Evaluation, LREC/COLING 2024, pages 5414–5424. ELRA and ICCL. Ziyang Luo, Haoning Wu, Dongxu Li, Jing Ma, Mo- han S. Kankanhalli, and Junnan Li. 2025...

2024
[18]

Andrea Madotto, Chien-Sheng Wu, and Pascale Fung

Steering conversational large language models for long emotional support conversations.CoRR, abs/2402.10453. Andrea Madotto, Chien-Sheng Wu, and Pascale Fung

work page arXiv
[19]

InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, pages 1468–1478

Mem2seq: Effectively incorporating knowl- edge bases into end-to-end task-oriented dialog sys- tems. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics, ACL 2018, pages 1468–1478. Association for Computa- tional Linguistics. Manqing Mao, Paishun Ting, Yijian Xiang, Mingyang Xu, Julia Chen, and Jianzhe Lin. 2024. Mult...

work page arXiv 2018
[20]

Robert J

Goal alignment in llm-based user simulators for conversational ai. Robert J. Moore and Raphael Arar. 2018. Conversa- tional UX design: An introduction. In Robert J. Moore, Margaret H. Szymanski, Raphael Arar, and Guang-Jie Ren, editors,Studies in Conversational UX Design, Human-Computer Interaction Series, pages 1–16. Springer. Rémi Munos and Andrew W. Mo...

2018
[21]

Monika Ol˛ edzka, Mark Benesio Carace, Susana de Oliveira Tomaz, Benny Pan, and Pengfei Jiang

Towards trustworthy knowledge graph reason- ing: An uncertainty aware perspective.Proceedings of the AAAI Conference on Artificial Intelligence, 39(12):12417–12425. Monika Ol˛ edzka, Mark Benesio Carace, Susana de Oliveira Tomaz, Benny Pan, and Pengfei Jiang
[22]

GPT-4 Technical Report

Ai as a teaching assistant: An innovative ap- proach to education through customized model an- swer generation and guided practice.Studia Eduka- cyjne, pages 67–79. Intergovernmental Panel on Climate Change. 2021.Cli- mate Change 2021: The Physical Science Basis. Cambridge University Press. OpenAI. 2023. GPT-4 technical report.CoRR, abs/2303.08774. Long O...

work page internal anchor Pith review arXiv 2021
[23]

InAdvances in Neural Information Processing Systems 35: Annual Confer- ence on Neural Information Processing Systems 2022, NeurIPS 2022, volume 35, pages 27730–27744

Training language models to follow instruc- tions with human feedback. InAdvances in Neural Information Processing Systems 35: Annual Confer- ence on Neural Information Processing Systems 2022, NeurIPS 2022, volume 35, pages 27730–27744. Sitong Pan, Robin Schmucker, Bernardo Garcia Bulle Bueno, Salome Aguilar Llanes, Fernanda Albo Alar- cón, Hangxiao Zhu,...

2022
[24]

AgentSociety: Large-Scale Simulation of LLM-Driven Generative Agents Advances Understanding of Human Behaviors and Society

Datadreamer: A tool for synthetic data genera- tion and reproducible LLM workflows. InProceed- ings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa- pers), ACL 2024, pages 3781–3799. Association for Computational Linguistics. Nikhil Patel and Sandeep Trivedi. 2020. Leveraging predictive modeling, machine lear...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Sys- tems 2023, NeurIPS 2023

Direct preference optimization: Your language model is secretly a reward model. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Sys- tems 2023, NeurIPS 2023. Hannah Rashkin, Eric Michael Smith, Margaret Li, and Y-Lan Boureau. 2019. Towards empathetic open- domain conversation models: A new benchma...

2023
[26]

David Rohde, Stephen Bonner, Travis Dunlop, Flavian Vasile, and Alexandros Karatzoglou

Estimation of regression coefficients when some regressors are not always observed.Journal of the American Statistical Association, 89(427):846– 866. David Rohde, Stephen Bonner, Travis Dunlop, Flavian Vasile, and Alexandros Karatzoglou. 2018. Reco- gym: A reinforcement learning environment for the problem of product recommendation in online adver- tising...

work page arXiv 2018
[27]

Person- ality traits in large language models

Personality traits in large language models. CoRR, abs/2307.00184. Pararth Shah, Dilek Hakkani-Tür, Bing Liu, and Gökhan Tür. 2018. Bootstrapping a neural conversational agent with dialogue self-play, crowdsourcing and on-line reinforcement learning. InProceedings of the 2018 Conference of the North American Chap- ter of the Association for Computational ...

work page arXiv 2018
[28]

Yunfan Shao, Linyang Li, Junqi Dai, and Xipeng Qiu

Role play with large language models.Nature, 623(7987):493–498. Yunfan Shao, Linyang Li, Junqi Dai, and Xipeng Qiu
[29]

InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, pages 13153–13187

Character-llm: A trainable agent for role- playing. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, pages 13153–13187. Association for Computational Linguistics. Wentao Shi, Mengqi Yuan, Junkang Wu, Qifan Wang, and Fuli Feng. 2024. Direct multi-turn preference op- timization for language agents. InProc...

2023
[30]

Retrieval-augmented simulacra: Generative agents for up-to-date and knowledge-adaptive simulations,

Association for Computational Linguistics. Hikaru Shimadzu, Takehito Utsuro, and Daisuke Ki- tayama. 2025. Retrieval-augmented simulacra: Gen- erative agents for up-to-date and knowledge-adaptive simulations.CoRR, abs/2503.14620. Yifan Song, Da Yin, Xiang Yue, Jie Huang, Sujian Li, and Bill Yuchen Lin. 2024. Trial and error: Exploration-based trajectory o...

work page arXiv 2025
[31]

LLaMA: Open and Efficient Foundation Language Models

IEEE. Michael Tomasello. 2010.Origins of Human Communi- cation. MIT Press. Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, Aurélien Rodriguez, Armand Joulin, Edouard Grave, and Guillaume Lample. 2023. Llama: Open and efficient foundation languag...

work page internal anchor Pith review arXiv 2010
[32]

InProceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue, SIGDIAL 2022, pages 351–360

Multiwoz 2.4: A multi-domain task-oriented dialogue dataset with essential annotation corrections to improve state tracking evaluation. InProceedings of the 23rd Annual Meeting of the Special Interest Group on Discourse and Dialogue, SIGDIAL 2022, pages 351–360. Association for Computational Lin- guistics. Jingheng Ye, Shen Wang, Deqing Zou, Yibo Yan, Kun...

work page arXiv 2022
[33]

Mathvc: An llm-simulated multi-character virtual classroom for mathematics education,

Evaluating character understanding of large language models via character profiling from fictional works. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Process- ing, EMNLP 2024, pages 8015–8036. Association for Computational Linguistics. Murong Yue, Wijdane Mifdal, Yixuan Zhang, Jennifer Suh, and Ziyu Yao. 2024. Mathvc: An ...

work page arXiv 2024
[34]

These simulators are grounded in well-understood dynamics, unlike human be- havior, which remains far more complex and less predictable

provide high-fidelity environments for phys- ical control tasks. These simulators are grounded in well-understood dynamics, unlike human be- havior, which remains far more complex and less predictable. Building large-scale, diverse human simulators remains a core challenge. Inspired by RL success in physical environments, researchers have also developed s...

2018
[35]

introduces a multi-turn online iterative frame- work for direct preference learning, specifically de- signed to handle multi-turn reasoning and tool inte- gration. Building on trajectory-based optimization, ETO (Song et al., 2024) develops an exploration- based approach that learns from past exploration trajectories, including failure cases, to improve pe...

2024
[36]

It combines Monte Carlo Tree Search (MCTS) with self-critique and iterative fine-tuning, learning from both positive and negative conversational trajecto- ries

further improves the trajectory exploration by addressing the sub-optimal policy outcomes due to compounding errors and limited exploration data. It combines Monte Carlo Tree Search (MCTS) with self-critique and iterative fine-tuning, learning from both positive and negative conversational trajecto- ries. More recently, LOOP (Chen et al., 2025a) trains in...

2024
[37]

role-play prompting

augment generation with personal context, and PsyPlay (Yang et al., 2025) builds personality- infused agents capable of portraying designated traits. Beyond persona, more recent work (Mehri et al., 2025) focuses on aligning the goals of the simulated persona. Beyond prompting, fine-tuning methods such as Supervised Fine-Tuning (SFT) and Direct Prefer- enc...

2025
[38]

novice buyer

framework introduces a simulation-based ap- proach to multi-party dialogue. MUCA employs a multi-user simulator to mimic the behaviors of several distinct human participants, enabling the training and evaluation of group-aware AI assis- tants. By modeling not just individual utterances but the evolving group dynamics over time, MUCA facilitates the develo...

2023
[39]

and LifeStageBench (Fan et al., 2025) also rely on expert or crowd annotators for final scor- ing, sometimes in combination with model judges. Common protocols include: (i) Likert scoring on multiple axes (e.g., naturalness, coherence, goal completion, persona/role fidelity), (ii) pairwise A/B testing that asks which conversation (or response) is better a...

2025
[40]

They un- derscore the importance of careful prompt writ- ing and using these methods earlier in the design process for need finding and early feedback

evaluate LLMs’ ability to generate synthetic user research data for usability tasks. They un- derscore the importance of careful prompt writ- ing and using these methods earlier in the design process for need finding and early feedback. The same study even found that LLM simulated con- versational data was often distinguishable from hu- man results, with ...

2025