pith. sign in

arxiv: 2606.06399 · v2 · pith:YOUQZRWXnew · submitted 2026-06-04 · 💻 cs.CL

CollabSim: A CSCW-Grounded Methodology for Investigating Collaborative Competence of LLM Agents through Controlled Multi-Agent Experiments

Pith reviewed 2026-06-28 01:23 UTC · model grok-4.3

classification 💻 cs.CL
keywords CollabSimcollaborative competencemulti-agent systemsLLM agentsCSCWsimulation frameworkcoordinationcommon ground
0
0 comments X

The pith

CollabSim is a simulation framework that measures collaborative competence of LLM agents by applying CSCW principles to controlled multi-agent text interactions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces CollabSim to fill a gap in multi-agent LLM evaluation, where systems often fail at coordination even when individual agents perform well on tasks. It draws on decades of CSCW research to define collaborative competence through four capabilities: establishing common ground, maintaining shared task understanding, balancing incentives, and repairing misalignment. The framework supports this by allowing experimenters to manipulate interaction conditions while probing agents' internal states at the action level. Experiments with four different LLMs demonstrate that the method detects how conditions affect collaboration, distinguishes performance patterns between models, and shows that agent design effects vary by task. A sympathetic reader would care because existing benchmarks emphasize solo reasoning or final task scores, leaving the coordination failures of real multi-agent deployments unexamined.

Core claim

CollabSim combines a theory-grounded definition of collaborative capabilities, controlled manipulation of interaction conditions, and action-level probing of agents' internal states to enable systematic analysis of agents' collaborative competence in MAS. Experiments across four LLMs show that CollabSim can capture condition effects, separate model performance patterns, and reveal task-dependent effects of agent design.

What carries the argument

CollabSim, a configurable simulation framework that integrates a CSCW-derived definition of collaborative capabilities with controlled condition manipulation and internal-state probing.

If this is right

  • Evaluations can now isolate and measure specific collaborative behaviors such as common-ground establishment separately from task accuracy.
  • Performance differences between LLMs can be attributed to collaboration patterns rather than only individual reasoning ability.
  • Agent design choices can be tested for their effects on collaboration, with those effects shown to depend on the particular task.
  • Condition manipulations become usable as causal probes for how communication constraints influence multi-agent coordination.
  • Systematic comparison of models and designs becomes possible on collaboration dimensions that current task-outcome benchmarks overlook.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same simulation structure could be adapted to study human-AI mixed teams by inserting human participants into the controlled interaction loop.
  • If collaborative competence turns out to be trainable independently of task-solving skill, future LLM fine-tuning objectives might explicitly target dialogue repair and incentive alignment.
  • The framework supplies a route to quantify when adding agents improves or degrades overall performance due to coordination overhead rather than capability limits.
  • Results from CollabSim could guide the design of communication protocols that reduce misalignment without increasing token cost.

Load-bearing premise

The CSCW-derived definition of collaborative capabilities validly measures the collaborative competence of LLM agents when applied through text-based simulation.

What would settle it

Running the same set of controlled experiments and finding that the four collaborative metrics show no reliable differences across manipulated conditions or across the four LLMs, or that the metrics do not correlate with observed coordination failures in actual multi-agent task runs.

Figures

Figures reproduced from arXiv: 2606.06399 by Bingsheng Yao, Bo Sun, Dakuo Wang, Jiaju Chen, Yun Wang, Yuxuan Lu.

Figure 1
Figure 1. Figure 1: Illustrations of the four multi-agent experi [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Architecture of CollabSim. The Controller manages the action validation, state update, and mental model probing for each agent. LLM agents in collaborative settings and assess their performance across different scenarios and in￾teraction protocols (Yu et al., 2025; Sun et al., 2025; Orogat et al., 2026; Tewolde et al., 2026). MultiA￾gentBench (Zhu et al., 2025), for example, covers a range of cooperative t… view at source ↗
Figure 3
Figure 3. Figure 3: Agents’ probing confidence as the task unfolded. The first row shows the performance of persona-based [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Inter-agent probing alignment as group size increases in Shape Factory and DayTrader, broken down by [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗
read the original abstract

Multi-agent systems (MAS) built on large language models have shown growing promise, with their effectiveness resting on agents' ability to coordinate through text-based channels much as human teams do. Yet recent study suggests that MAS often falter not because agents lack individual task-solving ability, but because they lack collaborative competence: the capacity to establish common ground, maintain shared task understanding, balance individual and collective incentives, and repair misalignment as interaction unfolds. Decades of research in Computer-Supported Cooperative Work have characterized these requirements for human teams coordinating under constrained communication, yet existing MAS evaluations focus mainly on task outcomes or single-agent proficiency in reasoning, planning, and tool use. To enable a systematic analysis of agents' collaborative competence in MAS, we introduce CollabSim, a configurable simulation framework that combines a theory-grounded definition of collaborative capabilities, controlled manipulation of interaction conditions, and action-level probing of agents' internal states. Experiments across four LLMs show that CollabSim can capture condition effects, separate model performance patterns, and reveal task-dependent effects of agent design.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces CollabSim, a configurable simulation framework grounded in CSCW research for evaluating collaborative competence in LLM-based multi-agent systems. It defines four collaborative capabilities (establishing common ground, maintaining shared task understanding, balancing incentives, and repairing misalignment), incorporates controlled manipulation of interaction conditions, and uses action-level probing of agents' internal states. Experiments across four LLMs are presented as demonstrating that the framework captures condition effects, separates model performance patterns, and reveals task-dependent effects of agent design.

Significance. If the central claims hold after addressing validation gaps, CollabSim would offer a theory-grounded alternative to existing MAS evaluations that focus primarily on task outcomes or single-agent reasoning. The integration of CSCW concepts with controlled simulations and internal-state probing represents a methodological contribution that could support more systematic study of coordination failures in text-based MAS.

major comments (2)
  1. [Abstract / Experiments] Abstract and Experiments section: The claim that 'experiments across four LLMs show that CollabSim can capture condition effects, separate model performance patterns, and reveal task-dependent effects of agent design' is presented without any reported details on experimental design, sample sizes, statistical methods, controls, or number of trials. This absence makes it impossible to assess whether the reported effects are supported by the data or distinguishable from simulation artifacts.
  2. [Methodology] Methodology section: The operationalization of the four CSCW-derived capabilities via text-based simulation and action-level probing assumes these metrics isolate collaborative competence rather than prompt sensitivity or single-agent reasoning. No ablation studies, correlation with human team baselines on identical tasks, or alternative operationalizations are described to test this assumption, which is load-bearing for the claim that CollabSim measures collaborative competence.
minor comments (2)
  1. [Abstract] The abstract states that MAS 'often falter not because agents lack individual task-solving ability, but because they lack collaborative competence,' but provides no citation or prior evidence for this characterization of recent study.
  2. [Methodology] Notation for the four collaborative capabilities is introduced without a dedicated table or figure summarizing their definitions, operationalizations, and probing methods, which would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments on our manuscript introducing CollabSim. We address each major comment below and indicate the revisions we will make.

read point-by-point responses
  1. Referee: [Abstract / Experiments] Abstract and Experiments section: The claim that 'experiments across four LLMs show that CollabSim can capture condition effects, separate model performance patterns, and reveal task-dependent effects of agent design' is presented without any reported details on experimental design, sample sizes, statistical methods, controls, or number of trials. This absence makes it impossible to assess whether the reported effects are supported by the data or distinguishable from simulation artifacts.

    Authors: While the abstract summarizes the findings at a high level without methodological specifics, the Experiments section of the manuscript details the simulation framework, the four LLMs evaluated, the collaborative tasks, and the condition manipulations. We concur that reporting sample sizes, statistical methods, controls, and trial numbers is essential for evaluating the robustness of the results. We will revise the Experiments section to include a comprehensive description of the experimental design, including these details and any statistical analyses performed. revision: yes

  2. Referee: [Methodology] Methodology section: The operationalization of the four CSCW-derived capabilities via text-based simulation and action-level probing assumes these metrics isolate collaborative competence rather than prompt sensitivity or single-agent reasoning. No ablation studies, correlation with human team baselines on identical tasks, or alternative operationalizations are described to test this assumption, which is load-bearing for the claim that CollabSim measures collaborative competence.

    Authors: The four capabilities are derived from CSCW literature and operationalized through controlled text-based interactions and internal state probing to focus on collaborative dynamics. We acknowledge that the current work does not include ablation studies or human team correlations to further validate the isolation from other factors. In the revision, we will expand the Methodology section to provide additional justification for the chosen operationalization and add a limitations subsection discussing the need for such validations in future work. revision: partial

Circularity Check

0 steps flagged

No circularity: framework introduction relies on external CSCW grounding and empirical experiments

full rationale

The paper introduces CollabSim as a new configurable simulation framework that operationalizes CSCW-derived collaborative capabilities (common ground, shared task understanding, incentive balance, misalignment repair) for LLM multi-agent systems. Central claims are demonstrated via controlled experiments across four LLMs showing capture of condition effects and model patterns; no equations, fitted parameters, predictions from subsets, or derivations are present. CSCW concepts are cited from external decades of research rather than self-citation chains or author prior work. No self-definitional mappings, ansatz smuggling, or renaming of known results occur. The methodology is self-contained against external benchmarks with no reduction of claims to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the assumption that CSCW concepts transfer to LLM agents; no free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption CSCW research supplies a valid theory-grounded definition of collaborative capabilities applicable to LLM agents in text-based interactions
    The abstract states the framework combines a theory-grounded definition from CSCW with controlled experiments.

pith-pipeline@v0.9.1-grok · 5733 in / 1122 out tokens · 21704 ms · 2026-06-28T01:23:32.086087+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

118 extracted references · 16 canonical work pages

  1. [1]

    Aho and Jeffrey D

    Alfred V. Aho and Jeffrey D. Ullman , title =. 1972

  2. [2]

    Publications Manual , year = "1983", publisher =

  3. [3]

    Chandra and Dexter C

    Ashok K. Chandra and Dexter C. Kozen and Larry J. Stockmeyer , year = "1981", title =. doi:10.1145/322234.322243

  4. [4]

    Scalable training of

    Andrew, Galen and Gao, Jianfeng , booktitle=. Scalable training of

  5. [5]

    Dan Gusfield , title =. 1997

  6. [6]

    Tetreault , title =

    Mohammad Sadegh Rasooli and Joel R. Tetreault , title =. Computing Research Repository , volume =. 2015 , url =

  7. [7]

    A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =

    Ando, Rie Kubota and Zhang, Tong , Issn =. A Framework for Learning Predictive Structures from Multiple Tasks and Unlabeled Data , Volume =. Journal of Machine Learning Research , Month = dec, Numpages =

  8. [8]

    arXiv preprint arXiv:2411.10109 , year=

    Generative agent simulations of 1,000 people , author=. arXiv preprint arXiv:2411.10109 , year=

  9. [9]

    , author=

    Pooling of unshared information in group decision making: Biased information sampling during discussion. , author=. Journal of personality and social psychology , volume=. 1985 , publisher=

  10. [10]

    Individual and group decision making: Current issues , volume=

    Shared mental models in expert team decision making , author=. Individual and group decision making: Current issues , volume=. 1993 , publisher=

  11. [11]

    arXiv preprint arXiv:2601.07110 , year=

    The Need for a Socially-Grounded Persona Framework for User Simulation , author=. arXiv preprint arXiv:2601.07110 , year=

  12. [12]

    Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society , volume=

    Whose personae? synthetic persona experiments in llm research and pathways to transparency , author=. Proceedings of the AAAI/ACM Conference on AI, Ethics, and Society , volume=

  13. [13]

    Human--computer interaction , volume=

    Distance matters , author=. Human--computer interaction , volume=. 2000 , publisher=

  14. [14]

    , author=

    Grounding in communication. , author=. 1991 , publisher=

  15. [15]

    ACM Computing Surveys (CSUR) , volume=

    The interdisciplinary study of coordination , author=. ACM Computing Surveys (CSUR) , volume=. 1994 , publisher=

  16. [16]

    ArXiv , year=

    WebArena: A Realistic Web Environment for Building Autonomous Agents , author=. ArXiv , year=

  17. [17]

    Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems , year=

    CoAuthor: Designing a Human-AI Collaborative Writing Dataset for Exploring Language Model Capabilities , author=. Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems , year=

  18. [18]

    arXiv preprint arXiv:2307.16789 , year=

    Toolllm: Facilitating large language models to master 16000+ real-world apis , author=. arXiv preprint arXiv:2307.16789 , year=

  19. [19]

    2025 , url=

    LLM-Based Human-Agent Collaboration and Interaction Systems: A Survey , author=. 2025 , url=

  20. [20]

    In: Palmer, M., Hwa, R., Riedel, S

    Lewis, Mike and Yarats, Denis and Dauphin, Yann and Parikh, Devi and Batra, Dhruv. Deal or No Deal? End-to-End Learning of Negotiation Dialogues. Proceedings of the 2017 Conference on Empirical Methods in Natural Language Processing. 2017. doi:10.18653/v1/D17-1259

  21. [21]

    Learning Symmetric Collaborative Dialogue Agents with Dynamic Knowledge Graph Embeddings

    He, He and Balakrishnan, Anusha and Eric, Mihail and Liang, Percy. Learning Symmetric Collaborative Dialogue Agents with Dynamic Knowledge Graph Embeddings. Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2017. doi:10.18653/v1/P17-1162

  22. [22]

    Chawla, J

    Chawla, Kushal and Ramirez, Jaysa and Clever, Rene and Lucas, Gale and May, Jonathan and Gratch, Jonathan. C a S i N o: A Corpus of Campsite Negotiation Dialogues for Automatic Negotiation Systems. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021. doi:10.18...

  23. [23]

    and Verme, Manuel Del and Marty, Tom and Vazquez, David and Chapados, Nicolas and Lacoste, Alexandre , title =

    Drouin, Alexandre and Gasse, Maxime and Caccia, Massimo and Laradji, Issam H. and Verme, Manuel Del and Marty, Tom and Vazquez, David and Chapados, Nicolas and Lacoste, Alexandre , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

  24. [24]

    2024 , eprint=

    -bench: A Benchmark for Tool-Agent-User Interaction in Real-World Domains , author=. 2024 , eprint=

  25. [25]

    ArXiv , year=

    Agentic Reasoning and Tool Integration for LLMs via Reinforcement Learning , author=. ArXiv , year=

  26. [26]

    Proceedings of the 31st International Conference on Computational Linguistics , pages=

    A review of prominent paradigms for llm-based agents: Tool use, planning (including rag), and feedback learning , author=. Proceedings of the 31st International Conference on Computational Linguistics , pages=

  27. [27]

    ArXiv , year=

    Proactive Agent: Shifting LLM Agents from Reactive Responses to Active Assistance , author=. ArXiv , year=

  28. [28]

    ACM Transactions on Software Engineering and Methodology , volume=

    LLM-Based Multi-Agent Systems for Software Engineering: Literature Review, Vision, and the Road Ahead , author=. ACM Transactions on Software Engineering and Methodology , volume=. 2025 , publisher=

  29. [29]

    1984 , url=

    Groups: Interaction and Performance , author=. 1984 , url=

  30. [30]

    Language and speech , volume=

    The HCRC map task corpus , author=. Language and speech , volume=. 1991 , publisher=

  31. [31]

    Frontiers in neuroscience , volume=

    The speed-accuracy tradeoff: history, physiology, methodology, and behavior , author=. Frontiers in neuroscience , volume=. 2014 , publisher=

  32. [32]

    Computer Supported Cooperative Work (CSCW) , volume=

    A descriptive framework of workspace awareness for real-time groupware , author=. Computer Supported Cooperative Work (CSCW) , volume=. 2002 , publisher=

  33. [33]

    arXiv preprint arXiv:2509.18008 , year=

    Through the Lens of Human-Human Collaboration: A Configurable Research Platform for Exploring Human-Agent Collaboration , author=. arXiv preprint arXiv:2509.18008 , year=

  34. [34]

    Proceedings of the 2018 conference on empirical methods in natural language processing , pages=

    Multiwoz-a large-scale multi-domain wizard-of-oz dataset for task-oriented dialogue modelling , author=. Proceedings of the 2018 conference on empirical methods in natural language processing , pages=

  35. [35]

    Frontiers in Artificial Intelligence , volume=

    Defining human-AI teaming the human-centered way: a scoping review and network analysis , author=. Frontiers in Artificial Intelligence , volume=. 2023 , publisher=

  36. [36]

    Large Language Model-based Human-Agent Collaboration for Complex Task Solving

    Feng, Xueyang and Chen, Zhi-Yuan and Qin, Yujia and Lin, Yankai and Chen, Xu and Liu, Zhiyuan and Wen, Ji-Rong. Large Language Model-based Human-Agent Collaboration for Complex Task Solving. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.72

  37. [37]

    M ulti A gent B ench : Evaluating the Collaboration and Competition of LLM agents

    Zhu, Kunlun and Du, Hongyi and Hong, Zhaochen and Yang, Xiaocheng and Guo, Shuyi and Wang, Zhe and Wang, Zhenhailong and Qian, Cheng and Tang, Robert and Ji, Heng and You, Jiaxuan. M ulti A gent B ench : Evaluating the Collaboration and Competition of LLM agents. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volu...

  38. [38]

    The eleventh international conference on learning representations , year=

    React: Synergizing reasoning and acting in language models , author=. The eleventh international conference on learning representations , year=

  39. [39]

    Advances in neural information processing systems , volume=

    Toolformer: Language models can teach themselves to use tools , author=. Advances in neural information processing systems , volume=

  40. [40]

    arXiv preprint arXiv:2502.12110 , year=

    A-mem: Agentic memory for llm agents , author=. arXiv preprint arXiv:2502.12110 , year=

  41. [41]

    Frontiers of Computer Science , volume=

    A survey on large language model based autonomous agents , author=. Frontiers of Computer Science , volume=. 2024 , publisher=

  42. [42]

    Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

    Flowbench: Revisiting and benchmarking workflow-guided planning for llm-based agents , author=. Findings of the Association for Computational Linguistics: EMNLP 2024 , pages=

  43. [43]

    Proceedings of the 2025 ACM Conference on International Computing Education Research V

    The Effects of GitHub Copilot on Computing Students' Programming Effectiveness, Efficiency, and Processes in Brownfield Coding Tasks , author=. Proceedings of the 2025 ACM Conference on International Computing Education Research V. 1 , pages=

  44. [44]

    Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems , articleno =

    Cila, Nazli , title =. Proceedings of the 2022 CHI Conference on Human Factors in Computing Systems , articleno =. 2022 , isbn =. doi:10.1145/3491102.3517500 , abstract =

  45. [45]

    Extended Abstracts of the CHI Conference on Human Factors in Computing Systems , articleno =

    Goyal, Nitesh and Chang, Minsuk and Terry, Michael , title =. Extended Abstracts of the CHI Conference on Human Factors in Computing Systems , articleno =. 2024 , isbn =. doi:10.1145/3613905.3650948 , abstract =

  46. [46]

    CRAB : Cross-environment Agent Benchmark for Multimodal Language Model Agents

    Xu, Tianqi and Chen, Linyao and Wu, Dai-Jie and Chen, Yanjun and Zhang, Zecheng and Yao, Xiang and Xie, Zhiqiang and Chen, Yongchao and Liu, Shilong and Qian, Bochen and Yang, Anjie and Jin, Zhaoxuan and Deng, Jianbo and Torr, Philip and Ghanem, Bernard and Li, Guohao. CRAB : Cross-environment Agent Benchmark for Multimodal Language Model Agents. Findings...

  47. [47]

    and Huang, Jimin and Qian, Lingfei and Peng, Xueqing and Suchow, Jordan W

    Li, Haohang and Cao, Yupeng and Yu, Yangyang and Javaji, Shashidhar Reddy and Deng, Zhiyang and He, Yueru and Jiang, Yuechen and Zhu, Zining and Subbalakshmi, K.p. and Huang, Jimin and Qian, Lingfei and Peng, Xueqing and Suchow, Jordan W. and Xie, Qianqian. INVESTORBENCH : A Benchmark for Financial Decision-Making Tasks with LLM -based Agent. Proceedings ...

  48. [48]

    Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems , articleno =

    Arakawa, Riku and Yakura, Hiromu and Akuzawa, Kei and Kubo, Shizuma , title =. Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems , articleno =. 2025 , isbn =. doi:10.1145/3706599.3706686 , abstract =

  49. [49]

    Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems , pages=

    Need help? designing proactive ai assistants for programming , author=. Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems , pages=

  50. [50]

    Proceedings of the 2025 CHI conference on human factors in computing systems , pages=

    Assistance or disruption? exploring and evaluating the design and trade-offs of proactive ai programming support , author=. Proceedings of the 2025 CHI conference on human factors in computing systems , pages=

  51. [51]

    Situational awareness , pages=

    Direct measurement of situation awareness: Validity and use of SAGAT , author=. Situational awareness , pages=. 2017 , publisher=

  52. [52]

    American journal of audiology , volume=

    Implementing ecological momentary assessment in audiological research: Opportunities and challenges , author=. American journal of audiology , volume=. 2024 , publisher=

  53. [53]

    arXiv preprint arXiv:2603.10165 , year=

    OpenClaw-RL: Train Any Agent Simply by Talking , author=. arXiv preprint arXiv:2603.10165 , year=

  54. [54]

    arXiv preprint arXiv:2508.08322 , year=

    Context Engineering for Multi-Agent LLM Code Assistants Using Elicit, NotebookLM, ChatGPT, and Claude Code , author=. arXiv preprint arXiv:2508.08322 , year=

  55. [55]

    Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

    Collabstory: Multi-llm collaborative story generation and authorship analysis , author=. Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

  56. [56]

    and Zhu, Shuqi and Ai, Qingyao and Zhou, Yujia and Liu, Yiqun

    Du, Bangde and Ye, Ziyi and Wu, Zhijing and Jankowska, Monika A. and Zhu, Shuqi and Ai, Qingyao and Zhou, Yujia and Liu, Yiqun. S im VBG : Simulating Individual Values by Backstory Generation. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.662

  57. [57]

    P ersona G ym: Evaluating Persona Agents and LLM s

    Samuel, Vinay and Zou, Henry Peng and Zhou, Yue and Chaudhari, Shreyas and Kalyan, Ashwin and Rajpurohit, Tanmay and Deshpande, Ameet and Narasimhan, Karthik R and Murahari, Vishvak. P ersona G ym: Evaluating Persona Agents and LLM s. Findings of the Association for Computational Linguistics: EMNLP 2025. 2025. doi:10.18653/v1/2025.findings-emnlp.368

  58. [58]

    arXiv preprint arXiv:2511.00222 , year=

    Consistently simulating human personas with multi-turn reinforcement learning , author=. arXiv preprint arXiv:2511.00222 , year=

  59. [59]

    A gent RM : Enhancing Agent Generalization with Reward Modeling

    Xia, Yu and Fan, Jingru and Chen, Weize and Yan, Siyu and Cong, Xin and Zhang, Zhong and Lu, Yaxi and Lin, Yankai and Liu, Zhiyuan and Sun, Maosong. A gent RM : Enhancing Agent Generalization with Reward Modeling. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl...

  60. [60]

    Stimulated Recall: A Report on Its Use in Naturalistic Research , volume =

    Lyle, John , year =. Stimulated Recall: A Report on Its Use in Naturalistic Research , volume =. British Educational Research Journal - BR EDUC RES J , doi =

  61. [61]

    Assessment & Evaluation in Higher Education , volume=

    Assessing teamwork in undergraduate education: a measurement tool to evaluate individual teamwork skills , author=. Assessment & Evaluation in Higher Education , volume=. 2017 , publisher=

  62. [62]

    Advances in neural information processing systems , volume=

    Judging llm-as-a-judge with mt-bench and chatbot arena , author=. Advances in neural information processing systems , volume=

  63. [63]

    Academy of management review , volume=

    A temporally based framework and taxonomy of team processes , author=. Academy of management review , volume=. 2001 , publisher=

  64. [64]

    arXiv preprint arXiv:2505.16944 , year=

    Agentif: Benchmarking instruction following of large language models in agentic scenarios , author=. arXiv preprint arXiv:2505.16944 , year=

  65. [65]

    arXiv preprint arXiv:2502.00640 , year=

    Collabllm: From passive responders to active collaborators , author=. arXiv preprint arXiv:2502.00640 , year=

  66. [66]

    Navigating Rifts in Human- LLM Grounding: Study and Benchmark

    Shaikh, Omar and Mozannar, Hussein and Bansal, Gagan and Fourney, Adam and Horvitz, Eric. Navigating Rifts in Human- LLM Grounding: Study and Benchmark. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.1016

  67. [67]

    arXiv preprint arXiv:2602.21337 , year=

    A Benchmark to Assess Common Ground in Human-AI Collaboration , author=. arXiv preprint arXiv:2602.21337 , year=

  68. [68]

    Situational awareness , pages=

    Toward a theory of situation awareness in dynamic systems , author=. Situational awareness , pages=. 2017 , publisher=

  69. [69]

    1995 , publisher=

    A computational theory of grounding in natural language conversation , author=. 1995 , publisher=

  70. [70]

    Sentence-bert: Sentence embeddings using siamese bert-networks , author=. Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) , pages=

  71. [71]

    Text summarization branches out , pages=

    Rouge: A package for automatic evaluation of summaries , author=. Text summarization branches out , pages=

  72. [72]

    The annals of mathematical statistics , pages=

    Snowball sampling , author=. The annals of mathematical statistics , pages=. 1961 , publisher=

  73. [73]

    arXiv preprint arXiv:2210.00720 , year=

    Complexity-based prompting for multi-step reasoning , author=. arXiv preprint arXiv:2210.00720 , year=

  74. [74]

    arXiv preprint arXiv:2605.11514 , year=

    FlowSteer: Prompt-Only Workflow Steering Exposes Planning-Time Vulnerabilities in Multi-Agent LLM Systems , author=. arXiv preprint arXiv:2605.11514 , year=

  75. [75]

    Advances in Neural Information Processing Systems , volume=

    Why do multi-agent llm systems fail? , author=. Advances in Neural Information Processing Systems , volume=

  76. [76]

    arXiv preprint arXiv:2604.07821 , year=

    More Capable, Less Cooperative? When LLMs Fail At Zero-Cost Collaboration , author=. arXiv preprint arXiv:2604.07821 , year=

  77. [77]

    arXiv preprint arXiv:2602.01011 , year=

    Multi-Agent Teams Hold Experts Back , author=. arXiv preprint arXiv:2602.01011 , year=

  78. [78]

    International Conference on Learning Representations , volume=

    MetaGPT: Meta programming for a multi-agent collaborative framework , author=. International Conference on Learning Representations , volume=

  79. [79]

    Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers) , pages=

    Chatdev: Communicative agents for software development , author=. Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers) , pages=

  80. [80]

    First conference on language modeling , year=

    Autogen: Enabling next-gen LLM applications via multi-agent conversations , author=. First conference on language modeling , year=

Showing first 80 references.