pith. sign in

arxiv: 2605.20548 · v1 · pith:TBF7WYKMnew · submitted 2026-05-19 · 💻 cs.MA

What Do Agents Communicate? Characterizing Information Exchange in Multi-Agent Systems

Pith reviewed 2026-05-21 06:12 UTC · model grok-4.3

classification 💻 cs.MA
keywords multi-agent systemslarge language modelsinter-agent communicationreasoningverificationinformation exchangeperformance improvementaugmentation
0
0 comments X

The pith

Absence of reasoning and verification in inter-agent communication degrades multi-agent performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper conducts a systematic analysis of what information agents share in multi-agent LLM systems. It finds that the lack of reasoning and verification details in these exchanges causes significant drops in performance due to error propagation. The authors propose Category-Aware Recovery Augmentation to force inclusion of these critical information categories in communications. This approach recovers up to 86.2 percent of the cases that previously failed. A sympathetic reader would care because it points to communication content as a key lever for improving collaborative AI systems without redesigning the agents themselves.

Core claim

The paper claims that the absence of reasoning and verification in inter-agent communication significantly degrades performance. Category-Aware Recovery Augmentation enforces the presence of critical information during communication and recovers up to 86.2% of failed cases. The results highlight the key role of information quality in effective MA collaboration.

What carries the argument

Category-Aware Recovery Augmentation, which categorizes critical information such as reasoning and verification and augments inter-agent messages to ensure their inclusion.

If this is right

  • Error propagation in multi-agent systems can be mitigated by ensuring messages contain explicit reasoning.
  • Verification steps in communications help maintain the integrity of information across agent iterations.
  • Performance in collaborative tasks improves when agents exchange complete categories of information.
  • The design of communication protocols is central to the success of multi-agent LLM setups.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This technique could be integrated into agent frameworks to automatically check and supplement messages.
  • Similar principles might apply to improving communication in other AI systems like tool-using agents.
  • Exploring the minimal set of categories needed could lead to more efficient augmentation methods.

Load-bearing premise

The categories of critical information such as reasoning and verification are broadly applicable across different tasks and agent architectures without introducing new failure modes.

What would settle it

Applying the Category-Aware Recovery Augmentation to a different multi-agent task or architecture and measuring a recovery rate much lower than 86 percent would falsify the general applicability of the claim.

Figures

Figures reproduced from arXiv: 2605.20548 by Iftekhar Ahmed, Yong Jin Chun.

Figure 1
Figure 1. Figure 1: Methodology Overview in our experiments, we evaluate these MA systems without additional techniques or tool integrations, enabling a fair comparison across systems while isolating the impact of inter-agent communication. All MA systems in our experiments consist of three agents and three interaction rounds, following the best practices outlined in prior MA literature that achieved strong performance while … view at source ↗
Figure 2
Figure 2. Figure 2: Overview of MA architectures included in the study [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Sample-level Task Success Change per Occlusion Category (%) [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
read the original abstract

Large Language Models (LLMs) have enabled collaborative Multi-Agent (MA) systems, where interacting agents improve performance through diverse reasoning and iterative refinement. However, these systems remain vulnerable to error propagation, where early-stage information degrades downstream reasoning. To address this, we conduct a systematic analysis of inter-agent communication to identify which information drives MA performance. We find that the absence of reasoning and verification in inter-agent communication significantly degrades performance. Based on these insights, we propose Category-Aware Recovery Augmentation (technique), which enforces the presence of critical information during communication. recovers up to 86.2% of failed cases. Our results highlight the key role of information quality in effective MA collaboration. Our code is available at https://anonymous.4open.science/r/cara_mas

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript reports a systematic empirical analysis of inter-agent communications in LLM-based multi-agent systems. The authors observe that the lack of explicit reasoning and verification steps in these communications leads to degraded performance due to error propagation. They introduce a Category-Aware Recovery Augmentation (CARA) method that injects these critical information categories into the communication process. Experiments show that this approach recovers up to 86.2% of cases that previously failed. The work underscores the importance of information quality for successful collaboration in such systems.

Significance. Should the findings prove robust, this paper makes a meaningful contribution by characterizing the types of information that are pivotal in multi-agent LLM interactions and offering a targeted intervention to mitigate common failure modes. The high recovery rate indicates potential practical utility for improving MA system reliability. Explicit code release aids in verifying and extending the results.

major comments (2)
  1. [Section 3] Section 3 (Communication Analysis): The identification of reasoning and verification as the primary missing categories is derived from pattern observation in the evaluated trajectories. However, the paper does not provide a quantitative measure of how frequently these categories appear or are absent across different agent setups, which is necessary to establish them as the load-bearing factors for the performance claims.
  2. [Section 5] Section 5 (Experimental Evaluation): The reported 86.2% recovery is presented in aggregate without per-task breakdowns or a control condition that injects equivalent additional information without restricting to the identified categories. This leaves open whether the gains arise from category enforcement specifically or from stronger general prompting, which is central to validating the information-quality hypothesis.
minor comments (2)
  1. [Abstract] Abstract: The claim that the technique 'recovers up to 86.2% of failed cases' would be strengthened by stating the total number of evaluated cases and the baseline failure rate for context.
  2. [Method] Method section: The description of how categories are detected and enforced during augmentation would benefit from a concise pseudocode listing or explicit decision rules to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments, which highlight important areas for strengthening the empirical support in our analysis of inter-agent communication. We address each major comment below and commit to revisions that will clarify the role of specific information categories without overstating current results.

read point-by-point responses
  1. Referee: [Section 3] Section 3 (Communication Analysis): The identification of reasoning and verification as the primary missing categories is derived from pattern observation in the evaluated trajectories. However, the paper does not provide a quantitative measure of how frequently these categories appear or are absent across different agent setups, which is necessary to establish them as the load-bearing factors for the performance claims.

    Authors: We agree that quantitative frequency measures would provide stronger grounding for identifying reasoning and verification as load-bearing factors. The current analysis in Section 3 relies on systematic pattern observation across trajectories, but we will revise the section to include explicit counts and percentages of category presence/absence, broken down by agent setup and task type. This will be presented in a new table or figure to directly support the performance claims. revision: yes

  2. Referee: [Section 5] Section 5 (Experimental Evaluation): The reported 86.2% recovery is presented in aggregate without per-task breakdowns or a control condition that injects equivalent additional information without restricting to the identified categories. This leaves open whether the gains arise from category enforcement specifically or from stronger general prompting, which is central to validating the information-quality hypothesis.

    Authors: We acknowledge that the aggregate reporting of the 86.2% recovery rate limits interpretability. In the revised manuscript, we will add per-task breakdowns of the recovery rates in Section 5 to show consistency across tasks. To directly test whether gains stem from the specific categories rather than general prompting, we will also include a control condition that injects comparable amounts of additional information without category restrictions; results from this ablation will be reported alongside the main CARA results to better isolate the effect of information quality. revision: yes

Circularity Check

0 steps flagged

Empirical study derives augmentation from observed communication patterns with no reduction to inputs by construction

full rationale

The paper performs a systematic empirical analysis of inter-agent messages in LLM-based multi-agent systems, identifies the absence of reasoning and verification steps as a performance-degrading factor through direct observation of trajectories, and introduces Category-Aware Recovery Augmentation as a technique motivated by those observations. The reported 86.2% recovery is an experimental outcome on previously failed cases rather than a quantity forced by fitting or redefinition. No equations, uniqueness theorems, or self-citations are invoked to make the central claim equivalent to its inputs; the derivation remains self-contained and externally falsifiable via the provided code and task evaluations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Paper rests on standard domain assumptions about LLM agent collaboration and error propagation; introduces one new technique without new entities or fitted parameters visible in abstract.

axioms (1)
  • domain assumption Interacting LLM agents improve performance through diverse reasoning and iterative refinement but remain vulnerable to error propagation.
    Opening premise of the abstract that frames the problem.

pith-pipeline@v0.9.0 · 5657 in / 1196 out tokens · 36499 ms · 2026-05-21T06:12:07.505010+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 2 internal anchors

  1. [1]

    Large language model-based agents for software engineering: A survey.ACM Transactions on Software Engineering and Methodology, 2024

    Junwei Liu, Kaixin Wang, Yixuan Chen, Xin Peng, Zhenpeng Chen, Lingming Zhang, and Yiling Lou. Large language model-based agents for software engineering: A survey.ACM Transactions on Software Engineering and Methodology, 2024

  2. [2]

    Knowledge boundary of large language models: A survey

    Moxin Li, Yong Zhao, Wenxuan Zhang, Shuaiyi Li, Wenya Xie, See Kiong Ng, Tat-Seng Chua, and Yang Deng. Knowledge boundary of large language models: A survey. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5131–5157, 2025

  3. [3]

    Unleashing the emergent cognitive synergy in large language models: A task-solving agent through multi- persona self-collaboration

    Zhenhailong Wang, Shaoguang Mao, Wenshan Wu, Tao Ge, Furu Wei, and Heng Ji. Unleashing the emergent cognitive synergy in large language models: A task-solving agent through multi- persona self-collaboration. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Vol...

  4. [4]

    Aflow: Automating agentic workflow generation

    Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, et al. Aflow: Automating agentic workflow generation. InInternational Conference on Learning Representations (ICLR), 2025

  5. [5]

    Can agents fix agent issues? InProceedings of the 39th International Conference on Neural Information Processing Systems, 2025

    Alfin Wijaya Rahardja, Junwei Liu, Weitong Chen, Zhenpeng Chen, and Yiling Lou. Can agents fix agent issues? InProceedings of the 39th International Conference on Neural Information Processing Systems, 2025

  6. [6]

    Unified software engineering agent as ai software engineer

    Leonhard Applis, Yuntong Zhang, Shanchao Liang, Nan Jiang, Lin Tan, and Abhik Roychoud- hury. Unified software engineering agent as ai software engineer. In2026 IEEE/ACM 48th International Conference on Software Engineering (ICSE), 2026

  7. [7]

    Chi, Quoc V

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InProceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY , USA, 2022. Curran Associates Inc

  8. [8]

    Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou

    Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations, 2023

  9. [9]

    Large language models cannot self-correct reasoning yet

    Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xiny- ing Song, and Denny Zhou. Large language models cannot self-correct reasoning yet. In International Conference on Learning Representations (ICLR), 2024

  10. [10]

    Tenenbaum, and Igor Mordatch

    Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

  11. [11]

    Improving multi-agent debate with sparse communication topology

    Yunxuan Li, Yibing Du, Jiageng Zhang, Le Hou, Peter Grabowski, Yeqing Li, and Eugene Ie. Improving multi-agent debate with sparse communication topology. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 7281–7294, Miami, Florida, USA, November 2024. Association for Computational Linguistics

  12. [12]

    Hyeong Kyu Choi, Xiaojin Zhu, and Sharon Li. Debate or vote: Which yields better decisions in multi-agent large language models? InProceedings of the 39th International Conference on Neural Information Processing Systems, 2025

  13. [13]

    Bernstein

    Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. InProceed- ings of the 36th Annual ACM Symposium on User Interface Software and Technology, UIST ’23, New York, NY , USA, 2023. Association for Computing Machinery

  14. [14]

    Multi-agent collaboration via evolving orchestration

    Yufan Dang, Chen Qian, Xueheng Luo, Jingru Fan, Zihao Xie, Ruijie Shi, Weize Chen, Cheng Yang, Xiaoyin Che, Ye Tian, et al. Multi-agent collaboration via evolving orchestration. In Proceedings of the 39th International Conference on Neural Information Processing Systems, 2025. 11

  15. [15]

    Encouraging divergent thinking in large language models through multi-agent debate

    Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi, and Zhaopeng Tu. Encouraging divergent thinking in large language models through multi-agent debate. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 17889–17904, Miami, Florida, USA, November 2024. Association for C...

  16. [16]

    Chateval: Towards better LLM-based evaluators through multi-agent debate

    Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. Chateval: Towards better LLM-based evaluators through multi-agent debate. InThe Twelfth International Conference on Learning Representations, 2024

  17. [17]

    Understanding the information propagation effects of communication topologies in LLM-based multi-agent systems

    Xu Shen, Yixin Liu, Yiwei Dai, Yili Wang, Rui Miao, Yue Tan, Shirui Pan, and Xin Wang. Understanding the information propagation effects of communication topologies in LLM-based multi-agent systems. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, November 2025

  18. [18]

    Multi-agent design: Optimizing agents with better prompts and topologies

    Han Zhou, Xingchen Wan, Ruoxi Sun, Hamid Palangi, Shariq Iqbal, Ivan Vuli´c, Anna Korhonen, and Sercan Ö Arık. Multi-agent design: Optimizing agents with better prompts and topologies. InInternational Conference on Learning Representations (ICLR), 2026

  19. [19]

    Cut the crap: An economical communication pipeline for llm-based multi-agent systems

    Guibin Zhang, Yanwei Yue, Zhixun Li, Sukwon Yun, Guancheng Wan, Kun Wang, Dawei Cheng, Jeffrey Xu Yu, and Tianlong Chen. Cut the crap: An economical communication pipeline for llm-based multi-agent systems. InInternational Conference on Learning Representations (ICLR), 2025

  20. [20]

    Why do multiagent systems fail? InICLR 2025 Workshop on Building Trust in Language Models and Applications, 2025

    Melissa Z Pan, Mert Cemri, Lakshya A Agrawal, Shuyi Yang, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Kannan Ramchandran, Dan Klein, et al. Why do multiagent systems fail? InICLR 2025 Workshop on Building Trust in Language Models and Applications, 2025

  21. [21]

    Nanda, C

    Rahul Nanda, Chandra Maddila, Smriti Jha, Euna Mehnaz Khan, Matteo Paltenghi, and Satish Chandra. Wink: Recovering from misbehaviors in coding agents.arXiv preprint arXiv:2602.17037, 2026

  22. [22]

    verbose database queries correlate with null results

    Kunlun Zhu, Zijia Liu, Bingxuan Li, Muxin Tian, Yingxuan Yang, Jiaxun Zhang, Pengrui Han, Qipeng Xie, Fuyang Cui, Weijia Zhang, et al. Where llm agents fail and how they can learn from failures.arXiv preprint arXiv:2509.25370, 2025

  23. [23]

    Evaluating step-by-step reasoning traces: A survey

    Jinu Lee and Julia Hockenmaier. Evaluating step-by-step reasoning traces: A survey. In Findings of the Association for Computational Linguistics: EMNLP 2025. Association for Computational Linguistics, 2025

  24. [24]

    Llm-based multi-agent systems for software engineering: Literature review, vision, and the road ahead.ACM Trans

    Junda He, Christoph Treude, and David Lo. Llm-based multi-agent systems for software engineering: Literature review, vision, and the road ahead.ACM Trans. Softw. Eng. Methodol.,

  25. [25]

    doi: 10.1145/3712003

  26. [26]

    Chapman And Hall, New York, 1982

    R Dennis Cook and Sanford Weisberg.Residuals and influence in regression. Chapman And Hall, New York, 1982. ISBN 9780412242809

  27. [27]

    Understanding black-box predictions via influence functions

    Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. InInternational conference on machine learning, pages 1885–1894. PMLR, 2017

  28. [28]

    Exchange-of-thought: Enhancing large language model capabilities through cross-model communication

    Zhangyue Yin, Qiushi Sun, Cheng Chang, Qipeng Guo, Junqi Dai, Xuan-Jing Huang, and Xipeng Qiu. Exchange-of-thought: Enhancing large language model capabilities through cross-model communication. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 15135–15153, 2023

  29. [29]

    V oting or consensus? decision-making in multi-agent debate

    Lars Benedikt Kaesberg, Jonas Becker, Jan Philip Wahle, Terry Ruas, and Bela Gipp. V oting or consensus? decision-making in multi-agent debate. InFindings of the Association for Computational Linguistics: ACL 2025, pages 11640–11671, 2025

  30. [30]

    Is multi-agent debate (mad) the silver bullet? an empirical analysis of mad in code summarization and translation.arXiv preprint arXiv:2503.12029, 2025

    Jina Chun, Qihong Chen, Jiawei Li, and Iftekhar Ahmed. Is multi-agent debate (mad) the silver bullet? an empirical analysis of mad in code summarization and translation.arXiv preprint arXiv:2503.12029, 2025. 12

  31. [31]

    Scaling large language model-based multi-agent collabora- tion

    Chen Qian, Zihao Xie, Yifei Wang, Wei Liu, Kunlun Zhu, Hanchen Xia, Yufan Dang, Zhuoyun Du, Weize Chen, Cheng Yang, et al. Scaling large language model-based multi-agent collabora- tion. InInternational Conference on Learning Representations (ICLR), 2025

  32. [32]

    Nguyen, and Nghi D

    Minh Huynh Nguyen, Thang Phan Chau, Phong X. Nguyen, and Nghi D. Q. Bui. Agilecoder: Dynamic collaborative agents for software development based on agile methodology. In 2025 IEEE/ACM Second International Conference on AI Foundation Models and Software Engineering (Forge), pages 156–167, 2025

  33. [33]

    CAMEL: Communicative agents for ”mind” exploration of large language model society

    Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. CAMEL: Communicative agents for ”mind” exploration of large language model society. InThirty-seventh Conference on Neural Information Processing Systems, 2023

  34. [34]

    Glenn L. Burrows. Sampling techniques. by william g. cochran. new york: John wiley and sons, inc., 1953. 330 pp. $6.50.Social Forces, 32(3):304–305, 03 1954. ISSN 0037-7732. doi: 10.2307/2573260

  35. [35]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

  36. [36]

    Let’s verify step by step

    Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023

  37. [37]

    Measuring massive multitask language understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations (ICLR), 2021

  38. [38]

    Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies

    Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9:346–361, 2021

  39. [39]

    Alex Gu, Baptiste Rozière, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, and Sida I. Wang. Cruxeval: a benchmark for code reasoning, understanding and execution. In Proceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

  40. [40]

    Livecodebench: Holistic and contamination free evaluation of large language models for code

    Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. InThe Thirteenth International Conference on Learning Representations, 2025

  41. [41]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  42. [42]

    Open, closed, or small language models for text classification?arXiv preprint arXiv:2308.10092, 2023

    Hao Yu, Zachary Yang, Kellin Pelrine, Jean Francois Godbout, and Reihaneh Rabbany. Open, closed, or small language models for text classification?arXiv preprint arXiv:2308.10092, 2023

  43. [43]

    Barrett, and Arnu Pretorius

    Andries Smit, Nathan Grinsztajn, Paul Duckworth, Thomas D. Barrett, and Arnu Pretorius. Should we be going mad? a look at multi-agent debate strategies for llms. InProceedings of the 41st International Conference on Machine Learning, 2024

  44. [44]

    Qwen2.5-32b-instruct-awq

    Hugging Face. Qwen2.5-32b-instruct-awq. https://huggingface.co/Qwen/Qwen2. 5-32B-Instruct-AWQ, 2026. Accessed: 2026-04-27

  45. [45]

    Qwen2.5-coder-32b-instruct-awq

    Hugging Face. Qwen2.5-coder-32b-instruct-awq. https://huggingface.co/Qwen/Qwen2. 5-Coder-32B-Instruct-AWQ, 2026. Accessed: 2026-04-27

  46. [46]

    Multi-agent consensus seeking via large language models.arXiv preprint arXiv:2310.20151, 2023

    Huaben Chen, Wenkang Ji, Lufeng Xu, and Shiyu Zhao. Multi-agent consensus seeking via large language models.arXiv preprint arXiv:2310.20151, 2023. 13

  47. [47]

    Glaser and Hon

    Barney G. Glaser and Hon. Open coding descriptions.Grounded Theory Review: An Interna- tional Journal, 2016

  48. [48]

    Qualitative content analysis

    Jane Forman and Laura Damschroder. Qualitative content analysis. InEmpirical Methods for Bioethics: A Primer. Emerald Group Publishing Limited, 11 2007

  49. [49]

    Can llms replace manual annotation of software engineering artifacts? In2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR), pages 526–538

    Toufique Ahmed, Premkumar Devanbu, Christoph Treude, and Michael Pradel. Can llms replace manual annotation of software engineering artifacts? In2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR), pages 526–538. IEEE, 2025

  50. [50]

    Large language models for data annotation and synthesis: A survey

    Zhen Tan, Dawei Li, Song Wang, Alimohammad Beigi, Bohan Jiang, Amrita Bhattacharjee, Mansooreh Karami, Jundong Li, Lu Cheng, and Huan Liu. Large language models for data annotation and synthesis: A survey. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 930–957, 2024

  51. [51]

    An llm-as-judge metric for bridging the gap with human evaluation in se tasks.arXiv preprint arXiv:2505.20854, 2025

    Xin Zhou, Kisub Kim, Ting Zhang, Martin Weyssow, Luis F Gomes, Guang Yang, Kui Liu, Xin Xia, and David Lo. An llm-as-judge metric for bridging the gap with human evaluation in se tasks.arXiv preprint arXiv:2505.20854, 2025

  52. [52]

    Just put a human in the loop? investigating LLM-assisted annotation for subjective tasks

    Hope Schroeder, Deb Roy, and Jad Kabbara. Just put a human in the loop? investigating LLM-assisted annotation for subjective tasks. InFindings of the Association for Computational Linguistics: ACL 2025, Vienna, Austria, July 2025. Association for Computational Linguistics. doi: 10.18653/v1/2025.findings-acl.1323

  53. [53]

    GPT 4o, 2025

    openai. GPT 4o, 2025. URLhttps://openai.com/research/gpt-4

  54. [54]

    A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20(1):37–46, 1960

    Jacob Cohen. A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20(1):37–46, 1960. doi: 10.1177/001316446002000104

  55. [55]

    Considering likelihood in NLP classification explana- tions with occlusion and language modeling

    David Harbecke and Christoph Alt. Considering likelihood in NLP classification explana- tions with occlusion and language modeling. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop. Association for Computational Linguistics, July 2020. doi: 10.18653/v1/2020.acl-srw.16

  56. [56]

    Spanbert: Improving pre-training by representing and predicting spans.Transactions of the association for computational linguistics, 8:64–77, 2020

    Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S Weld, Luke Zettlemoyer, and Omer Levy. Spanbert: Improving pre-training by representing and predicting spans.Transactions of the association for computational linguistics, 8:64–77, 2020

  57. [57]

    Fleiss, Bruce Levin, and Myunghee Cho Paik

    Joseph L. Fleiss, Bruce Levin, and Myunghee Cho Paik. Statistical methods for rates and proportions.Wiley Series in Probability and Statistics, Sep 2003. doi: https://doi.org/10.1002/ 0471445428

  58. [58]

    Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

  59. [59]

    Magi- core: Multi-agent, iterative, coarse-to-fine refinement for reasoning

    Justin Chen, Archiki Prasad, Swarnadeep Saha, Elias Stengel-Eskin, and Mohit Bansal. Magi- core: Multi-agent, iterative, coarse-to-fine refinement for reasoning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 32651–32674, 2025

  60. [60]

    imad: Intelligent multi-agent debate for efficient and accurate llm inference

    Wei Fan, JinYi Yoon, and Bo Ji. imad: Intelligent multi-agent debate for efficient and accurate llm inference. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 29403–29411, 2026

  61. [61]

    Stop wasting your tokens: Towards efficient runtime multi-agent systems

    Fulin Lin, Shaowen Chen, Ruishan Fang, Hongwei Wang, and Tao Lin. Stop wasting your tokens: Towards efficient runtime multi-agent systems. InInternational Conference on Learning Representations (ICLR), 2026

  62. [62]

    AgentDropout: Dynamic agent elimination for token-efficient and high-performance LLM- based multi-agent collaboration

    Zhexuan Wang, Yutong Wang, Xuebo Liu, Liang Ding, Miao Zhang, Jie Liu, and Min Zhang. AgentDropout: Dynamic agent elimination for token-efficient and high-performance LLM- based multi-agent collaboration. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2025

  63. [63]

    annotated_response

    Shuyang Liu, Yang Chen, Rahul Krishna, Saurabh Sinha, Jatin Ganhotra, and Reyhaneh Jabbarvand. Process-centric analysis of agentic software systems.Proceedings of the ACM on Programming Languages, 10(OOPSLA1):1961–1988, 2026. 14 Appendix This appendix complements the main paper by providing additional experimental details, prompt templates, supplementary ...

  64. [64]

    Prompt Augmentation.The system prompt for each agent is extended with explicit instructions specifying the critical information categories that must be present in the response

  65. [65]

    If any are missing, the agent is re-invoked with a correction instruction for a fixed number of retries, until all critical information are included in their response

    Response Verification.After generation, the response is checked to verify whether all required categories are present. If any are missing, the agent is re-invoked with a correction instruction for a fixed number of retries, until all critical information are included in their response. Layer 1 — Prompt Augmentation CARASystem Prompt (Initial Response) <MA...