What Do Agents Communicate? Characterizing Information Exchange in Multi-Agent Systems

Iftekhar Ahmed; Yong Jin Chun

arxiv: 2605.20548 · v1 · pith:TBF7WYKMnew · submitted 2026-05-19 · 💻 cs.MA

What Do Agents Communicate? Characterizing Information Exchange in Multi-Agent Systems

Yong Jin Chun , Iftekhar Ahmed This is my paper

Pith reviewed 2026-05-21 06:12 UTC · model grok-4.3

classification 💻 cs.MA

keywords multi-agent systemslarge language modelsinter-agent communicationreasoningverificationinformation exchangeperformance improvementaugmentation

0 comments

The pith

Absence of reasoning and verification in inter-agent communication degrades multi-agent performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper conducts a systematic analysis of what information agents share in multi-agent LLM systems. It finds that the lack of reasoning and verification details in these exchanges causes significant drops in performance due to error propagation. The authors propose Category-Aware Recovery Augmentation to force inclusion of these critical information categories in communications. This approach recovers up to 86.2 percent of the cases that previously failed. A sympathetic reader would care because it points to communication content as a key lever for improving collaborative AI systems without redesigning the agents themselves.

Core claim

The paper claims that the absence of reasoning and verification in inter-agent communication significantly degrades performance. Category-Aware Recovery Augmentation enforces the presence of critical information during communication and recovers up to 86.2% of failed cases. The results highlight the key role of information quality in effective MA collaboration.

What carries the argument

Category-Aware Recovery Augmentation, which categorizes critical information such as reasoning and verification and augments inter-agent messages to ensure their inclusion.

If this is right

Error propagation in multi-agent systems can be mitigated by ensuring messages contain explicit reasoning.
Verification steps in communications help maintain the integrity of information across agent iterations.
Performance in collaborative tasks improves when agents exchange complete categories of information.
The design of communication protocols is central to the success of multi-agent LLM setups.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This technique could be integrated into agent frameworks to automatically check and supplement messages.
Similar principles might apply to improving communication in other AI systems like tool-using agents.
Exploring the minimal set of categories needed could lead to more efficient augmentation methods.

Load-bearing premise

The categories of critical information such as reasoning and verification are broadly applicable across different tasks and agent architectures without introducing new failure modes.

What would settle it

Applying the Category-Aware Recovery Augmentation to a different multi-agent task or architecture and measuring a recovery rate much lower than 86 percent would falsify the general applicability of the claim.

Figures

Figures reproduced from arXiv: 2605.20548 by Iftekhar Ahmed, Yong Jin Chun.

**Figure 1.** Figure 1: Methodology Overview in our experiments, we evaluate these MA systems without additional techniques or tool integrations, enabling a fair comparison across systems while isolating the impact of inter-agent communication. All MA systems in our experiments consist of three agents and three interaction rounds, following the best practices outlined in prior MA literature that achieved strong performance while … view at source ↗

**Figure 2.** Figure 2: Overview of MA architectures included in the study [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Sample-level Task Success Change per Occlusion Category (%) [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

Large Language Models (LLMs) have enabled collaborative Multi-Agent (MA) systems, where interacting agents improve performance through diverse reasoning and iterative refinement. However, these systems remain vulnerable to error propagation, where early-stage information degrades downstream reasoning. To address this, we conduct a systematic analysis of inter-agent communication to identify which information drives MA performance. We find that the absence of reasoning and verification in inter-agent communication significantly degrades performance. Based on these insights, we propose Category-Aware Recovery Augmentation (technique), which enforces the presence of critical information during communication. recovers up to 86.2% of failed cases. Our results highlight the key role of information quality in effective MA collaboration. Our code is available at https://anonymous.4open.science/r/cara_mas

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows that missing reasoning and verification in LLM agent messages hurts multi-agent performance and offers a recovery augmentation that fixes most failures in their tests, but the gains may stem from stronger prompting on narrow tasks rather than a general information-quality fix.

read the letter

The key point is that this work looks at the actual content of messages passed between LLM agents and links the absence of reasoning and verification steps to clear performance drops. They then test a Category-Aware Recovery Augmentation that injects those elements and reports recovering 86.2% of failed cases. The code release is a plus for anyone who wants to inspect or extend it.

Referee Report

2 major / 2 minor

Summary. The manuscript reports a systematic empirical analysis of inter-agent communications in LLM-based multi-agent systems. The authors observe that the lack of explicit reasoning and verification steps in these communications leads to degraded performance due to error propagation. They introduce a Category-Aware Recovery Augmentation (CARA) method that injects these critical information categories into the communication process. Experiments show that this approach recovers up to 86.2% of cases that previously failed. The work underscores the importance of information quality for successful collaboration in such systems.

Significance. Should the findings prove robust, this paper makes a meaningful contribution by characterizing the types of information that are pivotal in multi-agent LLM interactions and offering a targeted intervention to mitigate common failure modes. The high recovery rate indicates potential practical utility for improving MA system reliability. Explicit code release aids in verifying and extending the results.

major comments (2)

[Section 3] Section 3 (Communication Analysis): The identification of reasoning and verification as the primary missing categories is derived from pattern observation in the evaluated trajectories. However, the paper does not provide a quantitative measure of how frequently these categories appear or are absent across different agent setups, which is necessary to establish them as the load-bearing factors for the performance claims.
[Section 5] Section 5 (Experimental Evaluation): The reported 86.2% recovery is presented in aggregate without per-task breakdowns or a control condition that injects equivalent additional information without restricting to the identified categories. This leaves open whether the gains arise from category enforcement specifically or from stronger general prompting, which is central to validating the information-quality hypothesis.

minor comments (2)

[Abstract] Abstract: The claim that the technique 'recovers up to 86.2% of failed cases' would be strengthened by stating the total number of evaluated cases and the baseline failure rate for context.
[Method] Method section: The description of how categories are detected and enforced during augmentation would benefit from a concise pseudocode listing or explicit decision rules to improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments, which highlight important areas for strengthening the empirical support in our analysis of inter-agent communication. We address each major comment below and commit to revisions that will clarify the role of specific information categories without overstating current results.

read point-by-point responses

Referee: [Section 3] Section 3 (Communication Analysis): The identification of reasoning and verification as the primary missing categories is derived from pattern observation in the evaluated trajectories. However, the paper does not provide a quantitative measure of how frequently these categories appear or are absent across different agent setups, which is necessary to establish them as the load-bearing factors for the performance claims.

Authors: We agree that quantitative frequency measures would provide stronger grounding for identifying reasoning and verification as load-bearing factors. The current analysis in Section 3 relies on systematic pattern observation across trajectories, but we will revise the section to include explicit counts and percentages of category presence/absence, broken down by agent setup and task type. This will be presented in a new table or figure to directly support the performance claims. revision: yes
Referee: [Section 5] Section 5 (Experimental Evaluation): The reported 86.2% recovery is presented in aggregate without per-task breakdowns or a control condition that injects equivalent additional information without restricting to the identified categories. This leaves open whether the gains arise from category enforcement specifically or from stronger general prompting, which is central to validating the information-quality hypothesis.

Authors: We acknowledge that the aggregate reporting of the 86.2% recovery rate limits interpretability. In the revised manuscript, we will add per-task breakdowns of the recovery rates in Section 5 to show consistency across tasks. To directly test whether gains stem from the specific categories rather than general prompting, we will also include a control condition that injects comparable amounts of additional information without category restrictions; results from this ablation will be reported alongside the main CARA results to better isolate the effect of information quality. revision: yes

Circularity Check

0 steps flagged

Empirical study derives augmentation from observed communication patterns with no reduction to inputs by construction

full rationale

The paper performs a systematic empirical analysis of inter-agent messages in LLM-based multi-agent systems, identifies the absence of reasoning and verification steps as a performance-degrading factor through direct observation of trajectories, and introduces Category-Aware Recovery Augmentation as a technique motivated by those observations. The reported 86.2% recovery is an experimental outcome on previously failed cases rather than a quantity forced by fitting or redefinition. No equations, uniqueness theorems, or self-citations are invoked to make the central claim equivalent to its inputs; the derivation remains self-contained and externally falsifiable via the provided code and task evaluations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Paper rests on standard domain assumptions about LLM agent collaboration and error propagation; introduces one new technique without new entities or fitted parameters visible in abstract.

axioms (1)

domain assumption Interacting LLM agents improve performance through diverse reasoning and iterative refinement but remain vulnerable to error propagation.
Opening premise of the abstract that frames the problem.

pith-pipeline@v0.9.0 · 5657 in / 1196 out tokens · 36499 ms · 2026-05-21T06:12:07.505010+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose Category-Aware Recovery Augmentation (CARA), which enforces the presence of critical information during communication. CARA recovers up to 86.2% of failed cases.
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Occlusion analysis to systematically mask each identified information category

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 2 internal anchors

[1]

Large language model-based agents for software engineering: A survey.ACM Transactions on Software Engineering and Methodology, 2024

Junwei Liu, Kaixin Wang, Yixuan Chen, Xin Peng, Zhenpeng Chen, Lingming Zhang, and Yiling Lou. Large language model-based agents for software engineering: A survey.ACM Transactions on Software Engineering and Methodology, 2024

work page 2024
[2]

Knowledge boundary of large language models: A survey

Moxin Li, Yong Zhao, Wenxuan Zhang, Shuaiyi Li, Wenya Xie, See Kiong Ng, Tat-Seng Chua, and Yang Deng. Knowledge boundary of large language models: A survey. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5131–5157, 2025

work page 2025
[3]

Unleashing the emergent cognitive synergy in large language models: A task-solving agent through multi- persona self-collaboration

Zhenhailong Wang, Shaoguang Mao, Wenshan Wu, Tao Ge, Furu Wei, and Heng Ji. Unleashing the emergent cognitive synergy in large language models: A task-solving agent through multi- persona self-collaboration. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Vol...

work page 2024
[4]

Aflow: Automating agentic workflow generation

Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, et al. Aflow: Automating agentic workflow generation. InInternational Conference on Learning Representations (ICLR), 2025

work page 2025
[5]

Can agents fix agent issues? InProceedings of the 39th International Conference on Neural Information Processing Systems, 2025

Alfin Wijaya Rahardja, Junwei Liu, Weitong Chen, Zhenpeng Chen, and Yiling Lou. Can agents fix agent issues? InProceedings of the 39th International Conference on Neural Information Processing Systems, 2025

work page 2025
[6]

Unified software engineering agent as ai software engineer

Leonhard Applis, Yuntong Zhang, Shanchao Liang, Nan Jiang, Lin Tan, and Abhik Roychoud- hury. Unified software engineering agent as ai software engineer. In2026 IEEE/ACM 48th International Conference on Software Engineering (ICSE), 2026

work page 2026
[7]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InProceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY , USA, 2022. Curran Associates Inc

work page 2022
[8]

Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations, 2023

work page 2023
[9]

Large language models cannot self-correct reasoning yet

Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xiny- ing Song, and Denny Zhou. Large language models cannot self-correct reasoning yet. In International Conference on Learning Representations (ICLR), 2024

work page 2024
[10]

Tenenbaum, and Igor Mordatch

Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

work page 2024
[11]

Improving multi-agent debate with sparse communication topology

Yunxuan Li, Yibing Du, Jiageng Zhang, Le Hou, Peter Grabowski, Yeqing Li, and Eugene Ie. Improving multi-agent debate with sparse communication topology. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 7281–7294, Miami, Florida, USA, November 2024. Association for Computational Linguistics

work page 2024
[12]

Hyeong Kyu Choi, Xiaojin Zhu, and Sharon Li. Debate or vote: Which yields better decisions in multi-agent large language models? InProceedings of the 39th International Conference on Neural Information Processing Systems, 2025

work page 2025
[13]

Bernstein

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. InProceed- ings of the 36th Annual ACM Symposium on User Interface Software and Technology, UIST ’23, New York, NY , USA, 2023. Association for Computing Machinery

work page 2023
[14]

Multi-agent collaboration via evolving orchestration

Yufan Dang, Chen Qian, Xueheng Luo, Jingru Fan, Zihao Xie, Ruijie Shi, Weize Chen, Cheng Yang, Xiaoyin Che, Ye Tian, et al. Multi-agent collaboration via evolving orchestration. In Proceedings of the 39th International Conference on Neural Information Processing Systems, 2025. 11

work page 2025
[15]

Encouraging divergent thinking in large language models through multi-agent debate

Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi, and Zhaopeng Tu. Encouraging divergent thinking in large language models through multi-agent debate. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 17889–17904, Miami, Florida, USA, November 2024. Association for C...

work page 2024
[16]

Chateval: Towards better LLM-based evaluators through multi-agent debate

Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. Chateval: Towards better LLM-based evaluators through multi-agent debate. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024
[17]

Understanding the information propagation effects of communication topologies in LLM-based multi-agent systems

Xu Shen, Yixin Liu, Yiwei Dai, Yili Wang, Rui Miao, Yue Tan, Shirui Pan, and Xin Wang. Understanding the information propagation effects of communication topologies in LLM-based multi-agent systems. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, November 2025

work page 2025
[18]

Multi-agent design: Optimizing agents with better prompts and topologies

Han Zhou, Xingchen Wan, Ruoxi Sun, Hamid Palangi, Shariq Iqbal, Ivan Vuli´c, Anna Korhonen, and Sercan Ö Arık. Multi-agent design: Optimizing agents with better prompts and topologies. InInternational Conference on Learning Representations (ICLR), 2026

work page 2026
[19]

Cut the crap: An economical communication pipeline for llm-based multi-agent systems

Guibin Zhang, Yanwei Yue, Zhixun Li, Sukwon Yun, Guancheng Wan, Kun Wang, Dawei Cheng, Jeffrey Xu Yu, and Tianlong Chen. Cut the crap: An economical communication pipeline for llm-based multi-agent systems. InInternational Conference on Learning Representations (ICLR), 2025

work page 2025
[20]

Why do multiagent systems fail? InICLR 2025 Workshop on Building Trust in Language Models and Applications, 2025

Melissa Z Pan, Mert Cemri, Lakshya A Agrawal, Shuyi Yang, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Kannan Ramchandran, Dan Klein, et al. Why do multiagent systems fail? InICLR 2025 Workshop on Building Trust in Language Models and Applications, 2025

work page 2025
[21]

Nanda, C

Rahul Nanda, Chandra Maddila, Smriti Jha, Euna Mehnaz Khan, Matteo Paltenghi, and Satish Chandra. Wink: Recovering from misbehaviors in coding agents.arXiv preprint arXiv:2602.17037, 2026

work page arXiv 2026
[22]

verbose database queries correlate with null results

Kunlun Zhu, Zijia Liu, Bingxuan Li, Muxin Tian, Yingxuan Yang, Jiaxun Zhang, Pengrui Han, Qipeng Xie, Fuyang Cui, Weijia Zhang, et al. Where llm agents fail and how they can learn from failures.arXiv preprint arXiv:2509.25370, 2025

work page arXiv 2025
[23]

Evaluating step-by-step reasoning traces: A survey

Jinu Lee and Julia Hockenmaier. Evaluating step-by-step reasoning traces: A survey. In Findings of the Association for Computational Linguistics: EMNLP 2025. Association for Computational Linguistics, 2025

work page 2025
[24]

Llm-based multi-agent systems for software engineering: Literature review, vision, and the road ahead.ACM Trans

Junda He, Christoph Treude, and David Lo. Llm-based multi-agent systems for software engineering: Literature review, vision, and the road ahead.ACM Trans. Softw. Eng. Methodol.,

work page
[25]

doi: 10.1145/3712003

work page doi:10.1145/3712003
[26]

Chapman And Hall, New York, 1982

R Dennis Cook and Sanford Weisberg.Residuals and influence in regression. Chapman And Hall, New York, 1982. ISBN 9780412242809

work page 1982
[27]

Understanding black-box predictions via influence functions

Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. InInternational conference on machine learning, pages 1885–1894. PMLR, 2017

work page 2017
[28]

Exchange-of-thought: Enhancing large language model capabilities through cross-model communication

Zhangyue Yin, Qiushi Sun, Cheng Chang, Qipeng Guo, Junqi Dai, Xuan-Jing Huang, and Xipeng Qiu. Exchange-of-thought: Enhancing large language model capabilities through cross-model communication. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 15135–15153, 2023

work page 2023
[29]

V oting or consensus? decision-making in multi-agent debate

Lars Benedikt Kaesberg, Jonas Becker, Jan Philip Wahle, Terry Ruas, and Bela Gipp. V oting or consensus? decision-making in multi-agent debate. InFindings of the Association for Computational Linguistics: ACL 2025, pages 11640–11671, 2025

work page 2025
[30]

Is multi-agent debate (mad) the silver bullet? an empirical analysis of mad in code summarization and translation.arXiv preprint arXiv:2503.12029, 2025

Jina Chun, Qihong Chen, Jiawei Li, and Iftekhar Ahmed. Is multi-agent debate (mad) the silver bullet? an empirical analysis of mad in code summarization and translation.arXiv preprint arXiv:2503.12029, 2025. 12

work page arXiv 2025
[31]

Scaling large language model-based multi-agent collabora- tion

Chen Qian, Zihao Xie, Yifei Wang, Wei Liu, Kunlun Zhu, Hanchen Xia, Yufan Dang, Zhuoyun Du, Weize Chen, Cheng Yang, et al. Scaling large language model-based multi-agent collabora- tion. InInternational Conference on Learning Representations (ICLR), 2025

work page 2025
[32]

Nguyen, and Nghi D

Minh Huynh Nguyen, Thang Phan Chau, Phong X. Nguyen, and Nghi D. Q. Bui. Agilecoder: Dynamic collaborative agents for software development based on agile methodology. In 2025 IEEE/ACM Second International Conference on AI Foundation Models and Software Engineering (Forge), pages 156–167, 2025

work page 2025
[33]

CAMEL: Communicative agents for ”mind” exploration of large language model society

Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. CAMEL: Communicative agents for ”mind” exploration of large language model society. InThirty-seventh Conference on Neural Information Processing Systems, 2023

work page 2023
[34]

Glenn L. Burrows. Sampling techniques. by william g. cochran. new york: John wiley and sons, inc., 1953. 330 pp. $6.50.Social Forces, 32(3):304–305, 03 1954. ISSN 0037-7732. doi: 10.2307/2573260

work page doi:10.2307/2573260 1953
[35]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[36]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023

work page 2023
[37]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations (ICLR), 2021

work page 2021
[38]

Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies

Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9:346–361, 2021

work page 2021
[39]

Alex Gu, Baptiste Rozière, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, and Sida I. Wang. Cruxeval: a benchmark for code reasoning, understanding and execution. In Proceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

work page 2024
[40]

Livecodebench: Holistic and contamination free evaluation of large language models for code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[41]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Open, closed, or small language models for text classification?arXiv preprint arXiv:2308.10092, 2023

Hao Yu, Zachary Yang, Kellin Pelrine, Jean Francois Godbout, and Reihaneh Rabbany. Open, closed, or small language models for text classification?arXiv preprint arXiv:2308.10092, 2023

work page arXiv 2023
[43]

Barrett, and Arnu Pretorius

Andries Smit, Nathan Grinsztajn, Paul Duckworth, Thomas D. Barrett, and Arnu Pretorius. Should we be going mad? a look at multi-agent debate strategies for llms. InProceedings of the 41st International Conference on Machine Learning, 2024

work page 2024
[44]

Qwen2.5-32b-instruct-awq

Hugging Face. Qwen2.5-32b-instruct-awq. https://huggingface.co/Qwen/Qwen2. 5-32B-Instruct-AWQ, 2026. Accessed: 2026-04-27

work page 2026
[45]

Qwen2.5-coder-32b-instruct-awq

Hugging Face. Qwen2.5-coder-32b-instruct-awq. https://huggingface.co/Qwen/Qwen2. 5-Coder-32B-Instruct-AWQ, 2026. Accessed: 2026-04-27

work page 2026
[46]

Multi-agent consensus seeking via large language models.arXiv preprint arXiv:2310.20151, 2023

Huaben Chen, Wenkang Ji, Lufeng Xu, and Shiyu Zhao. Multi-agent consensus seeking via large language models.arXiv preprint arXiv:2310.20151, 2023. 13

work page arXiv 2023
[47]

Glaser and Hon

Barney G. Glaser and Hon. Open coding descriptions.Grounded Theory Review: An Interna- tional Journal, 2016

work page 2016
[48]

Qualitative content analysis

Jane Forman and Laura Damschroder. Qualitative content analysis. InEmpirical Methods for Bioethics: A Primer. Emerald Group Publishing Limited, 11 2007

work page 2007
[49]

Can llms replace manual annotation of software engineering artifacts? In2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR), pages 526–538

Toufique Ahmed, Premkumar Devanbu, Christoph Treude, and Michael Pradel. Can llms replace manual annotation of software engineering artifacts? In2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR), pages 526–538. IEEE, 2025

work page 2025
[50]

Large language models for data annotation and synthesis: A survey

Zhen Tan, Dawei Li, Song Wang, Alimohammad Beigi, Bohan Jiang, Amrita Bhattacharjee, Mansooreh Karami, Jundong Li, Lu Cheng, and Huan Liu. Large language models for data annotation and synthesis: A survey. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 930–957, 2024

work page 2024
[51]

An llm-as-judge metric for bridging the gap with human evaluation in se tasks.arXiv preprint arXiv:2505.20854, 2025

Xin Zhou, Kisub Kim, Ting Zhang, Martin Weyssow, Luis F Gomes, Guang Yang, Kui Liu, Xin Xia, and David Lo. An llm-as-judge metric for bridging the gap with human evaluation in se tasks.arXiv preprint arXiv:2505.20854, 2025

work page arXiv 2025
[52]

Just put a human in the loop? investigating LLM-assisted annotation for subjective tasks

Hope Schroeder, Deb Roy, and Jad Kabbara. Just put a human in the loop? investigating LLM-assisted annotation for subjective tasks. InFindings of the Association for Computational Linguistics: ACL 2025, Vienna, Austria, July 2025. Association for Computational Linguistics. doi: 10.18653/v1/2025.findings-acl.1323

work page doi:10.18653/v1/2025.findings-acl.1323 2025
[53]

GPT 4o, 2025

openai. GPT 4o, 2025. URLhttps://openai.com/research/gpt-4

work page 2025
[54]

A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20(1):37–46, 1960

Jacob Cohen. A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20(1):37–46, 1960. doi: 10.1177/001316446002000104

work page doi:10.1177/001316446002000104 1960
[55]

Considering likelihood in NLP classification explana- tions with occlusion and language modeling

David Harbecke and Christoph Alt. Considering likelihood in NLP classification explana- tions with occlusion and language modeling. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop. Association for Computational Linguistics, July 2020. doi: 10.18653/v1/2020.acl-srw.16

work page doi:10.18653/v1/2020.acl-srw.16 2020
[56]

Spanbert: Improving pre-training by representing and predicting spans.Transactions of the association for computational linguistics, 8:64–77, 2020

Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S Weld, Luke Zettlemoyer, and Omer Levy. Spanbert: Improving pre-training by representing and predicting spans.Transactions of the association for computational linguistics, 8:64–77, 2020

work page 2020
[57]

Fleiss, Bruce Levin, and Myunghee Cho Paik

Joseph L. Fleiss, Bruce Levin, and Myunghee Cho Paik. Statistical methods for rates and proportions.Wiley Series in Probability and Statistics, Sep 2003. doi: https://doi.org/10.1002/ 0471445428

work page 2003
[58]

Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

work page 2023
[59]

Magi- core: Multi-agent, iterative, coarse-to-fine refinement for reasoning

Justin Chen, Archiki Prasad, Swarnadeep Saha, Elias Stengel-Eskin, and Mohit Bansal. Magi- core: Multi-agent, iterative, coarse-to-fine refinement for reasoning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 32651–32674, 2025

work page 2025
[60]

imad: Intelligent multi-agent debate for efficient and accurate llm inference

Wei Fan, JinYi Yoon, and Bo Ji. imad: Intelligent multi-agent debate for efficient and accurate llm inference. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 29403–29411, 2026

work page 2026
[61]

Stop wasting your tokens: Towards efficient runtime multi-agent systems

Fulin Lin, Shaowen Chen, Ruishan Fang, Hongwei Wang, and Tao Lin. Stop wasting your tokens: Towards efficient runtime multi-agent systems. InInternational Conference on Learning Representations (ICLR), 2026

work page 2026
[62]

AgentDropout: Dynamic agent elimination for token-efficient and high-performance LLM- based multi-agent collaboration

Zhexuan Wang, Yutong Wang, Xuebo Liu, Liang Ding, Miao Zhang, Jie Liu, and Min Zhang. AgentDropout: Dynamic agent elimination for token-efficient and high-performance LLM- based multi-agent collaboration. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2025

work page 2025
[63]

annotated_response

Shuyang Liu, Yang Chen, Rahul Krishna, Saurabh Sinha, Jatin Ganhotra, and Reyhaneh Jabbarvand. Process-centric analysis of agentic software systems.Proceedings of the ACM on Programming Languages, 10(OOPSLA1):1961–1988, 2026. 14 Appendix This appendix complements the main paper by providing additional experimental details, prompt templates, supplementary ...

work page arXiv 1961
[64]

Prompt Augmentation.The system prompt for each agent is extended with explicit instructions specifying the critical information categories that must be present in the response

work page
[65]

If any are missing, the agent is re-invoked with a correction instruction for a fixed number of retries, until all critical information are included in their response

Response Verification.After generation, the response is checked to verify whether all required categories are present. If any are missing, the agent is re-invoked with a correction instruction for a fixed number of retries, until all critical information are included in their response. Layer 1 — Prompt Augmentation CARASystem Prompt (Initial Response) <MA...

work page 1975

[1] [1]

Large language model-based agents for software engineering: A survey.ACM Transactions on Software Engineering and Methodology, 2024

Junwei Liu, Kaixin Wang, Yixuan Chen, Xin Peng, Zhenpeng Chen, Lingming Zhang, and Yiling Lou. Large language model-based agents for software engineering: A survey.ACM Transactions on Software Engineering and Methodology, 2024

work page 2024

[2] [2]

Knowledge boundary of large language models: A survey

Moxin Li, Yong Zhao, Wenxuan Zhang, Shuaiyi Li, Wenya Xie, See Kiong Ng, Tat-Seng Chua, and Yang Deng. Knowledge boundary of large language models: A survey. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5131–5157, 2025

work page 2025

[3] [3]

Unleashing the emergent cognitive synergy in large language models: A task-solving agent through multi- persona self-collaboration

Zhenhailong Wang, Shaoguang Mao, Wenshan Wu, Tao Ge, Furu Wei, and Heng Ji. Unleashing the emergent cognitive synergy in large language models: A task-solving agent through multi- persona self-collaboration. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Vol...

work page 2024

[4] [4]

Aflow: Automating agentic workflow generation

Jiayi Zhang, Jinyu Xiang, Zhaoyang Yu, Fengwei Teng, Xionghui Chen, Jiaqi Chen, Mingchen Zhuge, Xin Cheng, Sirui Hong, Jinlin Wang, et al. Aflow: Automating agentic workflow generation. InInternational Conference on Learning Representations (ICLR), 2025

work page 2025

[5] [5]

Can agents fix agent issues? InProceedings of the 39th International Conference on Neural Information Processing Systems, 2025

Alfin Wijaya Rahardja, Junwei Liu, Weitong Chen, Zhenpeng Chen, and Yiling Lou. Can agents fix agent issues? InProceedings of the 39th International Conference on Neural Information Processing Systems, 2025

work page 2025

[6] [6]

Unified software engineering agent as ai software engineer

Leonhard Applis, Yuntong Zhang, Shanchao Liang, Nan Jiang, Lin Tan, and Abhik Roychoud- hury. Unified software engineering agent as ai software engineer. In2026 IEEE/ACM 48th International Conference on Software Engineering (ICSE), 2026

work page 2026

[7] [7]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V . Le, and Denny Zhou. Chain-of-thought prompting elicits reasoning in large language models. InProceedings of the 36th International Conference on Neural Information Processing Systems, NIPS ’22, Red Hook, NY , USA, 2022. Curran Associates Inc

work page 2022

[8] [8]

Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H. Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. InThe Eleventh International Conference on Learning Representations, 2023

work page 2023

[9] [9]

Large language models cannot self-correct reasoning yet

Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xiny- ing Song, and Denny Zhou. Large language models cannot self-correct reasoning yet. In International Conference on Learning Representations (ICLR), 2024

work page 2024

[10] [10]

Tenenbaum, and Igor Mordatch

Yilun Du, Shuang Li, Antonio Torralba, Joshua B. Tenenbaum, and Igor Mordatch. Improving factuality and reasoning in language models through multiagent debate. InProceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

work page 2024

[11] [11]

Improving multi-agent debate with sparse communication topology

Yunxuan Li, Yibing Du, Jiageng Zhang, Le Hou, Peter Grabowski, Yeqing Li, and Eugene Ie. Improving multi-agent debate with sparse communication topology. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 7281–7294, Miami, Florida, USA, November 2024. Association for Computational Linguistics

work page 2024

[12] [12]

Hyeong Kyu Choi, Xiaojin Zhu, and Sharon Li. Debate or vote: Which yields better decisions in multi-agent large language models? InProceedings of the 39th International Conference on Neural Information Processing Systems, 2025

work page 2025

[13] [13]

Bernstein

Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. InProceed- ings of the 36th Annual ACM Symposium on User Interface Software and Technology, UIST ’23, New York, NY , USA, 2023. Association for Computing Machinery

work page 2023

[14] [14]

Multi-agent collaboration via evolving orchestration

Yufan Dang, Chen Qian, Xueheng Luo, Jingru Fan, Zihao Xie, Ruijie Shi, Weize Chen, Cheng Yang, Xiaoyin Che, Ye Tian, et al. Multi-agent collaboration via evolving orchestration. In Proceedings of the 39th International Conference on Neural Information Processing Systems, 2025. 11

work page 2025

[15] [15]

Encouraging divergent thinking in large language models through multi-agent debate

Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi, and Zhaopeng Tu. Encouraging divergent thinking in large language models through multi-agent debate. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 17889–17904, Miami, Florida, USA, November 2024. Association for C...

work page 2024

[16] [16]

Chateval: Towards better LLM-based evaluators through multi-agent debate

Chi-Min Chan, Weize Chen, Yusheng Su, Jianxuan Yu, Wei Xue, Shanghang Zhang, Jie Fu, and Zhiyuan Liu. Chateval: Towards better LLM-based evaluators through multi-agent debate. InThe Twelfth International Conference on Learning Representations, 2024

work page 2024

[17] [17]

Understanding the information propagation effects of communication topologies in LLM-based multi-agent systems

Xu Shen, Yixin Liu, Yiwei Dai, Yili Wang, Rui Miao, Yue Tan, Shirui Pan, and Xin Wang. Understanding the information propagation effects of communication topologies in LLM-based multi-agent systems. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, November 2025

work page 2025

[18] [18]

Multi-agent design: Optimizing agents with better prompts and topologies

Han Zhou, Xingchen Wan, Ruoxi Sun, Hamid Palangi, Shariq Iqbal, Ivan Vuli´c, Anna Korhonen, and Sercan Ö Arık. Multi-agent design: Optimizing agents with better prompts and topologies. InInternational Conference on Learning Representations (ICLR), 2026

work page 2026

[19] [19]

Cut the crap: An economical communication pipeline for llm-based multi-agent systems

Guibin Zhang, Yanwei Yue, Zhixun Li, Sukwon Yun, Guancheng Wan, Kun Wang, Dawei Cheng, Jeffrey Xu Yu, and Tianlong Chen. Cut the crap: An economical communication pipeline for llm-based multi-agent systems. InInternational Conference on Learning Representations (ICLR), 2025

work page 2025

[20] [20]

Why do multiagent systems fail? InICLR 2025 Workshop on Building Trust in Language Models and Applications, 2025

Melissa Z Pan, Mert Cemri, Lakshya A Agrawal, Shuyi Yang, Bhavya Chopra, Rishabh Tiwari, Kurt Keutzer, Aditya Parameswaran, Kannan Ramchandran, Dan Klein, et al. Why do multiagent systems fail? InICLR 2025 Workshop on Building Trust in Language Models and Applications, 2025

work page 2025

[21] [21]

Nanda, C

Rahul Nanda, Chandra Maddila, Smriti Jha, Euna Mehnaz Khan, Matteo Paltenghi, and Satish Chandra. Wink: Recovering from misbehaviors in coding agents.arXiv preprint arXiv:2602.17037, 2026

work page arXiv 2026

[22] [22]

verbose database queries correlate with null results

Kunlun Zhu, Zijia Liu, Bingxuan Li, Muxin Tian, Yingxuan Yang, Jiaxun Zhang, Pengrui Han, Qipeng Xie, Fuyang Cui, Weijia Zhang, et al. Where llm agents fail and how they can learn from failures.arXiv preprint arXiv:2509.25370, 2025

work page arXiv 2025

[23] [23]

Evaluating step-by-step reasoning traces: A survey

Jinu Lee and Julia Hockenmaier. Evaluating step-by-step reasoning traces: A survey. In Findings of the Association for Computational Linguistics: EMNLP 2025. Association for Computational Linguistics, 2025

work page 2025

[24] [24]

Llm-based multi-agent systems for software engineering: Literature review, vision, and the road ahead.ACM Trans

Junda He, Christoph Treude, and David Lo. Llm-based multi-agent systems for software engineering: Literature review, vision, and the road ahead.ACM Trans. Softw. Eng. Methodol.,

work page

[25] [25]

doi: 10.1145/3712003

work page doi:10.1145/3712003

[26] [26]

Chapman And Hall, New York, 1982

R Dennis Cook and Sanford Weisberg.Residuals and influence in regression. Chapman And Hall, New York, 1982. ISBN 9780412242809

work page 1982

[27] [27]

Understanding black-box predictions via influence functions

Pang Wei Koh and Percy Liang. Understanding black-box predictions via influence functions. InInternational conference on machine learning, pages 1885–1894. PMLR, 2017

work page 2017

[28] [28]

Exchange-of-thought: Enhancing large language model capabilities through cross-model communication

Zhangyue Yin, Qiushi Sun, Cheng Chang, Qipeng Guo, Junqi Dai, Xuan-Jing Huang, and Xipeng Qiu. Exchange-of-thought: Enhancing large language model capabilities through cross-model communication. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 15135–15153, 2023

work page 2023

[29] [29]

V oting or consensus? decision-making in multi-agent debate

Lars Benedikt Kaesberg, Jonas Becker, Jan Philip Wahle, Terry Ruas, and Bela Gipp. V oting or consensus? decision-making in multi-agent debate. InFindings of the Association for Computational Linguistics: ACL 2025, pages 11640–11671, 2025

work page 2025

[30] [30]

Is multi-agent debate (mad) the silver bullet? an empirical analysis of mad in code summarization and translation.arXiv preprint arXiv:2503.12029, 2025

Jina Chun, Qihong Chen, Jiawei Li, and Iftekhar Ahmed. Is multi-agent debate (mad) the silver bullet? an empirical analysis of mad in code summarization and translation.arXiv preprint arXiv:2503.12029, 2025. 12

work page arXiv 2025

[31] [31]

Scaling large language model-based multi-agent collabora- tion

Chen Qian, Zihao Xie, Yifei Wang, Wei Liu, Kunlun Zhu, Hanchen Xia, Yufan Dang, Zhuoyun Du, Weize Chen, Cheng Yang, et al. Scaling large language model-based multi-agent collabora- tion. InInternational Conference on Learning Representations (ICLR), 2025

work page 2025

[32] [32]

Nguyen, and Nghi D

Minh Huynh Nguyen, Thang Phan Chau, Phong X. Nguyen, and Nghi D. Q. Bui. Agilecoder: Dynamic collaborative agents for software development based on agile methodology. In 2025 IEEE/ACM Second International Conference on AI Foundation Models and Software Engineering (Forge), pages 156–167, 2025

work page 2025

[33] [33]

CAMEL: Communicative agents for ”mind” exploration of large language model society

Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. CAMEL: Communicative agents for ”mind” exploration of large language model society. InThirty-seventh Conference on Neural Information Processing Systems, 2023

work page 2023

[34] [34]

Glenn L. Burrows. Sampling techniques. by william g. cochran. new york: John wiley and sons, inc., 1953. 330 pp. $6.50.Social Forces, 32(3):304–305, 03 1954. ISSN 0037-7732. doi: 10.2307/2573260

work page doi:10.2307/2573260 1953

[35] [35]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[36] [36]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023

work page 2023

[37] [37]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. InInternational Conference on Learning Representations (ICLR), 2021

work page 2021

[38] [38]

Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies

Mor Geva, Daniel Khashabi, Elad Segal, Tushar Khot, Dan Roth, and Jonathan Berant. Did aristotle use a laptop? a question answering benchmark with implicit reasoning strategies. Transactions of the Association for Computational Linguistics, 9:346–361, 2021

work page 2021

[39] [39]

Alex Gu, Baptiste Rozière, Hugh Leather, Armando Solar-Lezama, Gabriel Synnaeve, and Sida I. Wang. Cruxeval: a benchmark for code reasoning, understanding and execution. In Proceedings of the 41st International Conference on Machine Learning, ICML’24. JMLR.org, 2024

work page 2024

[40] [40]

Livecodebench: Holistic and contamination free evaluation of large language models for code

Naman Jain, King Han, Alex Gu, Wen-Ding Li, Fanjia Yan, Tianjun Zhang, Sida Wang, Ar- mando Solar-Lezama, Koushik Sen, and Ion Stoica. Livecodebench: Holistic and contamination free evaluation of large language models for code. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[41] [41]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [42]

Open, closed, or small language models for text classification?arXiv preprint arXiv:2308.10092, 2023

Hao Yu, Zachary Yang, Kellin Pelrine, Jean Francois Godbout, and Reihaneh Rabbany. Open, closed, or small language models for text classification?arXiv preprint arXiv:2308.10092, 2023

work page arXiv 2023

[43] [43]

Barrett, and Arnu Pretorius

Andries Smit, Nathan Grinsztajn, Paul Duckworth, Thomas D. Barrett, and Arnu Pretorius. Should we be going mad? a look at multi-agent debate strategies for llms. InProceedings of the 41st International Conference on Machine Learning, 2024

work page 2024

[44] [44]

Qwen2.5-32b-instruct-awq

Hugging Face. Qwen2.5-32b-instruct-awq. https://huggingface.co/Qwen/Qwen2. 5-32B-Instruct-AWQ, 2026. Accessed: 2026-04-27

work page 2026

[45] [45]

Qwen2.5-coder-32b-instruct-awq

Hugging Face. Qwen2.5-coder-32b-instruct-awq. https://huggingface.co/Qwen/Qwen2. 5-Coder-32B-Instruct-AWQ, 2026. Accessed: 2026-04-27

work page 2026

[46] [46]

Multi-agent consensus seeking via large language models.arXiv preprint arXiv:2310.20151, 2023

Huaben Chen, Wenkang Ji, Lufeng Xu, and Shiyu Zhao. Multi-agent consensus seeking via large language models.arXiv preprint arXiv:2310.20151, 2023. 13

work page arXiv 2023

[47] [47]

Glaser and Hon

Barney G. Glaser and Hon. Open coding descriptions.Grounded Theory Review: An Interna- tional Journal, 2016

work page 2016

[48] [48]

Qualitative content analysis

Jane Forman and Laura Damschroder. Qualitative content analysis. InEmpirical Methods for Bioethics: A Primer. Emerald Group Publishing Limited, 11 2007

work page 2007

[49] [49]

Can llms replace manual annotation of software engineering artifacts? In2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR), pages 526–538

Toufique Ahmed, Premkumar Devanbu, Christoph Treude, and Michael Pradel. Can llms replace manual annotation of software engineering artifacts? In2025 IEEE/ACM 22nd International Conference on Mining Software Repositories (MSR), pages 526–538. IEEE, 2025

work page 2025

[50] [50]

Large language models for data annotation and synthesis: A survey

Zhen Tan, Dawei Li, Song Wang, Alimohammad Beigi, Bohan Jiang, Amrita Bhattacharjee, Mansooreh Karami, Jundong Li, Lu Cheng, and Huan Liu. Large language models for data annotation and synthesis: A survey. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 930–957, 2024

work page 2024

[51] [51]

An llm-as-judge metric for bridging the gap with human evaluation in se tasks.arXiv preprint arXiv:2505.20854, 2025

Xin Zhou, Kisub Kim, Ting Zhang, Martin Weyssow, Luis F Gomes, Guang Yang, Kui Liu, Xin Xia, and David Lo. An llm-as-judge metric for bridging the gap with human evaluation in se tasks.arXiv preprint arXiv:2505.20854, 2025

work page arXiv 2025

[52] [52]

Just put a human in the loop? investigating LLM-assisted annotation for subjective tasks

Hope Schroeder, Deb Roy, and Jad Kabbara. Just put a human in the loop? investigating LLM-assisted annotation for subjective tasks. InFindings of the Association for Computational Linguistics: ACL 2025, Vienna, Austria, July 2025. Association for Computational Linguistics. doi: 10.18653/v1/2025.findings-acl.1323

work page doi:10.18653/v1/2025.findings-acl.1323 2025

[53] [53]

GPT 4o, 2025

openai. GPT 4o, 2025. URLhttps://openai.com/research/gpt-4

work page 2025

[54] [54]

A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20(1):37–46, 1960

Jacob Cohen. A coefficient of agreement for nominal scales.Educational and Psychological Measurement, 20(1):37–46, 1960. doi: 10.1177/001316446002000104

work page doi:10.1177/001316446002000104 1960

[55] [55]

Considering likelihood in NLP classification explana- tions with occlusion and language modeling

David Harbecke and Christoph Alt. Considering likelihood in NLP classification explana- tions with occlusion and language modeling. InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics: Student Research Workshop. Association for Computational Linguistics, July 2020. doi: 10.18653/v1/2020.acl-srw.16

work page doi:10.18653/v1/2020.acl-srw.16 2020

[56] [56]

Spanbert: Improving pre-training by representing and predicting spans.Transactions of the association for computational linguistics, 8:64–77, 2020

Mandar Joshi, Danqi Chen, Yinhan Liu, Daniel S Weld, Luke Zettlemoyer, and Omer Levy. Spanbert: Improving pre-training by representing and predicting spans.Transactions of the association for computational linguistics, 8:64–77, 2020

work page 2020

[57] [57]

Fleiss, Bruce Levin, and Myunghee Cho Paik

Joseph L. Fleiss, Bruce Levin, and Myunghee Cho Paik. Statistical methods for rates and proportions.Wiley Series in Probability and Statistics, Sep 2003. doi: https://doi.org/10.1002/ 0471445428

work page 2003

[58] [58]

Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

work page 2023

[59] [59]

Magi- core: Multi-agent, iterative, coarse-to-fine refinement for reasoning

Justin Chen, Archiki Prasad, Swarnadeep Saha, Elias Stengel-Eskin, and Mohit Bansal. Magi- core: Multi-agent, iterative, coarse-to-fine refinement for reasoning. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 32651–32674, 2025

work page 2025

[60] [60]

imad: Intelligent multi-agent debate for efficient and accurate llm inference

Wei Fan, JinYi Yoon, and Bo Ji. imad: Intelligent multi-agent debate for efficient and accurate llm inference. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 29403–29411, 2026

work page 2026

[61] [61]

Stop wasting your tokens: Towards efficient runtime multi-agent systems

Fulin Lin, Shaowen Chen, Ruishan Fang, Hongwei Wang, and Tao Lin. Stop wasting your tokens: Towards efficient runtime multi-agent systems. InInternational Conference on Learning Representations (ICLR), 2026

work page 2026

[62] [62]

AgentDropout: Dynamic agent elimination for token-efficient and high-performance LLM- based multi-agent collaboration

Zhexuan Wang, Yutong Wang, Xuebo Liu, Liang Ding, Miao Zhang, Jie Liu, and Min Zhang. AgentDropout: Dynamic agent elimination for token-efficient and high-performance LLM- based multi-agent collaboration. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics. Association for Computational Linguistics, 2025

work page 2025

[63] [63]

annotated_response

Shuyang Liu, Yang Chen, Rahul Krishna, Saurabh Sinha, Jatin Ganhotra, and Reyhaneh Jabbarvand. Process-centric analysis of agentic software systems.Proceedings of the ACM on Programming Languages, 10(OOPSLA1):1961–1988, 2026. 14 Appendix This appendix complements the main paper by providing additional experimental details, prompt templates, supplementary ...

work page arXiv 1961

[64] [64]

Prompt Augmentation.The system prompt for each agent is extended with explicit instructions specifying the critical information categories that must be present in the response

work page

[65] [65]

If any are missing, the agent is re-invoked with a correction instruction for a fixed number of retries, until all critical information are included in their response

Response Verification.After generation, the response is checked to verify whether all required categories are present. If any are missing, the agent is re-invoked with a correction instruction for a fixed number of retries, until all critical information are included in their response. Layer 1 — Prompt Augmentation CARASystem Prompt (Initial Response) <MA...

work page 1975