SIGMA: Skill-Incidence Graphs for Compositional Multi-Agent Design

Haoyue Liu; Kun Zeng; Siyue Chen; Siyu Zhang; Xiaoying Tang; Yuecheng Zhuo; Yu Huo; Yuquan Lu

arxiv: 2606.19758 · v1 · pith:6AL3DPS3new · submitted 2026-06-18 · 💻 cs.MA

SIGMA: Skill-Incidence Graphs for Compositional Multi-Agent Design

Kun Zeng , Yu Huo , Siyu Zhang , Yuecheng Zhuo , Yuquan Lu , Haoyue Liu , Siyue Chen , Xiaoying Tang This is my paper

Pith reviewed 2026-06-26 15:26 UTC · model grok-4.3

classification 💻 cs.MA

keywords multi-agent systemscompositional designskill incidence graphsLLM agentsgraph-based MAStask-conditioned agentsagent construction

0 comments

The pith

SIGMA constructs multi-agent systems by bundling reusable skills into agents via incidence matrices rather than using fixed agent nodes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes SIGMA to address the limitation that existing graph-based multi-agent designers optimize only communication topologies while treating each agent as a fixed, closed entity. SIGMA instead predicts a skill-agent incidence matrix from a task and skill library, composes agent embeddings by selecting and combining skills, and decodes a topology over the resulting agents. Skill-specific mailboxes then route messages to the assigned capabilities at runtime. This compositional approach improves average performance over the strongest topology-only baseline across six benchmarks and three LLMs, while showing smaller degradation when skill libraries contain unseen items. The central suggestion is that building agents from modular skills forms a useful additional design axis alongside topology optimization.

Core claim

Given a task and skill library, SIGMA predicts a skill-agent incidence matrix, composes agent node embeddings from the selected skills, decodes a communication topology over the constructed agents, and makes the incidence structure operational through skill-specific mailboxes that route messages to the relevant capabilities during execution. On six reasoning and coding benchmarks with three base LLMs, this yields the best average performance and improves over the strongest non-compositional baseline by 2.06, 2.36, and 1.75 points respectively, while dropping only 0.96 points on average under unseen skill libraries.

What carries the argument

The skill-agent incidence matrix that selects and bundles skills into agents and enables direct operational routing via skill-specific mailboxes.

If this is right

Multi-agent designs can handle tasks whose required capability mixes were absent from training data.
The incidence structure integrates directly into execution without separate post-processing steps.
Compositional node construction and topology optimization can be pursued as complementary rather than competing directions.
Robustness to changes in the underlying skill library increases because new agents are assembled on the fly.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same incidence-matrix idea could be tested in domains where capabilities are already decomposed, such as tool-use agents or modular robotics controllers.
If skill libraries grow large, the prediction step for the incidence matrix may become a new bottleneck worth optimizing separately.
The framework implicitly assumes that message routing by skill mailbox remains efficient even as the number of skills per agent increases.

Load-bearing premise

Skills from the library are modular and reusable enough that incidence-matrix combinations produce effective agent capabilities even for tasks needing previously unseen mixes.

What would settle it

A controlled test in which tasks require novel skill combinations and performance with the predicted incidence matrix is no higher than with a fixed-agent baseline that cannot recombine skills.

Figures

Figures reproduced from arXiv: 2606.19758 by Haoyue Liu, Kun Zeng, Siyue Chen, Siyu Zhang, Xiaoying Tang, Yuecheng Zhuo, Yu Huo, Yuquan Lu.

**Figure 2.** Figure 2: Performance plot across six benchmarks and [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: Framework of SIGMA. Offline, an incidence generator is trained with deterministic pseudo-label supervision to assign reusable skills to agent slots. Online, the predicted task-specific incidence matrix is converted into skill-composed agent embeddings, which are used to decode an agent communication graph with skill-specific message routing. skill cards, each node embedding is computed from the selected ca… view at source ↗

**Figure 4.** Figure 4: Performance drop from source skill libraries [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Visualization of the accuracy and token trade [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Example pseudo-label for skill-agent inci [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Training dynamic of loss. The incidence term trains the skill-agent assignment predictor. The edge term is used only when annotated communication graphs or a shared structural prior are available. The sparsity term prevents degenerate solutions in which every slot receives too many skills or the decoded communication graph becomes overly dense. F.4 Complexity Let M be the number of skills and K the numbe… view at source ↗

**Figure 8.** Figure 8: Prompt format used for agent execution. H.2 Baseline Prompt Templates For reproducibility, we report the concrete Vanilla and CoT prompt templates used for single-agent baselines. Each template uses the same twomessage format as the agent execution prompt above. For a given dataset, system_prompt is constructed by concatenating the dataset-specific System base with either the Vanilla suffix or the CoT s… view at source ↗

**Figure 10.** Figure 10: Skill distribution across datasets. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 9.** Figure 9: Example runtime prompt used by SIGMA. The system message injects the assigned skill card, profile state, and answer-format constraints, while the user message provides the concrete question and routed mailbox evidence. 22 [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗

**Figure 11.** Figure 11: A HumanEval case where SIGMA’s chain topology turns a simple but risky cube-root implementation into a negative-aware final solution. L.2 MMLU: Skill-Conditioned Evidence Aggregation Qualitative Study Setting Evaluation Setting Base LLM and split Qwen3-8B on the fixed MMLU-153 evaluation subset. Overall run 153 executed questions, 94 solved questions, 61.44% accuracy. Execution regime Five AnalyzeAgent no… view at source ↗

**Figure 12.** Figure 12: A complete case where the ffnal decision node repairs answer-label noise by reading the agent rationales. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_12.png] view at source ↗

**Figure 13.** Figure 13: A MMLU case where counterexample skills provide the decisive corrective signal and allow the final decision node to override a noisy compact-answer majority. Case C: Recoverable disagreement among skill-composed agents Query Question and expected answer Record id: mmlu-test-public_relations-3. Solved: yes. Question: Which of these organizations is most effective in engaging with customers online? Options:… view at source ↗

**Figure 14.** Figure 14: A MMLU case where disagreement among skill-composed agents is recoverable because the final mailbox contains enough grounded evidence for the correct answer. Case D: Shared misconception in a domain-specific legal question Query Question and expected answer Record id: mmlu-test-international_law-1. Solved: no. Question: Who is an “injured State” in the law of international responsibility? Options: A. A St… view at source ↗

**Figure 15.** Figure 15: A MMLU failure case where skill diversity does not help because every card shares the same false legal premise. Case E: Correct majority lost during final aggregation Query Question and expected answer Record id: mmlu-test-jurisprudence-37. Solved: no. Question: Which of the following criticisms of Llewellyn’s distinction between the grand and formal styles of legal reasoning is the most compelling? Optio… view at source ↗

**Figure 16.** Figure 16: A MMLU failure case where three agents identify the correct answer, but the final decision node follows a minority distractor. 35 [PITH_FULL_IMAGE:figures/full_fig_p035_16.png] view at source ↗

read the original abstract

Existing graph-based multi-agent system (MAS) designers mainly improve collaboration by optimizing communication topologies over predefined agents, roles, or groups. However, because each node remains a closed-set entity, these methods struggle to generalize to tasks that require unseen combinations of capabilities. We propose SIGMA, a skill-incidence graph framework that constructs agents as task-conditioned bundles of reusable skills. Given a task and a skill library, SIGMA predicts a skill-agent incidence matrix, composes agent node embeddings from selected skills, and decodes a communication topology over the constructed agents. During execution, skill-specific mailboxes route messages to the relevant assigned capabilities, making the incidence structure directly operational. Across six reasoning and coding benchmarks with three base LLMs, SIGMA achieves the best average performance and improves over CARD, the strongest non-compositional topology-based baseline, by 2.06, 2.36, and 1.75 points, respectively. It also shows stronger robustness to unseen skill libraries, with an average performance drop of only 0.96 points. These results suggest that compositional node construction is a complementary and important axis for multi-agent design beyond communication topology optimization. Code is available at https://anonymous.4open.science/r/SIGMA-2338/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SIGMA's skill-incidence graphs for composing agents from a library is a clear step beyond pure topology optimization, but the abstract gives no ablations so the 2-point gains cannot be pinned on the composition mechanism.

read the letter

The core new piece here is treating agents as bundles drawn from a shared skill library via a predicted incidence matrix, then wiring skill-specific mailboxes on top of the decoded topology. That directly targets the closed-set node problem the abstract flags in prior graph-based MAS work. The reported edge over CARD (2.06–2.36 points on average across three LLMs, plus smaller drop on unseen skill sets) is the main empirical hook.

What the paper does cleanly is frame the problem as two orthogonal axes—node construction and topology—and then operationalize the first one. The robustness claim is also stated plainly.

The soft spot is exactly the one in the stress-test note. The abstract only shows an end-to-end win against a non-compositional baseline; there is no control that keeps the mailboxes and topology decoder but removes the incidence-based composition. Without that, or at least an ablation that swaps in random incidence matrices, the attribution stays untested. The methods section is not visible in what was supplied, so we also lack any detail on how the incidence matrix is actually predicted or regularized.

This is aimed at groups already building or benchmarking LLM multi-agent systems who care about generalization across task skill profiles. A reader who wants a new design knob to try will get an idea worth testing; someone looking for a settled result will not.

If the full paper supplies the missing controls and a reproducible implementation, it is worth sending out for review. Based on the abstract alone the experimental isolation is too thin to treat the central claim as demonstrated.

Referee Report

2 major / 1 minor

Summary. The paper proposes SIGMA, a skill-incidence graph framework for multi-agent system design. Instead of optimizing communication topologies over fixed agent nodes, SIGMA predicts a task-conditioned skill-agent incidence matrix from a skill library, composes node embeddings from the selected reusable skills, decodes a topology over the resulting agents, and uses skill-specific mailboxes at runtime. Experiments across six reasoning and coding benchmarks with three base LLMs report that SIGMA outperforms the non-compositional CARD baseline by 2.06, 2.36, and 1.75 points on average and exhibits greater robustness (0.96-point average drop) when skill libraries are replaced with unseen ones. The central claim is that compositional node construction constitutes an important complementary axis to topology optimization.

Significance. If the attribution of gains to incidence-based composition is substantiated, the work would usefully expand the design space for graph-based MAS beyond topology search. The public code release is a clear strength that enables direct verification and extension. The empirical framing (benchmark deltas rather than parameter-free derivations) means significance hinges on the quality of the experimental isolation of the compositional mechanism.

major comments (2)

[Abstract] Abstract: The reported 2.06–2.36 point gains over CARD are presented as end-to-end results without any ablation that disables incidence-matrix prediction and skill-composition while retaining skill-specific mailboxes and the topology decoder. This omission leaves open whether the observed improvements are driven by the claimed compositional node construction or by auxiliary pipeline components.
[Abstract] Abstract (and experimental section): The robustness claim (0.96-point average drop under unseen skill libraries) is stated without describing how the replacement libraries are sampled or whether the incidence predictor is retrained or zero-shot on the new libraries. Without these controls, it is unclear whether the smaller drop truly demonstrates superior generalization of the compositional representation.

minor comments (1)

[Abstract] The abstract states numeric improvements but supplies no statistical significance tests, variance across runs, or number of seeds; adding these details would strengthen the empirical claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on experimental isolation and controls. We address each major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The reported 2.06–2.36 point gains over CARD are presented as end-to-end results without any ablation that disables incidence-matrix prediction and skill-composition while retaining skill-specific mailboxes and the topology decoder. This omission leaves open whether the observed improvements are driven by the claimed compositional node construction or by auxiliary pipeline components.

Authors: We agree that an explicit ablation disabling incidence-matrix prediction and skill composition (while retaining mailboxes and the topology decoder) would more cleanly isolate the contribution of compositional node construction. The existing CARD comparison controls for topology but does not hold the auxiliary components fixed in this way. In the revised manuscript we will add this ablation. revision: yes
Referee: [Abstract] Abstract (and experimental section): The robustness claim (0.96-point average drop under unseen skill libraries) is stated without describing how the replacement libraries are sampled or whether the incidence predictor is retrained or zero-shot on the new libraries. Without these controls, it is unclear whether the smaller drop truly demonstrates superior generalization of the compositional representation.

Authors: We will expand the experimental section to specify the sampling procedure used to construct the replacement (unseen) skill libraries and to state explicitly that the incidence predictor is evaluated zero-shot without retraining. These details will be added to support the generalization claim. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical benchmark results only

full rationale

The paper introduces the SIGMA framework for constructing agents via skill-incidence matrices and evaluates it through end-to-end experiments on six benchmarks against the CARD baseline, reporting average performance improvements of 2.06–2.36 points. No derivation chain, first-principles equations, fitted parameters renamed as predictions, or self-citation load-bearing steps appear in the provided text. All reported quantities are direct experimental outcomes on held-out tasks rather than quantities forced by construction from the same data or prior self-referential results, making the work self-contained as an empirical proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review supplies no explicit parameter counts or formal axioms; the central claim rests on the unstated premise that skills are modular and that the incidence prediction model generalizes.

axioms (1)

domain assumption Skills from the library are modular and reusable across tasks
The incidence-matrix construction presupposes that selected skills can be bundled into functional agents for novel combinations.

pith-pipeline@v0.9.1-grok · 5774 in / 1059 out tokens · 33223 ms · 2026-06-26T15:26:19.374850+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

56 extracted references · 1 canonical work pages · 1 internal anchor

[1]

Advances in neural information processing systems , volume=

CAMEL: Communicative Agents for ``Mind'' Exploration of Large Language Model Society , author=. Advances in neural information processing systems , volume=
[2]

First conference on language modeling , year=

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversations , author=. First conference on language modeling , year=
[3]

International Conference on Learning Representations , year=

MetaGPT: Meta programming for a multi-agent collaborative framework , author=. International Conference on Learning Representations , year=
[4]

Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers) , pages=

Chatdev: Communicative agents for software development , author=. Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers) , pages=
[5]

International Conference on Learning Representations , year=

Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors , author=. International Conference on Learning Representations , year=
[6]

First Conference on Language Modeling , year=

A dynamic LLM-powered agent network for task-oriented agent collaboration , author=. First Conference on Language Modeling , year=
[7]

Forty-first International Conference on Machine Learning , year=

GPTSwarm: Language Agents as Optimizable Graphs , author=. Forty-first International Conference on Machine Learning , year=
[8]

arXiv preprint arXiv:2410.11782 , year=

G-designer: Architecting multi-agent communication topologies via graph neural networks , author=. arXiv preprint arXiv:2410.11782 , year=

arXiv
[9]

International Conference on Learning Representations , year=

Cut the crap: An economical communication pipeline for llm-based multi-agent systems , author=. International Conference on Learning Representations , year=
[11]

Advances in neural information processing systems , volume=

Toolformer: Language models can teach themselves to use tools , author=. Advances in neural information processing systems , volume=
[12]

arXiv preprint arXiv:2306.05301 , year=

Toolalpaca: Generalized tool learning for language models with 3000 simulated cases , author=. arXiv preprint arXiv:2306.05301 , year=

Pith/arXiv arXiv
[15]

Advances in Neural Information Processing Systems , volume=

Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face , author=. Advances in Neural Information Processing Systems , volume=
[16]

Proceedings of the 2023 conference on empirical methods in natural language processing , pages=

Api-bank: A comprehensive benchmark for tool-augmented llms , author=. Proceedings of the 2023 conference on empirical methods in natural language processing , pages=

2023
[18]

International Conference on Learning Representations , year=

Toolllm: Facilitating large language models to master 16000+ real-world apis , author=. International Conference on Learning Representations , year=
[20]

International Conference on Learning Representations , year=

Livecodebench: Holistic and contamination free evaluation of large language models for code , author=. International Conference on Learning Representations , year=
[24]

Proceedings of the 2015 conference on empirical methods in natural language processing , pages=

Solving general arithmetic word problems , author=. Proceedings of the 2015 conference on empirical methods in natural language processing , pages=

2015
[26]

Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: Long papers) , pages=

Program induction by rationale generation: Learning to solve and explain algebraic word problems , author=. Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: Long papers) , pages=
[27]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=
[28]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Agentdropout: Dynamic agent elimination for token-efficient and high-performance llm-based multi-agent collaboration , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[29]

International Conference on Machine Learning , pages=

G-Designer: Architecting Multi-agent Communication Topologies via Graph Neural Networks , author=. International Conference on Machine Learning , pages=. 2025 , organization=

2025
[30]

Forty-first international conference on machine learning , year=

Improving factuality and reasoning in language models through multiagent debate , author=. Forty-first international conference on machine learning , year=
[31]

Proceedings of the 2024 conference on empirical methods in natural language processing , pages=

Encouraging divergent thinking in large language models through multi-agent debate , author=. Proceedings of the 2024 conference on empirical methods in natural language processing , pages=

2024
[32]

Proceedings of the 36th annual acm symposium on user interface software and technology , pages=

Generative agents: Interactive simulacra of human behavior , author=. Proceedings of the 36th annual acm symposium on user interface software and technology , pages=
[33]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Assemble your crew: Automatic multi-agent communication topology design via autoregressive graph generation , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[37]

Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, and 1 others. 2025. gpt-oss-120b & gpt-oss-20b model card. arXiv preprint arXiv:2508.10925

Pith/arXiv arXiv 2025
[38]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, and 1 others. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374

Pith/arXiv arXiv 2021
[39]

Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chi-Min Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, and 1 others. 2024. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors. In International Conference on Learning Representations

2024
[40]

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, and 1 others. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168

Pith/arXiv arXiv 2021
[41]

Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. 2024. Improving factuality and reasoning in language models through multiagent debate. In Forty-first international conference on machine learning

2024
[42]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2020. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300

Pith/arXiv arXiv 2020
[43]

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Steven Yau, Zijuan Lin, Liyang Zhou, and 1 others. 2024. Metagpt: Meta programming for a multi-agent collaborative framework. In International Conference on Learning Representations

2024
[44]

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, and 1 others. 2024. Gpt-4o system card. arXiv preprint arXiv:2410.21276

Pith/arXiv arXiv 2024
[45]

Ehud Karpas, Omri Abend, Yonatan Belinkov, Barak Lenz, Opher Lieber, Nir Ratner, Yoav Shoham, Hofit Bata, Yoav Levine, Kevin Leyton-Brown, and 1 others. 2022. Mrkl systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning. arXiv preprint arXiv:2205.00445

Pith/arXiv arXiv 2022
[46]

Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. 2023 a . Camel: Communicative agents for ``mind'' exploration of large language model society. Advances in neural information processing systems, 36:51991--52008

2023
[47]

Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. 2023 b . Api-bank: A comprehensive benchmark for tool-augmented llms. In Proceedings of the 2023 conference on empirical methods in natural language processing, pages 3102--3116

2023
[48]

Shiyuan Li, Yixin Liu, Qingsong Wen, Chengqi Zhang, and Shirui Pan. 2026. Assemble your crew: Automatic multi-agent communication topology design via autoregressive graph generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 23142--23150

2026
[49]

Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi, and Zhaopeng Tu. 2024. Encouraging divergent thinking in large language models through multi-agent debate. In Proceedings of the 2024 conference on empirical methods in natural language processing, pages 17889--17904

2024
[50]

Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. 2017. Program induction by rationale generation: Learning to solve and explain algebraic word problems. In Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: Long papers), pages 158--167

2017
[51]

Zijun Liu, Yanzhe Zhang, Peng Li, Yang Liu, and Diyi Yang. 2024. A dynamic llm-powered agent network for task-oriented agent collaboration. In First Conference on Language Modeling

2024
[52]

Joon Sung Park, Joseph O'Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th annual acm symposium on user interface software and technology, pages 1--22

2023
[53]

Arkil Patel, Satwik Bhattamishra, and Navin Goyal. 2021. https://doi.org/10.18653/v1/2021.naacl-main.168 Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080--2094, Online. Association for ...

work page internal anchor Pith review doi:10.18653/v1/2021.naacl-main.168 2021
[54]

Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. 2023. Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334

Pith/arXiv arXiv 2023
[55]

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, and 1 others. 2024. Chatdev: Communicative agents for software development. In Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 15174--15186

2024
[56]

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, and 1 others. 2024. Toolllm: Facilitating large language models to master 16000+ real-world apis. In International Conference on Learning Representations

2024
[57]

Subhro Roy and Dan Roth. 2015. Solving general arithmetic word problems. In Proceedings of the 2015 conference on empirical methods in natural language processing, pages 1743--1752

2015
[58]

Timo Schick, Jane Dwivedi-Yu, Roberto Dess \` , Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language models can teach themselves to use tools. Advances in neural information processing systems, 36:68539--68551

2023
[59]

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. 2023. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. Advances in Neural Information Processing Systems, 36:38154--38180

2023
[60]

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. 2023. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291

Pith/arXiv arXiv 2023
[61]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, and 1 others. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824--24837

2022
[62]

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, and 1 others. 2024. Autogen: Enabling next-gen llm applications via multi-agent conversations. In First conference on language modeling

2024
[63]

Tongtong Wu, Yanming Li, Ziye Tang, Chen Jiang, Linhao Luo, Guilin Qi, Shirui Pan, and Gholamreza Haffari. 2026. Card: Towards conditional design of multi-agent topological structures. arXiv preprint arXiv:2603.01089

arXiv 2026
[64]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, and 1 others. 2025. Qwen3 technical report. arXiv preprint arXiv:2505.09388

Pith/arXiv arXiv 2025
[65]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629

Pith/arXiv arXiv 2022
[66]

Guibin Zhang, Yanwei Yue, Zhixun Li, Sukwon Yun, Guancheng Wan, Kun Wang, Dawei Cheng, Jeffrey Yu, and Tianlong Chen. 2025 a . Cut the crap: An economical communication pipeline for llm-based multi-agent systems. In International Conference on Learning Representations

2025
[67]

Guibin Zhang, Yanwei Yue, Xiangguo Sun, Guancheng Wan, Miao Yu, Junfeng Fang, Kun Wang, Tianlong Chen, and Dawei Cheng. 2025 b . G-designer: Architecting multi-agent communication topologies via graph neural networks. In International Conference on Machine Learning, pages 76678--76692. PMLR

2025
[68]

Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, and J \"u rgen Schmidhuber. 2024. Gptswarm: Language agents as optimizable graphs. In Forty-first International Conference on Machine Learning

2024

[1] [1]

Advances in neural information processing systems , volume=

CAMEL: Communicative Agents for ``Mind'' Exploration of Large Language Model Society , author=. Advances in neural information processing systems , volume=

[2] [2]

First conference on language modeling , year=

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversations , author=. First conference on language modeling , year=

[3] [3]

International Conference on Learning Representations , year=

MetaGPT: Meta programming for a multi-agent collaborative framework , author=. International Conference on Learning Representations , year=

[4] [4]

Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers) , pages=

Chatdev: Communicative agents for software development , author=. Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers) , pages=

[5] [5]

International Conference on Learning Representations , year=

Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors , author=. International Conference on Learning Representations , year=

[6] [6]

First Conference on Language Modeling , year=

A dynamic LLM-powered agent network for task-oriented agent collaboration , author=. First Conference on Language Modeling , year=

[7] [7]

Forty-first International Conference on Machine Learning , year=

GPTSwarm: Language Agents as Optimizable Graphs , author=. Forty-first International Conference on Machine Learning , year=

[8] [8]

arXiv preprint arXiv:2410.11782 , year=

G-designer: Architecting multi-agent communication topologies via graph neural networks , author=. arXiv preprint arXiv:2410.11782 , year=

arXiv

[9] [9]

International Conference on Learning Representations , year=

Cut the crap: An economical communication pipeline for llm-based multi-agent systems , author=. International Conference on Learning Representations , year=

[10] [11]

Advances in neural information processing systems , volume=

Toolformer: Language models can teach themselves to use tools , author=. Advances in neural information processing systems , volume=

[11] [12]

arXiv preprint arXiv:2306.05301 , year=

Toolalpaca: Generalized tool learning for language models with 3000 simulated cases , author=. arXiv preprint arXiv:2306.05301 , year=

Pith/arXiv arXiv

[12] [15]

Advances in Neural Information Processing Systems , volume=

Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face , author=. Advances in Neural Information Processing Systems , volume=

[13] [16]

Proceedings of the 2023 conference on empirical methods in natural language processing , pages=

Api-bank: A comprehensive benchmark for tool-augmented llms , author=. Proceedings of the 2023 conference on empirical methods in natural language processing , pages=

2023

[14] [18]

International Conference on Learning Representations , year=

Toolllm: Facilitating large language models to master 16000+ real-world apis , author=. International Conference on Learning Representations , year=

[15] [20]

International Conference on Learning Representations , year=

Livecodebench: Holistic and contamination free evaluation of large language models for code , author=. International Conference on Learning Representations , year=

[16] [24]

Proceedings of the 2015 conference on empirical methods in natural language processing , pages=

Solving general arithmetic word problems , author=. Proceedings of the 2015 conference on empirical methods in natural language processing , pages=

2015

[17] [26]

Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: Long papers) , pages=

Program induction by rationale generation: Learning to solve and explain algebraic word problems , author=. Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: Long papers) , pages=

[18] [27]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

[19] [28]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Agentdropout: Dynamic agent elimination for token-efficient and high-performance llm-based multi-agent collaboration , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[20] [29]

International Conference on Machine Learning , pages=

G-Designer: Architecting Multi-agent Communication Topologies via Graph Neural Networks , author=. International Conference on Machine Learning , pages=. 2025 , organization=

2025

[21] [30]

Forty-first international conference on machine learning , year=

Improving factuality and reasoning in language models through multiagent debate , author=. Forty-first international conference on machine learning , year=

[22] [31]

Proceedings of the 2024 conference on empirical methods in natural language processing , pages=

Encouraging divergent thinking in large language models through multi-agent debate , author=. Proceedings of the 2024 conference on empirical methods in natural language processing , pages=

2024

[23] [32]

Proceedings of the 36th annual acm symposium on user interface software and technology , pages=

Generative agents: Interactive simulacra of human behavior , author=. Proceedings of the 36th annual acm symposium on user interface software and technology , pages=

[24] [33]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Assemble your crew: Automatic multi-agent communication topology design via autoregressive graph generation , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

[25] [37]

Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, and 1 others. 2025. gpt-oss-120b & gpt-oss-20b model card. arXiv preprint arXiv:2508.10925

Pith/arXiv arXiv 2025

[26] [38]

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, and 1 others. 2021. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374

Pith/arXiv arXiv 2021

[27] [39]

Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chi-Min Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, and 1 others. 2024. Agentverse: Facilitating multi-agent collaboration and exploring emergent behaviors. In International Conference on Learning Representations

2024

[28] [40]

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, and 1 others. 2021. Training verifiers to solve math word problems. arXiv preprint arXiv:2110.14168

Pith/arXiv arXiv 2021

[29] [41]

Yilun Du, Shuang Li, Antonio Torralba, Joshua B Tenenbaum, and Igor Mordatch. 2024. Improving factuality and reasoning in language models through multiagent debate. In Forty-first international conference on machine learning

2024

[30] [42]

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. 2020. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300

Pith/arXiv arXiv 2020

[31] [43]

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Jinlin Wang, Ceyao Zhang, Steven Yau, Zijuan Lin, Liyang Zhou, and 1 others. 2024. Metagpt: Meta programming for a multi-agent collaborative framework. In International Conference on Learning Representations

2024

[32] [44]

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, and 1 others. 2024. Gpt-4o system card. arXiv preprint arXiv:2410.21276

Pith/arXiv arXiv 2024

[33] [45]

Ehud Karpas, Omri Abend, Yonatan Belinkov, Barak Lenz, Opher Lieber, Nir Ratner, Yoav Shoham, Hofit Bata, Yoav Levine, Kevin Leyton-Brown, and 1 others. 2022. Mrkl systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning. arXiv preprint arXiv:2205.00445

Pith/arXiv arXiv 2022

[34] [46]

Guohao Li, Hasan Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. 2023 a . Camel: Communicative agents for ``mind'' exploration of large language model society. Advances in neural information processing systems, 36:51991--52008

2023

[35] [47]

Minghao Li, Yingxiu Zhao, Bowen Yu, Feifan Song, Hangyu Li, Haiyang Yu, Zhoujun Li, Fei Huang, and Yongbin Li. 2023 b . Api-bank: A comprehensive benchmark for tool-augmented llms. In Proceedings of the 2023 conference on empirical methods in natural language processing, pages 3102--3116

2023

[36] [48]

Shiyuan Li, Yixin Liu, Qingsong Wen, Chengqi Zhang, and Shirui Pan. 2026. Assemble your crew: Automatic multi-agent communication topology design via autoregressive graph generation. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 23142--23150

2026

[37] [49]

Tian Liang, Zhiwei He, Wenxiang Jiao, Xing Wang, Yan Wang, Rui Wang, Yujiu Yang, Shuming Shi, and Zhaopeng Tu. 2024. Encouraging divergent thinking in large language models through multi-agent debate. In Proceedings of the 2024 conference on empirical methods in natural language processing, pages 17889--17904

2024

[38] [50]

Wang Ling, Dani Yogatama, Chris Dyer, and Phil Blunsom. 2017. Program induction by rationale generation: Learning to solve and explain algebraic word problems. In Proceedings of the 55th annual meeting of the association for computational linguistics (volume 1: Long papers), pages 158--167

2017

[39] [51]

Zijun Liu, Yanzhe Zhang, Peng Li, Yang Liu, and Diyi Yang. 2024. A dynamic llm-powered agent network for task-oriented agent collaboration. In First Conference on Language Modeling

2024

[40] [52]

Joon Sung Park, Joseph O'Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th annual acm symposium on user interface software and technology, pages 1--22

2023

[41] [53]

Arkil Patel, Satwik Bhattamishra, and Navin Goyal. 2021. https://doi.org/10.18653/v1/2021.naacl-main.168 Are NLP models really able to solve simple math word problems? In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 2080--2094, Online. Association for ...

work page internal anchor Pith review doi:10.18653/v1/2021.naacl-main.168 2021

[42] [54]

Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. 2023. Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334

Pith/arXiv arXiv 2023

[43] [55]

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, and 1 others. 2024. Chatdev: Communicative agents for software development. In Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 15174--15186

2024

[44] [56]

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, and 1 others. 2024. Toolllm: Facilitating large language models to master 16000+ real-world apis. In International Conference on Learning Representations

2024

[45] [57]

Subhro Roy and Dan Roth. 2015. Solving general arithmetic word problems. In Proceedings of the 2015 conference on empirical methods in natural language processing, pages 1743--1752

2015

[46] [58]

Timo Schick, Jane Dwivedi-Yu, Roberto Dess \` , Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. 2023. Toolformer: Language models can teach themselves to use tools. Advances in neural information processing systems, 36:68539--68551

2023

[47] [59]

Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. 2023. Hugginggpt: Solving ai tasks with chatgpt and its friends in hugging face. Advances in Neural Information Processing Systems, 36:38154--38180

2023

[48] [60]

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. 2023. Voyager: An open-ended embodied agent with large language models. arXiv preprint arXiv:2305.16291

Pith/arXiv arXiv 2023

[49] [61]

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, and 1 others. 2022. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems, 35:24824--24837

2022

[50] [62]

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, and 1 others. 2024. Autogen: Enabling next-gen llm applications via multi-agent conversations. In First conference on language modeling

2024

[51] [63]

Tongtong Wu, Yanming Li, Ziye Tang, Chen Jiang, Linhao Luo, Guilin Qi, Shirui Pan, and Gholamreza Haffari. 2026. Card: Towards conditional design of multi-agent topological structures. arXiv preprint arXiv:2603.01089

arXiv 2026

[52] [64]

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, and 1 others. 2025. Qwen3 technical report. arXiv preprint arXiv:2505.09388

Pith/arXiv arXiv 2025

[53] [65]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2022. React: Synergizing reasoning and acting in language models. arXiv preprint arXiv:2210.03629

Pith/arXiv arXiv 2022

[54] [66]

Guibin Zhang, Yanwei Yue, Zhixun Li, Sukwon Yun, Guancheng Wan, Kun Wang, Dawei Cheng, Jeffrey Yu, and Tianlong Chen. 2025 a . Cut the crap: An economical communication pipeline for llm-based multi-agent systems. In International Conference on Learning Representations

2025

[55] [67]

Guibin Zhang, Yanwei Yue, Xiangguo Sun, Guancheng Wan, Miao Yu, Junfeng Fang, Kun Wang, Tianlong Chen, and Dawei Cheng. 2025 b . G-designer: Architecting multi-agent communication topologies via graph neural networks. In International Conference on Machine Learning, pages 76678--76692. PMLR

2025

[56] [68]

Mingchen Zhuge, Wenyi Wang, Louis Kirsch, Francesco Faccio, Dmitrii Khizbullin, and J \"u rgen Schmidhuber. 2024. Gptswarm: Language agents as optimizable graphs. In Forty-first International Conference on Machine Learning

2024