Agents that Matter: Optimizing Multi-Agent LLMs via Removal-Based Attribution

Chris Lin; Mingyu Lu; Su-In Lee; Yushan Huang

arxiv: 2605.27621 · v1 · pith:FIRRMIVInew · submitted 2026-05-26 · 💻 cs.MA · cs.CL

Agents that Matter: Optimizing Multi-Agent LLMs via Removal-Based Attribution

Mingyu Lu , Yushan Huang , Chris Lin , Su-In Lee This is my paper

Pith reviewed 2026-06-29 14:33 UTC · model grok-4.3

classification 💻 cs.MA cs.CL

keywords multi-agent systemsagent attributioncooperative gameleave-one-outmodel replacementLLM optimizationmedical MAS

0 comments

The pith

Substituting low-contribution agent models in multi-agent LLM systems improves performance up to 17% and cuts costs up to 35%.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper formalizes the attribution of contributions among agents in multi-agent LLM systems as a cooperative game defined by coalition distribution, removal protocol, and target metric. It establishes that leave-one-out removal identifies bottleneck agents as effectively as combinatorial methods but with far lower computational cost. The central practical result is that using these attribution scores to guide model replacement for low-contribution agents yields substantial gains in task performance and reductions in cost across benchmarks. The framework is also applied to audit a medical multi-agent system, showing that contributions to diagnostic accuracy and ethical behavior are often decoupled and can be improved separately by targeted interventions.

Core claim

The authors show that removal-based attribution, formalized as a cooperative game, allows identification of agents whose underlying models can be substituted to improve overall system performance by up to 17% and reduce cost by up to 35% on three benchmarks, while also demonstrating that different removal protocols induce distinct attribution games and that in a medical MAS, agent contributions to accuracy and ethics can be decoupled.

What carries the argument

Removal-based attribution as a cooperative game, with Leave-One-Out (LOO) and combinatorial variants, and model replacement on low-attribution agents.

Load-bearing premise

The removal-based attribution scores identify agents for whom model replacement will actually deliver the performance and cost improvements.

What would settle it

Replacing the models of agents scored as low-contribution by the attribution method results in no improvement or even a decrease in task performance or an increase in cost on the benchmarks.

Figures

Figures reproduced from arXiv: 2605.27621 by Chris Lin, Mingyu Lu, Su-In Lee, Yushan Huang.

**Figure 1.** Figure 1: (a) Overview of removal-based attribution for multi-agent systems and (b) removal protocols. only its underlying model capability but also its assigned role and interactions with other agents (Zhu et al., 2025). This creates a challenge for MAS evaluation and optimization: How can we identify bottleneck agents and quantify each agent’s contribution to the overall system performance? Addressing this chall… view at source ↗

**Figure 2.** Figure 2: MAS communication topologies. model scale or tool access. Because every role remains instantiated, the MAS communication graph is preserved. The resulting attribution measures the marginal utility of changing the model assigned to an agent: A weaker replacement estimates the contribution of the original model’s additional capacity, while a stronger replacement estimates the potential benefit of upgradin… view at source ↗

**Figure 3.** Figure 3: Comparison of deletion AUC across differ [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Agent attribution under agent ablation (filled) [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Task success versus closed-source model token usage for bottom- [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 7.** Figure 7: (a) Agent attribution scores with model replacement for MedQA (x-axis) and MedEthicsQA (yaxis). (b) Performance after removing the bottom-k agents ranked by MedEthicsQA attribution. Leveraging these insights, we target agents whose original backbones degrade ethical performance while contributing little to diagnostic accuracy: Specialist, Triage, and CoT Reviewer. Replacing these three agents increase… view at source ↗

**Figure 8.** Figure 8: Comparison of deletion AUC across differ [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

**Figure 9.** Figure 9: Agent attribution under model replacement [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: Agent attribution under model replacement [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: LOO attribution rank comparison under re [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 13.** Figure 13: Average task success (y-axis) versus token usage (top) and cost (bottom) over 50 instances (x-axis) for bottom-k agent replacement (Qwen-122B-A10B) on WorkBench (top) and BrowseComp-Plus (bottom). Variances indicate the mean ± standard deviation across 3 runs. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗

read the original abstract

As multi-agent systems (MAS) become increasingly complex, identifying the contributions of individual agents is critical for system optimization. However, existing approaches lack a rigorous, unified framework for credit assignment. In this work, we formalize agent attribution as a cooperative game, parameterized by the coalition distribution, removal protocol, and target metric. Using this framework, we show that Leave-One-Out (LOO) identifies bottleneck agents as effectively as combinatorial methods, but at a fraction of the computational cost. We also demonstrate that removal protocols induce distinct games: Agent ablation isolates structural bottlenecks, whereas introspective LLM judges fail to faithfully approximate this behavior. Furthermore, to evaluate the utility of specific agent backbones, we introduce attribution via model replacement. By substituting underlying models of low-contribution agents, we improve task performance by up to 17% while reducing cost by up to 35% across three benchmarks. Finally, we apply our framework to audit a medical MAS, revealing that agent contributions to diagnostic accuracy and ethical behavior are often decoupled. By intervening on counterproductive roles, we observe an increase in ethics alignment while maintaining diagnostic accuracy. Overall, this work provides a principled approach for cost-effective MAS attribution and intervention.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The cooperative-game framing for agent attribution is a reasonable formalization and the medical audit is a useful application, but the headline performance and cost gains from model replacement lack controls showing that the attribution scores, rather than any replacement, drive the results.

read the letter

The punchline for you is that the paper formalizes agent attribution in multi-agent LLM systems as a cooperative game and uses it to guide model replacements that reportedly improve performance and cut costs, but the results don't include controls to confirm the attribution scores are responsible for those gains rather than the replacements themselves.

They define the game with parameters for how coalitions are formed, how agents are removed, and what metric is used. This lets them compare Leave-One-Out to full combinatorial approaches and show LOO finds the same bottlenecks at lower cost. Different removal protocols create different games, with ablation revealing structural issues that LLM-based judges don't capture well. The model replacement part is where they intervene by swapping in different backbones for low-attribution agents. They also run it on a medical MAS to show that diagnostic accuracy and ethical behavior can be decoupled, and intervening on the latter improves ethics without hurting accuracy.

The formalization and the explicit comparison of protocols are the new pieces. The medical audit is a nice real-world application that demonstrates the framework's utility for auditing.

The main weakness is the lack of controls on the replacement experiments. The abstract reports the 17% and 35% numbers after substituting low-contribution agents, but without showing what happens when high-contribution or random agents are replaced instead, we can't tell if the attribution ranking is key or if any model swap would produce similar deltas. The benchmarks and datasets are also not detailed in the abstract, which makes it hard to assess the claims. If the full paper has those controls and details, that would strengthen it a lot.

This paper is for researchers and practitioners building multi-agent LLM applications who want a more principled way to attribute and optimize. A reader interested in credit assignment or system auditing would get value from the framework even if the intervention results need more validation. It deserves a serious referee because the core idea is coherent and the application is timely, though the experimental design around the gains needs scrutiny.

Referee Report

3 major / 2 minor

Summary. The paper formalizes agent attribution in multi-agent LLM systems as a cooperative game parameterized by coalition distribution, removal protocol, and target metric. It shows that Leave-One-Out (LOO) attribution matches combinatorial methods in identifying bottleneck agents at lower cost, that ablation vs. introspective LLM-judge removal protocols induce distinct games, introduces model-replacement attribution to evaluate agent backbones, reports up to 17% task-performance gains and 35% cost reductions by substituting low-contribution agents across three benchmarks, and applies the framework to a medical MAS to reveal decoupled contributions to diagnostic accuracy and ethics, enabling interventions that improve ethical alignment without harming accuracy.

Significance. If the central empirical claims hold, the work would be significant for multi-agent systems research by supplying a unified game-theoretic credit-assignment framework that directly supports cost-effective optimization and auditing. The demonstration that LOO suffices for bottleneck detection and the practical medical-MAS case study illustrate utility beyond theory. The model-replacement intervention technique offers a concrete mechanism for translating attribution scores into system improvements.

major comments (3)

[Abstract / experimental results] Abstract and experimental results sections: the headline claim that substituting low-contribution agents yields up to 17% performance and 35% cost gains is load-bearing for the utility of the attribution framework, yet no controls (replacement of high-attribution agents, random agents, or matched non-attribution baselines) are described. Without these, it remains possible that any model substitution, independent of the LOO/combinatorial scores, produces the deltas.
[Framework definition / experimental protocol] Framework and methods sections: the three benchmarks, exact coalition distributions, removal protocols, target metrics, and statistical procedures (error bars, significance tests, number of runs) are not specified. These omissions prevent verification that the reported gains are attributable to the attribution method rather than to unspecified experimental choices.
[Attribution via model replacement] Model-replacement attribution section: the claim that attribution scores correctly identify agents whose replacement improves performance rests on the untested assumption that low-attribution agents are the ones whose backbone change produces the observed deltas; a direct comparison of replacement effects conditioned on attribution rank is required to establish this.

minor comments (2)

[Framework formalization] Notation for the cooperative-game parameters (coalition distribution, removal protocol) would benefit from a small concrete example immediately after the formal definition.
[Medical MAS case study] The medical-MAS audit would be clearer if the specific agent roles, the ethics metric, and the intervention procedure were tabulated.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. The major comments identify gaps in experimental controls and protocol details that we will address through revisions to strengthen the manuscript. We respond point-by-point below.

read point-by-point responses

Referee: [Abstract / experimental results] Abstract and experimental results sections: the headline claim that substituting low-contribution agents yields up to 17% performance and 35% cost gains is load-bearing for the utility of the attribution framework, yet no controls (replacement of high-attribution agents, random agents, or matched non-attribution baselines) are described. Without these, it remains possible that any model substitution, independent of the LOO/combinatorial scores, produces the deltas.

Authors: We agree that controls are required to establish that gains are attributable to the attribution method rather than generic substitution. In the revision we will add experiments replacing high-attribution agents, random agents, and matched non-attribution baselines, reporting performance and cost deltas for each condition across the benchmarks. revision: yes
Referee: [Framework definition / experimental protocol] Framework and methods sections: the three benchmarks, exact coalition distributions, removal protocols, target metrics, and statistical procedures (error bars, significance tests, number of runs) are not specified. These omissions prevent verification that the reported gains are attributable to the attribution method rather than to unspecified experimental choices.

Authors: We acknowledge these details were insufficiently specified. The revised manuscript will expand the methods section with a dedicated experimental protocol subsection that explicitly lists the three benchmarks, coalition distributions, removal protocols, target metrics, number of runs, error bar computation, and significance tests. revision: yes
Referee: [Attribution via model replacement] Model-replacement attribution section: the claim that attribution scores correctly identify agents whose replacement improves performance rests on the untested assumption that low-attribution agents are the ones whose backbone change produces the observed deltas; a direct comparison of replacement effects conditioned on attribution rank is required to establish this.

Authors: We agree that conditioning replacement effects on attribution rank is needed to validate the assumption. The revision will include a new analysis (table or plot) comparing performance and cost changes when replacing agents grouped by low, medium, and high attribution scores. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical gains measured on external benchmarks

full rationale

The paper defines a cooperative-game attribution framework via coalition distribution, removal protocol, and target metric, then applies LOO/combinatorial variants and model-replacement intervention to three benchmarks. The reported 17%/35% deltas are observed task outcomes after substitution, not quantities forced by the definitions themselves or by any fitted parameter renamed as prediction. No self-citation chains, uniqueness theorems, or ansatzes are invoked as load-bearing premises. The derivation therefore remains self-contained against the external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; cannot enumerate free parameters, axioms, or invented entities without the full manuscript.

pith-pipeline@v0.9.1-grok · 5743 in / 1101 out tokens · 33009 ms · 2026-06-29T14:33:36.834336+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 2 canonical work pages

[1]

Xiangru Tang, Anni Zou, Zhuosheng Zhang, Ziming Li, Yilun Zhao, Xingyao Zhang, Arman Cohan, and Mark Gerstein

Workbench: a benchmark dataset for agents in a realistic workplace setting.arXiv preprint arXiv:2405.00823. Xiangru Tang, Anni Zou, Zhuosheng Zhang, Ziming Li, Yilun Zhao, Xingyao Zhang, Arman Cohan, and Mark Gerstein. 2024. Medagents: Large language models as collaborators for zero-shot medical rea- soning. InFindings of the Association for Computa- tion...

work page arXiv 2024
[2]

calendar

Medethicsqa: A comprehensive question an- swering benchmark for medical ethics evaluation of llms.arXiv preprint arXiv:2506.22808. Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, and 1 others. 2024. Au- togen: Enabling next-gen llm applications via multi- agent conversations. InFir...

work page arXiv 2024

[1] [1]

Xiangru Tang, Anni Zou, Zhuosheng Zhang, Ziming Li, Yilun Zhao, Xingyao Zhang, Arman Cohan, and Mark Gerstein

Workbench: a benchmark dataset for agents in a realistic workplace setting.arXiv preprint arXiv:2405.00823. Xiangru Tang, Anni Zou, Zhuosheng Zhang, Ziming Li, Yilun Zhao, Xingyao Zhang, Arman Cohan, and Mark Gerstein. 2024. Medagents: Large language models as collaborators for zero-shot medical rea- soning. InFindings of the Association for Computa- tion...

work page arXiv 2024

[2] [2]

calendar

Medethicsqa: A comprehensive question an- swering benchmark for medical ethics evaluation of llms.arXiv preprint arXiv:2506.22808. Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, and 1 others. 2024. Au- togen: Enabling next-gen llm applications via multi- agent conversations. InFir...

work page arXiv 2024