ADK Arena: Evaluating Agent Development Kits via LLM-as-a-Developer

Gaurav Mittal; Jintao Huang; Xiaomin Li; Yu Hu

arxiv: 2606.05548 · v1 · pith:LPNZIOKBnew · submitted 2026-06-04 · 💻 cs.SE · cs.AI

ADK Arena: Evaluating Agent Development Kits via LLM-as-a-Developer

Jintao Huang , Xiaomin Li , Gaurav Mittal , Yu Hu This is my paper

Pith reviewed 2026-06-28 00:51 UTC · model grok-4.3

classification 💻 cs.SE cs.AI

keywords Agent Development KitsLLM-as-a-DeveloperAPI usability evaluationautonomous agentsframework benchmarkingcode generationSWE-bench

0 comments

The pith

Holding the developer fixed as an LLM coding agent turns ADK framework choice into a measurable proxy for API usability and agent performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LLM-as-a-Developer, a method that replaces human coders with an LLM agent which learns each framework's API from documentation, writes code, and fixes it through a validation loop until tests pass. By keeping the developer the same across all frameworks, differences in generation effort and final agent success become direct measures of framework quality. The authors built ADK Arena to apply this to all 51 popular Python ADKs across four benchmarks. Results show generation succeeds 57 percent of the time with costs varying 5.6 times, no single framework wins overall, and performance stays similar whether the LLM has raw source, documentation, or nothing at all.

Core claim

By holding the developer constant and varying only the framework, generation effort becomes a quantitative proxy for API usability and the resulting agents provide a controlled measure of framework effectiveness. Evaluating 204 agent-benchmark pairs across 51 frameworks finds generation succeeds for 57 percent of runs with 5.6 times cost variation, no single framework dominates, and genuine framework usage stays in a narrow 28-40 percent band regardless of information source.

What carries the argument

LLM-as-a-Developer methodology: an LLM coding agent that learns each ADK API from documentation and iteratively repairs generated code in a validate-and-feedback loop until tests pass.

If this is right

Framework effectiveness can be ranked by a combination of generation cost and benchmark resolution rate.
Best single-benchmark agents from some ADKs resolve up to 80 percent of tasks and can outperform general frontier coding agents at lower cost.
Median frameworks resolve only 32 percent of tasks, showing wide variation in practical utility.
Documentation, source code, and parametric knowledge act as largely substitutable inputs rather than any one being required.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method could be reused to track how framework updates change usability over time without recruiting human testers.
If the proxy holds, teams choosing an ADK could run a short LLM evaluation on their target benchmarks before committing.
The narrow band of framework usage across information sources suggests effort to improve documentation may yield smaller gains than improving core API design.
Extending the arena to non-Python frameworks or additional benchmarks would test whether the substitutability finding generalizes.

Load-bearing premise

The LLM coding agent serves as a fair and unbiased proxy for human developers when learning and using each framework's API from documentation.

What would settle it

Human developers building the same agents with the same frameworks produce different relative success rates or cost rankings than the LLM results.

Figures

Figures reproduced from arXiv: 2606.05548 by Gaurav Mittal, Jintao Huang, Xiaomin Li, Yu Hu.

**Figure 2.** Figure 2: The ADK Arena pipeline. 1 Environment Setup: collect repos, build Docker images. 2 Agent Generation: explore docs, write code, validate and repair. 3 Benchmarking: execute in containers, score output. frameworks, analogous to controlled SE experiments that fix expertise while varying tools (Ko et al., 2004; Murphy et al., 2006). We mitigate the training-data confound through a three-condition ablation (§4.… view at source ↗

**Figure 3.** Figure 3: Inter-ADK dependency network. A Ecosystem Analysis Beyond evaluating what ADK frameworks do on benchmarks, we examine how they relate to each other and to the broader developer community. We analyze inter-framework dependencies (§A.1), which reveal supply-chain risks and hidden coupling, and downstream adoption (§A.2), which quantifies real-world usage and market concentration. A.1 Inter-ADK Dependencies M… view at source ↗

read the original abstract

The rapid proliferation of Agent Development Kits (ADKs), SDK-level frameworks for building LLM-powered autonomous agents, has outpaced any empirical understanding of how framework choice affects agent performance. We propose \textbf{LLM-as-a-Developer}, a methodology that replaces human developers with an LLM coding agent that learns each framework's API from documentation, writes agent code, and iteratively repairs it through a validate-and-feedback loop until tests pass. By holding the developer constant and varying only the framework, generation effort becomes a quantitative proxy for API usability and the resulting agents provide a controlled measure of framework effectiveness. We implement this in \textbf{ADK Arena}, a fully automated pipeline with per-framework Docker isolation, a three-level validation pipeline, and benchmark adapters for SWE-bench, $\tau^2$-bench, Terminal-Bench, and MCP-Atlas. Evaluating all 51 popular Python ADK frameworks (204 agent--benchmark pairs), we find that: (1)~generation succeeds for 57\% of runs, and its cost varies 5.6$\times$ across frameworks (\$0.6 to \$3.4 per agent), a quantitative proxy for API complexity, though cost alone does not predict success; (2)~no single framework dominates: the best single-benchmark ADK agents resolve up to 80\% of tasks and can even \emph{beat} general-purpose frontier coding agents at a fraction of the cost, yet the median framework resolves only 32\%; (3)~across information-source ablations, genuine framework usage stays within a narrow 28--40\% band (highest with raw source access and still 33\% with no reference material at all), indicating that documentation, source code, and parametric knowledge are largely substitutable rather than any one being a hard bottleneck.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper scales an LLM coding proxy across 51 ADKs and reports concrete differences in cost and success, but the proxy's link to actual human developer effort stays untested.

read the letter

The main takeaway is that this work runs a controlled LLM agent through documentation, code writing, and repair loops on 51 Python ADKs, then measures success rates and costs on four benchmarks. It finds 57% overall success, 5.6x cost spread, no dominant framework, and that documentation, source, and model knowledge are roughly interchangeable.

What stands out is the engineering: per-framework Docker isolation, three-level validation, and adapters for SWE-bench and the others. That setup lets them run the same developer proxy on everything at once, which prior smaller studies did not do at this scale. The information-source ablations are also a clean addition.

The central weakness is exactly the one the stress-test flags. The numbers are presented as proxies for API usability and framework effectiveness because the developer is held fixed, yet there is no human baseline or correlation check showing that frameworks cheap or hard for the LLM track the same way for people. LLM-specific factors like training data overlap or error tolerance could drive the results instead. The abstract gives point estimates without error bars or details on how exclusions were handled, so the quantitative claims rest on an assumption that is not yet grounded.

This paper is for people who build or select agent frameworks and want a first broad map of the space. It is worth sending to referees because the scale and automation are real contributions even if the proxy interpretation needs tighter validation in revision.

Referee Report

2 major / 2 minor

Summary. The paper proposes the LLM-as-a-Developer methodology, implemented in ADK Arena, to evaluate 51 Python ADK frameworks by having an LLM coding agent learn APIs from documentation, generate agent code, and iteratively repair it via a validate-and-feedback loop. Holding the developer constant, it treats generation effort and success as proxies for API usability and framework effectiveness, reporting 57% overall success across 204 agent-benchmark pairs on SWE-bench, τ²-bench, Terminal-Bench, and MCP-Atlas; 5.6× cost variation ($0.6–$3.4); no dominant framework (best reaches 80%, median 32%); and substitutability of information sources (framework usage 28–40% across ablations).

Significance. If the LLM proxy assumption holds, the work supplies the first large-scale, controlled empirical comparison of ADK frameworks, demonstrating substantial variability in usability and effectiveness while showing that documentation, source, and parametric knowledge are largely interchangeable. The automated pipeline and benchmark adapters could become a reusable evaluation harness for the rapidly growing ADK ecosystem.

major comments (2)

[§3 (Methodology)] §3 (Methodology) and Abstract: The central claim that generation effort and success rates serve as a quantitative proxy for API usability and framework effectiveness rests on the untested assumption that LLM behavior (prompted from docs with validate-and-repair) tracks human developer learning curves and error patterns. No human baseline comparison or correlation study is reported, so LLM-specific factors such as training-data overlap or tolerance for verbose traces could drive the observed 57% success and 5.6× cost range rather than framework properties.
[Abstract and §5 (Results)] Abstract and §5 (Results): Concrete performance numbers (57% success, 5.6× cost range, 28–40% usage band, up to 80% task resolution) are presented without reported error bars, exact definitions of the three-level validation pipeline, or details on how post-hoc framework exclusions were handled, undermining assessment of statistical robustness and reproducibility of the substitutability and non-dominance claims.

minor comments (2)

[§5 (Results)] The abstract states that cost alone does not predict success, but no scatter plot or correlation coefficient is referenced in the results section to support this observation.
Docker isolation and benchmark adapters are mentioned but lack a high-level architecture diagram or pseudocode for the validate-and-repair loop, which would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive comments. We address each major point below, agreeing where clarifications are needed and defending the methodology's design where appropriate. Revisions will focus on improving reproducibility and acknowledging limitations.

read point-by-point responses

Referee: [§3 (Methodology)] §3 (Methodology) and Abstract: The central claim that generation effort and success rates serve as a quantitative proxy for API usability and framework effectiveness rests on the untested assumption that LLM behavior (prompted from docs with validate-and-repair) tracks human developer learning curves and error patterns. No human baseline comparison or correlation study is reported, so LLM-specific factors such as training-data overlap or tolerance for verbose traces could drive the observed 57% success and 5.6× cost range rather than framework properties.

Authors: We acknowledge that the LLM-as-a-Developer methodology relies on an assumption that LLM coding behavior provides a useful proxy for relative framework usability. The design intentionally holds the 'developer' constant (same LLM, same prompting strategy) to isolate framework effects, which would be infeasible with variable human developers at this scale. While LLM-specific factors could influence absolute values, relative rankings and cost variations across frameworks should still reflect API differences. We will add an explicit limitations paragraph in the discussion section noting the lack of human correlation data and calling for future work. No change to the core claims or results is warranted, as the proxy enables the first large-scale controlled comparison. revision: partial
Referee: [Abstract and §5 (Results)] Abstract and §5 (Results): Concrete performance numbers (57% success, 5.6× cost range, 28–40% usage band, up to 80% task resolution) are presented without reported error bars, exact definitions of the three-level validation pipeline, or details on how post-hoc framework exclusions were handled, undermining assessment of statistical robustness and reproducibility of the substitutability and non-dominance claims.

Authors: We agree these details are essential for reproducibility. In the revised manuscript we will: (1) add error bars (standard error across runs) to all aggregate metrics in the abstract, §5, and figures; (2) expand §3 with a precise description and pseudocode of the three-level validation pipeline (syntax check, unit tests, benchmark execution); (3) clarify post-hoc exclusions, including the exact number of frameworks removed, the criteria (e.g., installation failures, API incompatibilities), and sensitivity analysis showing results are robust to their inclusion. These changes will directly address concerns about statistical robustness. revision: yes

standing simulated objections not resolved

Conducting a human baseline comparison or correlation study to validate the LLM proxy assumption, which would require new large-scale experiments outside the current scope.

Circularity Check

0 steps flagged

No circularity: empirical evaluation is self-contained against external benchmarks

full rationale

The paper defines an experimental methodology (LLM-as-a-Developer) that runs an LLM coding agent on each framework's documentation to produce agents, then measures generation cost, success rate, and benchmark performance directly. No equations, fitted parameters, or derivations appear in the abstract or described pipeline; generation effort is observed output, not a prediction derived from itself. The claim that effort proxies usability follows from the controlled experimental design (developer held constant) rather than reducing to a self-referential fit or self-citation. Information-source ablations report observed usage rates (28-40%) without invoking prior author work as a uniqueness theorem. The evaluation uses external benchmarks (SWE-bench, etc.) and reports raw statistics, making the central results independent of any internal loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that an LLM can act as a constant developer proxy whose success and cost differences isolate framework effects; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption An LLM coding agent that learns from documentation and repairs via validate-and-feedback is a valid constant proxy for measuring framework API usability and effectiveness.
This premise is required for generation effort and agent performance to serve as quantitative proxies for framework quality.

pith-pipeline@v0.9.1-grok · 5866 in / 1331 out tokens · 28760 ms · 2026-06-28T00:51:42.608822+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 19 linked inside Pith

[1]

MCP-Atlas: A large-scale benchmark for tool-use competency with real MCP servers.arXiv preprint arXiv:2602.00933,

Chaithanya Bandi, Ben Hertzberg, Geobio Boo, Tejas Polakam, Jeff Da, Sami Hassaan, Manasi Sharma, Andrew Park, Ernesto Hernandez, Dan Rambado, Ivan Salazar, Rafael Cruz, Chetan Rane, Ben Levin, Brad Kenstler, and Bing Liu. MCP-Atlas: A large-scale benchmark for tool-use competency with real MCP servers.arXiv preprint arXiv:2602.00933,

Pith/arXiv arXiv
[2]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901,

1901
[3]

Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pond´e de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,

Pith/arXiv arXiv
[4]

AgentVerse: Facilitating multi-agent collaboration and exploring emergent behaviors.arXiv preprint arXiv:2308.10848,

Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chi-Min Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, et al. AgentVerse: Facilitating multi-agent collaboration and exploring emergent behaviors.arXiv preprint arXiv:2308.10848,

Pith/arXiv arXiv
[5]

Mohammed Mehedi Hasan, Hao Li, Emad Fallahzadeh, Gopi Krishnan Rajbahadur, Bram Adams, and Ahmed E. Hassan. An empirical study of testing practices in open source AI agent frameworks and agentic applications.arXiv preprint arXiv:2509.19185,

Pith/arXiv arXiv
[6]

MetaGPT: Meta programming for a multi-agent collaborative framework.arXiv preprint arXiv:2308.00352,

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. MetaGPT: Meta programming for a multi-agent collaborative framework.arXiv preprint arXiv:2308.00352,

Pith/arXiv arXiv
[7]

CAMEL: Communicative agents for “mind” exploration of large language model society.Advances in Neural Information Processing Systems, 36, 2023a

Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. CAMEL: Communicative agents for “mind” exploration of large language model society.Advances in Neural Information Processing Systems, 36, 2023a. Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Ak...

Pith/arXiv arXiv
[8]

Siddique, and Umar Farooq

Daniel Liu, Krishna Upadhyay, Vinaik Chhetri, A.B. Siddique, and Umar Farooq. A large-scale study on the development and issues of multi-agent AI systems.arXiv preprint arXiv:2601.07136,

arXiv
[9]

AgentBench: Evaluating LLMs as agents.arXiv preprint arXiv:2308.03688,

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. AgentBench: Evaluating LLMs as agents.arXiv preprint arXiv:2308.03688,

Pith/arXiv arXiv
[10]

Merrill, Nicholas Carlini, Alexander G

Mike A. Merrill, Nicholas Carlini, Alexander G. Shaw, Ludwig Schmidt, et al. Terminal-Bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868,

Pith/arXiv arXiv
[11]

GAIA: A benchmark for general AI assistants.arXiv preprint arXiv:2311.12983,

Gr´egoire Mialon, Cl´ementine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: A benchmark for general AI assistants.arXiv preprint arXiv:2311.12983,

Pith/arXiv arXiv
[12]

GPT-4 technical report.arXiv preprint arXiv:2303.08774,

OpenAI. GPT-4 technical report.arXiv preprint arXiv:2303.08774,

Pith/arXiv arXiv
[13]

Understanding multi-agent LLM frameworks: A unified benchmark and experimental analysis.arXiv preprint arXiv:2602.03128,

Abdelghny Orogat, Ana Rostam, and Essam Mansour. Understanding multi-agent LLM frameworks: A unified benchmark and experimental analysis.arXiv preprint arXiv:2602.03128,

arXiv
[14]

Gorilla: Large language model connected with massive APIs.arXiv preprint arXiv:2305.15334,

Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. Gorilla: Large language model connected with massive APIs.arXiv preprint arXiv:2305.15334,

Pith/arXiv arXiv
[15]

ChatDev: Communicative agents for software development.arXiv preprint arXiv:2307.07924,

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, et al. ChatDev: Communicative agents for software development.arXiv preprint arXiv:2307.07924,

Pith/arXiv arXiv
[16]

ToolLLM: Facilitating large language models to master 16000+ real-world APIs.arXiv preprint arXiv:2307.16789,

ADK Arena 15 Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. ToolLLM: Facilitating large language models to master 16000+ real-world APIs.arXiv preprint arXiv:2307.16789,

Pith/arXiv arXiv
[17]

Code Llama: Open foundation models for code.arXiv preprint arXiv:2308.12950,

Baptiste Rozi`ere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. Code Llama: Open foundation models for code.arXiv preprint arXiv:2308.12950,

Pith/arXiv arXiv
[18]

LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971,

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971,

Pith/arXiv arXiv
[19]

A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6), 2024a

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6), 2024a. Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Executable code actions elicit better LLM ag...

arXiv
[20]

AutoGen: Enabling next-gen LLM applications via multi-agent conversation.arXiv preprint arXiv:2308.08155,

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. AutoGen: Enabling next-gen LLM applications via multi-agent conversation.arXiv preprint arXiv:2308.08155,

Pith/arXiv arXiv
[21]

SWE-agent: Agent-computer interfaces enable automated software engineering.arXiv preprint arXiv:2405.15793,

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering.arXiv preprint arXiv:2405.15793,

Pith/arXiv arXiv
[22]

ReAct: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629,

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629,

Pith/arXiv arXiv
[23]

τ-bench: A benchmark for tool-agent- user interaction in real-world domains.arXiv preprint arXiv:2406.12045,

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ-bench: A benchmark for tool-agent- user interaction in real-world domains.arXiv preprint arXiv:2406.12045,

Pith/arXiv arXiv
[24]

A Ecosystem Analysis Beyond evaluating what ADK frameworksdoon benchmarks, we examine how theyrelateto each other and to the broader developer community

Hard dependency (6) Code-only import (40) Declared-only (8) Figure 3: Inter-ADK dependency network. A Ecosystem Analysis Beyond evaluating what ADK frameworksdoon benchmarks, we examine how theyrelateto each other and to the broader developer community. We analyze inter-framework dependencies (§A.1), which reveal supply-chain risks and hidden coupling, an...

2019
[25]

We extract dependency relationships from both package metadata (pyproject.toml, setup.py, requirements.txt) and source code imports across all 51 repositories

and hidden coupling invisible to package managers (Liu et al., 2022). We extract dependency relationships from both package metadata (pyproject.toml, setup.py, requirements.txt) and source code imports across all 51 repositories. An edge from frameworkAto frameworkBis classified into three categories (test and example imports are markedoptionaland exclude...

2022

[1] [1]

MCP-Atlas: A large-scale benchmark for tool-use competency with real MCP servers.arXiv preprint arXiv:2602.00933,

Chaithanya Bandi, Ben Hertzberg, Geobio Boo, Tejas Polakam, Jeff Da, Sami Hassaan, Manasi Sharma, Andrew Park, Ernesto Hernandez, Dan Rambado, Ivan Salazar, Rafael Cruz, Chetan Rane, Ben Levin, Brad Kenstler, and Bing Liu. MCP-Atlas: A large-scale benchmark for tool-use competency with real MCP servers.arXiv preprint arXiv:2602.00933,

Pith/arXiv arXiv

[2] [2]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in Neural Information Processing Systems, 33:1877–1901,

1901

[3] [3]

Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Pond´e de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374,

Pith/arXiv arXiv

[4] [4]

AgentVerse: Facilitating multi-agent collaboration and exploring emergent behaviors.arXiv preprint arXiv:2308.10848,

Weize Chen, Yusheng Su, Jingwei Zuo, Cheng Yang, Chenfei Yuan, Chi-Min Chan, Heyang Yu, Yaxi Lu, Yi-Hsin Hung, Chen Qian, et al. AgentVerse: Facilitating multi-agent collaboration and exploring emergent behaviors.arXiv preprint arXiv:2308.10848,

Pith/arXiv arXiv

[5] [5]

Mohammed Mehedi Hasan, Hao Li, Emad Fallahzadeh, Gopi Krishnan Rajbahadur, Bram Adams, and Ahmed E. Hassan. An empirical study of testing practices in open source AI agent frameworks and agentic applications.arXiv preprint arXiv:2509.19185,

Pith/arXiv arXiv

[6] [6]

MetaGPT: Meta programming for a multi-agent collaborative framework.arXiv preprint arXiv:2308.00352,

Sirui Hong, Mingchen Zhuge, Jonathan Chen, Xiawu Zheng, Yuheng Cheng, Ceyao Zhang, Jinlin Wang, Zili Wang, Steven Ka Shing Yau, Zijuan Lin, et al. MetaGPT: Meta programming for a multi-agent collaborative framework.arXiv preprint arXiv:2308.00352,

Pith/arXiv arXiv

[7] [7]

CAMEL: Communicative agents for “mind” exploration of large language model society.Advances in Neural Information Processing Systems, 36, 2023a

Guohao Li, Hasan Abed Al Kader Hammoud, Hani Itani, Dmitrii Khizbullin, and Bernard Ghanem. CAMEL: Communicative agents for “mind” exploration of large language model society.Advances in Neural Information Processing Systems, 36, 2023a. Raymond Li, Loubna Ben Allal, Yangtian Zi, Niklas Muennighoff, Denis Kocetkov, Chenghao Mou, Marc Marone, Christopher Ak...

Pith/arXiv arXiv

[8] [8]

Siddique, and Umar Farooq

Daniel Liu, Krishna Upadhyay, Vinaik Chhetri, A.B. Siddique, and Umar Farooq. A large-scale study on the development and issues of multi-agent AI systems.arXiv preprint arXiv:2601.07136,

arXiv

[9] [9]

AgentBench: Evaluating LLMs as agents.arXiv preprint arXiv:2308.03688,

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. AgentBench: Evaluating LLMs as agents.arXiv preprint arXiv:2308.03688,

Pith/arXiv arXiv

[10] [10]

Merrill, Nicholas Carlini, Alexander G

Mike A. Merrill, Nicholas Carlini, Alexander G. Shaw, Ludwig Schmidt, et al. Terminal-Bench: Benchmarking agents on hard, realistic tasks in command line interfaces.arXiv preprint arXiv:2601.11868,

Pith/arXiv arXiv

[11] [11]

GAIA: A benchmark for general AI assistants.arXiv preprint arXiv:2311.12983,

Gr´egoire Mialon, Cl´ementine Fourrier, Craig Swift, Thomas Wolf, Yann LeCun, and Thomas Scialom. GAIA: A benchmark for general AI assistants.arXiv preprint arXiv:2311.12983,

Pith/arXiv arXiv

[12] [12]

GPT-4 technical report.arXiv preprint arXiv:2303.08774,

OpenAI. GPT-4 technical report.arXiv preprint arXiv:2303.08774,

Pith/arXiv arXiv

[13] [13]

Understanding multi-agent LLM frameworks: A unified benchmark and experimental analysis.arXiv preprint arXiv:2602.03128,

Abdelghny Orogat, Ana Rostam, and Essam Mansour. Understanding multi-agent LLM frameworks: A unified benchmark and experimental analysis.arXiv preprint arXiv:2602.03128,

arXiv

[14] [14]

Gorilla: Large language model connected with massive APIs.arXiv preprint arXiv:2305.15334,

Shishir G Patil, Tianjun Zhang, Xin Wang, and Joseph E Gonzalez. Gorilla: Large language model connected with massive APIs.arXiv preprint arXiv:2305.15334,

Pith/arXiv arXiv

[15] [15]

ChatDev: Communicative agents for software development.arXiv preprint arXiv:2307.07924,

Chen Qian, Wei Liu, Hongzhang Liu, Nuo Chen, Yufan Dang, Jiahao Li, Cheng Yang, Weize Chen, Yusheng Su, Xin Cong, et al. ChatDev: Communicative agents for software development.arXiv preprint arXiv:2307.07924,

Pith/arXiv arXiv

[16] [16]

ToolLLM: Facilitating large language models to master 16000+ real-world APIs.arXiv preprint arXiv:2307.16789,

ADK Arena 15 Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. ToolLLM: Facilitating large language models to master 16000+ real-world APIs.arXiv preprint arXiv:2307.16789,

Pith/arXiv arXiv

[17] [17]

Code Llama: Open foundation models for code.arXiv preprint arXiv:2308.12950,

Baptiste Rozi`ere, Jonas Gehring, Fabian Gloeckle, Sten Sootla, Itai Gat, Xiaoqing Ellen Tan, Yossi Adi, Jingyu Liu, Romain Sauvestre, Tal Remez, et al. Code Llama: Open foundation models for code.arXiv preprint arXiv:2308.12950,

Pith/arXiv arXiv

[18] [18]

LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971,

Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timoth´ee Lacroix, Baptiste Rozi`ere, Naman Goyal, Eric Hambro, Faisal Azhar, et al. LLaMA: Open and efficient foundation language models.arXiv preprint arXiv:2302.13971,

Pith/arXiv arXiv

[19] [19]

A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6), 2024a

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6), 2024a. Xingyao Wang, Yangyi Chen, Lifan Yuan, Yizhe Zhang, Yunzhu Li, Hao Peng, and Heng Ji. Executable code actions elicit better LLM ag...

arXiv

[20] [20]

AutoGen: Enabling next-gen LLM applications via multi-agent conversation.arXiv preprint arXiv:2308.08155,

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, et al. AutoGen: Enabling next-gen LLM applications via multi-agent conversation.arXiv preprint arXiv:2308.08155,

Pith/arXiv arXiv

[21] [21]

SWE-agent: Agent-computer interfaces enable automated software engineering.arXiv preprint arXiv:2405.15793,

John Yang, Carlos E Jimenez, Alexander Wettig, Kilian Lieret, Shunyu Yao, Karthik Narasimhan, and Ofir Press. SWE-agent: Agent-computer interfaces enable automated software engineering.arXiv preprint arXiv:2405.15793,

Pith/arXiv arXiv

[22] [22]

ReAct: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629,

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629,

Pith/arXiv arXiv

[23] [23]

τ-bench: A benchmark for tool-agent- user interaction in real-world domains.arXiv preprint arXiv:2406.12045,

Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik Narasimhan. τ-bench: A benchmark for tool-agent- user interaction in real-world domains.arXiv preprint arXiv:2406.12045,

Pith/arXiv arXiv

[24] [24]

A Ecosystem Analysis Beyond evaluating what ADK frameworksdoon benchmarks, we examine how theyrelateto each other and to the broader developer community

Hard dependency (6) Code-only import (40) Declared-only (8) Figure 3: Inter-ADK dependency network. A Ecosystem Analysis Beyond evaluating what ADK frameworksdoon benchmarks, we examine how theyrelateto each other and to the broader developer community. We analyze inter-framework dependencies (§A.1), which reveal supply-chain risks and hidden coupling, an...

2019

[25] [25]

We extract dependency relationships from both package metadata (pyproject.toml, setup.py, requirements.txt) and source code imports across all 51 repositories

and hidden coupling invisible to package managers (Liu et al., 2022). We extract dependency relationships from both package metadata (pyproject.toml, setup.py, requirements.txt) and source code imports across all 51 repositories. An edge from frameworkAto frameworkBis classified into three categories (test and example imports are markedoptionaland exclude...

2022