pith. sign in

arxiv: 2606.20629 · v1 · pith:I2OK3RHEnew · submitted 2026-05-28 · 💻 cs.MA · cs.AI· cs.LG

Specialize Roles, Mix Deployments: Pushing the Cost-Accuracy Frontier of LLM Agent Teams

Pith reviewed 2026-06-28 23:51 UTC · model grok-4.3

classification 💻 cs.MA cs.AIcs.LG
keywords LLM agentsmulti-role teamscost-accuracy trade-offheterogeneous deploymentrole specializationagent benchmarkshybrid hostingPareto frontier
0
0 comments X

The pith

Heterogeneous LLM agent teams with specialized roles and mixed deployments reach higher accuracy at the same cost or match top accuracy at up to 12 times lower cost than uniform teams.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests multi-role LLM agent teams where work is split among planner, executor, verifier and similar roles. It shows that assigning different models to different roles and choosing different hosting options for each role produces teams that sit on the better cost-accuracy frontier than teams that use the same model for every role. Existing benchmarks fix the model or the configuration, so they give little help choosing deployments that balance expense and performance. The authors supply a new benchmark, AgentCARD, that measures role assignments and deployment mixes across domains and supports adding more roles or new models over time.

Core claim

Heterogeneous teams consistently occupy the cost-accuracy frontier. They improve accuracy by up to 44% over cost-equivalent homogeneous teams, or match the strongest homogeneous team at up to 12× lower per-task cost through hybrid deployment. The best role assignment is domain-dependent: some domains are planner-bottlenecked, while others are executor-bottlenecked. The benchmark extends to workflows with additional roles such as verification and supports continual evaluation as new domains and team structures emerge.

What carries the argument

AgentCARD, a role-aware benchmark suite that combines a role-decomposed evaluation harness, a unified API and self-hosted cost model, Pareto-frontier analysis, and Shapley-based diagnostics to locate role bottlenecks.

If this is right

  • Accuracy gains or cost savings from heterogeneous assignment vary by domain.
  • Hybrid deployment (some roles on API, others self-hosted) can match strong uniform teams at substantially lower per-task cost.
  • Some domains require stronger planners while others require stronger executors.
  • The same measurement approach can evaluate workflows that add verification or other roles.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Teams could monitor ongoing tasks and reassign models to roles when bottlenecks appear.
  • Cost models will require periodic updates as API prices and hardware costs shift.
  • The approach could be extended to measure interactions among more than three roles in longer workflows.

Load-bearing premise

The role-decomposed evaluation harness accurately isolates each role's contribution without large unmodeled interactions between roles or deployment choices affecting task decomposition.

What would settle it

A re-run of the benchmark on the same tasks but with an explicit model of role interactions that changes the measured accuracy or cost ordering of heterogeneous versus homogeneous teams by more than a few percent.

Figures

Figures reproduced from arXiv: 2606.20629 by Edoardo Ponti, Liang Cheng, Li Dong, Luo Mai, Wenda Li, Yeqi Huang, Yinsicheng Jiang, Yufan Zhao, Zhan Lu.

Figure 1
Figure 1. Figure 1: Overview of AgentCARD. Agent teams, such as planner–executor and planner–verifier– [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Pairwise synergy ∆acc(p, q) in %. Lower triangle: planner p with executor q; upper triangle: reversed pair. Warmer cells indicate positive synergy, colder cells negative synergy. user-specified parallelism level using the throughput surface in Appendix A. Reported costs cover LLM serving only; tool execution and orchestration overhead are < 5% of the LLM cost and are excluded. Benchmark use cases. AgentCAR… view at source ↗
Figure 3
Figure 3. Figure 3: Cost–accuracy scatter plot for every (planner [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Shapley decomposition of role criticality under three deployment modes of weak-to-strong [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
read the original abstract

LLM agents are increasingly deployed as multi-role teams, where tasks are divided across specialized roles such as planner, executor, and verifier. In these systems, cost and accuracy are no longer properties of a single model: they depend on which model fills each role and where it is hosted, including API, self-hosted, and hybrid deployment. Existing agentic benchmarks typically evaluate fixed models or fixed agent configurations, and therefore offer limited guidance for cost-accuracy-optimal deployment. We introduce AgentCARD, a role-aware benchmark suite for evaluating LLM agent teams across role assignment and deployment mode. AgentCARD combines a role-decomposed evaluation harness, a unified API/self-hosted cost model, Pareto-frontier analysis, and a Shapley-based diagnostic for identifying role bottlenecks. Our evaluation shows that heterogeneous teams consistently occupy the cost-accuracy frontier. They improve accuracy by up to $44\%$ over cost-equivalent homogeneous teams, or match the strongest homogeneous team at up to $12\times$ lower per-task cost through hybrid deployment. We further find that the best role assignment is domain-dependent: some domains are planner-bottlenecked, while others are executor-bottlenecked. Finally, AgentCARD extends beyond planner--executor teams to workflows with additional roles such as verification, and supports continual evaluation as new domains and team structures emerge. Our code is released at: https://github.com/Auto-CAP/AgentCAP

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces AgentCARD, a role-aware benchmark suite for LLM agent teams that evaluates role assignment (planner, executor, verifier) and deployment modes (API, self-hosted, hybrid) using a role-decomposed harness, unified cost model, Pareto-frontier analysis, and Shapley diagnostics. It claims that heterogeneous teams occupy the cost-accuracy frontier, improving accuracy by up to 44% over cost-equivalent homogeneous teams or matching the strongest homogeneous team at up to 12× lower per-task cost, with domain-dependent bottlenecks (planner- vs. executor-bottlenecked). The work extends to additional roles and releases code for continual evaluation.

Significance. If the results hold under rigorous controls, the paper offers actionable guidance for cost-accuracy optimization in multi-role LLM agent deployments, a practically relevant problem as agent teams proliferate. The combination of Pareto analysis with Shapley value diagnostics for bottleneck identification is a useful methodological contribution, and the open benchmark/code supports reproducibility and extension to new domains.

major comments (2)
  1. [Abstract and Evaluation section] Abstract and Evaluation section: The central quantitative claims (up to 44% accuracy improvement and 12× cost reduction for heterogeneous teams) are presented without any description of task selection criteria, statistical significance testing, error bars, baseline definitions, or data exclusion rules. This information is required to evaluate whether the Pareto-frontier occupancy result is supported.
  2. [Role-decomposed evaluation harness] Role-decomposed evaluation harness (described in the methods): The harness is used to attribute performance to per-role model choice and deployment, but the manuscript provides no explicit ablation or test for cross-role interactions (e.g., whether planner quality alters executor behavior) or deployment effects on task decomposition itself. Without such a test, the Shapley diagnostics and frontier claims risk overstating specialization benefits.
minor comments (2)
  1. [Abstract] The abstract states that the code is released at a GitHub link, but the manuscript should include a brief note on the repository contents (e.g., which datasets and scripts are provided) to aid immediate reproducibility.
  2. [Cost model section] Notation for cost model and deployment modes could be clarified with a small table summarizing API vs. self-hosted vs. hybrid per role.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help strengthen the rigor of our claims. We address each major point below and indicate planned revisions.

read point-by-point responses
  1. Referee: [Abstract and Evaluation section] Abstract and Evaluation section: The central quantitative claims (up to 44% accuracy improvement and 12× cost reduction for heterogeneous teams) are presented without any description of task selection criteria, statistical significance testing, error bars, baseline definitions, or data exclusion rules. This information is required to evaluate whether the Pareto-frontier occupancy result is supported.

    Authors: We agree these details are necessary for evaluating the claims. The full manuscript's Evaluation section describes the AgentCARD benchmark domains and task sampling but does not include statistical tests, error bars, or explicit exclusion rules. In the revision we will add: task selection criteria with domain coverage details; paired t-tests or Wilcoxon tests with p-values for the 44% and 12× improvements; error bars (standard error) on all Pareto plots and tables; explicit baseline definitions for homogeneous teams; and any data exclusion criteria applied. These additions will directly support the frontier occupancy result. revision: yes

  2. Referee: [Role-decomposed evaluation harness] Role-decomposed evaluation harness (described in the methods): The harness is used to attribute performance to per-role model choice and deployment, but the manuscript provides no explicit ablation or test for cross-role interactions (e.g., whether planner quality alters executor behavior) or deployment effects on task decomposition itself. Without such a test, the Shapley diagnostics and frontier claims risk overstating specialization benefits.

    Authors: The role-decomposed harness isolates per-role assignments by design, and Shapley values are computed over the full combinatorial space of role-model pairs to capture marginal contributions. However, we did not run dedicated ablations that vary planner quality while holding executor fixed or that measure changes in task decomposition under different deployments. We will add a targeted ablation on two domains (one planner-bottlenecked, one executor-bottlenecked) to quantify interaction effects and any decomposition shifts; if results show negligible interactions we will report them as supporting evidence for the current claims. This constitutes a partial revision as full cross-role interaction testing across all domains would require substantial additional compute. revision: partial

Circularity Check

0 steps flagged

No significant circularity: empirical benchmark with independent evaluations

full rationale

The paper introduces AgentCARD as a new role-aware benchmark suite with role-decomposed harness, cost model, Pareto analysis, and Shapley diagnostics. All central claims (heterogeneous teams on frontier, accuracy/cost gains, domain-dependent bottlenecks) rest on direct empirical measurements across models, roles, and deployments rather than any derivation, fitted parameter renamed as prediction, or self-citation chain. No equations, uniqueness theorems, or ansatzes are present that reduce to inputs by construction. The evaluation harness and diagnostics are externally falsifiable via the released code and new task data.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are stated. The unified cost model is referenced but its parameterization is not detailed.

pith-pipeline@v0.9.1-grok · 5817 in / 1043 out tokens · 27130 ms · 2026-06-28T23:51:19.034933+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

47 extracted references · 12 canonical work pages · 4 internal anchors

  1. [1]

    Openai Codex: Cloud-based autonomous coding agent

    OpenAI. Openai Codex: Cloud-based autonomous coding agent. https://openai.com/ index/introducing-codex/, 2025. Accessed: 30 April 2026

  2. [2]

    Composer 2 technical report

    Cursor Research Team. Composer 2 technical report. https://cursor.com/resources/ Composer2.pdf, 2026. Accessed: 30 April 2026

  3. [3]

    Opencode: The open source AI coding agent

    OpenCode. Opencode: The open source AI coding agent. https://github.com/opencode- ai/opencode, 2025. Accessed: 30 April 2026

  4. [4]

    AutoGen: Enabling next-gen LLM applications via multi-agent conversation,

    Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Beibin Li, Erkang Zhu, Li Jiang, Xiaoyun Zhang, Shaokun Zhang, Jiale Liu, Ahmed Hassan Awadallah, Ryen W White, Doug Burger, and Chi Wang. AutoGen: Enabling next-gen LLM applications via multi-agent conversation,

  5. [5]

    URLhttps://arxiv.org/abs/2308.08155

  6. [6]

    Google agent development kit (ADK)

    Google. Google agent development kit (ADK). https://google.github.io/adk-docs/,

  7. [8]

    Kimi agent swarm: Let 100 AI agents work for you

    Moonshot AI. Kimi agent swarm: Let 100 AI agents work for you. https://www.kimi.com/ blog/agent-swarm, 2026. Accessed: 30 April 2026

  8. [9]

    How we built our multi-agent research system

    Jeremy Hadfield, Barry Zhang, Kenneth Lien, Florian Scholz, Jeremy Fox, and Daniel Ford. How we built our multi-agent research system. https://www.anthropic.com/ engineering/built-multi-agent-research-system , 2025. Anthropic Engineering Blog. Accessed: 30 April 2026

  9. [10]

    Don’t build multi-agents

    Walden Yan. Don’t build multi-agents. https://cognition.ai/blog/dont-build- multi-agents, 2024. Cognition AI engineering blog

  10. [11]

    MCP Atlas leaderboard

    Scale Labs. MCP Atlas leaderboard. https://labs.scale.com/leaderboard/mcp_atlas,

  11. [12]

    Updated April 8, 2026

  12. [13]

    Plan-and-solve prompting: Improving zero-shot Chain-of-Thought reasoning by Large Lan- guage Models

    Lei Wang, Wanyu Xu, Yihuai Lan, Zhiqiang Hu, Yunshi Lan, Roy Ka-Wei Lee, and Ee-Peng Lim. Plan-and-solve prompting: Improving zero-shot Chain-of-Thought reasoning by Large Lan- guage Models. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki, editors,Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Lon...

  13. [14]

    ReAct: Synergizing reasoning and acting in language models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. ReAct: Synergizing reasoning and acting in language models. InThe Eleventh International Conference on Learning Representations, 2023. URL https://openreview. net/forum?id=WE_vluYUL-X

  14. [15]

    HuggingGPT: Solving AI tasks with chatGPT and its friends in hugging face

    Yongliang Shen, Kaitao Song, Xu Tan, Dongsheng Li, Weiming Lu, and Yueting Zhuang. HuggingGPT: Solving AI tasks with chatGPT and its friends in hugging face. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview. net/forum?id=yHdTscY6Ci

  15. [16]

    Reflexion: Language agents with verbal reinforcement learning

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik R Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum? id=vAElhFcKW6. 10

  16. [17]

    An LLM compiler for parallel function calling

    Sehoon Kim, Suhong Moon, Ryan Tabrizi, Nicholas Lee, Michael W Mahoney, Kurt Keutzer, and Amir Gholami. An LLM compiler for parallel function calling. InProceedings of the 41st International Conference on Machine Learning (ICML), 2024. URL https://openreview. net/forum?id=uQ2FUoFjnF

  17. [18]

    Plan-and-Act: Improving planning of agents for long-horizon tasks

    Lutfi Eren Erdogan, Hiroki Furuta, Sehoon Kim, Nicholas Lee, Suhong Moon, Gopala Anu- manchipalli, Kurt Keutzer, and Amir Gholami. Plan-and-Act: Improving planning of agents for long-horizon tasks. InForty-second International Conference on Machine Learning, 2025. URLhttps://openreview.net/forum?id=ybA4EcMmUZ

  18. [19]

    GitHub Copilot CLI v1.0.18: experimental critic agent for plan and complex im- plementation review

    GitHub. GitHub Copilot CLI v1.0.18: experimental critic agent for plan and complex im- plementation review. GitHub Copilot CLI release notes (April 4, 2026), April 2026. URL https://github.com/github/copilot-cli/releases?page=5. Page number of v1.0.18 may change. Accessed: 30 April 2026

  19. [20]

    Best practices for coding with agents: Plan mode

    Lee Robinson. Best practices for coding with agents: Plan mode. Cursor Engineering Blog, https://cursor.sh/blog/agent-best-practices, January 2026. Plan Mode is a Cursor product feature; agents propose a plan and wait for approval before execution. Accessed: 1 May 2026

  20. [21]

    State of AI 2025: An Empirical 100 Trillion Token Study

    Malika Aubakirova, Alex Atallah, Chris Clark, Justin Summerville, and Anjney Midha. State of AI 2025: An Empirical 100 Trillion Token Study. https://openrouter.ai/state-of-ai,

  21. [22]

    Accessed: 30 April 2026

  22. [23]

    Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity, 2024

    Tyler Griggs, Xiaoxuan Liu, Jiaxiang Yu, Doyoung Kim, Wei-Lin Chiang, Alvin Cheung, and Ion Stoica. Mélange: Cost Efficient Large Language Model Serving by Exploiting GPU Heterogeneity, 2024. URLhttps://arxiv.org/abs/2404.14527

  23. [24]

    MoE-CAP: Benchmarking cost, accuracy and performance of sparse mixture-of-experts systems, 2025

    Yinsicheng Jiang, Yao Fu, Yeqi Huang, Ping Nie, Zhan Lu, Leyang Xue, Congjie He, Man-Kit Sit, Jilong Xue, Li Dong, Ziming Miao, Dayou Du, Tairan Xu, Kai Zou, Edoardo Ponti, and Luo Mai. MoE-CAP: Benchmarking cost, accuracy and performance of sparse mixture-of-experts systems, 2025. URLhttps://arxiv.org/abs/2412.07067

  24. [25]

    PEAR: Planner-executor agent robustness benchmark

    Shen Dong, Mingxuan Zhang, Pengfei He, Li Ma, Bhavani Thuraisingham, Hui Liu, and Yue Xing. PEAR: Planner-executor agent robustness benchmark. In Vera Demberg, Kentaro Inui, and Lluís Marquez, editors,Findings of the Association for Computational Linguistics: EACL 2026, pages 4547–4567, Rabat, Morocco, March 2026. Association for Computational Linguistics...

  25. [26]

    How handshake saves 50% on LLM GPU costs with anyscale

    Anyscale. How handshake saves 50% on LLM GPU costs with anyscale. https://www.anyscale.com/resources/case-study/how-handshake-saves-50- on-llm-gpu-costs-with-anyscale, 2024. Accessed: 6 May 2026

  26. [27]

    Making AI more accessible: Up to 80% cost savings with Meta Llama 3.3 on databricks

    Databricks. Making AI more accessible: Up to 80% cost savings with Meta Llama 3.3 on databricks. https://www.databricks.com/blog/making-ai-more-accessible-80- cost-savings-meta-llama-33-databricks, 2024. Accessed: 6 May 2026

  27. [28]

    The cost of dynamic reasoning: Demystifying AI agents and test-time scaling from an AI infrastructure perspective,

    Jiin Kim, Byeongjun Shin, Jinha Chung, and Minsoo Rhu. The Cost of Dynamic Reasoning: Demystifying AI Agents and Test-Time Scaling from an AI Infrastructure Perspective. In Proceedings of the 32nd IEEE International Symposium on High-Performance Computer Archi- tecture (HPCA), 2026. URLhttps://arxiv.org/abs/2506.04301. arXiv:2506.04301

  28. [29]

    Narasimhan

    Shunyu Yao, Noah Shinn, Pedram Razavi, and Karthik R. Narasimhan. τ-bench: A benchmark for Tool-Agent-User interaction in real-world domains. InThe Thirteenth International Con- ference on Learning Representations, 2025. URL https://openreview.net/forum?id= roNSXZpUDN

  29. [30]

    MultiAgentBench: Evaluating the collaboration and competition of LLM agents, 2025

    Kunlun Zhu, Hongyi Du, Zhaochen Hong, Xiaocheng Yang, Shuyi Guo, Zhe Wang, Zhenhailong Wang, Cheng Qian, Xiangru Tang, Heng Ji, and Jiaxuan You. MultiAgentBench: Evaluating the Collaboration and Competition of LLM Agents. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (ACL), 2025. URL https://aclanthology. org/20...

  30. [31]

    TheAgentCompany: Benchmarking LLM Agents on Consequential Real World Tasks

    Frank F. Xu, Yufan Song, Boxuan Li, Yuxuan Tang, Kritanjali Jain, Mengxue Bao, Zora Z. Wang, Xuhui Zhou, Zhitong Guo, Murong Cao, Mingyang Yang, Hao Yang Lu, Amaad Martin, Zhe Su, Leander Maben, Raj Mehta, Wayne Chi, Lawrence Jang, Yiqing Xie, Shuyan Zhou, and Graham Neubig. TheAgentCompany: Benchmarking LLM agents on consequential real world tasks, 2025....

  31. [32]

    Faustino, Guanheng Liu, Shan Zhang, Hongbin Luo, Suhaib A

    Tie Ma, Yixi Chen, Vaastav Anand, Alessandro Cornacchia, Amândio R. Faustino, Guanheng Liu, Shan Zhang, Hongbin Luo, Suhaib A. Fahmy, Zafar A. Qazi, and Marco Canini. Maestro: Multi-agent evaluation suite for testing, reliability, and observability, 2026. URL https: //arxiv.org/abs/2601.00481

  32. [33]

    Holistic agent leaderboard: The missing infrastructure for AI agent evaluation

    Sayash Kapoor, Benedikt Stroebl, Peter Kirgis, Nitya Nadgir, Zachary S Siegel, Boyi Wei, Tianci Xue, Ziru Chen, Felix Chen, Saiteja Utpala, Franck Ndzomga, Dheeraj Oruganty, Sophie Luskin, Kangheng Liu, Botao Yu, Amit Arora, Dongyoon Hahm, Harsh Trivedi, Huan Sun, Juyong Lee, Tengjun Jin, Yifan Mai, Yifei Zhou, Yuxuan Zhu, Rishi Bommasani, Daniel Kang, Da...

  33. [34]

    Black, Gloria Geng, Danny Park, James Zou, Andrew Y

    Yixing Jiang, Kameron C. Black, Gloria Geng, Danny Park, James Zou, Andrew Y . Ng, and Jonathan H. Chen. MedAgentBench: A virtual EHR environment to benchmark medical LLM agents.NEJM AI, 2(9):AIdbp2500144, 2025. doi: 10.1056/AIdbp2500144. URL https://ai.nejm.org/doi/full/10.1056/AIdbp2500144

  34. [35]

    FinanceBench: A New Benchmark for Financial Question Answering

    Pranab Islam, Anand Kannappan, Douwe Kiela, Rebecca Qian, Nino Scherrer, and Bertie Vidgen. Financebench: A new benchmark for financial question answering, 2023. URL https://arxiv.org/abs/2311.11944

  35. [36]

    Thang Luong, Dawsen Hwang, Hoang H. Nguyen, Golnaz Ghiasi, Yuri Chervonyi, Insuk Seo, Junsu Kim, Garrett Bingham, Jonathan Lee, Swaroop Mishra, Alex Zhai, Clara Huiyi Hu, Henryk Michalewski, Jimin Kim, Jeonghyun Ahn, Junhwi Bae, Xingyou Song, Trieu H. Trinh, Quoc V . Le, and Junehyuk Jung. Towards robust mathematical reasoning. InProceedings of the 2025 C...

  36. [37]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. SWE-bench: Can language models resolve real-world Github issues? InInternational Conference on Learning Representations, 2024. URL https://arxiv.org/ abs/2310.06770

  37. [38]

    Oh my OpenAgent (omo): Multi-agent harness configurations for OpenCode

    Code Yeongyu. Oh my OpenAgent (omo): Multi-agent harness configurations for OpenCode. https://github.com/code-yeongyu/oh-my-openagent , 2026. GitHub repository, previ- ouslyoh-my-opencode; accessed 2 May 2026

  38. [39]

    Vast.ai pricing: live H100 marketplace listings

    Vast.ai. Vast.ai pricing: live H100 marketplace listings. https://vast.ai/pricing, 2026. Accessed: 30 April 2026

  39. [40]

    Self-refine: Iterative refinement with self-feedback

    Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, Shashank Gupta, Bodhisattwa Prasad Majumder, Katherine Hermann, Sean Welleck, Amir Yazdanbakhsh, and Peter Clark. Self-refine: Iterative refinement with self-feedback. InThirty-seventh Conference on Neural Informati...

  40. [41]

    Large Language Models cannot self-correct reasoning yet

    Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xiny- ing Song, and Denny Zhou. Large Language Models cannot self-correct reasoning yet. InThe Twelfth International Conference on Learning Representations, 2024. URL https: //openreview.net/forum?id=IkmD3fKBPQ

  41. [42]

    H100 rental prices: A cloud cost comparison, October 2025

    Adrien Laurent. H100 rental prices: A cloud cost comparison, October 2025. IntuitionLabs, https://intuitionlabs.ai/articles/h100-rental-prices-cloud-comparison

  42. [43]

    Hyperstack AI cloud pricing

    Hyperstack. Hyperstack AI cloud pricing. https://www.hyperstack.cloud/pricing,

  43. [44]

    Accessed Mar 2026. 12

  44. [45]

    H100 rental prices: The $2 vs $98 gap explained, March 2026

    gpu.fund. H100 rental prices: The $2 vs $98 gap explained, March 2026. https://gpu.fund/ blog/h100-price-reality-check-march-2026

  45. [46]

    NVIDIA DGX H100 system datasheet (8×H100, list price reference)

    NVIDIA Corporation. NVIDIA DGX H100 system datasheet (8×H100, list price reference). https://www.nvidia.com/en-us/data-center/dgx-h100/ , 2024. Accessed: 30 April 2026

  46. [47]

    Energy Information Administration

    U.S. Energy Information Administration. Electric power monthly: average retail price of elec- tricity to ultimate customers (commercial sector). https://www.eia.gov/electricity/ monthly/epm_table_grapher.php?t=epmt_5_6_a, 2025. Accessed: 30 April 2026

  47. [48]

    role-criticality under a specified upgrade path

    Uptime Institute. 2024 global data center survey: PUE remains stuck at 1.56 glob- ally. https://datacenter.uptimeinstitute.com/rs/711-RIA-145/images/2024. GlobalDataCenterSurvey.Report.pdf, 2024. Accessed: 30 April 2026. 13 A Self-Hosted Throughput Profiles We profile Qwen3.5-27B and GPT-OSS-120B at TP∈{2,4,8} on H100 SXM 80GB with vLLM. Table 2 reports p...