arxiv: 2604.09593 · v1 · submitted 2026-03-04 · 💻 cs.DC

Recognition: 2 theorem links

· Lean Theorem

Benchmarking Compound AI Applications for Hardware-Software Co-Design

Paramuth Samuthrsindh , Angel Cervantes , Varun Gohil , Gohar Irfan Chaudhry , Christina Delimitrou , Adam Belay

Authors on Pith no claims yet

Pith reviewed 2026-05-15 16:12 UTC · model grok-4.3

classification 💻 cs.DC

keywords compound AIbenchmarking suitehardware-software co-designresource efficiencyLLM applicationsdatacenter workloadscross-stack analysis

0 comments

The pith

A benchmarking suite for compound AI applications enables cross-stack analysis to derive hardware-software co-design principles for higher resource efficiency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a benchmarking suite for compound AI applications that combine large language models, other machine learning models, external tools, and data sources. This suite examines the large configuration space across the full stack from applications and serving software down to hardware. Using measurements from the suite, the authors extract key takeaways and design principles for co-design. These principles target improved resource efficiency in datacenter deployments where such applications are growing common. A sympathetic reader would care because standardized benchmarks have been missing for this workload, making systematic exploration of performance, cost, and consumption difficult.

Core claim

We present our benchmarking suite used for cross-stack analysis of Compound AI applications. Using this, we derive key takeaways and design principles spanning several layers of the stack for hardware-software co-design to unlock higher resource-efficiency.

What carries the argument

The benchmarking suite for compound AI applications that spans applications, serving software, and hardware to analyze configuration space effects on performance and resources.

If this is right

Design principles guide optimizations at multiple stack layers to reduce resource consumption.
Cross-stack insights support better decisions on application serving and hardware choices.
Standardized evaluation helps navigate trade-offs in performance, deployment cost, and efficiency.
The suite provides a foundation for analyzing diverse use cases of compound AI workloads.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Hardware architects could use the principles to prioritize features that align with common interactions in these applications.
Extending the suite to new models or tools might reveal additional principles as workloads evolve.
Adoption could lower overall datacenter energy and operational costs if the principles generalize beyond tested cases.
Similar benchmarking approaches might apply to other emerging composite workloads beyond AI.

Load-bearing premise

The proposed suite adequately covers the large configuration space across applications, serving software, and hardware without missing critical interactions that would alter the derived design principles.

What would settle it

Testing the suite on additional compound AI applications and hardware configurations where measured resource efficiency and performance deviate from the expected design principles due to unaccounted layer interactions.

Figures

Figures reproduced from arXiv: 2604.09593 by Adam Belay, Angel Cervantes, Christina Delimitrou, Gohar Irfan Chaudhry, Paramuth Samuthrsindh, Varun Gohil.

**Figure 2.** Figure 2: Temporal resource dominance in the end-to-end [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 4.** Figure 4: System resource utilization timeline of RAG [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 3.** Figure 3: System resource utilization timeline of RAG [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 5.** Figure 5: Video-QA sensitivity to per-component GPU frequencies and impact on energy and tail latency under varying load. [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗

**Figure 6.** Figure 6: Video-QA MM LLM GPU’s power draw at different [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Accuracy versus 90th percentile latency trade-off in [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗

**Figure 8.** Figure 8: Comparison of KV cache usage for OpenEvolve between the default implementation vs. prompt-reordering (optimized). [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: MM cache usage for Video-QA for random routing [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

**Figure 10.** Figure 10: Highlights demonstrate the segments that result in [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗

read the original abstract

Compound AI applications, composed from interactions between Large Language Models (LLMs), Machine Learning (ML) models, external tools and data sources are quickly becoming an integral workload in datacenters. Their diverse sub-components and use-cases present a large configuration-space across the deployment stack -- ranging from applications and serving software down to hardware -- each of which may influence the application performance, deployment cost, and/or resource consumption. Despite their rapid adoption, however, the systems community lacks a standardized benchmark for analyzing this complicated design-space and guiding in system design. In this work, we present our benchmarking suite used for cross-stack analysis of Compound AI applications. Using this, we derive key takeaways and design principles spanning several layers of the stack for hardware-software co-design to unlock higher resource-efficiency.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper introduces a benchmarking suite for compound AI applications to support hardware-software co-design, but the derived principles rest on whether the chosen workloads adequately sample the large configuration space.

read the letter

The main thing to know about this paper is that the authors built a benchmarking suite for compound AI applications—workloads that combine LLMs with other ML models, external tools, and data sources—and used it to extract cross-stack design principles for better hardware-software co-design in datacenters. This addresses a real gap, since the systems community has benchmarks for simpler cases but nothing standardized for these interactive, multi-component apps that are now common at scale. The paper does a solid job framing why the configuration space is large and why choices from serving software down to hardware matter for performance, cost, and resource use. If the suite includes representative workloads and metrics that reflect actual interactions, it could become a useful reference for efficiency work. The soft spots center on coverage. The takeaways depend on the selected configurations capturing key interactions without major omissions, such as how specific inference engines pair with accelerators to affect memory bandwidth or tool-calling latency. If the evaluation only tests a modest number of points without sensitivity analysis or a clear argument for why the sampling is sufficient, some principles could shift with different setups. I would check the full evaluation section to see the exact number of configurations, the workloads chosen, and any validation against production traces. This paper is for systems researchers and hardware designers focused on datacenter AI efficiency. Readers working on co-design or resource optimization for emerging AI workloads would find the suite description and principles useful, assuming the data supports the claims. It deserves peer review because a benchmark in this area fills a practical need, even if revisions are needed to strengthen the sampling arguments and make the methodology more transparent.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces a benchmarking suite for Compound AI applications composed of LLMs, ML models, external tools, and data sources. The suite supports cross-stack analysis across the application, serving software, and hardware layers in datacenters. The authors apply the suite to derive key takeaways and design principles for hardware-software co-design aimed at improving resource efficiency.

Significance. If the suite provides adequate coverage of the configuration space and the derived principles are supported by systematic evaluation, the work would address a clear gap in standardized benchmarks for these emerging workloads. Providing an open benchmarking suite is a concrete strength that enables future reproducible studies in hardware-software co-design.

major comments (2)

[§4 (Evaluation)] §4 (Evaluation): the paper reports results from a modest number of configurations without explicit arguments for coverage of the large space or sensitivity analysis. This leaves open whether omitted interactions (e.g., specific LLM inference engine + accelerator pairings on memory bandwidth or tool-calling latency) would change the reported takeaways.
[§5 (Design Principles)] §5 (Design Principles): the cross-stack design principles are presented as general but rest on the sampling performed in the evaluation; without quantitative justification of how the chosen points represent the full space, the load-bearing claim that the suite enables reliable derivation of principles is not yet demonstrated.

minor comments (1)

[Figures] Figure captions and axis labels in the evaluation figures could be expanded to clarify which layers of the stack are being varied in each plot.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on evaluation coverage and the generality of the derived design principles. We address each point below and will incorporate clarifications and additional analysis in the revised manuscript.

read point-by-point responses

Referee: §4 (Evaluation): the paper reports results from a modest number of configurations without explicit arguments for coverage of the large space or sensitivity analysis. This leaves open whether omitted interactions (e.g., specific LLM inference engine + accelerator pairings on memory bandwidth or tool-calling latency) would change the reported takeaways.

Authors: We selected the evaluated configurations to span representative dimensions of the space, including model scales (7B–70B), serving frameworks (vLLM and TensorRT-LLM), accelerators (A100/H100), and tool-call latencies drawn from production traces. To strengthen the section we will add an explicit sampling rationale together with a sensitivity study that varies batch size, memory bandwidth, and engine–accelerator pairings. The open-source suite is intentionally extensible so that omitted interactions can be explored by users; the reported results are therefore intended as illustrative demonstrations rather than exhaustive enumeration. revision: partial
Referee: §5 (Design Principles): the cross-stack design principles are presented as general but rest on the sampling performed in the evaluation; without quantitative justification of how the chosen points represent the full space, the load-bearing claim that the suite enables reliable derivation of principles is not yet demonstrated.

Authors: We will revise §5 to state the scope of the principles explicitly, include quantitative coverage metrics (ranges and distributions of model sizes, hardware specs, and latency budgets), and note that the principles reflect trends observed within the sampled subspace. This will ground the claims in the evaluation methodology while acknowledging that broader validation remains possible through the released benchmark. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmarking suite and takeaways are empirically derived

full rationale

The paper introduces a benchmarking suite for Compound AI applications and uses it to perform cross-stack analysis, from which design principles are extracted. No equations, fitted parameters, self-citations, or ansatzes are described that would reduce any claimed prediction or principle back to the suite's own inputs by construction. The derivation chain consists of running the suite on selected workloads and reporting observed interactions, which is self-contained empirical work rather than a closed definitional loop. This matches the default expectation for a systems benchmarking paper.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields minimal ledger entries; the central claim rests on the domain assumption that compound AI workloads are rapidly becoming integral and that a standardized benchmark is currently missing.

axioms (1)

domain assumption Compound AI applications are quickly becoming an integral workload in datacenters with diverse sub-components that influence performance, cost, and resource consumption.
Stated directly in the abstract as the motivation for the benchmark.

pith-pipeline@v0.9.0 · 5446 in / 1041 out tokens · 37714 ms · 2026-05-15T16:12:28.719169+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We present our benchmarking suite used for cross-stack analysis of Compound AI applications... derive key takeaways and design principles spanning several layers of the stack for hardware-software co-design
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Takeaway: Optimizing CPU execution is just as vital as optimizing GPU execution... dynamically adjusting GPU configurations and power management strategies

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 6 internal anchors

[1]

Features | Cursor – The AI Code Editor

2025. Features | Cursor – The AI Code Editor. https://cursor.com/ features. Accessed: 19 August 2025

work page 2025
[2]

Shubham Agrawal, Adeola Adesoba, Dhruv Nandakumar, Katherine Huang, and Vignesh Srinivasakumar. 2024. Build an Agentic Video Workflow with Video Search and Summarization. NVIDIA Developer Blog. Available at: https://developer.nvidia.com/blog/build-an-agentic- video-workflow-with-video-search-and-summarization/

work page 2024
[3]

Gohar Irfan Chaudhry, Esha Choukse, Íñigo Goiri, Rodrigo Fonseca, Adam Belay, and Ricardo Bianchini. 2025. Towards Resource-Efficient Compound AI Systems. InProceedings of the 2025 Workshop on Hot Topics in Operating Systems(Banff, AB, Canada)(HotOS ’25). Asso- ciation for Computing Machinery, New York, NY, USA, 218–224. https://doi.org/10.1145/3713082.3730377

work page doi:10.1145/3713082.3730377 2025
[4]

Gohar Irfan Chaudhry, Esha Choukse, Haoran Qiu, Íñigo Goiri, Ro- drigo Fonseca, Adam Belay, and Ricardo Bianchini. 2025. Murakkab: Resource-Efficient Agentic Workflow Orchestration in Cloud Plat- forms. arXiv:2508.18298 [cs.MA] https://arxiv.org/abs/2508.18298

work page arXiv 2025
[5]

Esha Choukse, Brijesh Warrier, Scot Heath, Luz Belmont, April Zhao, Hassan Ali Khan, Brian Harry, Matthew Kappel, Russell J. Hewett, Kushal Datta, Yu Pei, Caroline Lichtenberger, John Siegler, David Lukofsky, Zaid Kahn, Gurpreet Sahota, Andy Sullivan, Charles Fred- erick, Hien Thai, Rebecca Naughton, Daniel Jurnove, Justin Harp, Reid Carper, Nithish Mahal...

work page arXiv 2025
[6]

The ml. energy benchmark: Toward automated inference energy measurement and optimization,

Jae-Won Chung, Jiachen Liu, Jeff J. Ma, Ruofan Wu, Oh Jun Kweon, Yux- uan Xia, Zhiyu Wu, and Mosharaf Chowdhury. 2025. The ML.ENERGY Benchmark: Toward Automated Inference Energy Measurement and Optimization. arXiv:2505.06371 [cs.LG] https://arxiv.org/abs/2505. 06371

work page arXiv 2025
[7]

NVIDIA Corporation. 2022. NVIDIA H100 Tensor Core GPU Datasheet. https://resources.nvidia.com/en-us-tensor-core/nvidia- tensor-core-gpu-datasheet

work page 2022
[8]

NVIDIA Corporation. 2024. NVIDIA Volta V100 GPU Datasheet. https://images.nvidia.com/content/technologies/volta/pdf/437317- Volta-V100-DS-NV-US-WEB.pdf

work page 2024
[9]

Ling Dai, Yuan-Hao Jiang, Yuanyuan Chen, Zinuo Guo, Tian-Yi Liu, and Xiaobao Shao. 2024. Agent4EDU: Advancing AI for Education with Agentic Workflows. InProceedings of the 2024 3rd International Conference on Artificial Intelligence and Education. 180–185

work page 2024
[10]

Yu Gan, Yanqi Zhang, Dailun Cheng, Ankitha Shetty, Priyal Rathi, Nayan Katarki, Ariana Bruno, Justin Hu, Brian Ritchken, Brendon Jack- son, Kelvin Hu, Meghna Pancholi, Yuan He, Brett Clancy, Chris Colen, Fukang Wen, Catherine Leung, Siyuan Wang, Leon Zaruvinsky, Mateo Espinosa, Rick Lin, Zhongling Liu, Jake Padilla, and Christina Delim- itrou. 2019. An Op...

work page doi:10.1145/3297858.3304013 2019
[11]

Bogdan Georgiev, Javier Gómez-Serrano, Terence Tao, and Adam Zsolt Wagner. 2025. Mathematical exploration and discovery at scale. arXiv:2511.02864 [cs.NE] https://arxiv.org/abs/2511.02864

work page arXiv 2025
[12]

Sagar Goyal, Eti Rastogi, Sree Prasanna Rajagopal, Dong Yuan, Fen Zhao, Jai Chintagunta, Gautam Naik, and Jeff Ward. 2024. HealAI: A healthcare LLM for effective medical documentation. InProceedings of the 17th ACM International Conference on Web Search and Data Mining. 1167–1168

work page 2024
[13]

John L. Henning. 2006. SPEC CPU2006 benchmark descriptions. SIGARCH Comput. Archit. News34, 4 (Sept. 2006), 1–17. https: //doi.org/10.1145/1186736.1186737

work page doi:10.1145/1186736.1186737 2006
[14]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. 2023. DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines. arXiv:2310.03714 [cs.CL] https: //arxiv.org/abs/2310.03714

work page internal anchor Pith review Pith/arXiv arXiv 2023
[15]

Efficient Memory Management for Large Language Model Serving with PagedAttention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Sto- ica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. arXiv:2309.06180 [cs.LG] https: //arxiv.org/abs/2309.06180

work page internal anchor Pith review Pith/arXiv arXiv 2023
[16]

Gonzalez, Hao Zhang, and Ion Sto- ica

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Sto- ica. 2023. Efficient Memory Management for Large Language Model Serving with PagedAttention. InProceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles

work page 2023
[17]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval- Augmented Generation for Knowledge-Intensive NLP Tasks.CoRR abs/2005.11401 (2020). arXiv:2005.11401 https://arxiv.org/abs/2005. 11401

work page internal anchor Pith review Pith/arXiv arXiv 2020
[18]

Linux man-pages project. 2025. madvise(2) — give advice about use of memory. Linux manual page. https://man7.org/linux/man-pages/ man2/madvise.2.html Linux man-pages 6.16, accessed 2026-02-05

work page 2025
[19]

John, Tsug- uchika Tabaru, Carole-Jean Wu, Lingjie Xu, Masafumi Yamazaki, Cliff Young, and Matei Zaharia

Peter Mattson, Christine Cheng, Cody Coleman, Greg Diamos, Paulius Micikevicius, David Patterson, Hanlin Tang, Gu-Yeon Wei, Peter Bailis, Victor Bittorf, David Brooks, Dehao Chen, Debojyoti Dutta, Udit Gupta, Kim Hazelwood, Andrew Hock, Xinyuan Huang, Atsushi Ike, Bill Jia, Daniel Kang, David Kanter, Naveen Kumar, Jeffery Liao, Guokai Ma, Deepak Narayanan...

work page arXiv 2020
[20]

Alexander Novikov, Ngân V˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Ku- mar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. 2025. Al- phaEvolve: A coding agent for scientific...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[21]

NVIDIA. 2024. NVIDIA A100 Tensor Core GPU Datasheet. https://www.nvidia.com/content/dam/en-zz/Solutions/Data-Center/ a100/pdf/nvidia-a100-datasheet-us-nvidia-1758950-r4-web.pdf

work page 2024
[22]

KernelBench: Can LLMs Write Efficient GPU Kernels?

Anne Ouyang, Simon Guo, Simran Arora, Alex L. Zhang, William Hu, Christopher Ré, and Azalia Mirhoseini. 2025. KernelBench: Can LLMs Write Efficient GPU Kernels?. InInternational Conference on Machine Learning. https://doi.org/10.48550/arXiv.2502.10517

work page internal anchor Pith review doi:10.48550/arxiv.2502.10517 2025
[23]

Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Anselm Levskaya, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. 2022. Efficiently Scaling Transformer Inference. arXiv:2211.05102 [cs.LG] https://arxiv.org/abs/2211.05102

work page arXiv 2022
[24]

2026.Qwen3-Coder-Next Technical Report

Qwen Team. 2026.Qwen3-Coder-Next Technical Report. Technical Re- port. https://github.com/QwenLM/Qwen3-Coder/blob/main/qwen3_ coder_next_tech_report.pdf Accessed: 2026-02-04

work page 2026
[25]

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2022. Robust Speech Recognition via Large-Scale Weak Supervision. https://openai.com/research/whisper. OpenAI

work page 2022
[26]

Francisco Romero, Johann Hauswald, Aditi Partap, Daniel Kang, Matei Zaharia, and Christos Kozyrakis. 2022. Optimizing Video Analytics with Declarative Model Relationships.Proc. VLDB Endow.16, 3 (Nov. 2022), 447–460. https://doi.org/10.14778/3570690.3570695

work page doi:10.14778/3570690.3570695 2022
[27]

2025.OpenEvolve: an open-source evolution- ary coding agent

Asankhaya Sharma. 2025.OpenEvolve: an open-source evolution- ary coding agent. https://github.com/algorithmicsuperintelligence/ openevolve

work page 2025
[28]

Jovan Stojkovic, Chaojie Zhang, Íñigo Goiri, Esha Choukse, Haoran Qiu, Rodrigo Fonseca, Josep Torrellas, and Ricardo Bianchini. 2025. TAPAS: Thermal- and Power-Aware Scheduling for LLM Inference in Cloud Platforms. arXiv:2501.02600 [cs.DC] https://arxiv.org/abs/2501. 02600

work page arXiv 2025
[29]

Yunlong Tang, Jing Bi, Siting Xu, Luchuan Song, Susan Liang, Teng Wang, Daoan Zhang, Jie An, Jingyang Lin, Rongyi Zhu, Ali Vosoughi, Chao Huang, Zeliang Zhang, Pinxin Liu, Mingqian Feng, Feng Zheng, Jianguo Zhang, Ping Luo, Jiebo Luo, and Chenliang Xu

work page
[30]

arXiv:2312.17432 [cs.CV] https://arxiv.org/abs/2312.17432

Video Understanding with Large Language Models: A Survey. arXiv:2312.17432 [cs.CV] https://arxiv.org/abs/2312.17432

work page arXiv
[31]

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexan- dre Ramé, Morgane Rivière, Louis Rouillard, Thomas Mesnard, Ge- offrey Cideron, Jean bastien Grill, Sabela Ramos, Edouard Yvinec, Michelle Casbon, Etienne Pot, Ivo Penchev, Gaël Liu, Francesco Visin, Kathleen Kenealy, Lucas...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

2026.Vast.ai GPU Rental Pricing

Vast.ai. 2026.Vast.ai GPU Rental Pricing. https://vast.ai/pricing

work page 2026
[33]

Anjiang Wei, Tianran Sun, Yogesh Seenichamy, Hang Song, Anne Ouyang, Azalia Mirhoseini, Ke Wang, and Alex Aiken. 2025. Astra: A Multi-Agent System for GPU Kernel Performance Optimization. arXiv:2509.07506 [cs.DC] https://arxiv.org/abs/2509.07506

work page arXiv 2025
[34]

Matei Zaharia, Omar Khattab, Lingjiao Chen, Jared Quincy Davis, Heather Miller, Chris Potts, James Zou, Michael Carbin, Jonathan Frankle, Naveen Rao, and Ali Ghodsi. 2024. The Shift from Models to Compound AI Systems. https://bair.berkeley.edu/blog/2024/02/18/ compound-ai-systems/

work page 2024
[35]

Lu Zhang, Tiancheng Zhao, Heting Ying, Yibo Ma, and Kyu- song Lee. 2024. OmAgent: A Multi-modal Agent Framework for Complex Video Understanding with Task Divide-and-Conquer. arXiv:2406.16620 [cs.CL] https://arxiv.org/abs/2406.16620

work page arXiv 2024
[36]

Yaolun Zhang, Yinxu Pan, Yudong Wang, and Jie Cai. 2024. PyBench: Evaluating LLM agent on various real-world coding tasks.arXiv preprint arXiv:2407.16732(2024). 11

work page arXiv 2024