arxiv: 2604.25724 · v1 · submitted 2026-04-28 · 💻 cs.AI

Recognition: unknown

Scalable Inference Architectures for Compound AI Systems: A Production Deployment Study

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:14 UTC · model grok-4.3

classification 💻 cs.AI

keywords compound AI systemsinference architectureproduction deploymenttail latencythroughputcost savingsagentic AIautoscaling

0 comments

The pith

A modular inference architecture for compound AI systems cuts tail latency over 50 percent, raises throughput up to 3.9 times, and saves 30 to 40 percent in costs during live production use.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper describes a production deployment of a modular inference platform built to serve compound AI systems that combine multiple models, retrievers, and tools for complex tasks such as autonomous agents. The authors demonstrate how serverless execution combined with dynamic autoscaling and MLOps pipelines lets these systems manage concurrent and heterogeneous model calls without the delays common in earlier static setups. Real-world measurements from enterprise deployments show clear gains in speed, capacity, and expense. These findings matter because many current AI applications now rely on multi-step agent workflows that place new demands on inference infrastructure.

Core claim

The central claim is that a platform-agnostic modular architecture using serverless execution, dynamic autoscaling, and MLOps pipelines enables compound AI systems to scale model invocations efficiently. In production it delivers more than 50 percent lower P95 tail latency, up to 3.9 times higher throughput, and 30 to 40 percent cost reduction versus prior static deployments. The work also isolates compound-system-specific issues such as multi-model fan-out overhead, cascading cold-start propagation, and heterogeneous scaling dynamics that appear when serving agentic workloads.

What carries the argument

The modular, platform-agnostic inference architecture that combines serverless execution, dynamic autoscaling, and MLOps pipelines to serve concurrent heterogeneous model invocations in compound workflows.

If this is right

Model invocations within agent workflows can scale in parallel without manual intervention.
Bursty multi-agent workloads become manageable through automatic resource adjustment.
Rapid iteration on individual models can occur without disrupting overall system availability.
Unique overheads such as fan-out across models and cascading cold starts receive targeted mitigation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same dynamic scaling approach could extend to other multi-component AI applications beyond agents once similar production data becomes available.
Analysis of heterogeneous scaling dynamics may guide capacity planning for future systems that compose even more models and tools.
Enterprises considering agentic AI may find that shifting from static to modular inference reduces both operational risk and total cost of ownership as workload volume grows.

Load-bearing premise

The observed reductions in latency, gains in throughput, and cost savings are caused by the new modular architecture rather than by differences in workload, hardware, or other unmentioned changes between the old and new deployments.

What would settle it

Run the identical compound AI workloads on both the prior static infrastructure and the new modular architecture under matched conditions and measure the resulting P95 latency, throughput, and operating cost.

Figures

Figures reproduced from arXiv: 2604.25724 by Srikanta Prasad S V, Utkarsh Arora.

**Figure 1.** Figure 1: Cognitive orchestration in the Atlas Reasoning Engine. The Planner Agent decomposes user queries; the Tool Selector view at source ↗

read the original abstract

Modern enterprise AI applications increasingly rely on compound AI systems - architectures that compose multiple models, retrievers, and tools to accomplish complex tasks. Deploying such systems in production demands inference infrastructure that can efficiently serve concurrent, heterogeneous model invocations while maintaining cost-effectiveness and low latency. This paper presents a production deployment study of a modular, platform-agnostic inference architecture developed at Salesforce to support compound AI use cases including Agentforce (autonomous AI agents) and ApexGuru (AI-powered code analysis). The system integrates serverless execution, dynamic autoscaling, and MLOps pipelines to deliver consistent low-latency inference across multi-component agent workflows. We report production results demonstrating over 50% reduction in tail latency (P95), up to 3.9x throughput improvement, and 30 to 40% cost savings compared to prior static deployments. We further present a novel analysis of compound-system-specific challenges including multi-model fan-out overhead, cascading cold-start propagation, and heterogeneous scaling dynamics that emerge uniquely when serving agentic workloads. Through detailed case studies and operational lessons, we illustrate how the architecture enables compound AI systems to scale model invocations in parallel, handle bursty multi-agent workloads, and support rapid model iteration - capabilities essential for operationalizing agentic AI at enterprise scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a Salesforce production case study with concrete numbers on scaling compound AI inference, but the gains are attributed to the new architecture without controls that would rule out workload or hardware differences.

read the letter

The main takeaway is that the paper describes a modular serverless setup deployed for agentic workloads at Salesforce, with reported improvements over prior static systems. It focuses on real operational issues that appear when multiple models run together in agent workflows rather than single-model serving. The discussion of fan-out overhead, cascading cold starts, and heterogeneous scaling dynamics is the part that feels most grounded in actual experience. Those details come from their use cases with Agentforce and ApexGuru, and they show how the architecture supports parallel invocations and burst handling. That kind of practical reporting is what deployment teams need when moving from simple LLM calls to full compound systems. The integration of autoscaling and MLOps pipelines is presented clearly enough to give readers a sense of how they handled rapid model iteration in production. The soft spot is the results. The claims of over 50% P95 latency reduction, 3.9x throughput, and 30-40% cost savings are tied directly to the architecture change versus earlier static deployments. The abstract gives no methodology, no mention of workload matching, hardware consistency, traffic patterns, or any other controls between the two periods. In a live production comparison, those unmeasured factors can easily explain the deltas, so the causal link stays under-supported. This paper is for MLOps engineers and teams that need to run multi-agent systems at enterprise scale. Readers looking for operational lessons on bursty heterogeneous workloads will get something usable from the case studies. It deserves a serious referee because the topic is timely and the data comes from a large actual deployment, even though the evidence section needs more rigor on the comparisons. I would send it to peer review with a request to expand the methodology and add clearer caveats or additional controls if available.

Referee Report

1 major / 2 minor

Summary. The paper presents a production deployment study of a modular, platform-agnostic inference architecture for compound AI systems at Salesforce, supporting use cases such as Agentforce and ApexGuru. It integrates serverless execution, dynamic autoscaling, and MLOps pipelines, and reports empirical production results including over 50% reduction in P95 tail latency, up to 3.9x throughput improvement, and 30-40% cost savings compared to prior static deployments. The work also analyzes compound-system-specific challenges including multi-model fan-out overhead, cascading cold-start propagation, and heterogeneous scaling dynamics, illustrated through case studies and operational lessons for scaling agentic AI workloads.

Significance. If the reported gains can be causally attributed to the architecture with appropriate controls, this would represent a valuable contribution as one of the few detailed production studies on inference infrastructure for multi-component agent workflows. The emphasis on real-world challenges like cold-start propagation in heterogeneous setups and the practical lessons for enterprise deployment provide actionable insights that could guide future systems design in applied AI.

major comments (1)

[Abstract and Results] Abstract and production results description: The central claims of >50% P95 latency reduction, 3.9x throughput improvement, and 30-40% cost savings are presented as resulting from the modular/serverless/dynamic architecture versus 'prior static deployments,' but the manuscript provides no details on comparison methodology, including workload equivalence, hardware matching, traffic pattern controls, model version consistency, or use of A/B testing. This is load-bearing for the attribution of improvements to the proposed architecture rather than confounding factors.

minor comments (2)

[Abstract] Abstract: The summary of results omits any reference to error bars, statistical tests, data exclusion criteria, or sample sizes, reducing clarity on the robustness of the numeric claims.
[Throughout] Terminology: Ensure consistent definition of terms such as 'compound AI systems' and 'fan-out overhead' at first use, and consider adding a table summarizing the key challenges and mitigations for reader accessibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for highlighting the importance of methodological transparency in attributing the reported performance gains. We have revised the manuscript to address the concern directly.

read point-by-point responses

Referee: [Abstract and Results] Abstract and production results description: The central claims of >50% P95 latency reduction, 3.9x throughput improvement, and 30-40% cost savings are presented as resulting from the modular/serverless/dynamic architecture versus 'prior static deployments,' but the manuscript provides no details on comparison methodology, including workload equivalence, hardware matching, traffic pattern controls, model version consistency, or use of A/B testing. This is load-bearing for the attribution of improvements to the proposed architecture rather than confounding factors.

Authors: We agree that the original manuscript lacked explicit details on the comparison methodology, which weakens the strength of the causal claims. The reported improvements reflect before-and-after measurements taken from the same production environment at Salesforce, using request logs to match workload composition, volume, and traffic patterns as closely as possible. Hardware instances were drawn from the same fleet, and model versions were held constant across the transition window. A controlled A/B test was not performed because the deployment occurred in a live enterprise setting where service disruption had to be minimized. We have added a new subsection (4.2) titled 'Production Comparison Methodology' that documents these controls, including the use of historical log replay for workload equivalence verification and explicit checks for hardware and model consistency. This revision directly addresses the referee's point while preserving the production-study nature of the work. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical production measurements with no derivations or self-referential reductions

full rationale

The paper is a production deployment study reporting observed metrics (P95 latency reduction, throughput gains, cost savings) from a before/after comparison of static vs. modular inference architectures. No mathematical derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described structure. The central claims are direct empirical observations rather than any chain that reduces to its own inputs by construction. The noted weakness (lack of controlled isolation for causality) is a validity concern, not circularity. This is a standard non-circular empirical report.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available for review; no free parameters, axioms, or invented entities can be extracted from the provided text.

pith-pipeline@v0.9.0 · 5528 in / 1090 out tokens · 48508 ms · 2026-05-07T16:14:40.160806+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 4 canonical work pages · 3 internal anchors

[1]

The Shift from Models to Compound AI Systems,

M. Zaharia et al., “The Shift from Models to Compound AI Systems, ”Berkeley AI Research Blog, Feb. 2024

2024
[2]

Language Models are Few-Shot Learners,

T. Brown et al., “Language Models are Few-Shot Learners, ” inProc. NeurIPS, vol. 33, pp. 1877–1901, 2020

1901
[3]

A Blueprint Architecture of Compound AI Systems for Enterprise,

S. Suri et al., “A Blueprint Architecture of Compound AI Systems for Enterprise, ” arXiv:2406.00584, 2024

work page arXiv 2024
[4]

Scalable Inference Architectures for Operationalizing AI/ML in Enterprise Systems,

S. Prasad, U. Arora, et al., “Scalable Inference Architectures for Operationalizing AI/ML in Enterprise Systems, ” inProc. IEEE AIMLSYS, Oct. 2025

2025
[5]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

O. Khattab et al., “DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines, ”arXiv:2310.03714, 2023

work page internal anchor Pith review arXiv 2023
[6]

Large scale distributed deep networks,

J. Dean et al., “Large scale distributed deep networks, ” inProc. NeurIPS, vol. 25, pp. 1223–1231, 2012

2012
[7]

PipeDream: Generalized Pipeline Parallelism for DNN Training,

D. Narayanan et al., “PipeDream: Generalized Pipeline Parallelism for DNN Training, ” inProc. 27th ACM SOSP, pp. 1–15, 2019

2019
[8]

Beyond Data and Model Parallelism for Deep Neural Networks,

Z. Jia, M. Zaharia, and A. Aiken, “Beyond Data and Model Parallelism for Deep Neural Networks, ” inProc. 3rd MLSys, 2020

2020
[9]

Zero: Memory Optimization Toward Training Trillion Parameter Models,

S. Rajbhandari et al., “Zero: Memory Optimization Toward Training Trillion Parameter Models, ” inProc. SC, 2020

2020
[10]

Scalable Inference Serving Systems for Deep Learning Models,

K. Sreedharan et al., “Scalable Inference Serving Systems for Deep Learning Models, ” inProc. 19th USENIX NSDI, pp. 135–149, 2022

2022
[11]

Improved ML model deployment using Amazon SageMaker Inference Recommender,

S. R. Kotini et al., “Improved ML model deployment using Amazon SageMaker Inference Recommender, ”A WS Machine Learning Blog, 2023

2023
[12]

How AWS SageMaker Inference Components Save AI Inference Costs by Up to 8x,

R. Jadav and R. Aggarwal, “How AWS SageMaker Inference Components Save AI Inference Costs by Up to 8x, ”A WS Machine Learning Blog, 2025

2025
[13]

How Amazon Bedrock Custom Model Import stream- lined LLM deployment for Salesforce,

S. Prasad, U. Arora, et al., “How Amazon Bedrock Custom Model Import stream- lined LLM deployment for Salesforce, ”A WS Machine Learning Blog, Oct. 2025

2025
[14]

Efficient Memory Management for Large Language Model Serving with PagedAttention,

W. Kwon et al., “Efficient Memory Management for Large Language Model Serving with PagedAttention, ” inProc. USENIX OSDI, pp. 1159–1177, 2023

2023
[15]

Efficient Streaming Language Models with Attention Sinks

N. Parmar et al., “Scalable and Efficient Foundation Model Inference with Page- dAttention, ”arXiv:2309.17453, 2023

work page internal anchor Pith review arXiv 2023
[16]

ALTO: An Efficient Network Orchestrator for Compound AI Systems,

K. Santhanam et al., “ALTO: An Efficient Network Orchestrator for Compound AI Systems, ” inProc. 4th Workshop on ML and Systems, pp. 117–125, 2024

2024
[17]

SGLang: Efficient Execution of Structured Language Model Programs

L. Zheng et al., “Efficiently Programming Large Language Models using SGLang, ” arXiv:2312.07104, 2023

work page internal anchor Pith review arXiv 2023