pith. machine review for the scientific record. sign in

arxiv: 2604.25724 · v1 · submitted 2026-04-28 · 💻 cs.AI

Recognition: unknown

Scalable Inference Architectures for Compound AI Systems: A Production Deployment Study

Srikanta Prasad S V , Utkarsh Arora

Authors on Pith no claims yet

Pith reviewed 2026-05-07 16:14 UTC · model grok-4.3

classification 💻 cs.AI
keywords compound AI systemsinference architectureproduction deploymenttail latencythroughputcost savingsagentic AIautoscaling
0
0 comments X

The pith

A modular inference architecture for compound AI systems cuts tail latency over 50 percent, raises throughput up to 3.9 times, and saves 30 to 40 percent in costs during live production use.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper describes a production deployment of a modular inference platform built to serve compound AI systems that combine multiple models, retrievers, and tools for complex tasks such as autonomous agents. The authors demonstrate how serverless execution combined with dynamic autoscaling and MLOps pipelines lets these systems manage concurrent and heterogeneous model calls without the delays common in earlier static setups. Real-world measurements from enterprise deployments show clear gains in speed, capacity, and expense. These findings matter because many current AI applications now rely on multi-step agent workflows that place new demands on inference infrastructure.

Core claim

The central claim is that a platform-agnostic modular architecture using serverless execution, dynamic autoscaling, and MLOps pipelines enables compound AI systems to scale model invocations efficiently. In production it delivers more than 50 percent lower P95 tail latency, up to 3.9 times higher throughput, and 30 to 40 percent cost reduction versus prior static deployments. The work also isolates compound-system-specific issues such as multi-model fan-out overhead, cascading cold-start propagation, and heterogeneous scaling dynamics that appear when serving agentic workloads.

What carries the argument

The modular, platform-agnostic inference architecture that combines serverless execution, dynamic autoscaling, and MLOps pipelines to serve concurrent heterogeneous model invocations in compound workflows.

If this is right

  • Model invocations within agent workflows can scale in parallel without manual intervention.
  • Bursty multi-agent workloads become manageable through automatic resource adjustment.
  • Rapid iteration on individual models can occur without disrupting overall system availability.
  • Unique overheads such as fan-out across models and cascading cold starts receive targeted mitigation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same dynamic scaling approach could extend to other multi-component AI applications beyond agents once similar production data becomes available.
  • Analysis of heterogeneous scaling dynamics may guide capacity planning for future systems that compose even more models and tools.
  • Enterprises considering agentic AI may find that shifting from static to modular inference reduces both operational risk and total cost of ownership as workload volume grows.

Load-bearing premise

The observed reductions in latency, gains in throughput, and cost savings are caused by the new modular architecture rather than by differences in workload, hardware, or other unmentioned changes between the old and new deployments.

What would settle it

Run the identical compound AI workloads on both the prior static infrastructure and the new modular architecture under matched conditions and measure the resulting P95 latency, throughput, and operating cost.

Figures

Figures reproduced from arXiv: 2604.25724 by Srikanta Prasad S V, Utkarsh Arora.

Figure 1
Figure 1. Figure 1: Cognitive orchestration in the Atlas Reasoning Engine. The Planner Agent decomposes user queries; the Tool Selector view at source ↗
read the original abstract

Modern enterprise AI applications increasingly rely on compound AI systems - architectures that compose multiple models, retrievers, and tools to accomplish complex tasks. Deploying such systems in production demands inference infrastructure that can efficiently serve concurrent, heterogeneous model invocations while maintaining cost-effectiveness and low latency. This paper presents a production deployment study of a modular, platform-agnostic inference architecture developed at Salesforce to support compound AI use cases including Agentforce (autonomous AI agents) and ApexGuru (AI-powered code analysis). The system integrates serverless execution, dynamic autoscaling, and MLOps pipelines to deliver consistent low-latency inference across multi-component agent workflows. We report production results demonstrating over 50% reduction in tail latency (P95), up to 3.9x throughput improvement, and 30 to 40% cost savings compared to prior static deployments. We further present a novel analysis of compound-system-specific challenges including multi-model fan-out overhead, cascading cold-start propagation, and heterogeneous scaling dynamics that emerge uniquely when serving agentic workloads. Through detailed case studies and operational lessons, we illustrate how the architecture enables compound AI systems to scale model invocations in parallel, handle bursty multi-agent workloads, and support rapid model iteration - capabilities essential for operationalizing agentic AI at enterprise scale.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper presents a production deployment study of a modular, platform-agnostic inference architecture for compound AI systems at Salesforce, supporting use cases such as Agentforce and ApexGuru. It integrates serverless execution, dynamic autoscaling, and MLOps pipelines, and reports empirical production results including over 50% reduction in P95 tail latency, up to 3.9x throughput improvement, and 30-40% cost savings compared to prior static deployments. The work also analyzes compound-system-specific challenges including multi-model fan-out overhead, cascading cold-start propagation, and heterogeneous scaling dynamics, illustrated through case studies and operational lessons for scaling agentic AI workloads.

Significance. If the reported gains can be causally attributed to the architecture with appropriate controls, this would represent a valuable contribution as one of the few detailed production studies on inference infrastructure for multi-component agent workflows. The emphasis on real-world challenges like cold-start propagation in heterogeneous setups and the practical lessons for enterprise deployment provide actionable insights that could guide future systems design in applied AI.

major comments (1)
  1. [Abstract and Results] Abstract and production results description: The central claims of >50% P95 latency reduction, 3.9x throughput improvement, and 30-40% cost savings are presented as resulting from the modular/serverless/dynamic architecture versus 'prior static deployments,' but the manuscript provides no details on comparison methodology, including workload equivalence, hardware matching, traffic pattern controls, model version consistency, or use of A/B testing. This is load-bearing for the attribution of improvements to the proposed architecture rather than confounding factors.
minor comments (2)
  1. [Abstract] Abstract: The summary of results omits any reference to error bars, statistical tests, data exclusion criteria, or sample sizes, reducing clarity on the robustness of the numeric claims.
  2. [Throughout] Terminology: Ensure consistent definition of terms such as 'compound AI systems' and 'fan-out overhead' at first use, and consider adding a table summarizing the key challenges and mitigations for reader accessibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for highlighting the importance of methodological transparency in attributing the reported performance gains. We have revised the manuscript to address the concern directly.

read point-by-point responses
  1. Referee: [Abstract and Results] Abstract and production results description: The central claims of >50% P95 latency reduction, 3.9x throughput improvement, and 30-40% cost savings are presented as resulting from the modular/serverless/dynamic architecture versus 'prior static deployments,' but the manuscript provides no details on comparison methodology, including workload equivalence, hardware matching, traffic pattern controls, model version consistency, or use of A/B testing. This is load-bearing for the attribution of improvements to the proposed architecture rather than confounding factors.

    Authors: We agree that the original manuscript lacked explicit details on the comparison methodology, which weakens the strength of the causal claims. The reported improvements reflect before-and-after measurements taken from the same production environment at Salesforce, using request logs to match workload composition, volume, and traffic patterns as closely as possible. Hardware instances were drawn from the same fleet, and model versions were held constant across the transition window. A controlled A/B test was not performed because the deployment occurred in a live enterprise setting where service disruption had to be minimized. We have added a new subsection (4.2) titled 'Production Comparison Methodology' that documents these controls, including the use of historical log replay for workload equivalence verification and explicit checks for hardware and model consistency. This revision directly addresses the referee's point while preserving the production-study nature of the work. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical production measurements with no derivations or self-referential reductions

full rationale

The paper is a production deployment study reporting observed metrics (P95 latency reduction, throughput gains, cost savings) from a before/after comparison of static vs. modular inference architectures. No mathematical derivations, fitted parameters renamed as predictions, or load-bearing self-citations appear in the abstract or described structure. The central claims are direct empirical observations rather than any chain that reduces to its own inputs by construction. The noted weakness (lack of controlled isolation for causality) is a validity concern, not circularity. This is a standard non-circular empirical report.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available for review; no free parameters, axioms, or invented entities can be extracted from the provided text.

pith-pipeline@v0.9.0 · 5528 in / 1090 out tokens · 48508 ms · 2026-05-07T16:14:40.160806+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

17 extracted references · 4 canonical work pages · 3 internal anchors

  1. [1]

    The Shift from Models to Compound AI Systems,

    M. Zaharia et al., “The Shift from Models to Compound AI Systems, ”Berkeley AI Research Blog, Feb. 2024

  2. [2]

    Language Models are Few-Shot Learners,

    T. Brown et al., “Language Models are Few-Shot Learners, ” inProc. NeurIPS, vol. 33, pp. 1877–1901, 2020

  3. [3]

    A Blueprint Architecture of Compound AI Systems for Enterprise,

    S. Suri et al., “A Blueprint Architecture of Compound AI Systems for Enterprise, ” arXiv:2406.00584, 2024

  4. [4]

    Scalable Inference Architectures for Operationalizing AI/ML in Enterprise Systems,

    S. Prasad, U. Arora, et al., “Scalable Inference Architectures for Operationalizing AI/ML in Enterprise Systems, ” inProc. IEEE AIMLSYS, Oct. 2025

  5. [5]

    DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

    O. Khattab et al., “DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines, ”arXiv:2310.03714, 2023

  6. [6]

    Large scale distributed deep networks,

    J. Dean et al., “Large scale distributed deep networks, ” inProc. NeurIPS, vol. 25, pp. 1223–1231, 2012

  7. [7]

    PipeDream: Generalized Pipeline Parallelism for DNN Training,

    D. Narayanan et al., “PipeDream: Generalized Pipeline Parallelism for DNN Training, ” inProc. 27th ACM SOSP, pp. 1–15, 2019

  8. [8]

    Beyond Data and Model Parallelism for Deep Neural Networks,

    Z. Jia, M. Zaharia, and A. Aiken, “Beyond Data and Model Parallelism for Deep Neural Networks, ” inProc. 3rd MLSys, 2020

  9. [9]

    Zero: Memory Optimization Toward Training Trillion Parameter Models,

    S. Rajbhandari et al., “Zero: Memory Optimization Toward Training Trillion Parameter Models, ” inProc. SC, 2020

  10. [10]

    Scalable Inference Serving Systems for Deep Learning Models,

    K. Sreedharan et al., “Scalable Inference Serving Systems for Deep Learning Models, ” inProc. 19th USENIX NSDI, pp. 135–149, 2022

  11. [11]

    Improved ML model deployment using Amazon SageMaker Inference Recommender,

    S. R. Kotini et al., “Improved ML model deployment using Amazon SageMaker Inference Recommender, ”A WS Machine Learning Blog, 2023

  12. [12]

    How AWS SageMaker Inference Components Save AI Inference Costs by Up to 8x,

    R. Jadav and R. Aggarwal, “How AWS SageMaker Inference Components Save AI Inference Costs by Up to 8x, ”A WS Machine Learning Blog, 2025

  13. [13]

    How Amazon Bedrock Custom Model Import stream- lined LLM deployment for Salesforce,

    S. Prasad, U. Arora, et al., “How Amazon Bedrock Custom Model Import stream- lined LLM deployment for Salesforce, ”A WS Machine Learning Blog, Oct. 2025

  14. [14]

    Efficient Memory Management for Large Language Model Serving with PagedAttention,

    W. Kwon et al., “Efficient Memory Management for Large Language Model Serving with PagedAttention, ” inProc. USENIX OSDI, pp. 1159–1177, 2023

  15. [15]

    Efficient Streaming Language Models with Attention Sinks

    N. Parmar et al., “Scalable and Efficient Foundation Model Inference with Page- dAttention, ”arXiv:2309.17453, 2023

  16. [16]

    ALTO: An Efficient Network Orchestrator for Compound AI Systems,

    K. Santhanam et al., “ALTO: An Efficient Network Orchestrator for Compound AI Systems, ” inProc. 4th Workshop on ML and Systems, pp. 117–125, 2024

  17. [17]

    SGLang: Efficient Execution of Structured Language Model Programs

    L. Zheng et al., “Efficiently Programming Large Language Models using SGLang, ” arXiv:2312.07104, 2023