Recognition: unknown
Towards Understanding, Analyzing, and Optimizing Agentic AI Execution: A CPU-Centric Perspective
read the original abstract
Agentic AI serving converts monolithic LLM-based inference to autonomous problem-solvers that can plan, call tools, perform reasoning, and adapt on the fly. Due to diverse task execution need, such serving heavily rely on heterogeneous CPU-GPU systems with majority of the external tools responsible for agentic capability, either run on or are orchestrated by the CPU. Towards having a deeper understanding of its role, this paper aims to characterize and analyze the system bottlenecks introduced by agentic AI workloads from a largely overlooked CPU-centric perspective. We first present a compile-time characterization of agentic AI execution and choose representative workloads to capture the algorithmic diversity. We then perform runtime characterization of the representative workloads analyzing the end-to-end latency and throughput on two different hardware systems to isolate respective architectural bottlenecks. Based on the insights on the bottlenecks, we finally present two scheduling optimizations, namely, 1. CPU-Aware Overlapped Micro-Batching (COMB) and 2. Mixed Agentic Scheduling (MAS) on homogeneous and heterogeneous agentic workloads, respectively. In specific, these methods optimize for improved CPU-GPU concurrent utilization while reducing skewed resource allocation for heterogeneous execution. Experimental evaluations on the two hardware systems demonstrate the efficacy of COMB in yielding up to 1.7x lower P50 latency in standalone homogeneous workload execution and up to 3.9x/1.8x lower service/total latency under homogeneous open-loop load. Additionally, for heterogeneous open-loop load, MAS can reduce the total latency for minority request-type by up to 2.37x/2.49x at P50/P90 percentile.
This paper has not been read by Pith yet.
Forward citations
Cited by 4 Pith papers
-
SPEC CPU2026: Characterization, Representativeness, and Cross-Suite Comparison
SPEC CPU2026 increases instruction volume and memory footprint while shifting pressure to instruction-cache bottlenecks; 4-5 workload subsets per group preserve 96.4-99.9% of full-suite behavior and show complementary...
-
SPEC CPU2026: Characterization, Representativeness, and Cross-Suite Comparison
SPEC CPU2026 raises instruction volume and memory demands while shifting pressure to instruction caches; 4-5 workload subsets per group preserve 96.4-99.9% of full-suite microarchitectural behavior and better approxim...
-
KAIROS: Stateful, Context-Aware Power-Efficient Agentic Inference Serving
KAIROS reduces power by 27% on average (up to 39.8%) for agentic AI inference by using long-lived context to jointly manage GPU frequency, concurrency, and request routing across instances.
-
LARA: Validation-Driven Agentic Supercomputer Workflows for Atomistic Modeling
LARA-HPC introduces a validation-first agentic system with dry-run verification and multi-phase refinement that improves robustness of AI-generated DFT workflows on HPC systems.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.