pith. machine review for the scientific record. sign in

arxiv: 2511.00739 · v3 · submitted 2025-11-01 · 💻 cs.AI · cs.LG· cs.MA

Recognition: unknown

Towards Understanding, Analyzing, and Optimizing Agentic AI Execution: A CPU-Centric Perspective

Authors on Pith no claims yet
classification 💻 cs.AI cs.LGcs.MA
keywords agenticexecutionheterogeneouslatencyworkloadsbottleneckshomogeneoussystems
0
0 comments X
read the original abstract

Agentic AI serving converts monolithic LLM-based inference to autonomous problem-solvers that can plan, call tools, perform reasoning, and adapt on the fly. Due to diverse task execution need, such serving heavily rely on heterogeneous CPU-GPU systems with majority of the external tools responsible for agentic capability, either run on or are orchestrated by the CPU. Towards having a deeper understanding of its role, this paper aims to characterize and analyze the system bottlenecks introduced by agentic AI workloads from a largely overlooked CPU-centric perspective. We first present a compile-time characterization of agentic AI execution and choose representative workloads to capture the algorithmic diversity. We then perform runtime characterization of the representative workloads analyzing the end-to-end latency and throughput on two different hardware systems to isolate respective architectural bottlenecks. Based on the insights on the bottlenecks, we finally present two scheduling optimizations, namely, 1. CPU-Aware Overlapped Micro-Batching (COMB) and 2. Mixed Agentic Scheduling (MAS) on homogeneous and heterogeneous agentic workloads, respectively. In specific, these methods optimize for improved CPU-GPU concurrent utilization while reducing skewed resource allocation for heterogeneous execution. Experimental evaluations on the two hardware systems demonstrate the efficacy of COMB in yielding up to 1.7x lower P50 latency in standalone homogeneous workload execution and up to 3.9x/1.8x lower service/total latency under homogeneous open-loop load. Additionally, for heterogeneous open-loop load, MAS can reduce the total latency for minority request-type by up to 2.37x/2.49x at P50/P90 percentile.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SPEC CPU2026: Characterization, Representativeness, and Cross-Suite Comparison

    cs.AR 2026-05 unverdicted novelty 7.0

    SPEC CPU2026 increases instruction volume and memory footprint while shifting pressure to instruction-cache bottlenecks; 4-5 workload subsets per group preserve 96.4-99.9% of full-suite behavior and show complementary...

  2. SPEC CPU2026: Characterization, Representativeness, and Cross-Suite Comparison

    cs.AR 2026-05 unverdicted novelty 6.0

    SPEC CPU2026 raises instruction volume and memory demands while shifting pressure to instruction caches; 4-5 workload subsets per group preserve 96.4-99.9% of full-suite microarchitectural behavior and better approxim...

  3. KAIROS: Stateful, Context-Aware Power-Efficient Agentic Inference Serving

    cs.DC 2026-04 unverdicted novelty 6.0

    KAIROS reduces power by 27% on average (up to 39.8%) for agentic AI inference by using long-lived context to jointly manage GPU frequency, concurrency, and request routing across instances.

  4. LARA: Validation-Driven Agentic Supercomputer Workflows for Atomistic Modeling

    physics.comp-ph 2026-04 unverdicted novelty 4.0

    LARA-HPC introduces a validation-first agentic system with dry-run verification and multi-phase refinement that improves robustness of AI-generated DFT workflows on HPC systems.