Apiserve: Efficient api support for large-language model inferencing

Infercept: Efficient intercept support for augmented large language model inference · 2024 · arXiv 2402.01869

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

read on arXiv browse 6 citing papers

citation-role summary

background 1 baseline 1

citation-polarity summary

background 1 baseline 1

representative citing papers

A Policy-Driven Runtime Layer for Agentic LLM Serving

cs.AI · 2026-05-26 · unverdicted · novelty 7.0

Introduces a three-tier architecture with an agent runtime layer and four primitives for agent-aware policies in LLM serving, validated on KV caching via CacheSage showing 13-37pp hit-rate gains on five workloads.

Idleness is Relative: Exploiting Tool-Call Idle Windows for Offloading in Agentic Systems with MORI

cs.OS · 2026-05-30 · unverdicted · novelty 6.0

MORI improves throughput 20-71% and TTFT 18-43% over baselines by ranking programs on a continuous idleness spectrum and shifting the GPU-CPU boundary to match capacity in agentic LLM serving.

SGLang: Efficient Execution of Structured Language Model Programs

cs.AI · 2023-12-12 · conditional · novelty 6.0

SGLang is a new system that speeds up structured LLM programs by up to 6.4x using RadixAttention for KV cache reuse and compressed finite state machines for output decoding.

Multi-Segment Attention: Enabling Efficient KV-Cache Management for Faster Large Language Model Serving

cs.AR · 2026-06-01 · unverdicted · novelty 5.0

AsymCache combines Multi-Segment Attention, position-aware eviction, and adaptive chunking to cut TTFT by up to 2.03x and TPOT by up to 1.71x versus recent baselines in LLM serving.

Diffusion Templates: A Unified Plugin Framework for Controllable Diffusion

cs.LG · 2026-04-27 · unverdicted · novelty 5.0

Diffusion Templates is a unified plugin framework that allows injecting various controllable capabilities into diffusion models through a standardized interface.

AgentOpt v0.1 Technical Report: Client-Side Optimization for LLM-Based Agent

cs.LG · 2026-04-07 · unverdicted · novelty 5.0

AgentOpt introduces a framework-agnostic package that uses algorithms like UCB-E to find cost-effective model assignments in multi-step LLM agent pipelines, cutting evaluation budgets by 62-76% while maintaining near-optimal accuracy on benchmarks.

citing papers explorer

Showing 6 of 6 citing papers.

A Policy-Driven Runtime Layer for Agentic LLM Serving cs.AI · 2026-05-26 · unverdicted · none · ref 1
Introduces a three-tier architecture with an agent runtime layer and four primitives for agent-aware policies in LLM serving, validated on KV caching via CacheSage showing 13-37pp hit-rate gains on five workloads.
Idleness is Relative: Exploiting Tool-Call Idle Windows for Offloading in Agentic Systems with MORI cs.OS · 2026-05-30 · unverdicted · none · ref 1
MORI improves throughput 20-71% and TTFT 18-43% over baselines by ranking programs on a continuous idleness spectrum and shifting the GPU-CPU boundary to match capacity in agentic LLM serving.
SGLang: Efficient Execution of Structured Language Model Programs cs.AI · 2023-12-12 · conditional · none · ref 1
SGLang is a new system that speeds up structured LLM programs by up to 6.4x using RadixAttention for KV cache reuse and compressed finite state machines for output decoding.
Multi-Segment Attention: Enabling Efficient KV-Cache Management for Faster Large Language Model Serving cs.AR · 2026-06-01 · unverdicted · none · ref 4
AsymCache combines Multi-Segment Attention, position-aware eviction, and adaptive chunking to cut TTFT by up to 2.03x and TPOT by up to 1.71x versus recent baselines in LLM serving.
Diffusion Templates: A Unified Plugin Framework for Controllable Diffusion cs.LG · 2026-04-27 · unverdicted · none · ref 1
Diffusion Templates is a unified plugin framework that allows injecting various controllable capabilities into diffusion models through a standardized interface.
AgentOpt v0.1 Technical Report: Client-Side Optimization for LLM-Based Agent cs.LG · 2026-04-07 · unverdicted · none · ref 1
AgentOpt introduces a framework-agnostic package that uses algorithms like UCB-E to find cost-effective model assignments in multi-step LLM agent pipelines, cutting evaluation budgets by 62-76% while maintaining near-optimal accuracy on benchmarks.

Apiserve: Efficient api support for large-language model inferencing

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer