hub

SGLang: Efficient Execution of Structured Language Model Programs

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody Hao Yu · 2023 · cs.AI · arXiv 2312.07104

31 Pith papers cite this work. Polarity classification is still indexing.

31 Pith papers citing it

open full Pith review browse 31 citing papers arXiv PDF

abstract

Large language models (LLMs) are increasingly used for complex tasks that require multiple generation calls, advanced prompting techniques, control flow, and structured inputs/outputs. However, efficient systems are lacking for programming and executing these applications. We introduce SGLang, a system for efficient execution of complex language model programs. SGLang consists of a frontend language and a runtime. The frontend simplifies programming with primitives for generation and parallelism control. The runtime accelerates execution with novel optimizations like RadixAttention for KV cache reuse and compressed finite state machines for faster structured output decoding. Experiments show that SGLang achieves up to 6.4x higher throughput compared to state-of-the-art inference systems on various large language and multi-modal models on tasks including agent control, logical reasoning, few-shot learning benchmarks, JSON decoding, retrieval-augmented generation pipelines, and multi-turn chat. The code is publicly available at https://github.com/sgl-project/sglang

hub tools

JSON dossier citing papers JSON arXiv source

representative citing papers

VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?

cs.AI · 2026-05-07 · unverdicted · novelty 8.0

VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual models, workloads, or hardware.

NCCLZ: Compression-Enabled GPU Collectives with Decoupled Quantization and Entropy Coding

cs.DC · 2026-05-12 · unverdicted · novelty 7.0

NCCLZ decouples quantization and entropy coding across NCCL stack layers to enable overlapped compression, delivering up to 9.65x speedup over plain NCCL on scientific and training workloads.

Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents

cs.CL · 2026-05-11 · unverdicted · novelty 7.0

A new image-bank harness and closed-loop on-policy data evolution method raises multimodal agent performance on visual search benchmarks from 24.9% to 39.0% for an 8B model and from 30.6% to 41.5% for a 30B model.

Surviving Partial Rank Failures in Wide Expert-Parallel MoE Inference

cs.DC · 2026-05-11 · unverdicted · novelty 7.0

EEP makes wide expert-parallel MoE serving survive single-rank failures with an 11s recovery pause, 8s reintegration pause, and throughput restored to 95% of pre-fault level within 52s while staying within 4.4% of a fixed-membership baseline in steady state.

Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation

cs.DC · 2026-05-08 · unverdicted · novelty 7.0

Dooly reduces LLM inference profiling costs by 56.4% via configuration-agnostic taint-based labeling and selective database reuse, delivering simulation accuracy within 5% MAPE for TTFT and 8% for TPOT across 12 models.

When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs

cs.PF · 2026-05-04 · unverdicted · novelty 7.0 · 2 refs

Hosted open-weight LLM APIs function as time-varying heterogeneous services rather than fixed model artifacts, with demand concentrated, supply-use mismatches, and task-specific routing yielding major cost and throughput gains.

Sparse Prefix Caching for Hybrid and Recurrent LLM Serving

cs.LG · 2026-04-17 · unverdicted · novelty 7.0

Sparse prefix caching via dynamic programming for optimal checkpoint placement under overlap distributions improves the Pareto frontier for recurrent and hybrid LLM serving on shared-prefix data.

TrigReason: Trigger-Based Collaboration between Small and Large Reasoning Models

cs.AI · 2026-04-16 · unverdicted · novelty 7.0

TrigReason matches large reasoning model accuracy on math and science benchmarks by delegating most steps to small models and intervening selectively on three triggers, cutting latency by 43.9% and cost by 73.3%.

Fleet: Hierarchical Task-based Abstraction for Megakernels on Multi-Die GPUs

cs.AR · 2026-04-15 · unverdicted · novelty 7.0

Fleet adds a Chiplet-task level to GPU task models, enabling per-chiplet scheduling and cooperative cache reuse in persistent megakernels, yielding 1.3-1.5x lower LLM decode latency and up to 37% less HBM traffic on AMD MI350 hardware.

CodeComp: Structural KV Cache Compression for Agentic Coding

cs.CL · 2026-04-11 · unverdicted · novelty 7.0

CodeComp uses Joern-extracted Code Property Graph priors for training-free structural KV cache compression, outperforming attention-only baselines on bug localization and code generation while matching full-context patch quality.

Attention Once Is All You Need: Efficient Streaming Inference with Stateful Transformers

cs.LG · 2026-05-13 · unverdicted · novelty 6.0

Stateful sessions with incremental KV cache and flash queries allow O(|q|) latency in streaming transformer inference, delivering up to 5.9x speedup over conventional engines while preserving full attention.

KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving

cs.AR · 2026-05-10 · unverdicted · novelty 6.0

KV-RM regularizes KV-cache movement in static-graph LLM serving via block paging and merge-staged transport to improve throughput, tail latency, and memory use for variable-length decoding.

VisMMOE: Exploiting Visual-Expert Affinity for Efficient Visual-Language MoE Offloading

cs.LG · 2026-05-07 · unverdicted · novelty 6.0

VisMMoE exploits visual-expert affinity via token pruning to achieve up to 2.68x faster VL-MoE inference on memory-constrained hardware while keeping accuracy competitive.

RaMP: Runtime-Aware Megakernel Polymorphism for Mixture-of-Experts

cs.LG · 2026-04-28 · unverdicted · novelty 6.0

RaMP uses a hardware-derived performance region analysis and a four-parameter wave cost model to select optimal polymorphic kernel configurations for MoE inference from runtime expert histograms, delivering 1.22x kernel and 1.30x end-to-end speedups with 0.93% mean regret after brief profiling.

When Agents Go Quiet: Output Generation Capacity and Format-Cost Separation for LLM Document Synthesis

cs.AI · 2026-04-17 · unverdicted · novelty 6.0

LLM agents avoid output stalling and reduce generation tokens by 48-72% via deferred template rendering guided by Output Generation Capacity and a Format-Cost Separation Theorem.

ProbeLogits: Kernel-Level LLM Inference Primitives for AI-Native Operating Systems

cs.OS · 2026-04-13 · unverdicted · novelty 6.0

ProbeLogits performs single-pass logit reading inside the kernel to classify LLM agent actions as safe or dangerous, reaching 97-99% block rates on HarmBench and F1 parity or better than Llama Guard 3 at 2.5x lower latency.

Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale

cs.CL · 2026-04-13 · unverdicted · novelty 6.0

Relax is a new RL training engine with omni-native design and async execution that delivers up to 2x speedups over baselines like veRL while converging to equivalent reward levels on Qwen3 models.

MEMENTO: Teaching LLMs to Manage Their Own Context

cs.AI · 2026-04-10 · unverdicted · novelty 6.0

MEMENTO trains LLMs to segment reasoning into blocks, generate mementos as dense summaries, and reason forward using only mementos and KV states, cutting peak KV cache by ~2.5x while preserving benchmark accuracy.

Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization

cs.CL · 2026-03-30 · unverdicted · novelty 6.0

Kernel-Smith combines evolutionary search with RL post-training to generate optimized GPU kernels, achieving SOTA speedups on KernelBench that beat Gemini-3.0-pro and Claude-4.6-opus on NVIDIA Triton and generalize to MetaX MACA.

Large Language Monkeys: Scaling Inference Compute with Repeated Sampling

cs.LG · 2024-07-31 · unverdicted · novelty 6.0

Repeated sampling scales problem coverage log-linearly with sample count, improving SWE-bench Lite performance from 15.9% to 56% using 250 samples.

An Executable Benchmarking Suite for Tool-Using Agents

cs.SE · 2026-05-10 · unverdicted · novelty 5.0

The paper delivers a unified executable benchmarking suite for tool-using agents that enforces a shared evidence-admission contract across web, code, and micro-task environments.

How to Compress KV Cache in RL Post-Training? Shadow Mask Distillation for Memory-Efficient Alignment

cs.LG · 2026-05-07 · unverdicted · novelty 5.0

Shadow Mask Distillation enables KV cache compression in RL post-training of LLMs by mitigating amplified off-policy bias that defeats standard importance reweighting.

Securing the Agent: Vendor-Neutral, Multitenant Enterprise Retrieval and Tool Use

cs.CR · 2026-05-06 · unverdicted · novelty 5.0

A server-side architecture with policy-aware ingestion and ABAC-based retrieval gating prevents cross-tenant data leakage in multitenant enterprise RAG and agent systems.

VLMaxxing through FrameMogging Training-Free Anti-Recomputation for Video Vision-Language Models

cs.CV · 2026-05-05 · unverdicted · novelty 5.0

Training-free adaptive reuse of stable visual state in video VLMs reduces follow-up latency by 15-36x on Qwen2.5-VL while preserving correctness on VideoMME, with smaller first-query speedups via pruning.

citing papers explorer

Showing 31 of 31 citing papers.

VibeServe: Can AI Agents Build Bespoke LLM Serving Systems? cs.AI · 2026-05-07 · unverdicted · none · ref 84 · internal anchor
VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual models, workloads, or hardware.
NCCLZ: Compression-Enabled GPU Collectives with Decoupled Quantization and Entropy Coding cs.DC · 2026-05-12 · unverdicted · none · ref 18 · internal anchor
NCCLZ decouples quantization and entropy coding across NCCL stack layers to enable overlapped compression, delivering up to 9.65x speedup over plain NCCL on scientific and training workloads.
Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents cs.CL · 2026-05-11 · unverdicted · none · ref 23 · internal anchor
A new image-bank harness and closed-loop on-policy data evolution method raises multimodal agent performance on visual search benchmarks from 24.9% to 39.0% for an 8B model and from 30.6% to 41.5% for a 30B model.
Surviving Partial Rank Failures in Wide Expert-Parallel MoE Inference cs.DC · 2026-05-11 · unverdicted · none · ref 34 · internal anchor
EEP makes wide expert-parallel MoE serving survive single-rank failures with an 11s recovery pause, 8s reintegration pause, and throughput restored to 95% of pre-fault level within 52s while staying within 4.4% of a fixed-membership baseline in steady state.
Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation cs.DC · 2026-05-08 · unverdicted · none · ref 41 · internal anchor
Dooly reduces LLM inference profiling costs by 56.4% via configuration-agnostic taint-based labeling and selective database reuse, delivering simulation accuracy within 5% MAPE for TTFT and 8% for TPOT across 12 models.
When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs cs.PF · 2026-05-04 · unverdicted · none · ref 14 · 2 links · internal anchor
Hosted open-weight LLM APIs function as time-varying heterogeneous services rather than fixed model artifacts, with demand concentrated, supply-use mismatches, and task-specific routing yielding major cost and throughput gains.
Sparse Prefix Caching for Hybrid and Recurrent LLM Serving cs.LG · 2026-04-17 · unverdicted · none · ref 38 · internal anchor
Sparse prefix caching via dynamic programming for optimal checkpoint placement under overlap distributions improves the Pareto frontier for recurrent and hybrid LLM serving on shared-prefix data.
TrigReason: Trigger-Based Collaboration between Small and Large Reasoning Models cs.AI · 2026-04-16 · unverdicted · none · ref 19 · internal anchor
TrigReason matches large reasoning model accuracy on math and science benchmarks by delegating most steps to small models and intervening selectively on three triggers, cutting latency by 43.9% and cost by 73.3%.
Fleet: Hierarchical Task-based Abstraction for Megakernels on Multi-Die GPUs cs.AR · 2026-04-15 · unverdicted · none · ref 26 · internal anchor
Fleet adds a Chiplet-task level to GPU task models, enabling per-chiplet scheduling and cooperative cache reuse in persistent megakernels, yielding 1.3-1.5x lower LLM decode latency and up to 37% less HBM traffic on AMD MI350 hardware.
CodeComp: Structural KV Cache Compression for Agentic Coding cs.CL · 2026-04-11 · unverdicted · none · ref 17 · internal anchor
CodeComp uses Joern-extracted Code Property Graph priors for training-free structural KV cache compression, outperforming attention-only baselines on bug localization and code generation while matching full-context patch quality.
Attention Once Is All You Need: Efficient Streaming Inference with Stateful Transformers cs.LG · 2026-05-13 · unverdicted · none · ref 15 · internal anchor
Stateful sessions with incremental KV cache and flash queries allow O(|q|) latency in streaming transformer inference, delivering up to 5.9x speedup over conventional engines while preserving full attention.
KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving cs.AR · 2026-05-10 · unverdicted · none · ref 47 · internal anchor
KV-RM regularizes KV-cache movement in static-graph LLM serving via block paging and merge-staged transport to improve throughput, tail latency, and memory use for variable-length decoding.
VisMMOE: Exploiting Visual-Expert Affinity for Efficient Visual-Language MoE Offloading cs.LG · 2026-05-07 · unverdicted · none · ref 59 · internal anchor
VisMMoE exploits visual-expert affinity via token pruning to achieve up to 2.68x faster VL-MoE inference on memory-constrained hardware while keeping accuracy competitive.
RaMP: Runtime-Aware Megakernel Polymorphism for Mixture-of-Experts cs.LG · 2026-04-28 · unverdicted · none · ref 14 · internal anchor
RaMP uses a hardware-derived performance region analysis and a four-parameter wave cost model to select optimal polymorphic kernel configurations for MoE inference from runtime expert histograms, delivering 1.22x kernel and 1.30x end-to-end speedups with 0.93% mean regret after brief profiling.
When Agents Go Quiet: Output Generation Capacity and Format-Cost Separation for LLM Document Synthesis cs.AI · 2026-04-17 · unverdicted · none · ref 36 · internal anchor
LLM agents avoid output stalling and reduce generation tokens by 48-72% via deferred template rendering guided by Output Generation Capacity and a Format-Cost Separation Theorem.
ProbeLogits: Kernel-Level LLM Inference Primitives for AI-Native Operating Systems cs.OS · 2026-04-13 · unverdicted · none · ref 17 · internal anchor
ProbeLogits performs single-pass logit reading inside the kernel to classify LLM agent actions as safe or dangerous, reaching 97-99% block rates on HarmBench and F1 parity or better than Llama Guard 3 at 2.5x lower latency.
Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale cs.CL · 2026-04-13 · unverdicted · none · ref 18 · internal anchor
Relax is a new RL training engine with omni-native design and async execution that delivers up to 2x speedups over baselines like veRL while converging to equivalent reward levels on Qwen3 models.
MEMENTO: Teaching LLMs to Manage Their Own Context cs.AI · 2026-04-10 · unverdicted · none · ref 39 · internal anchor
MEMENTO trains LLMs to segment reasoning into blocks, generate mementos as dense summaries, and reason forward using only mementos and KV states, cutting peak KV cache by ~2.5x while preserving benchmark accuracy.
Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization cs.CL · 2026-03-30 · unverdicted · none · ref 34 · internal anchor
Kernel-Smith combines evolutionary search with RL post-training to generate optimized GPU kernels, achieving SOTA speedups on KernelBench that beat Gemini-3.0-pro and Claude-4.6-opus on NVIDIA Triton and generalize to MetaX MACA.
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling cs.LG · 2024-07-31 · unverdicted · none · ref 69 · internal anchor
Repeated sampling scales problem coverage log-linearly with sample count, improving SWE-bench Lite performance from 15.9% to 56% using 250 samples.
An Executable Benchmarking Suite for Tool-Using Agents cs.SE · 2026-05-10 · unverdicted · none · ref 18 · internal anchor
The paper delivers a unified executable benchmarking suite for tool-using agents that enforces a shared evidence-admission contract across web, code, and micro-task environments.
How to Compress KV Cache in RL Post-Training? Shadow Mask Distillation for Memory-Efficient Alignment cs.LG · 2026-05-07 · unverdicted · none · ref 13 · internal anchor
Shadow Mask Distillation enables KV cache compression in RL post-training of LLMs by mitigating amplified off-policy bias that defeats standard importance reweighting.
Securing the Agent: Vendor-Neutral, Multitenant Enterprise Retrieval and Tool Use cs.CR · 2026-05-06 · unverdicted · none · ref 45 · internal anchor
A server-side architecture with policy-aware ingestion and ABAC-based retrieval gating prevents cross-tenant data leakage in multitenant enterprise RAG and agent systems.
VLMaxxing through FrameMogging Training-Free Anti-Recomputation for Video Vision-Language Models cs.CV · 2026-05-05 · unverdicted · none · ref 30 · internal anchor
Training-free adaptive reuse of stable visual state in video VLMs reduces follow-up latency by 15-36x on Qwen2.5-VL while preserving correctness on VideoMME, with smaller first-query speedups via pruning.
SURGE: SuperBatch Unified Resource-efficient GPU Encoding for Heterogeneous Partitioned Data cs.DC · 2026-05-01 · unverdicted · none · ref 53 · internal anchor
SURGE achieves fixed-batch throughput for GPU embedding generation on 800M texts across 40k partitions using 12.6x less memory, 68x faster time-to-first-output, and fault tolerance via a streaming two-threshold policy with an analytical cost model accurate to 2%.
EdgeFM: Efficient Edge Inference for Vision-Language Models cs.CV · 2026-04-30 · unverdicted · none · ref 14 · internal anchor
EdgeFM is an agent-driven framework that strips non-essential features from VLMs and packages reusable optimized kernels, achieving up to 1.49x speedup over TensorRT-Edge-LLM on NVIDIA Orin while enabling first end-to-end deployment on Horizon Journey hardware.
LLM StructCore: Schema-Guided Reasoning Condensation and Deterministic Compilation cs.CL · 2026-04-22 · unverdicted · none · ref 15 · internal anchor
Two-stage Schema-Guided Reasoning with LLM condensation and deterministic compilation achieves macro-F1 of 0.63 on dyspnea CRF filling task and is language-agnostic.
HieraSparse: Hierarchical Semi-Structured Sparse KV Attention cs.DC · 2026-04-18 · unverdicted · none · ref 49 · internal anchor
HieraSparse delivers a hierarchical semi-structured sparse KV attention system that achieves 1.2x KV compression and 4.57x decode attention speedup versus prior unstructured sparsity methods at equivalent sparsity, plus up to 1.85x prefill speedup and 1.37x/1.77x speedups with magnitude pruning and
enclawed: A Configurable, Sector-Neutral Hardening Framework for Single-User AI Assistant Gateways cs.CR · 2026-04-18 · unverdicted · none · ref 17 · 2 links · internal anchor
enclawed is a sector-neutral hardening framework for AI gateways providing signed modules, audit trails, peer attestation, and a 356-case test suite for regulated deployments.
Scalable Inference Architectures for Compound AI Systems: A Production Deployment Study cs.AI · 2026-04-28 · unverdicted · none · ref 17 · internal anchor
A deployed modular inference architecture for compound AI systems cut tail latency over 50%, boosted throughput up to 3.9x, and reduced costs 30-40% while handling multi-model agent workloads.
A Survey on Efficient Inference for Large Language Models cs.CL · 2024-04-22 · accept · none · ref 52 · internal anchor
The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.

SGLang: Efficient Execution of Structured Language Model Programs

hub tools

fields

years

verdicts

representative citing papers

citing papers explorer