arxiv: 2312.07104 · v2 · submitted 2023-12-12 · 💻 cs.AI · cs.PL

Recognition: 2 theorem links

SGLang: Efficient Execution of Structured Language Model Programs

Lianmin Zheng , Liangsheng Yin , Zhiqiang Xie , Chuyue Sun , Jeff Huang , Cody Hao Yu , Shiyi Cao , Christos Kozyrakis

show 4 more authors

Ion Stoica Joseph E. Gonzalez Clark Barrett Ying Sheng

Authors on Pith no claims yet

Pith reviewed 2026-05-12 08:14 UTC · model grok-4.3

classification 💻 cs.AI cs.PL

keywords SGLangstructured language model programsRadixAttentionKV cache reusecompressed finite state machinesinference optimizationthroughputmulti-modal models

0 comments

The pith

SGLang speeds up execution of structured language model programs by reusing computation across calls and accelerating structured decoding.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SGLang to address the lack of efficient systems for running complex applications that use large language models with multiple steps, control flow, and structured inputs and outputs. It provides a frontend language with simple primitives for generation and parallelism control, paired with a runtime that includes new optimizations. These optimizations allow the system to reuse key-value caches from previous generations and to decode structured outputs more quickly. A sympathetic reader would care because such efficiency gains could make advanced LLM uses practical at scale without needing proportionally more computing resources. Tests on tasks from agent control to multi-turn chat confirm higher throughput than existing inference systems.

Core claim

SGLang consists of a frontend language and a runtime. The frontend simplifies programming with primitives for generation and parallelism control. The runtime accelerates execution with novel optimizations like RadixAttention for KV cache reuse and compressed finite state machines for faster structured output decoding. Experiments show that SGLang achieves up to 6.4x higher throughput compared to state-of-the-art inference systems on various large language and multi-modal models on tasks including agent control, logical reasoning, few-shot learning benchmarks, JSON decoding, retrieval-augmented generation pipelines, and multi-turn chat.

What carries the argument

RadixAttention for KV cache reuse across related prompts and compressed finite state machines for efficient structured output decoding.

Load-bearing premise

The optimizations deliver consistent throughput gains across diverse models and workloads without introducing accuracy loss or excessive overhead.

What would settle it

Running the same benchmarks on a new model or workload with irregular control flow shows no throughput improvement or degraded outputs compared to baseline systems.

read the original abstract

Large language models (LLMs) are increasingly used for complex tasks that require multiple generation calls, advanced prompting techniques, control flow, and structured inputs/outputs. However, efficient systems are lacking for programming and executing these applications. We introduce SGLang, a system for efficient execution of complex language model programs. SGLang consists of a frontend language and a runtime. The frontend simplifies programming with primitives for generation and parallelism control. The runtime accelerates execution with novel optimizations like RadixAttention for KV cache reuse and compressed finite state machines for faster structured output decoding. Experiments show that SGLang achieves up to 6.4x higher throughput compared to state-of-the-art inference systems on various large language and multi-modal models on tasks including agent control, logical reasoning, few-shot learning benchmarks, JSON decoding, retrieval-augmented generation pipelines, and multi-turn chat. The code is publicly available at https://github.com/sgl-project/sglang

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SGLang pairs a clean frontend for multi-call LLM programs with RadixAttention and compressed FSMs that deliver real throughput gains on structured tasks.

read the letter

SGLang pairs a clean frontend for multi-call LLM programs with RadixAttention and compressed FSMs that deliver real throughput gains on structured tasks. The frontend adds primitives for generation, parallelism, and control flow that make agent-style and constrained-output code simpler to write. The runtime then reuses KV caches across related calls via a radix-tree structure and accelerates constrained decoding with compressed finite state machines. Both ideas are implemented and measured against vLLM and other baselines on agent control, reasoning, few-shot, JSON, RAG, and chat workloads, with reported peaks at 6.4x throughput. The paper includes ablations, confirms output equivalence to reference decoders, and ships public code, so the speedups can be checked directly. The evidence for the central claim holds up without obvious fitting or circularity. The gains are largest on workloads with repeated context or strict output constraints, which is expected and clearly stated. Minor gaps remain around cache eviction details for very large contexts and overhead on the simplest single-call cases, but these do not undermine the main results. This work is aimed at engineers who run or extend LLM inference systems and agent frameworks. It has enough concrete implementation and measurement to merit a full referee process.

Referee Report

2 major / 3 minor

Summary. The paper introduces SGLang, a system for efficient execution of structured language model programs consisting of a frontend language with primitives for generation and parallelism control, and a runtime that incorporates novel optimizations including RadixAttention for KV cache reuse and compressed finite state machines for structured output decoding. Experiments across various LLMs and multi-modal models on tasks such as agent control, logical reasoning, few-shot learning, JSON decoding, RAG pipelines, and multi-turn chat report up to 6.4x higher throughput compared to state-of-the-art inference systems, with the code released publicly.

Significance. If the reported throughput gains hold under scrutiny, this work is significant because it directly addresses the growing need for efficient systems to handle complex, multi-step LLM programs involving control flow and structured I/O, areas where current inference engines fall short. The concrete optimizations and open-source implementation provide a practical foundation for improving performance in agentic and structured generation workloads, with potential to influence future inference system designs.

major comments (2)

[Experiments] Experiments section: the central throughput claims (up to 6.4x) are presented without reported error bars, number of repeated runs, or statistical tests, which weakens the ability to assess whether the gains from RadixAttention and compressed FSMs are robust across hardware and workload variations.
[Runtime] Runtime section on compressed FSMs: while the paper states that outputs match reference decoders, the description does not provide sufficient algorithmic detail (e.g., compression algorithm or state reduction rules) to verify that the optimization preserves correctness for all edge cases in structured generation tasks.

minor comments (3)

[Abstract] The abstract and introduction would benefit from a clearer distinction between the contributions of the frontend language versus the runtime optimizations.
[Figures] Figure captions for throughput plots should explicitly list the exact models, batch sizes, and hardware used in each comparison to improve reproducibility.
[Related Work] A few citations to related work on KV cache management (e.g., vLLM's PagedAttention) appear to be missing or under-cited in the related work section.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their positive summary, significance assessment, and recommendation for minor revision. The feedback on experimental reporting and algorithmic details is constructive, and we address both major comments point by point below.

read point-by-point responses

Referee: Experiments section: the central throughput claims (up to 6.4x) are presented without reported error bars, number of repeated runs, or statistical tests, which weakens the ability to assess whether the gains from RadixAttention and compressed FSMs are robust across hardware and workload variations.

Authors: We agree that the absence of error bars, run counts, and statistical details limits assessment of robustness. In the revised manuscript, we will add error bars computed from five independent runs per configuration, report the mean and standard deviation, and include a short paragraph discussing observed variability across hardware and workloads. This will be incorporated into the Experiments section and relevant figures. revision: yes
Referee: Runtime section on compressed FSMs: while the paper states that outputs match reference decoders, the description does not provide sufficient algorithmic detail (e.g., compression algorithm or state reduction rules) to verify that the optimization preserves correctness for all edge cases in structured generation tasks.

Authors: We acknowledge that the current description of the compressed finite state machine optimization lacks sufficient algorithmic detail. We will expand the Runtime section with a precise description of the compression algorithm, the state reduction rules, and a proof sketch showing equivalence to the uncompressed FSM. We will also add pseudocode and a discussion of edge-case handling (e.g., nested structures, optional fields, and regex constraints) to allow verification of correctness. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes a concrete runtime system (SGLang) with frontend primitives and two optimizations (RadixAttention for KV-cache reuse, compressed FSMs for structured decoding). All performance claims are empirical measurements of throughput on external workloads against external baselines (vLLM and others). No equations, fitted parameters, predictions, or first-principles derivations appear; the reported speedups are direct outcomes of the implemented code and benchmark runs, not reductions to self-referential inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

The central claim rests on the effectiveness of two newly introduced runtime techniques whose performance benefits are demonstrated experimentally rather than derived from first principles.

invented entities (2)

RadixAttention no independent evidence
purpose: Efficient KV cache reuse during LLM inference for structured programs
Presented as a novel optimization in the runtime component.
compressed finite state machines no independent evidence
purpose: Faster decoding of structured outputs such as JSON
Introduced as a new technique for constrained generation.

pith-pipeline@v0.9.0 · 5498 in / 1098 out tokens · 47820 ms · 2026-05-12T08:14:21.660954+00:00 · methodology

discussion (0)

Forward citations

Cited by 34 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VibeServe: Can AI Agents Build Bespoke LLM Serving Systems?
cs.AI 2026-05 unverdicted novelty 8.0

VibeServe demonstrates that AI agents can synthesize bespoke LLM serving systems end-to-end, remaining competitive with vLLM in standard settings while outperforming it in six non-standard scenarios involving unusual ...
Knowledge Packs: Zero-Token Knowledge Delivery via KV Cache Injection
cs.CL 2026-03 unverdicted novelty 8.0

Knowledge Packs deliver knowledge via pre-computed KV caches with exact equivalence under causal masking, achieving zero divergences on tested questions and enabling value-based steering without training.
NCCLZ: Compression-Enabled GPU Collectives with Decoupled Quantization and Entropy Coding
cs.DC 2026-05 unverdicted novelty 7.0

NCCLZ decouples quantization and entropy coding across NCCL stack layers to enable overlapped compression, delivering up to 9.65x speedup over plain NCCL on scientific and training workloads.
Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents
cs.CL 2026-05 unverdicted novelty 7.0

A new image-bank harness and closed-loop on-policy data evolution method raises multimodal agent performance on visual search benchmarks from 24.9% to 39.0% for an 8B model and from 30.6% to 41.5% for a 30B model.
Surviving Partial Rank Failures in Wide Expert-Parallel MoE Inference
cs.DC 2026-05 unverdicted novelty 7.0

EEP makes wide expert-parallel MoE serving survive single-rank failures with an 11s recovery pause, 8s reintegration pause, and throughput restored to 95% of pre-fault level within 52s while staying within 4.4% of a f...
Dooly: Configuration-Agnostic, Redundancy-Aware Profiling for LLM Inference Simulation
cs.DC 2026-05 unverdicted novelty 7.0

Dooly reduces LLM inference profiling costs by 56.4% via configuration-agnostic taint-based labeling and selective database reuse, delivering simulation accuracy within 5% MAPE for TTFT and 8% for TPOT across 12 models.
When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs
cs.PF 2026-05 conditional novelty 7.0

Hosted open-weight LLMs function as heterogeneous, time-varying services rather than uniform model artifacts, with concentrated demand, decoupled supply and adoption, and measurable gains from task-aware routing.
When Is the Same Model Not the Same Service? A Measurement Study of Hosted Open-Weight LLM APIs
cs.PF 2026-05 unverdicted novelty 7.0

Hosted open-weight LLM APIs function as time-varying heterogeneous services rather than fixed model artifacts, with demand concentrated, supply-use mismatches, and task-specific routing yielding major cost and through...
Sparse Prefix Caching for Hybrid and Recurrent LLM Serving
cs.LG 2026-04 unverdicted novelty 7.0

Sparse prefix caching via dynamic programming for optimal checkpoint placement under overlap distributions improves the Pareto frontier for recurrent and hybrid LLM serving on shared-prefix data.
TrigReason: Trigger-Based Collaboration between Small and Large Reasoning Models
cs.AI 2026-04 unverdicted novelty 7.0

TrigReason matches large reasoning model accuracy on math and science benchmarks by delegating most steps to small models and intervening selectively on three triggers, cutting latency by 43.9% and cost by 73.3%.
Fleet: Hierarchical Task-based Abstraction for Megakernels on Multi-Die GPUs
cs.AR 2026-04 unverdicted novelty 7.0

Fleet adds a Chiplet-task level to GPU task models, enabling per-chiplet scheduling and cooperative cache reuse in persistent megakernels, yielding 1.3-1.5x lower LLM decode latency and up to 37% less HBM traffic on A...
CodeComp: Structural KV Cache Compression for Agentic Coding
cs.CL 2026-04 unverdicted novelty 7.0

CodeComp uses Joern-extracted Code Property Graph priors for training-free structural KV cache compression, outperforming attention-only baselines on bug localization and code generation while matching full-context pa...
Attention Once Is All You Need: Efficient Streaming Inference with Stateful Transformers
cs.LG 2026-05 unverdicted novelty 6.0

Stateful sessions with incremental KV cache and flash queries allow O(|q|) latency in streaming transformer inference, delivering up to 5.9x speedup over conventional engines while preserving full attention.
KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving
cs.AR 2026-05 unverdicted novelty 6.0

KV-RM regularizes KV-cache movement in static-graph LLM serving via block paging and merge-staged transport to improve throughput, tail latency, and memory use for variable-length decoding.
VisMMOE: Exploiting Visual-Expert Affinity for Efficient Visual-Language MoE Offloading
cs.LG 2026-05 unverdicted novelty 6.0

VisMMoE exploits visual-expert affinity via token pruning to achieve up to 2.68x faster VL-MoE inference on memory-constrained hardware while keeping accuracy competitive.
RaMP: Runtime-Aware Megakernel Polymorphism for Mixture-of-Experts
cs.LG 2026-04 unverdicted novelty 6.0

RaMP uses a hardware-derived performance region analysis and a four-parameter wave cost model to select optimal polymorphic kernel configurations for MoE inference from runtime expert histograms, delivering 1.22x kern...
When Agents Go Quiet: Output Generation Capacity and Format-Cost Separation for LLM Document Synthesis
cs.AI 2026-04 unverdicted novelty 6.0

LLM agents avoid output stalling and reduce generation tokens by 48-72% via deferred template rendering guided by Output Generation Capacity and a Format-Cost Separation Theorem.
ProbeLogits: Kernel-Level LLM Inference Primitives for AI-Native Operating Systems
cs.OS 2026-04 unverdicted novelty 6.0

ProbeLogits performs single-pass logit reading inside the kernel to classify LLM agent actions as safe or dangerous, reaching 97-99% block rates on HarmBench and F1 parity or better than Llama Guard 3 at 2.5x lower latency.
Relax: An Asynchronous Reinforcement Learning Engine for Omni-Modal Post-Training at Scale
cs.CL 2026-04 unverdicted novelty 6.0

Relax is a new RL training engine with omni-native design and async execution that delivers up to 2x speedups over baselines like veRL while converging to equivalent reward levels on Qwen3 models.
MEMENTO: Teaching LLMs to Manage Their Own Context
cs.AI 2026-04 unverdicted novelty 6.0

MEMENTO trains LLMs to segment reasoning into blocks, generate mementos as dense summaries, and reason forward using only mementos and KV states, cutting peak KV cache by ~2.5x while preserving benchmark accuracy.
Kernel-Smith: A Unified Recipe for Evolutionary Kernel Optimization
cs.CL 2026-03 unverdicted novelty 6.0

Kernel-Smith combines evolutionary search with RL post-training to generate optimized GPU kernels, achieving SOTA speedups on KernelBench that beat Gemini-3.0-pro and Claude-4.6-opus on NVIDIA Triton and generalize to...
Large Language Monkeys: Scaling Inference Compute with Repeated Sampling
cs.LG 2024-07 unverdicted novelty 6.0

Repeated sampling scales problem coverage log-linearly with sample count, improving SWE-bench Lite performance from 15.9% to 56% using 250 samples.
An Executable Benchmarking Suite for Tool-Using Agents
cs.SE 2026-05 unverdicted novelty 5.0

The paper delivers a unified executable benchmarking suite for tool-using agents that enforces a shared evidence-admission contract across web, code, and micro-task environments.
How to Compress KV Cache in RL Post-Training? Shadow Mask Distillation for Memory-Efficient Alignment
cs.LG 2026-05 unverdicted novelty 5.0

Shadow Mask Distillation enables KV cache compression in RL post-training of LLMs by mitigating amplified off-policy bias that defeats standard importance reweighting.
Securing the Agent: Vendor-Neutral, Multitenant Enterprise Retrieval and Tool Use
cs.CR 2026-05 unverdicted novelty 5.0

A server-side architecture with policy-aware ingestion and ABAC-based retrieval gating prevents cross-tenant data leakage in multitenant enterprise RAG and agent systems.
VLMaxxing through FrameMogging Training-Free Anti-Recomputation for Video Vision-Language Models
cs.CV 2026-05 unverdicted novelty 5.0

Training-free adaptive reuse of stable visual state in video VLMs reduces follow-up latency by 15-36x on Qwen2.5-VL while preserving correctness on VideoMME, with smaller first-query speedups via pruning.
SURGE: SuperBatch Unified Resource-efficient GPU Encoding for Heterogeneous Partitioned Data
cs.DC 2026-05 unverdicted novelty 5.0

SURGE achieves fixed-batch throughput for GPU embedding generation on 800M texts across 40k partitions using 12.6x less memory, 68x faster time-to-first-output, and fault tolerance via a streaming two-threshold policy...
EdgeFM: Efficient Edge Inference for Vision-Language Models
cs.CV 2026-04 unverdicted novelty 5.0

EdgeFM is an agent-driven framework that strips non-essential features from VLMs and packages reusable optimized kernels, achieving up to 1.49x speedup over TensorRT-Edge-LLM on NVIDIA Orin while enabling first end-to...
LLM StructCore: Schema-Guided Reasoning Condensation and Deterministic Compilation
cs.CL 2026-04 unverdicted novelty 5.0

Two-stage Schema-Guided Reasoning with LLM condensation and deterministic compilation achieves macro-F1 of 0.63 on dyspnea CRF filling task and is language-agnostic.
HieraSparse: Hierarchical Semi-Structured Sparse KV Attention
cs.DC 2026-04 unverdicted novelty 5.0

HieraSparse delivers a hierarchical semi-structured sparse KV attention system that achieves 1.2x KV compression and 4.57x decode attention speedup versus prior unstructured sparsity methods at equivalent sparsity, pl...
enclawed: A Configurable, Sector-Neutral Hardening Framework for Single-User AI Assistant Gateways
cs.CR 2026-04 unverdicted novelty 5.0

enclawed is a sector-neutral hardening framework for AI gateways providing signed modules, audit trails, peer attestation, and a 356-case test suite for regulated deployments.
enclawed: A Configurable, Sector-Neutral Hardening Framework for Single-User AI Assistant Gateways
cs.CR 2026-04 conditional novelty 4.0

enclawed is a two-flavor hardening framework for OpenClaw AI gateways that supplies attestable trust, strict allowlists, FIPS crypto assertion, DLP signals, and a 204-case test suite for regulated-industry deployments.
Scalable Inference Architectures for Compound AI Systems: A Production Deployment Study
cs.AI 2026-04 unverdicted novelty 3.0

A deployed modular inference architecture for compound AI systems cut tail latency over 50%, boosted throughput up to 3.9x, and reduced costs 30-40% while handling multi-model agent workloads.
A Survey on Efficient Inference for Large Language Models
cs.CL 2024-04 accept novelty 3.0

The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · cited by 32 Pith papers · 16 internal anchors

[1]

Apiserve: Efficient api support for large-language model inferencing

Reyna Abhyankar, Zijian He, Vikranth Srivatsa, Hao Zhang, and Yiying Zhang. Apiserve: Efficient api support for large-language model inferencing. arXiv preprint arXiv:2402.01869, 2024

work page arXiv 2024
[2]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advances in Neural Information Processing Systems, 35:23716–23736, 2022

work page 2022
[3]

Deepspeed- inference: enabling efficient inference of transformer models at unprecedented scale

Reza Yazdani Aminabadi, Samyam Rajbhandari, Ammar Ahmad Awan, Cheng Li, Du Li, Elton Zheng, Olatunji Ruwase, Shaden Smith, Minjia Zhang, Jeff Rasley, et al. Deepspeed- inference: enabling efficient inference of transformer models at unprecedented scale. In SC22: International Conference for High Performance Computing, Networking, Storage and Analysis, pag...

work page 2022
[4]

Prompting is programming: A query language for large language models

Luca Beurer-Kellner, Marc Fischer, and Martin Vechev. Prompting is programming: A query language for large language models. Proceedings of the ACM on Programming Languages, 7(PLDI):1946–1969, 2023

work page 1946
[5]

Language models are few-shot learners

Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems, 33:1877–1901, 2020

work page 1901
[6]

Sparks of Artificial General Intelligence: Early experiments with GPT-4

Sébastien Bubeck, Varun Chandrasekaran, Ronen Eldan, Johannes Gehrke, Eric Horvitz, Ece Kamar, Peter Lee, Yin Tat Lee, Yuanzhi Li, Scott Lundberg, et al. Sparks of artificial general intelligence: Early experiments with gpt-4. arXiv preprint arXiv:2303.12712, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Gonzalez, Ion Stoica, and Eric P

Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chatbot impressing gpt-4 with 90%* chatgpt quality, March 2023. 10

work page 2023
[8]

Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference. arXiv preprint arXiv:2403.04132, 2024

work page internal anchor Pith review arXiv 2024
[9]

PaLM: Scaling Language Modeling with Pathways

Aakanksha Chowdhery, Sharan Narang, Jacob Devlin, Maarten Bosma, Gaurav Mishra, Adam Roberts, Paul Barham, Hyung Won Chung, Charles Sutton, Sebastian Gehrmann, et al. Palm: Scaling language modeling with pathways. arXiv preprint arXiv:2204.02311, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[10]

Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in Neural Information Processing Systems, 35:16344–16359, 2022

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in Neural Information Processing Systems, 35:16344–16359, 2022

work page 2022
[11]

Model tells you what to discard: Adaptive kv cache compression for llms

Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. Model tells you what to discard: Adaptive kv cache compression for llms. In The Twelfth International Conference on Learning Representations, 2023

work page 2023
[12]

Prompt cache: Modular attention reuse for low-latency inference

In Gim, Guojun Chen, Seung-seob Lee, Nikhil Sarda, Anurag Khandelwal, and Lin Zhong. Prompt cache: Modular attention reuse for low-latency inference. arXiv preprint arXiv:2311.04934, 2023

work page arXiv 2023
[13]

A guidance language for controlling large language models

Guidance AI. A guidance language for controlling large language models. https://github. com/guidance-ai/guidance. Accessed: 2023-11

work page 2023
[14]

Measuring massive multitask language understanding

Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2020

work page 2020
[15]

Kvquant: Towards 10 million context length llm inference with kv cache quantization,

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami. Kvquant: Towards 10 million context length llm inference with kv cache quantization. arXiv preprint arXiv:2401.18079, 2024

work page arXiv 2024
[16]

Text generation inference

Hugging Face. Text generation inference. https://github.com/huggingface/ text-generation-inference. Accessed: 2023-11

work page 2023
[17]

Mixtral of Experts

Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Juravsky, B

Jordan Juravsky, Bradley Brown, Ryan Ehrlich, Daniel Y Fu, Christopher Ré, and Azalia Mirhoseini. Hydragen: High-throughput llm inference with shared prefixes. arXiv preprint arXiv:2402.05099, 2024

work page arXiv 2024
[19]

Gear: An efficient kv cache compression recipe for near-lossless generative inference of llm.arXiv preprint arXiv:2403.05527, 2024

Hao Kang, Qingru Zhang, Souvik Kundu, Geonhwa Jeong, Zaoxing Liu, Tushar Krishna, and Tuo Zhao. Gear: An efficient kv cache compression recipefor near-lossless generative inference of llm. arXiv preprint arXiv:2403.05527, 2024

work page arXiv 2024
[20]

DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines

Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T. Joshi, Hanna Moazam, Heather Miller, Matei Zaharia, and Christopher Potts. Dspy: Compiling declarative language model calls into self-improving pipelines. arXiv preprint arXiv:2310.03714, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Mahoney, Kurt Keutzer, and Amir Gholami

Sehoon Kim, Suhong Moon, Ryan Tabrizi, Nicholas Lee, Michael W Mahoney, Kurt Keutzer, and Amir Gholami. An llm compiler for parallel function calling. arXiv preprint arXiv:2312.04511, 2023

work page arXiv 2023
[22]

Validating large language models with relm

Michael Kuchnik, Virginia Smith, and George Amvrosiadis. Validating large language models with relm. Proceedings of Machine Learning and Systems, 5, 2023

work page 2023
[23]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th Symposium on Operating Systems Principles, pages 611–626, 2023

work page 2023
[24]

Langchain

LangChain AI. Langchain. https://github.com/langchain-ai/langchain. Accessed: 2023-11

work page 2023
[25]

Competition-level code generation with alphacode

Yujia Li, David Choi, Junyoung Chung, Nate Kushman, Julian Schrittwieser, Rémi Leblond, Tom Eccles, James Keeling, Felix Gimeno, Agustin Dal Lago, et al. Competition-level code generation with alphacode. Science, 378(6624):1092–1097, 2022. 11

work page 2022
[26]

AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Xingyu Dang, and Song Han. Awq: Activation-aware weight quantization for llm compression and acceleration. arXiv preprint arXiv:2306.00978, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[27]

Improved Baselines with Visual Instruction Tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. arXiv preprint arXiv:2310.03744, 2023

work page internal anchor Pith review arXiv 2023
[28]

Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

Haotian Liu, Chunyuan Li, Yuheng Li, Bo Li, Yuanhan Zhang, Sheng Shen, and Yong Jae Lee. Llava-next: Improved reasoning, ocr, and world knowledge, January 2024

work page 2024
[29]

Optimizing llm queries in relational workloads

Shu Liu, Asim Biswal, Audrey Cheng, Xiangxi Mo, Shiyi Cao, Joseph E Gonzalez, Ion Stoica, and Matei Zaharia. Optimizing llm queries in relational workloads. arXiv preprint arXiv:2403.05821, 2024

work page arXiv 2024
[30]

Prompting Frameworks for Large Language Models: A Survey

Xiaoxia Liu, Jingyi Wang, Jun Sun, Xiaohan Yuan, Guoliang Dong, Peng Di, Wenhai Wang, and Dongxia Wang. Prompting frameworks for large language models: A survey.arXiv preprint arXiv:2311.12785, 2023

work page arXiv 2023
[31]

Scissorhands: Exploiting the persistence of impor- tance hypothesis for llm kv cache compression at test time

Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of impor- tance hypothesis for llm kv cache compression at test time. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[32]

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. Kivi: A tuning-free asymmetric 2bit quantization for kv cache. arXiv preprint arXiv:2402.02750, 2024

work page internal anchor Pith review arXiv 2024
[33]

Skeleton- of-thought: Prompting LLMs for efficient parallel generation

Xuefei Ning, Zinan Lin, Zixuan Zhou, Zifu Wang, Huazhong Yang, and Yu Wang. Skeleton- of-thought: Prompting LLMs for efficient parallel generation. In The Twelfth International Conference on Learning Representations, 2024

work page 2024
[34]

Tensorrt-llm

NVIDIA. Tensorrt-llm. https://github.com/NVIDIA/TensorRT-LLM. Accessed: 2023- 11

work page 2023
[35]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023

work page 2023
[36]

O’Brien, Carrie J

Joon Sung Park, Joseph C. O’Brien, Carrie J. Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. Generative agents: Interactive simulacra of human behavior. In In the 36th Annual ACM Symposium on User Interface Software and Technology (UIST ’23), UIST ’23, New York, NY , USA, 2023. Association for Computing Machinery

work page 2023
[37]

Pytorch: An imperative style, high-performance deep learning library

Adam Paszke, Sam Gross, Francisco Massa, Adam Lerer, James Bradbury, Gregory Chanan, Trevor Killeen, Zeming Lin, Natalia Gimelshein, Luca Antiga, et al. Pytorch: An imperative style, high-performance deep learning library. Advances in neural information processing systems, 32, 2019

work page 2019
[38]

Gorilla: Large Language Model Connected with Massive APIs

Shishir G. Patil, Tianjun Zhang, Xin Wang, and Joseph E. Gonzalez. Gorilla: Large language model connected with massive apis. arXiv preprint arXiv:2305.15334, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Efficiently scaling transformer inference

Reiner Pope, Sholto Douglas, Aakanksha Chowdhery, Jacob Devlin, James Bradbury, Jonathan Heek, Kefan Xiao, Shivani Agrawal, and Jeff Dean. Efficiently scaling transformer inference. Proceedings of Machine Learning and Systems, 5, 2023

work page 2023
[40]

Branch-solve-merge improves large language model evaluation and generation

Swarnadeep Saha, Omer Levy, Asli Celikyilmaz, Mohit Bansal, Jason Weston, and Xian Li. Branch-solve-merge improves large language model evaluation and generation. arXiv preprint arXiv:2310.15123, 2023

work page arXiv 2023
[41]

Toolformer: Language Models Can Teach Themselves to Use Tools

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Luke Zettle- moyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools. arXiv preprint arXiv:2302.04761, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

Fairness in serving large language models,

Ying Sheng, Shiyi Cao, Dacheng Li, Banghua Zhu, Zhuohan Li, Danyang Zhuo, Joseph E Gonza- lez, and Ion Stoica. Fairness in serving large language models.arXiv preprint arXiv:2401.00588, 2023

work page arXiv 2023
[43]

Flexgen: high-throughput generative inference of large language models with a single gpu

Ying Sheng, Lianmin Zheng, Binhang Yuan, Zhuohan Li, Max Ryabinin, Beidi Chen, Percy Liang, Christopher Ré, Ion Stoica, and Ce Zhang. Flexgen: high-throughput generative inference of large language models with a single gpu. In International Conference on Machine Learning, pages 31094–31116. PMLR, 2023. 12

work page 2023
[44]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model parallelism. arXiv preprint arXiv:1909.08053, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1909
[45]

Preble: Efficient distributed prompt scheduling for llm serving

Vikranth Srivatsa, Zijian He, Reyna Abhyankar, Dongming Li, and Yiying Zhang. Preble: Efficient distributed prompt scheduling for llm serving. 2024

work page 2024
[46]

Cognitive architec- tures for language agents

Theodore Sumers, Shunyu Yao, Karthik Narasimhan, and Thomas Griffiths. Cognitive architec- tures for language agents. Transactions on Machine Learning Research, 2023

work page 2023
[47]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Yonghui Wu, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, et al. Gemini: a family of highly capable multimodal models. arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

Triton: an intermediate language and compiler for tiled neural network computations

Philippe Tillet, Hsiang-Tsung Kung, and David Cox. Triton: an intermediate language and compiler for tiled neural network computations. In Proceedings of the 3rd ACM SIGPLAN International Workshop on Machine Learning and Programming Languages , pages 10–19, 2019

work page 2019
[49]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[50]

Fast, high-fidelity llm decoding with regex constraints, 2024

Vivien Tran-Thien. Fast, high-fidelity llm decoding with regex constraints, 2024

work page 2024
[51]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. Advances in neural information processing systems, 30, 2017

work page 2017
[52]

Voyager: An Open-Ended Embodied Agent with Large Language Models

Guanzhi Wang, Yuqi Xie, Yunfan Jiang, Ajay Mandlekar, Chaowei Xiao, Yuke Zhu, Linxi Fan, and Anima Anandkumar. V oyager: An open-ended embodied agent with large language models. arXiv preprint arXiv: Arxiv-2305.16291, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[53]

Self-consistency improves chain of thought reasoning in language models

Xuezhi Wang, Jason Wei, Dale Schuurmans, Quoc V Le, Ed H Chi, Sharan Narang, Aakanksha Chowdhery, and Denny Zhou. Self-consistency improves chain of thought reasoning in language models. In The Eleventh International Conference on Learning Representations, 2022

work page 2022
[54]

Efficient guided generation for large language models

Brandon T Willard and Rémi Louf. Efficient guided generation for large language models. arXiv e-prints, pages arXiv–2307, 2023

work page 2023
[55]

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation

Qingyun Wu, Gagan Bansal, Jieyu Zhang, Yiran Wu, Shaokun Zhang, Erkang Zhu, Beibin Li, Li Jiang, Xiaoyun Zhang, and Chi Wang. Autogen: Enabling next-gen llm applications via multi-agent conversation framework. arXiv preprint arXiv:2308.08155, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[56]

Tree of Thoughts: Deliberate Problem Solving with Large Language Models

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Thomas L Griffiths, Yuan Cao, and Karthik Narasimhan. Tree of thoughts: Deliberate problem solving with large language models. arXiv preprint arXiv:2305.10601, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[57]

React: Synergizing reasoning and acting in language models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik R Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models. In The Eleventh International Conference on Learning Representations, 2022

work page 2022
[58]

Chunkattention: Efficient self-attention with prefix-aware kv cache and two-phase partition

Lu Ye, Ze Tao, Yong Huang, and Yang Li. Chunkattention: Efficient self-attention with prefix-aware kv cache and two-phase partition. arXiv preprint arXiv:2402.15220, 2024

work page arXiv 2024
[59]

Accelerating self-attentions for llm serving with flashinfer, February 2024

Zihao Ye, Lequn Chen, Ruihang Lai, Yilong Zhao, Size Zheng, Junru Shao, Bohan Hou, Hongyi Jin, Yifei Zuo, Liangsheng Yin, Tianqi Chen, and Luis Ceze. Accelerating self-attentions for llm serving with flashinfer, February 2024

work page 2024
[60]

Orca: A distributed serving system for {Transformer-Based} generative models

Gyeong-In Yu, Joo Seong Jeong, Geon-Woo Kim, Soojeong Kim, and Byung-Gon Chun. Orca: A distributed serving system for {Transformer-Based} generative models. In 16th USENIX Symposium on Operating Systems Design and Implementation (OSDI 22), pages 521–538, 2022

work page 2022
[61]

Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, 2019

Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, 2019

work page 2019
[62]

prefill"). It then sequentially decodes output tokens, with each token depending on prior tokens (this process is called

Yuanhan Zhang, Bo Li, haotian Liu, Yong jae Lee, Liangke Gui, Di Fu, Jiashi Feng, Ziwei Liu, and Chunyuan Li. Llava-next: A strong zero-shot video understanding model, April 2024. 13 (d) Tree-of-thought Question Branch 1Branch 1Search HistoryBranch 1.1Branch 1.1Search HistoryBranch 1.1.1Branch 1.1.1Search HistoryBranch 1.2Branch 1.2Search HistoryBranch 1....

work page 2024