pith. machine review for the scientific record. sign in

arxiv: 2602.06932 · v3 · submitted 2026-02-06 · 💻 cs.LG

Recognition: no theorem link

When RL Meets Adaptive Speculative Training: A Unified Training-Serving System

Authors on Pith no claims yet

Pith reviewed 2026-05-16 06:30 UTC · model grok-4.3

classification 💻 cs.LG
keywords speculative decodingreinforcement learningLLM inferenceadaptive systemstraining-serving integrationday-zero deployment
0
0 comments X

The pith

Aurora unifies speculator training and serving for speculative decoding using asynchronous reinforcement learning from live traces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Speculative decoding accelerates large language model inference but suffers from offline training delays and staleness when models or traffic change. Aurora addresses this by continuously training the speculator during serving through an RL process that uses acceptance as positive signal and rejection as negative. The system combines an inference server with a training server for seamless hot-swaps. This allows immediate deployment on new models and quick adaptation, delivering measured speedups right away. A sympathetic reader would care because it turns speculative decoding into a self-improving service rather than a static optimization.

Core claim

By reframing speculator learning as asynchronous RL where accepted tokens give positive feedback and rejected proposals give implicit negative feedback, Aurora enables a unified training-serving system that supports day-0 deployment and continuous adaptation to user traffic shifts.

What carries the argument

Asynchronous reinforcement learning loop that learns the speculator directly from live inference traces of accepted and rejected tokens.

If this is right

  • Immediate 1.5x speedup on frontier models upon deployment without prior offline training.
  • Additional 1.25x speedup when adapting to shifts in user traffic compared to static speculators.
  • Prevention of performance degradation due to domain drift in the target model.
  • Hot-swapped speculator updates that maintain continuous service availability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar RL feedback mechanisms could be applied to other serving optimizations like dynamic batching or KV cache management.
  • The emphasis on end-to-end speedup over acceptance rate alone may encourage rethinking evaluation metrics for acceleration techniques.
  • Deploying on more models could reveal how well the RL signals generalize across different architectures and sizes.

Load-bearing premise

Feedback signals from accepted and rejected tokens during live inference are clean enough to train the speculator stably without excessive tuning.

What would settle it

Running the system on a frontier model and observing either no speedup gain over a static baseline or frequent serving interruptions due to training instability.

Figures

Figures reproduced from arXiv: 2602.06932 by Avner May, Ben Athiwaratkun, Ce Zhang, Chenfeng Xu, Fengxiang Bie, Jisen Li, Junxiong Wang, Percy Liang, Qingyang Wu, Shuaiwen Leon Song, Sri Yanamandra, Tri Dao, Xiaoxia Wu, Yineng Zhang, Yinghui Liu, Yubo Wang, Zelei Shao, Zhongzhu Zhou.

Figure 1
Figure 1. Figure 1: Aurora. A unified training–serving framework for online speculative training with asynchronous, RL-style updates. A production inference server performs speculative decoding with a fixed target (verifier) and a lightweight draft model (speculator), accepting or rejecting proposed tokens during verification. Serving traces—including both accepted and rejected prefixes—are streamed into a data buffer and tra… view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the Tree Attention mecha [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Mixed streams. Day-0 adaptation of an untrained speculator. (a) The acceptance length starts at one and rapidly increases, converging with the pretrained baseline. (b) The per-request throughput, defined as (Tinput + Toutput)/trequest where Tinput and Toutput are the input and output token counts and trequest is the end-to-end latency, initially suffers but recovers as the speculator adapts, demonstrating … view at source ↗
Figure 4
Figure 4. Figure 4: Ordered streams. Day-0 adaptation of an untrained speculator. (a) The acceptance length starts at one and rapidly increases, converging and sometimes even surpassing the pretrained baseline. (b) The throughput(see definition in Section B.1) initially suffers but recovers as the speculator adapts, demonstrating the effectiveness of the serve-to-train flywheel. (c)Continuing fine-tuning on top of the trained… view at source ↗
Figure 5
Figure 5. Figure 5: A Study of Speculator Asynchronization Policy. More frequent policy refresh improves post-shift adaptation (higher acceptance length) but can reduce serving throughput due to synchronization overhead. A moderately lazy schedule (e.g., Trained w Async 48) provides a strong Pareto point, preserving throughput while retaining most of the adaptation benefit. To quantify this, we sweep the policy update interva… view at source ↗
Figure 6
Figure 6. Figure 6: Moving-average speculative decoding acceptance length over inference requests for Qwen-8B [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Moving-average speculative decoding accept length over inference requests for Llama-3.1- [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Aurora increases speculative accept length and boosts throughput, with larger speedups at [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: MiniMax M2.1. Top: accepted draft length over time. Bottom: per-request throughput over time. Aurora (Scratch) increases acceptance length to 2.8 and translates it into 1.45× throughput (BS4) gains over the no speculation baseline [PITH_FULL_IMAGE:figures/full_fig_p014_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qwen3-Coder-Next. Top: accepted draft length over time. Bottom: per-request throughput over time. We discard the first 1,000 warm-up steps, since hybrid deployments exhibit transient throughput instability during initialization. Despite this variability, Aurora (Scratch) raises the mean accepted draft length to 3 and delivers a 1.21× throughput improvement over the no-speculation baseline (averaged over t… view at source ↗
read the original abstract

Speculative decoding can significantly accelerate LLM serving, yet most deployments today disentangle speculator training from serving, treating speculator training as a standalone offline modeling problem. We show that this decoupled formulation introduces substantial deployment and adaptation lag: (1) high time-to-serve, since a speculator must be trained offline for a considerable period before deployment; (2) delayed utility feedback, since the true end-to-end decoding speedup is only known after training and cannot be inferred reliably from acceptance rate alone due to model-architecture and system-level overheads; and (3) domain-drift degradation, as the target model is repurposed to new domains and the speculator becomes stale and less effective. To address these issues, we present Aurora, a unified training-serving system that closes the loop by continuously learning a speculator directly from live inference traces. Aurora reframes online speculator learning as an asynchronous reinforcement-learning problem: accepted tokens provide positive feedback, while rejected speculator proposals provide implicit negative feedback that we exploit to improve sample efficiency. Our design integrates an SGLang-based inference server with an asynchronous training server, enabling hot-swapped speculator updates without service interruption. Crucially, Aurora supports day-0 deployment: a speculator can be served immediately and rapidly adapted to live traffic, improving system performance while providing immediate utility feedback. Across experiments, Aurora achieves a 1.5x day-0 speedup on recently released frontier models (e.g., MiniMax M2.1 229B and Qwen3-Coder-Next 80B). Aurora also adapts effectively to distribution shifts in user traffic, delivering an additional 1.25x speedup over a well-trained but static speculator on widely used models (e.g., Qwen3 and Llama3).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents Aurora, a unified training-serving system for speculative decoding that closes the loop between inference and speculator training via asynchronous RL. Accepted tokens supply positive feedback while rejected proposals supply implicit negative feedback; the system integrates an SGLang inference server with an asynchronous trainer to enable hot-swapped updates without downtime. The central claims are day-0 deployment with immediate 1.5x speedup on frontier models (MiniMax M2.1 229B, Qwen3-Coder-Next 80B) and an additional 1.25x speedup over static speculators under traffic distribution shifts (Qwen3, Llama3).

Significance. If the empirical results hold under rigorous controls, the work would be significant for production LLM serving: it directly tackles the deployment lag, stale-speculator degradation, and lack of end-to-end utility feedback that currently separate training from serving. The RL framing of online speculator adaptation from live traces is a concrete step toward self-improving inference systems and could influence future designs that treat serving traces as primary training data.

major comments (2)
  1. [Abstract and Experiments] Abstract and Experiments section: the reported 1.5x day-0 and 1.25x adaptation speedups are presented without any description of experimental setup, number of runs, variance, baseline speculators (including their training data and hyperparameters), or controls for system-level overheads; these omissions are load-bearing because the central claims rest entirely on the magnitude and reliability of the measured speedups.
  2. [§3] §3 (RL formulation): the asynchronous RL loop is described as using accepted tokens for positive reward and rejected proposals for implicit negative feedback, yet no reward shaping, advantage normalization, clipping, or handling of the confounding factors (target-model uncertainty, temperature, overhead) is specified; without these, policy-gradient instability is a plausible risk that would contradict the day-0 and “no extensive tuning” assertions.
minor comments (2)
  1. [§3] Notation for the reward signal and the asynchronous update protocol could be formalized with explicit equations to improve reproducibility.
  2. [Figures and Tables] Figure captions and table headers should explicitly state the models, sequence lengths, and hardware used for each speedup measurement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate the revisions planned for the next version of the manuscript.

read point-by-point responses
  1. Referee: [Abstract and Experiments] Abstract and Experiments section: the reported 1.5x day-0 and 1.25x adaptation speedups are presented without any description of experimental setup, number of runs, variance, baseline speculators (including their training data and hyperparameters), or controls for system-level overheads; these omissions are load-bearing because the central claims rest entirely on the magnitude and reliability of the measured speedups.

    Authors: We agree that the current version of the abstract and Experiments section omits key methodological details. In the revised manuscript we will expand the Experiments section with: (i) a full description of the experimental setup (hardware, SGLang integration, traffic generation, and measurement protocol); (ii) results averaged over five independent runs together with standard deviations; (iii) explicit specifications of all baseline speculators, including the offline training data, hyperparameters, and training duration used for each; and (iv) additional measurements and ablations that isolate system-level overheads (training latency, hot-swap cost, and inference-server contention). These additions will directly substantiate the reported speedups. revision: yes

  2. Referee: [§3] §3 (RL formulation): the asynchronous RL loop is described as using accepted tokens for positive reward and rejected proposals for implicit negative feedback, yet no reward shaping, advantage normalization, clipping, or handling of the confounding factors (target-model uncertainty, temperature, overhead) is specified; without these, policy-gradient instability is a plausible risk that would contradict the day-0 and “no extensive tuning” assertions.

    Authors: We acknowledge that §3 currently provides only a high-level description of the reward signal. In the revision we will add a dedicated paragraph that specifies: accepted tokens receive a reward of +1; each rejected proposal receives a length-scaled negative reward of -0.5; advantage is computed with an exponential-moving-average baseline; and no PPO-style clipping is applied because updates remain infrequent and asynchronous. We will also discuss how confounding factors are handled: target-model uncertainty and temperature are absorbed implicitly by optimizing directly on live end-to-end speedup rather than acceptance rate, while system overhead is captured in the same utility signal. The resulting formulation remains simple enough to support day-0 deployment without offline hyper-parameter search, consistent with the empirical stability we observe. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents Aurora as an empirical unified training-serving system that reframes speculator adaptation as an asynchronous RL problem using live inference traces, with accepted tokens as positive feedback and rejected proposals as implicit negative feedback. All central claims (1.5x day-0 speedup on frontier models and 1.25x adaptation to traffic shifts) are grounded in experimental measurements on specific models rather than any closed-form derivation, self-referential equations, or load-bearing self-citations. No mathematical steps are shown that define a quantity in terms of itself or rename a fitted input as a prediction; the RL formulation applies standard policy-gradient ideas to the speculative decoding setting without reducing the reported speedups to the inputs by construction. The system description remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based on abstract only; no explicit free parameters, axioms, or invented entities are detailed. The approach implicitly relies on standard RL assumptions (reward from acceptance/rejection) and system integration assumptions that are not enumerated.

pith-pipeline@v0.9.0 · 5696 in / 1143 out tokens · 32515 ms · 2026-05-16T06:30:50.385268+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SpecBlock: Block-Iterative Speculative Decoding with Dynamic Tree Drafting

    cs.CL 2026-05 unverdicted novelty 6.0

    SpecBlock achieves 8-19% higher speedup than EAGLE-3 in LLM speculative decoding by using repeated block expansions with hidden-state inheritance, a dynamic rank head, and a valid-prefix training mask.

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · cited by 1 Pith paper · 8 internal anchors

  1. [1]

    https://huggingface.co/datasets/alespalla/chatbot_ instruction_prompts

    chatbot_instruction_prompts. https://huggingface.co/datasets/alespalla/chatbot_ instruction_prompts. Accessed: 2026-01-28

  2. [2]

    https://huggingface.co/datasets/gbharti/finance-alpaca

    finance-alpaca dataset. https://huggingface.co/datasets/gbharti/finance-alpaca. Ac- cessed: 2026-01-28

  3. [3]

    Claude opus 4.6, 2026

    Anthropic. Claude opus 4.6, 2026. URL https://www.anthropic.com/news/ claude-opus-4-6

  4. [4]

    Q2 2025 ai hypercomputer updates, 2025

    Google Cloud Blog. Q2 2025 ai hypercomputer updates, 2025. URLhttps://cloud.google. com/blog/products/ai-machine-learning/q2-2025-ai-hypercomputer-updates

  5. [5]

    Medusa: Simple framework for accelerating llm generation with multiple decoding heads

    Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, and Tri Dao. Medusa: Simple framework for accelerating llm generation with multiple decoding heads. Accessed: 2023-09-08, 2023

  6. [6]

    MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention

    Aili Chen, Aonian Li, Bangwei Gong, Binyang Jiang, Bo Fei, Bo Yang, Boji Shan, Changqing Yu, Chao Wang, Cheng Zhu, et al. Minimax-m1: Scaling test-time compute efficiently with lightning attention.arXiv preprint arXiv:2506.13585, 2025

  7. [7]

    Accelerating Large Language Model Decoding with Speculative Sampling

    Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318, 2023

  8. [8]

    Sequoia: Scalable, robust, and hardware-aware speculative decoding, 2024

    Zhuoming Chen, Avner May, Ruslan Svirschevski, Yuhsun Huang, Max Ryabinin, Zhihao Jia, and Beidi Chen. Sequoia: Scalable, robust, and hardware-aware speculative decoding, 2024

  9. [9]

    Training Verifiers to Solve Math Word Problems

    Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021. 16

  10. [10]

    The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407, 2024

  11. [11]

    AReaL: A Large-Scale Asynchronous Reinforcement Learning System for Language Reasoning

    Wei Fu, Jiaxuan Gao, Xujie Shen, Chen Zhu, Zhiyu Mei, Chuyi He, Shusheng Xu, Guo Wei, Jun Mei, Jiashu Wang, et al. Areal: A large-scale asynchronous reinforcement learning system for language reasoning.arXiv preprint arXiv:2505.24298, 2025

  12. [12]

    Break the sequential dependency of llm inference using lookahead decoding.arXiv preprint arXiv:2402.02057, 2024

    Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. Break the sequential dependency of llm inference using lookahead decoding.arXiv preprint arXiv:2402.02057, 2024

  13. [13]

    CodeSearchNet Challenge: Evaluating the State of Semantic Code Search

    Hamel Husain, Ho-Hsiang Wu, Tiferet Gazit, Miltiadis Allamanis, and Marc Brockschmidt. Codesearchnet challenge: Evaluating the state of semantic code search.arXiv preprint arXiv:1909.09436, 2019

  14. [14]

    Fast inference from transformers via speculative decoding, 2023

    Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding, 2023

  15. [15]

    Eagle: Speculative sampling requires rethinking feature uncertainty.International Conference on Machine Learning, 2024

    Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle: Speculative sampling requires rethinking feature uncertainty.International Conference on Machine Learning, 2024

  16. [16]

    Eagle-2: Faster inference of language models with dynamic draft trees.arXiv preprint arXiv:2406.16858, 2024

    Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-2: Faster inference of language models with dynamic draft trees.arXiv preprint arXiv:2406.16858, 2024

  17. [17]

    Eagle-3: Scaling up inference acceleration of large language models via training-time test.arXiv preprint, 2025

    Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. Eagle-3: Scaling up inference acceleration of large language models via training-time test.arXiv preprint, 2025

  18. [18]

    Online speculative decoding.arXiv preprint arXiv:2310.07177, 2023

    Xiaoxuan Liu, Lanxiang Hu, Peter Bailis, Ion Stoica, Zhijie Deng, Alvin Cheung, and Hao Zhang. Online speculative decoding.arXiv preprint arXiv:2310.07177, 2023

  19. [19]

    Specinfer: Accelerating generative llm serving with speculative inference and token tree verification.arXiv preprint arXiv:2305.09781, 2023

    Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Rae Ying Yee Wong, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, and Zhihao Jia. Specinfer: Accelerating generative llm serving with speculative inference and token tree verification.arXiv preprint arXiv:2305.09781, 2023

  20. [20]

    Mixed Precision Training

    Paulius Micikevicius, Sharan Narang, Jonah Alben, Gregory Diamos, Erich Elsen, David Garcia, Boris Ginsburg, Michael Houston, Oleksii Kuchaiev, Ganesh Venkatesh, et al. Mixed precision training.arXiv preprint arXiv:1710.03740, 2017

  21. [21]

    Minimax-m2.1

    MiniMaxAI. Minimax-m2.1. https://huggingface.co/MiniMaxAI/MiniMax-M2.1, 2024. Hug- ging Face model repository

  22. [22]

    Introducing gpt-5.3-codex, 2026

    OpenAI. Introducing gpt-5.3-codex, 2026. URL https://openai.com/index/ introducing-gpt-5-3-codex/

  23. [23]

    Mooncake: A kvcache-centric disaggregated architecture for llm serving.ACM Transactions on Storage, 2024

    Ruoyu Qin, Zheming Li, Weiran He, Jialei Cui, Heyi Tang, Feng Ren, Teng Ma, Shangming Cai, Yineng Zhang, Mingxing Zhang, et al. Mooncake: A kvcache-centric disaggregated architecture for llm serving.ACM Transactions on Storage, 2024

  24. [24]

    Qwen3-coder-next technical report

    Qwen Team. Qwen3-coder-next technical report. Technical report. URL https://github. com/QwenLM/Qwen3-Coder/blob/main/qwen3_coder_next_tech_report.pdf. Accessed: 2026- 02-03. 17

  25. [25]

    MiLeS:Ahighperformancerlframework, 2025

    radixark. MiLeS:Ahighperformancerlframework, 2025. URL https://github.com/radixark/ miles. GitHub repository

  26. [26]

    Beat the long tail: Distribution-aware speculative decoding for rl training.arXiv preprint arXiv:2511.13841, 2025

    Zelei Shao, Vikranth Srivatsa, Sanjana Srivastava, Qingyang Wu, Alpay Ariyak, Xiaoxia Wu, Ameen Patel, Jue Wang, Percy Liang, Tri Dao, et al. Beat the long tail: Distribution-aware speculative decoding for rl training.arXiv preprint arXiv:2511.13841, 2025

  27. [27]

    Llm updates, 2026

    LLM Stats. Llm updates, 2026. URLhttps://llm-stats.com/llm-updates

  28. [28]

    Test- time training with self-supervision for generalization under distribution shifts.International Conference on Machine Learning (ICML), pages 9229–9248, 2020

    Yu Sun, Xiaolong Wang, Zhuang Liu, John Miller, Alexei A Efros, and Moritz Hardt. Test- time training with self-supervision for generalization under distribution shifts.International Conference on Machine Learning (ICML), pages 9229–9248, 2020

  29. [29]

    SLiME: A post-training framework for reinforcement learning scaling, 2024

    THUDM. SLiME: A post-training framework for reinforcement learning scaling, 2024. URL https://github.com/THUDM/slime. GitHub repository

  30. [30]

    The mamba in the llama: Distilling and accelerating hybrid models.Advances in Neural Information Processing Systems, 37:62432–62457, 2024

    Junxiong Wang, Daniele Paliotta, Avner May, Alexander Rush, and Tri Dao. The mamba in the llama: Distilling and accelerating hybrid models.Advances in Neural Information Processing Systems, 37:62432–62457, 2024

  31. [31]

    Angles don’t lie: Unlocking training-efficient rl through the model’s own signals, 2025

    Qinsi Wang, Jinghan Ke, Hancheng Ye, Yueqian Lin, Yuzhe Fu, Jianyi Zhang, Kurt Keutzer, Chenfeng Xu, and Yiran Chen. Angles don’t lie: Unlocking training-efficient rl through the model’s own signals, 2025. URLhttps://arxiv.org/abs/2506.02281

  32. [32]

    Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding.arXiv preprint arXiv:2401.07851, 2024

    Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, and Zhifang Sui. Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding.arXiv preprint arXiv:2401.07851, 2024

  33. [33]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

  34. [34]

    Gated Delta Networks: Improving Mamba2 with Delta Rule

    Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule.arXiv preprint arXiv:2412.06464, 2024

  35. [35]

    Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task

    Tao Yu, Rui Zhang, Kai Yang, Michihiro Yasunaga, Dongxu Wang, Zifan Li, James Ma, Irene Li, Qingning Yao, Shanelle Roman, Zilin Zhang, and Dragomir Radev. Spider: A large-scale human-labeled dataset for complex and cross-domain semantic parsing and text-to-sql task. InProceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, ...

  36. [36]

    Beyond the speculative game: A survey of speculative execution in large language models, 2024

    Chen Zhang, Zhuorui Liu, and Dawei Song. Beyond the speculative game: A survey of speculative execution in large language models, 2024. URLhttps://arxiv.org/abs/2404.14897

  37. [37]

    Sglang: Efficient execution of structured language model programs.Advances in neural information processing systems, 37: 62557–62583, 2024

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Livia Sun, Jeff Huang, Cody Hao Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Efficient execution of structured language model programs.Advances in neural information processing systems, 37: 62557–62583, 2024

  38. [38]

    {DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving

    Yinmin Zhong, Shengyu Liu, Junda Chen, Jianbo Hu, Yibo Zhu, Xuanzhe Liu, Xin Jin, and Hao Zhang. {DistServe}: Disaggregating prefill and decoding for goodput-optimized large language model serving. In18th USENIX Symposium on Operating Systems Design and Implementation (OSDI 24), pages 193–210, 2024. 18 A Technical Details forAuroraSystem A key technical c...

  39. [39]

    Our training server acts as a third disaggregated role, receiving hidden states and logits over the same communication fabric (e.g., RDMA) without requiring a separate data path

    and DistServe [38], which already separate prefill and decoding onto distinct node pools with cross-node GPU transfer infrastructure. Our training server acts as a third disaggregated role, receiving hidden states and logits over the same communication fabric (e.g., RDMA) without requiring a separate data path. This makes our inference-time training pipel...