arxiv: 2605.13319 · v2 · submitted 2026-05-13 · 💻 cs.DC

Recognition: no theorem link

PipeSD: An Efficient Cloud-Edge Collaborative Pipeline Inference Framework with Speculative Decoding

Yunhe Han , Yunqi Gao , Bing Hu , Mahdi Boloursaz Mashhadi , Yitong Duan , Pei Xiao , Yanfeng Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-15 03:07 UTC · model grok-4.3

classification 💻 cs.DC

keywords speculative decodingcloud-edge collaborationLLM inferencepipeline schedulingdynamic programmingBayesian optimizationenergy efficiencycollaborative inference

0 comments

The pith

PipeSD speeds up cloud-edge LLM inference 1.16x-2.16x by pipelining token batches and flexible verification.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Existing collaborative speculative decoding for LLMs is limited by sequential token generation that leaves resources idle and by rigid triggering of cloud verification that triggers either too early or causes expensive rollbacks. PipeSD fixes both with a token-batch pipeline whose scheduling is chosen by dynamic programming to overlap generation and communication, plus a dual-threshold mechanism for non-autoregressive verification that a lightweight Bayesian optimizer tunes on the fly. Evaluations on a real cloud-edge testbed with two model pairs show the changes produce consistent speedups and energy savings while preserving the privacy and offline benefits of edge deployment. A reader would care because the method makes large-model inference practical on mixed cloud and local hardware without sacrificing responsiveness or privacy.

Core claim

PipeSD overlaps token generation and communication through a token-batch pipeline scheduling mechanism optimized by dynamic programming, and improves verification flexibility through a dual-threshold NAV triggering mechanism with a lightweight Bayesian optimization autotuner; the resulting framework, implemented with llama-cpp-python, PyTorch, and FastAPI, delivers 1.16x-2.16x speedup and 14.3%-25.3% lower energy use compared with state-of-the-art baselines across four scenarios and two draft-target model pairs.

What carries the argument

Token-batch pipeline scheduler with dynamic-programming optimization paired with dual-threshold NAV triggering tuned by Bayesian autotuner; it overlaps generation and communication while allowing flexible verification to cut rollbacks.

If this is right

Token generation and communication can be overlapped to raise utilization in distributed LLM inference.
Flexible non-autoregressive verification reduces premature checks and costly rollbacks.
Energy consumption falls 14-25 percent while generation speed rises across tested model pairs and scenarios.
Cloud workload offloading remains compatible with offline robustness and privacy guarantees.
The same mechanisms apply to multiple draft-target model pairs without per-deployment retuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The pipelining idea could extend to other distributed AI workloads such as vision or sensor models that also mix local and remote computation.
Real-time adaptation of the Bayesian thresholds could let the system respond to changing network conditions without manual intervention.
If the autotuner proves lightweight enough, similar self-tuning could appear in pure edge deployments that occasionally borrow cloud capacity.

Load-bearing premise

The dynamic-programming batch scheduler and Bayesian autotuner will keep delivering stable gains across unseen model pairs, network conditions, and workloads without hidden overhead or needing extensive retuning.

What would settle it

Measuring speedup below 1.1x or zero energy reduction when the same implementation is run on a new model pair under different network latency would disprove the claim of consistent outperformance.

Figures

Figures reproduced from arXiv: 2605.13319 by Bing Hu, Mahdi Boloursaz Mashhadi, Pei Xiao, Yanfeng Zhang, Yitong Duan, Yunhe Han, Yunqi Gao.

**Figure 1.** Figure 1: Illustration of the speculative decoding process. decoder layers, each utilizing masked self-attention and feed-forward networks to process input sequences (Sheng et al., 2023). Decoder-only LLMs typically adopt autoregressive generation, which produces output sequence one token at a time, and each newly generated token is appended to the input sequence to predict the subsequent one (Vaswani et al., 2017;… view at source ↗

**Figure 2.** Figure 2: Comparison of transmission strategies [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 4.** Figure 4: Overview of PipeSD architecture. The green part is the core of PipeSD. target model, which makes the system easy to scale and compatible with existing cloud-edge collaborative frameworks. Moreover, although the current design only considers a single client, PipeSD can be easily extended to support multiple clients with minor modifications (see Appendix I for details). The modules on the edge device includ… view at source ↗

**Figure 5.** Figure 5: Average TPT (ms) with different bandwidth levels on HumanEval in Scenario 1 [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Communication and computation latency characteristics used in PipeSD. 5.2.3. PERFORMANCE EVALUATION OF BO AUTOTUNER We evaluate the effectiveness of BO autotuner by comparing it with grid search and random search on tuning (R1, R2). As shown in [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

read the original abstract

Speculative decoding can significantly accelerate LLM inference, especially given that its cloud-edge collaborative deployment offers cloud workload offloading, offline robustness, and privacy enhancement. However, existing collaborative inference frameworks with speculative decoding are constrained by (i) sequential token generation and communication with low resource utilization, and (ii) inflexible cloud non-autoregressive verification (NAV) triggering that induces premature verification or costly rollbacks. In this paper, we propose PipeSD, an efficient cloud-edge collaborative pipeline inference framework with speculative decoding. PipeSD overlaps token generation and communication by a token-batch pipeline scheduling mechanism optimized by dynamic programming, and improves verification flexibility through a dual-threshold NAV triggering mechanism with a lightweight Bayesian optimization autotuner. We implement PipeSD using llama-cpp-python, PyTorch, and FastAPI, and evaluate it on a real-world cloud-edge testbed with two draft-target model pairs across four scenarios. Results show that PipeSD consistently outperforms state-of-the-art baselines, achieving 1.16x-2.16x speedup and reducing energy consumption by 14.3%-25.3%.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PipeSD brings a DP-scheduled token batch pipeline and Bayesian-tuned dual-threshold NAV to cloud-edge speculative decoding, but the reported gains rest on a very narrow testbed with no visible ablations or generalization checks.

read the letter

The main thing here is a practical tweak to collaborative speculative decoding: they pipeline token batches across cloud and edge using dynamic programming to overlap generation and communication, plus a dual-threshold trigger for non-autoregressive verification that gets tuned on the fly with Bayesian optimization. That combination is new enough to stand out from earlier cloud-edge speculative work, and the abstract claims it delivers 1.16x-2.16x speedups and 14-25% lower energy on their testbed. The engineering focus on real deployment constraints like network latency and resource offloading is the part that could actually matter for people running large models on mixed hardware.

Referee Report

1 major / 0 minor

Summary. The paper introduces PipeSD, a cloud-edge collaborative pipeline inference framework for large language models using speculative decoding. It proposes a token-batch pipeline scheduling mechanism optimized via dynamic programming to overlap generation and communication, along with a dual-threshold non-autoregressive verification (NAV) triggering mechanism enhanced by a lightweight Bayesian optimization autotuner. The framework is implemented using llama-cpp-python, PyTorch, and FastAPI, and evaluated on a real-world cloud-edge testbed with two draft-target model pairs across four scenarios, claiming consistent outperformance of state-of-the-art baselines with speedups of 1.16x-2.16x and energy reductions of 14.3%-25.3%.

Significance. If the empirical results hold under broader conditions, PipeSD could meaningfully advance efficient distributed inference for LLMs by improving pipeline utilization and verification flexibility in cloud-edge setups. The use of dynamic programming for scheduling and Bayesian tuning for triggering offers a principled approach to optimization that may generalize if validated more extensively.

major comments (1)

[Evaluation] The experiments cover only two draft-target model pairs on one testbed across four scenarios. This limited scope leaves the generalization of the dynamic-programming batch scheduler and Bayesian autotuner unproven, as the mechanisms may incur hidden overhead or require per-deployment retuning under varying model scales, network conditions, or workloads, undermining the claim of consistent speedups.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's comments. We appreciate the positive assessment of the paper's potential impact and address the major comment on evaluation below.

read point-by-point responses

Referee: [Evaluation] The experiments cover only two draft-target model pairs on one testbed across four scenarios. This limited scope leaves the generalization of the dynamic-programming batch scheduler and Bayesian autotuner unproven, as the mechanisms may incur hidden overhead or require per-deployment retuning under varying model scales, network conditions, or workloads, undermining the claim of consistent speedups.

Authors: We thank the referee for pointing out the limited scope of our experiments. While the evaluation is indeed restricted to two model pairs and one testbed, these were selected to cover a range of practical cloud-edge conditions through the four scenarios, which vary in terms of communication latency and bandwidth. The dynamic-programming-based scheduler is designed to be general, as it takes as input the profiled computation and communication times for any given model pair and network, solving for the optimal pipeline schedule without assuming specific model scales. Similarly, the Bayesian autotuner optimizes the dual thresholds based on empirical performance data from the deployment, allowing adaptation to different workloads. We have measured and reported the overhead of these mechanisms in Section 5, showing they are negligible. To better address generalization, in the revised version we will expand the 'Discussion' section to include an analysis of how the proposed mechanisms can be applied to other model sizes and network conditions, along with potential limitations. revision: partial

Circularity Check

0 steps flagged

No circularity: claims rest on direct empirical measurements

full rationale

The paper describes a pipeline scheduling mechanism using dynamic programming and a dual-threshold NAV trigger with Bayesian autotuner, then reports measured speedups (1.16x-2.16x) and energy reductions from implementation on a specific cloud-edge testbed with two model pairs. No equations, predictions, or uniqueness theorems are presented that reduce by construction to fitted inputs, self-citations, or renamed ansatzes; the results are direct testbed outputs rather than derived quantities forced by the method itself.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities. The framework builds on standard speculative decoding assumptions (draft model accuracy, network latency models) without stating new ones.

pith-pipeline@v0.9.0 · 5510 in / 1205 out tokens · 71306 ms · 2026-05-15T03:07:07.271256+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages

[1]

IEEE Transactions on Mobile Computing , month = apr, pages =

Xu, Daliang and Yin, Wangsong and Zhang, Hao and Jin, Xin and Zhang, Ying and Wei, Shiyun and Xu, Mengwei and Liu, Xuanzhe , title =. IEEE Transactions on Mobile Computing , month = apr, pages =. 2025 , issue_date =. doi:10.1109/TMC.2024.3513457 , abstract =

work page doi:10.1109/tmc.2024.3513457 2025
[2]

Proceedings of the Workshop on Edge and Mobile Foundation Models , pages =

Hao, Zixu and Jiang, Huiqiang and Jiang, Shiqi and Ren, Ju and Cao, Ting , title =. Proceedings of the Workshop on Edge and Mobile Foundation Models , pages =. 2024 , isbn =. doi:10.1145/3662006.3662067 , abstract =

work page doi:10.1145/3662006.3662067 2024
[3]

EcoFed: Efficient Communication for DNN Partitioning-Based Federated Learning , year=

Wu, Di and Ullah, Rehmat and Rodgers, Philip and Kilpatrick, Peter and Spence, Ivor and Varghese, Blesson , journal=. EcoFed: Efficient Communication for DNN Partitioning-Based Federated Learning , year=

work page
[4]

Proceedings of the 14th USENIX Conference on Networked Systems Design and Implementation , pages =

Alipourfard, Omid and Liu, Hongqiang Harry and Chen, Jianshu and Venkataraman, Shivaram and Yu, Minlan and Zhang, Ming , title =. Proceedings of the 14th USENIX Conference on Networked Systems Design and Implementation , pages =. 2017 , isbn =

work page 2017
[5]

Proceedings of the 40th International Conference on Machine Learning (ICML) , year =

Yaniv Leviathan and Matan Kalman and Yossi Matias , title =. Proceedings of the 40th International Conference on Machine Learning (ICML) , year =

work page
[6]

2023 , eprint=

Accelerating Large Language Model Decoding with Speculative Sampling , author=. 2023 , eprint=

work page 2023
[7]

Parallel Comput

The Communication Challenge for MPP: Intel Paragon and Meiko CS-2 , author=. Parallel Comput. , year=

work page
[8]

2023 , eprint=

Efficient Memory Management for Large Language Model Serving with PagedAttention , author=. 2023 , eprint=

work page 2023
[9]

2024 , eprint=

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads , author=. 2024 , eprint=

work page 2024
[10]

2025 , eprint=

EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty , author=. 2025 , eprint=

work page 2025
[11]

2024 , eprint=

Lookahead: An Inference Acceleration Framework for Large Language Model with Lossless Generation Accuracy , author=. 2024 , eprint=

work page 2024
[12]

2025 , howpublished =

work page 2025
[13]

2023 , eprint=

LLMCad: Fast and Scalable On-device Large Language Model Inference , author=. 2023 , eprint=

work page 2023
[14]

2025 , eprint=

A Novel Hat-Shaped Device-Cloud Collaborative Inference Framework for Large Language Models , author=. 2025 , eprint=

work page 2025
[15]

2025 , eprint=

Quantize-Sample-and-Verify: LLM Acceleration via Adaptive Edge-Cloud Speculative Decoding , author=. 2025 , eprint=

work page 2025
[16]

SpecEdge: Scalable Edge-Assisted Serving Framework for Interactive

Jinwoo Park and Seunggeun Cho and Dongsu Han , booktitle=. SpecEdge: Scalable Edge-Assisted Serving Framework for Interactive. 2025 , url=

work page 2025
[17]

SpecInfer: Accelerating Large Language Model Serving with Tree-based Speculative Inference and Verification , url=

Miao, Xupeng and Oliaro, Gabriele and Zhang, Zhihao and Cheng, Xinhao and Wang, Zeyu and Zhang, Zhengxin and Wong, Rae Ying Yee and Zhu, Alan and Yang, Lijie and Shi, Xiaoxiang and Shi, Chunan and Chen, Zhuoming and Arfeen, Daiyaan and Abhyankar, Reyna and Jia, Zhihao , year=. SpecInfer: Accelerating Large Language Model Serving with Tree-based Speculativ...

work page doi:10.1145/3620666.3651335
[18]

2024 , eprint=

DeepSpeed-FastGen: High-throughput Text Generation for LLMs via MII and DeepSpeed-Inference , author=. 2024 , eprint=

work page 2024
[19]

Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence,

Faster and Lighter LLMs: A Survey on Current Challenges and Way Forward , author =. Proceedings of the Thirty-Third International Joint Conference on Artificial Intelligence,. 2024 , month =. doi:10.24963/ijcai.2024/883 , url =

work page doi:10.24963/ijcai.2024/883 2024
[20]

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence,

Exploring the Trade-Offs: Quantization Methods, Task Difficulty, and Model Size in Large Language Models From Edge to Giant , author =. Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence,. 2025 , month =. doi:10.24963/ijcai.2025/902 , url =

work page doi:10.24963/ijcai.2025/902 2025
[21]

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence,

Not All Layers of LLMs Are Necessary During Inference , author =. Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence,. 2025 , month =. doi:10.24963/ijcai.2025/566 , url =

work page doi:10.24963/ijcai.2025/566 2025
[22]

Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence,

Semi-Clairvoyant Scheduling of Speculative Decoding Requests to Minimize LLM Inference Latency , author =. Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence,. 2025 , month =. doi:10.24963/ijcai.2025/951 , url =

work page doi:10.24963/ijcai.2025/951 2025
[23]

2024 , eprint=

EdgeShard: Efficient LLM Inference via Collaborative Edge Computing , author=. 2024 , eprint=

work page 2024
[24]

IEEE INFOCOM 2025 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS) , year=

EdgePrompt: A Distributed Key-Value Inference Framework for LLMs in 6G Networks , author=. IEEE INFOCOM 2025 - IEEE Conference on Computer Communications Workshops (INFOCOM WKSHPS) , year=

work page 2025
[25]

Proceedings of the Tenth ACM/IEEE Symposium on Edge Computing , year=

SLED: A Speculative LLM Decoding Framework for Efficient Edge Serving , author=. Proceedings of the Tenth ACM/IEEE Symposium on Edge Computing , year=

work page
[26]

2025 , eprint=

Communication-Efficient Collaborative LLM Inference via Distributed Speculative Decoding , author=. 2025 , eprint=

work page 2025
[27]

2025 , eprint=

Jupiter: Fast and Resource-Efficient Collaborative Inference of Generative LLMs on Edge Devices , author=. 2025 , eprint=

work page 2025
[28]

Splitwise: Collaborative Edge–Cloud Inference for LLMs via Lyapunov-Assisted DRL , url=

Younesi, Abolfazl and Shabrang Maryan, Abbas and Oustad, Elyas and Najafabadi Samani, Zahra and Ansari, Mohsen and Fahringer, Thomas , year=. Splitwise: Collaborative Edge–Cloud Inference for LLMs via Lyapunov-Assisted DRL , url=. doi:10.1145/3773274.3774267 , booktitle=

work page doi:10.1145/3773274.3774267
[29]

Ruikun Luo and Changwei Gu and Qiang He and Feifei Chen and Song Wu and Hai Jin and Yun Yang , booktitle=. Sim-. 2025 , url=

work page 2025
[30]

2025 , eprint=

CLONE: Customizing LLMs for Efficient Latency-Aware Inference at the Edge , author=. 2025 , eprint=

work page 2025
[31]

2022 , eprint=

FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness , author=. 2022 , eprint=

work page 2022
[32]

Faster Speculative Decoding via Effective Draft Decoder with Pruned Candidate Tree

Zheng, Huanran and Wang, Xiaoling. Faster Speculative Decoding via Effective Draft Decoder with Pruned Candidate Tree. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.acl-long.486

work page doi:10.18653/v1/2025.acl-long.486 2025
[33]

Proceedings of the 38th International Conference on Neural Information Processing Systems , articleno =

Chen, Ziyi and Yang, Xiaocong and Lin, Jiacheng and Sun, Chenkai and Chang, Kevin Chen-Chuan and Huang, Jie , title =. Proceedings of the 38th International Conference on Neural Information Processing Systems , articleno =. 2024 , isbn =

work page 2024
[34]

2023 , eprint=

FlexGen: High-Throughput Generative Inference of Large Language Models with a Single GPU , author=. 2023 , eprint=

work page 2023
[35]

Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation

Xia, Heming and Ge, Tao and Wang, Peiyi and Chen, Si-Qing and Wei, Furu and Sui, Zhifang. Speculative Decoding: Exploiting Speculative Execution for Accelerating Seq2seq Generation. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023. doi:10.18653/v1/2023.findings-emnlp.257

work page doi:10.18653/v1/2023.findings-emnlp.257 2023
[36]

and Kaiser,

Vaswani, Ashish and Shazeer, Noam and Parmar, Niki and Uszkoreit, Jakob and Jones, Llion and Gomez, Aidan N. and Kaiser,. Attention is all you need , year =. Proceedings of the 31st International Conference on Neural Information Processing Systems , pages =

work page
[37]

2019 , url=

Language Models are Unsupervised Multitask Learners , author=. 2019 , url=

work page 2019
[38]

2021 , eprint=

Evaluating Large Language Models Trained on Code , author=. 2021 , eprint=

work page 2021
[39]

2021 , eprint=

Training Verifiers to Solve Math Word Problems , author=. 2021 , eprint=

work page 2021
[40]

2024 , eprint=

DeepSeek-Coder: When the Large Language Model Meets Programming -- The Rise of Code Intelligence , author=. 2024 , eprint=

work page 2024
[41]

2023 , eprint=

Llama 2: Open Foundation and Fine-Tuned Chat Models , author=. 2023 , eprint=

work page 2023
[42]

2024 , eprint=

TinyLlama: An Open-Source Small Language Model , author=. 2024 , eprint=

work page 2024
[43]

2023 , eprint=

Speculative Decoding with Big Little Decoder , author=. 2023 , eprint=

work page 2023
[44]

2019 , eprint=

PyTorch: An Imperative Style, High-Performance Deep Learning Library , author=. 2019 , eprint=

work page 2019
[45]

llama-cpp-python: Python bindings for llama.cpp , author =

work page
[46]

GGUF: A Binary Model File Format for Efficient Loading and Inference , author =

work page
[47]

Adaptive Block-Wise Regularization and Knowledge Distillation for Enhancing Federated Learning , year=

Liu, Jianchun and Zeng, Qingmin and Xu, Hongli and Xu, Yang and Wang, Zhiyuan and Huang, He , journal=. Adaptive Block-Wise Regularization and Knowledge Distillation for Enhancing Federated Learning , year=

work page
[48]

, journal=

Al-Falahy, Naser and Alani, Omar Y. , journal=. Technologies for 5G Networks: Challenges and Opportunities , year=

work page
[49]

Constrained Device Performance Benchmarking with the Implementation of Post-Quantum Cryptography , volume =

Fitzgibbon, Gregory and Ottaviani, Carlo , year =. Constrained Device Performance Benchmarking with the Implementation of Post-Quantum Cryptography , volume =. Cryptography , doi =

work page
[50]

2024 , eprint=

The Llama 3 Herd of Models , author=. 2024 , eprint=

work page 2024
[51]

2024 , eprint=

GPT-4 Technical Report , author=. 2024 , eprint=

work page 2024
[52]

2024 , eprint=

SGLang: Efficient Execution of Structured Language Model Programs , author=. 2024 , eprint=

work page 2024
[53]

2025 , eprint=

CE-CoLLM: Efficient and Adaptive Large Language Models Through Cloud-Edge Collaboration , author=. 2025 , eprint=

work page 2025
[54]

2025 , eprint=

PICE: A Semantic-Driven Progressive Inference System for LLM Serving in Cloud-Edge Networks , author=. 2025 , eprint=

work page 2025
[55]

2024 , eprint=

An Edge-Cloud Collaboration Framework for Generative AI Service Provision with Synergetic Big Cloud Model and Small Edge Models , author=. 2024 , eprint=

work page 2024
[56]

UPOA: A User Preference Based Latency and Energy Aware Intelligent Offloading Approach for Cloud-Edge Systems , year=

Yuan, Jingling and Xiang, Yao and Deng, Yuhui and Zhou, Yi and Min, Geyong , journal=. UPOA: A User Preference Based Latency and Energy Aware Intelligent Offloading Approach for Cloud-Edge Systems , year=

work page
[57]

A Cloud–Edge Collaborative Architecture for Multimodal LLM-Based Advanced Driver Assistance Systems in IoT Networks , year=

Hu, Yaqi and Ye, Dongdong and Kang, Jiawen and Wu, Maoqiang and Yu, Rong , journal=. A Cloud–Edge Collaborative Architecture for Multimodal LLM-Based Advanced Driver Assistance Systems in IoT Networks , year=

work page
[58]

2024 , eprint=

Resource Allocation for Stable LLM Training in Mobile Edge Computing , author=. 2024 , eprint=

work page 2024
[59]

2025 , eprint=

Collaborative Inference and Learning between Edge SLMs and Cloud LLMs: A Survey of Algorithms, Execution, and Open Challenges , author=. 2025 , eprint=

work page 2025
[60]

Proceedings of the 41st International Conference on Machine Learning , articleno =

Du, Cunxiao and Jiang, Jing and Yuanchen, Xu and Wu, Jiawei and Yu, Sicheng and Li, Yongqi and Li, Shenggui and Xu, Kai and Nie, Liqiang and Tu, Zhaopeng and You, Yang , title =. Proceedings of the 41st International Conference on Machine Learning , articleno =. 2024 , publisher =

work page 2024
[61]

2020 , eprint=

Language Models are Few-Shot Learners , author=. 2020 , eprint=

work page 2020
[62]

2023 , eprint=

LLaMA: Open and Efficient Foundation Language Models , author=. 2023 , eprint=

work page 2023
[63]

2001 , howpublished =

Hubert, Bert , title =. 2001 , howpublished =

work page 2001
[64]

Measuring energy consumption for short code paths using RAPL , year =

H\". Measuring energy consumption for short code paths using RAPL , year =. SIGMETRICS Perform. Eval. Rev. , month = jan, pages =. doi:10.1145/2425248.2425252 , abstract =

work page doi:10.1145/2425248.2425252
[65]

and Malik, S

Tiwari, V. and Malik, S. and Wolfe, A. , journal=. Power analysis of embedded software: a first step towards software power minimization , year=

work page
[66]

2025 , organization =

work page 2025
[67]

US-Byte: An Efficient Communication Framework for Scheduling Unequal-Sized Tensor Blocks in Distributed Deep Learning , year=

Gao, Yunqi and Hu, Bing and Mashhadi, Mahdi Boloursaz and Jin, A-Long and Xiao, Pei and Wu, Chunming , journal=. US-Byte: An Efficient Communication Framework for Scheduling Unequal-Sized Tensor Blocks in Distributed Deep Learning , year=

work page
[68]

2025 , eprint=

FlowMoE: A Scalable Pipeline Scheduling Framework for Distributed Mixture-of-Experts Training , author=. 2025 , eprint=

work page 2025
[69]

2025 , eprint=

SpecDec++: Boosting Speculative Decoding via Adaptive Candidate Lengths , author=. 2025 , eprint=

work page 2025
[70]

Draft Model Knows When to Stop: Self-Verification Speculative Decoding for Long-Form Generation

Zhang, Ziyin and Xu, Jiahao and Liang, Tian and Chen, Xingyu and He, Zhiwei and Wang, Rui and Tu, Zhaopeng. Draft Model Knows When to Stop: Self-Verification Speculative Decoding for Long-Form Generation. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.844

work page doi:10.18653/v1/2025.emnlp-main.844 2025