Recognition: no theorem link
CATS: Cascaded Adaptive Tree Speculation for Memory-Limited LLM Inference Acceleration
Pith reviewed 2026-05-13 03:58 UTC · model grok-4.3
The pith
CATS achieves up to 5.08x wall-clock speedup for LLM inference on memory-limited edge devices by cascaded adaptive tree speculation without increasing memory usage or degrading quality.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CATS is a self-speculative decoding framework for memory-limited devices that conducts cascaded verification and correction based on the memory budget and parameter offloading patterns. This design maximizes token acceptance rate and end-to-end speedup while keeping the peak memory footprint on the device equal to that of the target model alone. Evaluations show a wall-clock speedup of up to 5.08x with no degradation in generation quality, outperforming the SOTA method by up to 1.45x under edge memory constraints.
What carries the argument
Cascaded adaptive tree speculation: a staged verification process adapted to memory constraints and offloading patterns that enables self-speculation by maximizing accepted draft tokens per target model invocation.
If this is right
- Fewer invocations of the full target model are needed per output token due to higher acceptance rates.
- Inference runs faster in real time on edge devices without hardware upgrades.
- Output quality stays the same as standard decoding since incorrect drafts are corrected.
- The framework works across multiple LLM architectures and evaluation benchmarks.
Where Pith is reading between the lines
- Similar cascaded approaches might improve efficiency in other generative models facing memory limits, such as diffusion models.
- Integrating with dynamic offloading could further optimize for fluctuating memory availability during long generations.
- This opens possibilities for deploying larger LLMs on consumer hardware by reducing the effective memory requirement for acceleration techniques.
Load-bearing premise
That conducting cascaded verification and correction based on the memory budget and parameter offloading patterns maximizes token acceptance rate and end-to-end speedup while keeping the peak memory footprint on the device equal to that of the target model alone.
What would settle it
Deploying CATS on a specific edge device with a known memory limit, running inference on a benchmark, and verifying if the measured wall-clock time reduction approaches 5x, memory usage does not exceed the target model's, and quality metrics like perplexity remain equivalent to the baseline model.
Figures
read the original abstract
Auto-regressive decoding in Large Language Models (LLMs) is inherently memory-bound: every generation step requires loading the model weights and intermediate results from memory (e.g., High-Bandwidth Memory (HBM) for GPU servers), making throughput bottlenecked by memory bandwidth rather than compute. Speculative decoding addresses this by enabling parallel verification of multiple draft tokens, effectively amortizing the cost of each target-model call. However, existing speculative decoding methods are designed under the assumption that HBM is sufficiently large to hold both the target model and an auxiliary draft model simultaneously -- an assumption that breaks down on memory-constrained devices such as edge platforms with limited DRAM. We analyze the inference bottleneck in this memory-limited regime and propose CATS, a self-speculative decoding framework that conducts cascaded verification and correction based on the memory budget and parameter offloading patterns on memory-limited devices. This design maximizes token acceptance rate and end-to-end speedup while keeping the peak memory footprint on the device equal to that of the target model alone. We evaluate CATS on different models across five benchmarks on real edge devices. CATS can achieve a wall-clock speedup of up to 5.08x with no degradation in generation quality, outperforming the SOTA method by up to 1.45x under edge memory constraints.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces CATS, a self-speculative decoding framework for memory-limited LLM inference that performs cascaded verification and correction guided by memory budget and parameter offloading patterns. It claims this maximizes token acceptance rate while ensuring the device's peak memory footprint equals that of the target model alone, yielding up to 5.08x wall-clock speedup with no quality degradation and 1.45x improvement over SOTA methods on edge devices across five benchmarks.
Significance. If the memory-equality claim and empirical speedups hold under rigorous verification, the work would be significant for practical LLM deployment on DRAM-constrained edge platforms, where standard speculative decoding fails due to simultaneous model storage requirements. The empirical focus on real devices and parameter-free adaptation to offloading patterns strengthens its potential impact.
major comments (3)
- [Abstract, §4] Abstract and §4 (method): The central claim that 'peak memory footprint on the device equal to that of the target model alone' is load-bearing for all reported speedups, yet no memory breakdown, profiling results, or ablation isolating overheads from the adaptive tree, draft-token storage, verification states, or correction logic is provided. This leaves the skeptic's concern about auxiliary buffers unaddressed.
- [§5] §5 (experiments): The reported 5.08x wall-clock speedup and 1.45x over SOTA lack details on experimental setup, exact baselines, acceptance-rate calculation, statistical measures (e.g., variance across runs), or how generation quality was assessed (e.g., perplexity, human eval). This makes the 'no degradation' claim difficult to verify.
- [§3] §3 (analysis): The assumption that cascaded verification based on offloading patterns always maximizes acceptance rate without increasing peak memory is not supported by a concrete memory model or counter-example analysis; a modest temporary allocation for tree nodes could force weight eviction in the target regime.
minor comments (2)
- [§4] Notation for the cascaded tree structure and offloading schedule could be clarified with a small diagram or pseudocode in §4.
- [Abstract] The abstract mentions 'five benchmarks' but does not name them; listing them explicitly would aid reproducibility.
Simulated Author's Rebuttal
We appreciate the referee's thorough review and constructive suggestions. We address each of the major comments below and plan to revise the manuscript to incorporate additional details and clarifications as outlined.
read point-by-point responses
-
Referee: [Abstract, §4] Abstract and §4 (method): The central claim that 'peak memory footprint on the device equal to that of the target model alone' is load-bearing for all reported speedups, yet no memory breakdown, profiling results, or ablation isolating overheads from the adaptive tree, draft-token storage, verification states, or correction logic is provided. This leaves the skeptic's concern about auxiliary buffers unaddressed.
Authors: We agree with the referee that a memory breakdown is necessary to support the central claim. Although the CATS framework is designed such that all operations occur within the memory footprint of the target model by adapting to offloading patterns and using cascaded structures that reuse memory, we did not include explicit profiling in the original submission. In the revised version, we will add a memory usage analysis, including breakdowns of peak memory during different phases and ablations on auxiliary overheads, to confirm no additional memory is required beyond the target model. revision: yes
-
Referee: [§5] §5 (experiments): The reported 5.08x wall-clock speedup and 1.45x over SOTA lack details on experimental setup, exact baselines, acceptance-rate calculation, statistical measures (e.g., variance across runs), or how generation quality was assessed (e.g., perplexity, human eval). This makes the 'no degradation' claim difficult to verify.
Authors: The experimental details were summarized due to space constraints, but we acknowledge the need for more transparency. We will revise §5 to provide full details on the experimental setup (including device specs, baseline code references if available, acceptance rate computation formula), report variance across runs, and specify quality metrics used (perplexity and sample outputs). This will substantiate the speedup claims and no-degradation assertion. revision: yes
-
Referee: [§3] §3 (analysis): The assumption that cascaded verification based on offloading patterns always maximizes acceptance rate without increasing peak memory is not supported by a concrete memory model or counter-example analysis; a modest temporary allocation for tree nodes could force weight eviction in the target regime.
Authors: We will strengthen §3 by introducing a formal memory model that accounts for temporary allocations during tree speculation and verification. This model will demonstrate that under the offloading patterns, the peak memory does not exceed the target model's requirement, as temporary buffers are allocated in memory freed by offloaded weights. We will also discuss potential counter-examples and why they do not apply in our cascaded adaptive approach. revision: yes
Circularity Check
No circularity detected; empirical framework with independent evaluation
full rationale
The paper introduces CATS as a practical algorithmic adaptation of speculative decoding for memory-limited edge devices, relying on cascaded verification/correction tuned to offloading patterns. All reported speedups (up to 5.08x) and memory claims are grounded in direct wall-clock measurements on real hardware across benchmarks, not in any closed-loop fitting, self-referential definitions, or load-bearing self-citations. The design goal of matching the target model's peak memory footprint is stated as an engineering constraint that the method satisfies experimentally, without equations or derivations that reduce to their own inputs by construction. No self-citation chains or ansatzes are invoked to justify core results.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
- [1]
-
[2]
Karen Khatamifard, Minsik Cho, Carlo C
Keivan Alizadeh, Iman Mirzadeh, Dmitry Belenko, S. Karen Khatamifard, Minsik Cho, Carlo C. Del Mundo, Mohammad Rastegari, and Mehrdad Farajtabar. LLM in a flash: Efficient large language model inference with limited memory. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Co...
work page 2024
-
[3]
Fast inference from transformers via speculative decoding
Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. InProceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research. PMLR, 2023
work page 2023
-
[4]
Accelerating Large Language Model Decoding with Speculative Sampling
Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling.arXiv preprint arXiv:2302.01318, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Blockwise parallel decoding for deep autoregressive models
Mitchell Stern, Noam Shazeer, and Jakob Uszkoreit. Blockwise parallel decoding for deep autoregressive models. InAdvances in Neural Information Processing Systems, volume 31, 2018
work page 2018
-
[6]
Draft & verify: Lossless large language model acceleration via self-speculative decoding
Jun Zhang, Jue Zeng, Huizhhen Wang, Linjun Hu, Heming Xia, Tao Ge, and Furu Wei. Draft & verify: Lossless large language model acceleration via self-speculative decoding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, 2024
work page 2024
-
[7]
LayerSkip: Enabling early exit inference and self-speculative decoding
Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, Kelly Ma, and Elias Aly. LayerSkip: Enabling early exit inference and self-speculative decoding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pa...
work page 2024
-
[8]
SWIFT: On-the-fly self-speculative decoding for LLM inference acceleration
Heming Xia, Yongqi Li, Jun Zhang, Cunxiao Du, and Wenjie Li. SWIFT: On-the-fly self-speculative decoding for LLM inference acceleration. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[9]
Kangaroo: Lossless self-speculative decoding via double early exiting
Fangcheng Liu, Yehui Tang, Zhenhua Liu, Yunsheng Ni, Kai Han, and Yunhe Wang. Kangaroo: Lossless self-speculative decoding via double early exiting. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research. PMLR, 2024
work page 2024
-
[10]
Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D. Lee, Deming Chen, and Tri Dao. Medusa: Simple LLM inference acceleration framework with multiple decoding heads. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research. PMLR, 2024. 10
work page 2024
-
[11]
EAGLE: Speculative sampling requires rethinking feature uncertainty
Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. EAGLE: Speculative sampling requires rethinking feature uncertainty. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research. PMLR, 2024
work page 2024
-
[12]
Speculative decoding: Exploiting speculative execution for accelerating Seq2seq generation
Heming Xia, Tao Ge, Peiyi Wang, Si-Qing Chen, Furu Wei, and Zhifang Sui. Speculative decoding: Exploiting speculative execution for accelerating Seq2seq generation. InFindings of the Association for Computational Linguistics: EMNLP 2023. Association for Computational Linguistics, 2023
work page 2023
-
[13]
Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Zhengxin Zhang, Rae Ying Yee Wong, Alan Zhu, Lijie Yang, Xiaoxiang Shi, Chunan Shi, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, and Zhihao Jia. SpecInfer: Accelerating large language model serving with tree-based speculative inference and verification. InProceedings of the 29th ACM I...
work page 2024
-
[14]
Sequoia: Scalable, robust, and hardware-aware speculative decoding
Zhuoming Chen, Avner May, Ruslan Svirschevski, Yuhsun Huang, Max Ryabinin, Zhihao Jia, and Beidi Chen. Sequoia: Scalable, robust, and hardware-aware speculative decoding. InAdvances in Neural Information Processing Systems, volume 37, 2024
work page 2024
-
[15]
Jikai Wang, Yi Su, Juntao Li, Qingrong Xia, Zi Ye, Xinyu Duan, Zhefeng Wang, and Min Zhang. OPT-Tree: Speculative decoding with adaptive draft tree structure.arXiv preprint arXiv:2406.17276, 2024
-
[16]
Yunfan Xiong, Ruoyu Zhang, Yanzeng Li, Tianhao Wu, and Lei Zou. DySpec: Faster speculative decoding with dynamic token tree structure.arXiv preprint arXiv:2410.11744, 2024
-
[17]
DistillSpec: Improving speculative decoding via knowledge distillation
Yongchao Zhou, Kaifeng Lyu, Ankit Singh Rawat, Aditya Krishna Menon, Afshin Rostamizadeh, Sanjiv Kumar, Jean-François Kagy, and Rishabh Agarwal. DistillSpec: Improving speculative decoding via knowledge distillation. InThe Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[18]
Gonzalez, Ion Stoica, and Hao Zhang
Xiaoxuan Liu, Lanxiang Qian, Ying Ye, Qinghao Zhao, Joseph E. Gonzalez, Ion Stoica, and Hao Zhang. Online speculative decoding. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research. PMLR, 2024
work page 2024
-
[19]
GliDe with a CaPE: A low-hassle method to accelerate speculative decoding
Cunxiao Du, Jing Jiang, Yuankai Xu, Jiawei Wu, Sicheng Yu, Yongqi Li, Shenggui Li, Kai Xu, Liqiang Nie, Zhaopeng Tu, and Yang You. GliDe with a CaPE: A low-hassle method to accelerate speculative decoding. InProceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research. PMLR, 2024
work page 2024
-
[20]
Zhenyu He, Zexuan Zhong, Tianle Cai, Jason D. Lee, and Di He. REST: Retrieval-based speculative decoding. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, NAACL 2024. Association for Computational Linguistics, 2024
work page 2024
-
[21]
Nearest neighbor speculative decoding for LLM generation and attribution
Minghan Li, Xilun Chen, Ari Holtzman, Beidi Chen, Jimmy Lin, Wen-tau Yih, and Xi Victoria Lin. Nearest neighbor speculative decoding for LLM generation and attribution. InAdvances in Neural Information Processing Systems, volume 37, 2024
work page 2024
-
[22]
SuffixDecoding: A model-free approach to speeding up large language model inference
Gabriele Oliaro, Zhihao Jia, Daniel Campos, and Aurick Qiao. SuffixDecoding: A model-free approach to speeding up large language model inference. InAdvances in Neural Information Processing Systems, volume 38, 2025
work page 2025
-
[23]
Yichao Fu, Peter Bailis, Ion Stoica, and Hao Zhang. Break the sequential dependency of LLM inference using lookahead decoding.arXiv preprint arXiv:2402.02057, 2024
-
[24]
Heming Xia, Zhe Yang, Qingxiu Dong, Peiyi Wang, Yongqi Li, Tao Ge, Tianyu Liu, Wenjie Li, and Zhifang Sui. Unlocking efficiency in large language model inference: A comprehensive survey of speculative decoding. InFindings of the Association for Computational Linguistics: ACL 2024. Association for Computational Linguistics, 2024
work page 2024
-
[25]
Mahoney, Amir Gholami, and Kurt Keutzer
Sehoon Kim, Karttikeya Mangalam, Suhong Moon, Jitendra Malik, Michael W. Mahoney, Amir Gholami, and Kurt Keutzer. Speculative decoding with big little decoder. InAdvances in Neural Information Processing Systems, volume 36, 2023
work page 2023
-
[26]
Cascade speculative drafting for even faster LLM inference
Ziyi Chen, Xiaocong Yang, Jiacheng Lin, Chenkai Sun, Kevin Chen-Chuan Chang, and Jie He. Cascade speculative drafting for even faster LLM inference. InAdvances in Neural Information Processing Systems, volume 37, 2024. 11
work page 2024
-
[27]
Ouroboros: Speculative decoding with large model enhanced drafting
Weilin Zhao, Yuxiang Huang, Xu Han, Wang Xu, Chaojun Xiao, Zhiyuan Liu, and Maosong Sun. Ouroboros: Speculative decoding with large model enhanced drafting. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. Association for Computational Linguistics, 2024
work page 2024
-
[28]
Speculative RAG: Enhancing retrieval augmented generation through drafting
Zilong Wang, Zifeng Wang, Long Le, Huaixiu Steven Zheng, Swaroop Mishra, Vincent Perot, Yuwei Zhang, Anush Goldie, Jingbo Shang, Chenguang Zhu, Chen-Yu Lee, and Tomas Pfister. Speculative RAG: Enhancing retrieval augmented generation through drafting. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[29]
Speeding up speculative decoding via sequential approximate verification
Meiyu Zhong, Noel Teku, and Ravi Tandon. Speeding up speculative decoding via sequential approximate verification. InProceedings of the 3rd Efficient Systems for Foundation Models Workshop at the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research. PMLR, 2025
work page 2025
-
[30]
SpecTr: Fast speculative decoding via optimal transport
Ziteng Sun, Ananda Theertha Suresh, Jae Hun Ro, Ahmad Beirami, Himanshu Jain, and Felix Yu. SpecTr: Fast speculative decoding via optimal transport. InAdvances in Neural Information Processing Systems, volume 36, 2023
work page 2023
-
[31]
Block verification accelerates speculative decoding
Ziteng Sun, Jae Hun Ro, Ahmad Beirami, and Ananda Theertha Suresh. Block verification accelerates speculative decoding. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[32]
Judge decoding: Faster speculative sam- pling requires going beyond model alignment
Gregor Bachmann, Sotiris Anagnostidis, Albert Pumarola, Markos Georgopoulos, Artsiom Sanakoyeu, Yuming Du, Edgar Schönfeld, Ali Thabet, and Jonas Kohler. Judge decoding: Faster speculative sam- pling requires going beyond model alignment. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[33]
Yixuan Wang, Yijun Liu, Shiyu Ji, Yuzhuang Xu, Yang Xu, Qingfu Zhu, and Wanxiang Che. Think before you accept: Semantic reflective verification for faster speculative decoding.arXiv preprint arXiv:2505.18629, 2025
-
[34]
TriForce: Lossless acceleration of long sequence generation with hierarchical speculative decoding
Hanshi Sun, Zhuoming Chen, Xinyu Yang, Yuandong Tian, and Beidi Chen. TriForce: Lossless acceleration of long sequence generation with hierarchical speculative decoding. InFirst Conference on Language Modeling, 2024
work page 2024
-
[35]
Jian Chen, Vashisth Tiwari, Ranajoy Sadhukhan, Zhuoming Chen, Jinyuan Shi, Ian En-Hsu Yen, and Beidi Chen. MagicDec: Breaking the latency-throughput tradeoff for long context generation with speculative decoding. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[36]
Seongjin Cha, Gyuwan Kim, Dongsu Han, Tao Yang, and Insu Han. KnapSpec: Self-speculative decoding via adaptive layer selection as a knapsack problem.arXiv preprint arXiv:2602.20217, 2026
-
[37]
Hydra: Sequentially-dependent draft heads for Medusa decoding
Zachary Ankner, Rishab Parthasarathy, Aniruddha Nrusimha, Christopher Rinard, Jonathan Ragan-Kelley, and William Brandon. Hydra: Sequentially-dependent draft heads for Medusa decoding. InFirst Conference on Language Modeling, 2024
work page 2024
-
[38]
EAGLE-2: Faster inference of language models with dynamic draft trees
Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. EAGLE-2: Faster inference of language models with dynamic draft trees. InThe Thirteenth International Conference on Learning Representations, 2025
work page 2025
-
[39]
EAGLE-3: Scaling up inference acceleration of large language models via training-time test
Yuhui Li, Fangyun Wei, Chao Zhang, and Hongyang Zhang. EAGLE-3: Scaling up inference acceleration of large language models via training-time test. InAdvances in Neural Information Processing Systems, volume 38, 2025
work page 2025
-
[40]
CLLMs: Consistency large language models
Siqi Kou, Lanxiang Hu, Zhezhi He, Zhijie Deng, and Hao Zhang. CLLMs: Consistency large language models. InProceedings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research. PMLR, 2024
work page 2024
-
[41]
Self-speculative decoding in any-order and any-subset autoregressive models
Gabe Guo and Stefano Ermon. Self-speculative decoding in any-order and any-subset autoregressive models. InStructured Probabilistic Inference and Generative Modeling Workshop at NeurIPS 2025, 2025
work page 2025
-
[42]
Better & faster large language models via multi-token prediction
Fabian Gloeckle, Badr Youbi Idrissi, Baptiste Rozière, David Lopez-Paz, and Gabriel Synnaeve. Better & faster large language models via multi-token prediction. InProceedings of the 41st International Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research. PMLR, 2024
work page 2024
-
[43]
Zongyue Qin, Ziniu Hu, Zifan He, Neha Prakriya, Jason Cong, and Yizhou Sun. Multi-token joint speculative decoding for accelerating large language model inference.arXiv preprint arXiv:2407.09722, 2024. 12
-
[44]
PaSS: Parallel speculative sampling
Giovanni Monea, Armand Joulin, and Edouard Grave. PaSS: Parallel speculative sampling. InEfficient Natural Language and Speech Processing Workshop at NeurIPS 2023, 2023
work page 2023
-
[45]
PowerInfer: Fast large language model serving with a consumer-grade GPU
Yixin Song, Zeyu Mi, Haotong Xie, and Haibo Chen. PowerInfer: Fast large language model serving with a consumer-grade GPU. InProceedings of the ACM SIGOPS 30th Symposium on Operating Systems Principles. ACM, 2024
work page 2024
-
[46]
An I/O characterizing study of offloading LLM models and KV caches to NVMe SSD
Zebin Ren, Krijn Doekemeijer, Tiziano De Matteis, Christian Pinto, Radu Stoica, and Animesh Trivedi. An I/O characterizing study of offloading LLM models and KV caches to NVMe SSD. InProceedings of the Twentieth European Conference on Computer Systems, EuroSys ’25. ACM, 2025
work page 2025
-
[47]
Hao Chen, Cong Tian, Zixuan He, Bin Yu, Yepang Liu, and Jialun Cao. Inference performance eval- uation for LLMs on edge devices with a novel benchmarking framework and metric.arXiv preprint arXiv:2508.11269, 2025
-
[48]
Gonzalez, Ion Stoica, and Eric P
Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yonghao Zhuang, Joseph E. Gonzalez, Ion Stoica, and Eric P. Xing. Vicuna: An open-source chat- bot impressing GPT-4 with 90% ChatGPT quality. https://lmsys.org/blog/2023-03-30-vicuna/ , 2023
work page 2023
-
[49]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models.arXiv preprint arXiv:2307.09288, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[50]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging LLM- as-a-judge with MT-Bench and chatbot arena. InAdvances in Neural Information Processing Systems, volume 36, 2023
work page 2023
-
[51]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
- [52]
-
[53]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde de Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code.arXiv preprint arXiv:2107.03374, 2021. 13 A CATSalgorithm Algorithm 1CATS: Cascaded Self-Speculative Decoding with Tree-Masked Final Verificatio...
work page internal anchor Pith review Pith/arXiv arXiv 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.