Recognition: 2 theorem links
· Lean TheoremA Survey on Efficient Inference for Large Language Models
Pith reviewed 2026-05-15 02:36 UTC · model grok-4.3
The pith
A survey organizes methods for efficient large language model inference into data-level, model-level, and system-level categories and benchmarks representative techniques.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that existing efficiency techniques can be systematically classified into data-level (quantization, pruning, distillation), model-level (sparse attention, efficient architectures), and system-level (kernel fusion, serving frameworks) optimizations, with comparative experiments revealing consistent patterns in latency and memory reduction across these categories.
What carries the argument
The three-tier taxonomy of data-level, model-level, and system-level optimization, which structures the surveyed methods and supports direct quantitative comparison of their efficiency gains.
If this is right
- Data-level methods such as quantization reduce memory usage while preserving most accuracy.
- Model-level changes like sparse attention cut the quadratic cost of self-attention.
- System-level improvements raise throughput in multi-user serving without altering the model.
- Hybrid combinations across levels produce larger gains than isolated techniques.
- The taxonomy supplies a basis for future automated selection of optimization stacks.
Where Pith is reading between the lines
- The quantitative comparisons could inform hardware-aware selection rules for edge versus cloud deployment.
- The same three-level structure may extend to efficient training or multimodal inference pipelines.
- Researchers could test whether the taxonomy remains stable when applied to mixture-of-experts or state-space models.
Load-bearing premise
The chosen representative methods and experimental setups fairly capture performance differences across the broader literature without significant selection bias.
What would settle it
A controlled replication that places a widely used technique outside the three categories or shows that the reported speedups disappear on models larger than those tested would falsify the taxonomy's completeness and generalizability.
read the original abstract
Large Language Models (LLMs) have attracted extensive attention due to their remarkable performance across various tasks. However, the substantial computational and memory requirements of LLM inference pose challenges for deployment in resource-constrained scenarios. Efforts within the field have been directed towards developing techniques aimed at enhancing the efficiency of LLM inference. This paper presents a comprehensive survey of the existing literature on efficient LLM inference. We start by analyzing the primary causes of the inefficient LLM inference, i.e., the large model size, the quadratic-complexity attention operation, and the auto-regressive decoding approach. Then, we introduce a comprehensive taxonomy that organizes the current literature into data-level, model-level, and system-level optimization. Moreover, the paper includes comparative experiments on representative methods within critical sub-fields to provide quantitative insights. Last but not least, we provide some knowledge summary and discuss future research directions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This paper surveys techniques for efficient inference in Large Language Models. It identifies primary causes of inefficiency (large model size, quadratic attention complexity, and autoregressive decoding), organizes the literature via a taxonomy into data-level, model-level, and system-level optimizations, presents comparative experiments on representative methods in key sub-fields to supply quantitative insights, and discusses future directions.
Significance. If the experimental comparisons hold, the survey offers a useful organizing framework for a fast-growing area and supplies concrete quantitative benchmarks that can inform deployment decisions. The taxonomy and experiments together provide more actionable guidance than a purely descriptive review.
major comments (1)
- [Comparative Experiments] Section describing the comparative experiments: the manuscript states that experiments were run on 'representative methods' but supplies no explicit, reproducible selection protocol (citation thresholds, recency cutoffs, implementation availability, or hardware filters). Without such criteria the reported speed/accuracy trade-offs cannot be shown to be free of selection bias and therefore do not reliably generalize to the full literature covered by the taxonomy.
minor comments (1)
- [Abstract] Abstract: the phrase 'comparative experiments on representative methods' would be clearer if it named the primary metrics (e.g., latency, throughput, memory) and the number of methods compared.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and the recommendation for major revision. The feedback on the comparative experiments section is well-taken, and we have revised the manuscript to include an explicit, reproducible selection protocol. This strengthens the transparency and generalizability of the quantitative results.
read point-by-point responses
-
Referee: Section describing the comparative experiments: the manuscript states that experiments were run on 'representative methods' but supplies no explicit, reproducible selection protocol (citation thresholds, recency cutoffs, implementation availability, or hardware filters). Without such criteria the reported speed/accuracy trade-offs cannot be shown to be free of selection bias and therefore do not reliably generalize to the full literature covered by the taxonomy.
Authors: We agree that the original manuscript lacked a clear selection protocol, which limits reproducibility. In the revised version, we have added a dedicated subsection (now Section 4.1) that explicitly defines the criteria used: (1) methods with publicly available open-source implementations at the time of writing, (2) publications in top-tier venues (NeurIPS, ICML, ICLR, ACL, EMNLP) from 2022 onward, (3) coverage of at least one representative technique per major sub-category in the taxonomy, and (4) evaluation on consistent hardware (A100 GPUs) and model backbones (Llama-7B/13B). We also include a new table (Table 1) listing all selected methods with their original citations and implementation links. These additions directly address potential selection bias and allow readers to replicate or extend the comparisons. revision: yes
Circularity Check
No circularity: survey organizes external literature without self-referential derivations
full rationale
This is a survey paper that summarizes and taxonomizes existing work on efficient LLM inference into data-level, model-level, and system-level categories, with comparative experiments on representative methods drawn from the broader literature. No original derivations, equations, fitted parameters, or predictions are presented that could reduce to self-defined inputs by construction. All claims reference external citations, and the taxonomy serves as an organizational framework rather than a derived result. The selection of representative methods for experiments does not constitute circularity under the defined patterns, as it involves no self-definition, fitted-input renaming, or load-bearing self-citation chains.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 22 Pith papers
-
KVServe: Service-Aware KV Cache Compression for Communication-Efficient Disaggregated LLM Serving
KVServe delivers up to 9.13x job completion time speedup and 32.8x time-to-first-token reduction by making KV cache compression service-aware and adaptive in disaggregated LLM serving.
-
Breaking the Autoregressive Chain: Hyper-Parallel Decoding for Efficient LLM-Based Attribute Value Extraction
Hyper-Parallel Decoding enables parallel generation of independent sequences in LLMs via position ID manipulation, delivering up to 13.8X speedup for attribute value extraction.
-
Choose, Don't Label: Multiple-Choice Query Synthesis for Program Disambiguation
Multiple-choice queries synthesized from Hoare triples enable more reliable identification of intended programs than labeled-example supervision in active learning for program disambiguation.
-
TriAttention: Efficient Long Reasoning with Trigonometric KV Compression
TriAttention compresses KV cache by exploiting stable pre-RoPE Q/K concentration and trigonometric distance preferences to match full-attention reasoning accuracy with far lower memory and higher speed.
-
ENEC: A Lossless AI Model Compression Method Enabling Fast Inference on Ascend NPUs
ENEC delivers 3.43X higher throughput than DietGPU and 1.12X better compression ratio than nvCOMP for lossless model weight compression on Ascend NPUs, yielding up to 6.3X end-to-end inference speedup.
-
RailVQA: A Benchmark and Framework for Efficient Interpretable Visual Cognition in Automatic Train Operation
RailVQA-bench supplies 21,168 QA pairs for ATO visual cognition while RailVQA-CoM combines large-model reasoning with small-model efficiency via transparent modules and temporal sampling.
-
OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization
OSAQ uses the low-rank structure of the Hessian to construct a closed-form additive weight transformation that suppresses outliers without changing task loss, enabling better low-bit LLM quantization.
-
OSAQ: Outlier Self-Absorption for Accurate Low-bit LLM Quantization
OSAQ suppresses weight outliers in LLMs via a closed-form additive transformation from the Hessian's stable null space, improving 2-bit quantization perplexity by over 40% versus vanilla GPTQ with no inference overhead.
-
Compress Then Adapt? No, Do It Together via Task-aware Union of Subspaces
JACTUS unifies low-rank compression and task adaptation via a task-aware union of subspaces and global rank allocation by marginal gain, outperforming 100% PEFT methods like DoRA on ViT-Base (89.2% avg) and Llama2-7B ...
-
Joint Architecture-Token-Bitwidth Multi-Axis Optimization of Vision Transformers for Semiconductor IC Packaging
A joint architecture-token-bitwidth optimization of Vision Transformers delivers over 10x gains in throughput, parameters, FLOPs and energy on a semiconductor defect classification task while preserving required accuracy.
-
DASH-KV: Accelerating Long-Context LLM Inference via Asymmetric KV Cache Hashing
DASH-KV accelerates long-context LLM inference to linear complexity via asymmetric KV cache hashing and mixed-precision retention, matching full attention performance on LongBench.
-
LBLLM: Lightweight Binarization of Large Language Models via Three-Stage Distillation
LBLLM achieves better accuracy than prior binarization methods for LLMs by decoupling weight and activation quantization through initialization, layer-wise distillation, and learnable activation scaling.
-
Strix: Re-thinking NPU Reliability from a System Perspective
Strix delivers sub-microsecond fault localisation, detection, and correction on NPUs with 1.04x slowdown and minimal hardware cost by system-level re-partitioning and targeted safeguards.
-
MP-ISMoE: Mixed-Precision Interactive Side Mixture-of-Experts for Efficient Transfer Learning
MP-ISMoE uses Gaussian noise perturbed iterative quantization and interactive side mixture-of-experts to deliver higher accuracy than prior memory-efficient transfer learning methods while keeping similar parameter an...
-
Paper Espresso: From Paper Overload to Research Insight
Paper Espresso deploys LLMs to summarize and analyze trends across 13,300+ arXiv papers over 35 months, releasing metadata that shows non-saturating topic growth and higher engagement for novel topics.
-
Prompt Compression in the Wild: Measuring Latency, Rate Adherence, and Quality for Faster LLM Inference
LLMLingua prompt compression yields up to 18% end-to-end LLM speedups with unchanged quality when prompt length, ratio, and hardware align, plus an open profiler to predict the break-even point.
-
FASTER: Rethinking Real-Time Flow VLAs
FASTER uses a horizon-aware flow sampling schedule to compress immediate-action denoising to one step, slashing effective reaction latency in real-robot VLA deployments.
-
Don't Waste Bits! Adaptive KV-Cache Quantization for Lightweight On-Device LLMs
A data-driven adaptive policy for KV-cache bit-width selection based on token importance features reduces decoding latency by ~18% and improves accuracy over static quantization while staying near FP16 levels on SmolL...
-
Transparent Screening for LLM Inference and Training Impacts
The paper proposes a transparent proxy framework for estimating LLM inference and training environmental impacts from natural-language application descriptions.
-
Gemma 4, Phi-4, and Qwen3: Accuracy-Efficiency Tradeoffs in Dense and MoE Reasoning Language Models
Gemma-4-E4B with few-shot chain-of-thought reaches the highest weighted accuracy of 0.675 at 14.9 GB VRAM, while the larger Gemma-4-26B-A4B MoE model scores 0.663 but uses 48.1 GB.
-
Efficient Inference for Large Vision-Language Models: Bottlenecks, Techniques, and Prospects
A survey that taxonomizes efficiency methods for LVLMs across the full inference pipeline, decouples the problem into information density, long-context attention, and memory limits, and outlines four future research f...
-
Cloud-native and Distributed Systems for Efficient and Scalable Large Language Models -- A Research Agenda
This research agenda argues that cloud-native architectures, microservices, autoscaling, and emerging trends like serverless inference and federated learning are required to make large language models efficient and scalable.
Reference graph
Works this paper leans on
-
[1]
Improving language understanding by generative pre-training,
A. Radford, K. Narasimhan, T. Salimans, I. Sutskever et al. , “Improving language understanding by generative pre-training,” 2018
work page 2018
-
[2]
Language models are unsupervised multitask learners,
A. Radford, J. Wu, R. Child, D. Luan, D. Amodei, I. Sutskever et al. , “Language models are unsupervised multitask learners,” OpenAI blog, vol. 1, no. 8, p. 9, 2019
work page 2019
-
[3]
Language models are few-shot learners,
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P . Dhari- wal, A. Neelakantan, P . Shyam, G. Sastry, A. Askell et al. , “Language models are few-shot learners,” Advances in neural information processing systems, vol. 33, pp. 1877–1901, 2020
work page 1901
-
[4]
OPT: Open Pre-trained Transformer Language Models
S. Zhang, S. Roller, N. Goyal, M. Artetxe, M. Chen, S. Chen, C. Dewan, M. Diab, X. Li, X. V . Lin et al., “Opt: Open pre-trained transformer language models,” arXiv preprint arXiv:2205.01068 , 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[6]
A. Yang, B. Xiao, B. Wang, B. Zhang, C. Bian, C. Yin, C. Lv, D. Pan, D. Wang, D. Yan et al., “Baichuan 2: Open large-scale language models,” arXiv preprint arXiv:2309.10305, 2023
-
[7]
Vicuna: An open- source chatbot impressing gpt-4 with 90%* chatgpt quality,
W.-L. Chiang, Z. Li, Z. Lin, Y. Sheng, Z. Wu, H. Zhang, L. Zheng, S. Zhuang, Y. Zhuang, J. E. Gonzalez et al. , “Vicuna: An open- source chatbot impressing gpt-4 with 90%* chatgpt quality,” See https://vicuna. lmsys. org (accessed 14 April 2023) , 2023
work page 2023
-
[8]
How long can context length of open- source llms truly promise?
D. Li, R. Shao, A. Xie, Y. Sheng, L. Zheng, J. Gonzalez, I. Stoica, X. Ma, and H. Zhang, “How long can context length of open- source llms truly promise?” in NeurIPS 2023 Workshop on Instruc- tion Tuning and Instruction Following, 2023
work page 2023
-
[9]
BLOOM: A 176B-Parameter Open-Access Multilingual Language Model
B. Workshop, T. L. Scao, A. Fan, C. Akiki, E. Pavlick, S. Ili ´c, D. Hesslow, R. Castagn´e, A. S. Luccioni, F. Yvon et al., “Bloom: A 176b-parameter open-access multilingual language model,”arXiv preprint arXiv:2211.05100, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[10]
The Falcon Series of Open Language Models , journal =
E. Almazrouei, H. Alobeidli, A. Alshamsi, A. Cappelli, R. Cojo- caru, M. Debbah, ´E. Goffinet, D. Hesslow, J. Launay, Q. Malartic et al., “The falcon series of open language models,” arXiv preprint arXiv:2311.16867, 2023
-
[11]
Glm: General language model pretraining with autoregressive blank infilling
Z. Du, Y. Qian, X. Liu, M. Ding, J. Qiu, Z. Yang, and J. Tang, “Glm: General language model pretraining with autoregressive blank infilling,” arXiv preprint arXiv:2103.10360, 2021
-
[12]
A. Q. Jiang, A. Sablayrolles, A. Roux, A. Mensch, B. Savary, C. Bamford, D. S. Chaplot, D. d. l. Casas, E. B. Hanna, F. Bressand et al., “Mixtral of experts,” arXiv preprint arXiv:2401.04088, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[13]
Harnessing the power of llms in practice: A survey on chatgpt and beyond,
J. Yang, H. Jin, R. Tang, X. Han, Q. Feng, H. Jiang, S. Zhong, B. Yin, and X. Hu, “Harnessing the power of llms in practice: A survey on chatgpt and beyond,” ACM Transactions on Knowledge Discovery from Data, 2023
work page 2023
-
[14]
Chain-of-thought prompting elicits reasoning in large language models,
J. Wei, X. Wang, D. Schuurmans, M. Bosma, F. Xia, E. Chi, Q. V . Le, D. Zhou et al., “Chain-of-thought prompting elicits reasoning in large language models,” Advances in Neural Information Processing Systems, vol. 35, pp. 24 824–24 837, 2022
work page 2022
-
[15]
Evaluating Large Language Models Trained on Code
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P . d. O. Pinto, J. Kaplan, H. Edwards, Y. Burda, N. Joseph, G. Brockman et al. , “Eval- uating large language models trained on code,” arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[16]
Sparks of Artificial General Intelligence: Early experiments with GPT-4
S. Bubeck, V . Chandrasekaran, R. Eldan, J. Gehrke, E. Horvitz, E. Kamar, P . Lee, Y. T. Lee, Y. Li, S. Lundberg et al. , “Sparks of artificial general intelligence: Early experiments with gpt-4,” arXiv preprint arXiv:2303.12712, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[17]
A survey on model compression for large language models
X. Zhu, J. Li, Y. Liu, C. Ma, and W. Wang, “A survey on model compression for large language models,” arXiv preprint arXiv:2308.07633, 2023
-
[18]
A comprehensive survey of compression algorithms for language models
S. Park, J. Choi, S. Lee, and U. Kang, “A comprehensive survey of compression algorithms for language models,” arXiv preprint arXiv:2401.15347, 2024
-
[19]
Model compression and efficient inference for large language models: A survey
W. Wang, W. Chen, Y. Luo, Y. Long, Z. Lin, L. Zhang, B. Lin, D. Cai, and X. He, “Model compression and efficient infer- ence for large language models: A survey,” arXiv preprint arXiv:2402.09748, 2024
-
[20]
A survey on transformer compression,
Y. Tang, Y. Wang, J. Guo, Z. Tu, K. Han, H. Hu, and D. Tao, “A survey on transformer compression,” arXiv preprint arXiv:2402.05964, 2024
-
[21]
The efficiency spectrum of large language models: An algorithmic survey,
T. Ding, T. Chen, H. Zhu, J. Jiang, Y. Zhong, J. Zhou, G. Wang, Z. Zhu, I. Zharkov, and L. Liang, “The efficiency spectrum of large language models: An algorithmic survey,” arXiv preprint arXiv:2312.00678, 2023
-
[22]
Towards efficient generative large language model serving: A survey from algorithms to systems
X. Miao, G. Oliaro, Z. Zhang, X. Cheng, H. Jin, T. Chen, and Z. Jia, “Towards efficient generative large language model serving: A survey from algorithms to systems,” arXiv preprint arXiv:2312.15234, 2023
-
[23]
Efficient large language models: A survey
Z. Wan, X. Wang, C. Liu, S. Alam, Y. Zheng, Z. Qu, S. Yan, Y. Zhu, Q. Zhang, M. Chowdhury et al., “Efficient large language models: A survey,” arXiv preprint arXiv:2312.03863, vol. 1, 2023
-
[24]
A survey of resource-efficient llm and multimodal foundation models,
M. Xu, W. Yin, D. Cai, R. Yi, D. Xu, Q. Wang, B. Wu, Y. Zhao, C. Yang, S. Wang et al. , “A survey of resource-efficient llm and multimodal foundation models,” arXiv preprint arXiv:2401.08092, 2024
-
[25]
A Survey of Large Language Models
W. X. Zhao, K. Zhou, J. Li, T. Tang, X. Wang, Y. Hou, Y. Min, B. Zhang, J. Zhang, Z. Dong et al., “A survey of large language models,” arXiv preprint arXiv:2303.18223, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[26]
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” Advances in neural information processing systems , vol. 30, 2017
work page 2017
-
[27]
Llm inference unveiled: Survey and roofline model insights,
Z. Yuan, Y. Shang, Y. Zhou, Z. Dong, C. Xue, B. Wu, Z. Li, Q. Gu, Y. J. Lee, Y. Yanet al., “Llm inference unveiled: Survey and roofline model insights,” arXiv preprint arXiv:2402.16363, 2024
-
[28]
A. Golden, S. Hsia, F. Sun, B. Acun, B. Hosmer, Y. Lee, Z. DeVito, J. Johnson, G.-Y. Wei, D. Brooks et al., “Is flash attention stable?” arXiv preprint arXiv:2405.02803, 2024
-
[29]
Retrieval- augmented generation for knowledge-intensive nlp tasks,
P . Lewis, E. Perez, A. Piktus, F. Petroni, V . Karpukhin, N. Goyal, H. K ¨uttler, M. Lewis, W.-t. Yih, T. Rockt ¨aschel et al., “Retrieval- augmented generation for knowledge-intensive nlp tasks,” Ad- vances in Neural Information Processing Systems , vol. 33, pp. 9459– 9474, 2020
work page 2020
- [30]
-
[31]
Replug: Retrieval-augmented black-box language models,
W. Shi, S. Min, M. Yasunaga, M. Seo, R. James, M. Lewis, L. Zettlemoyer, and W. tau Yih, “Replug: Retrieval-augmented black-box language models,” 2023
work page 2023
-
[32]
Self- rag: Learning to retrieve, generate, and critique through self- reflection,
A. Asai, Z. Wu, Y. Wang, A. Sil, and H. Hajishirzi, “Self- rag: Learning to retrieve, generate, and critique through self- reflection,” 2023
work page 2023
-
[33]
D. Wingate, M. Shoeybi, and T. Sorensen, “Prompt compres- sion and contrastive conditioning for controllability and toxicity reduction in language models,” arXiv preprint arXiv:2210.03162 , 2022
-
[34]
Learning to compress prompts with gist tokens,
J. Mu, X. L. Li, and N. Goodman, “Learning to compress prompts with gist tokens,” arXiv preprint arXiv:2304.08467, 2023. 30
-
[35]
In-context autoencoder for context compression in a large language model.ArXiv, abs/2307.06945,
T. Ge, J. Hu, X. Wang, S.-Q. Chen, and F. Wei, “In-context autoencoder for context compression in a large language model,” arXiv preprint arXiv:2307.06945, 2023
-
[36]
Recomp: Improving retrieval- augmented lms with compression and selective augmentation,
F. Xu, W. Shi, and E. Choi, “Recomp: Improving retrieval- augmented lms with compression and selective augmentation,” arXiv preprint arXiv:2310.04408, 2023
-
[37]
Ex- tending context window of large language models via semantic compression,
W. Fei, X. Niu, P . Zhou, L. Hou, B. Bai, L. Deng, and W. Han, “Ex- tending context window of large language models via semantic compression,” arXiv preprint arXiv:2312.09571, 2023
-
[38]
Efficient prompting via dynamic in-context learning,
W. Zhou, Y. E. Jiang, R. Cotterell, and M. Sachan, “Efficient prompting via dynamic in-context learning,” arXiv preprint arXiv:2305.11170, 2023
-
[39]
Compressing context to enhance inference efficiency of large language models,
Y. Li, B. Dong, F. Guerin, and C. Lin, “Compressing context to enhance inference efficiency of large language models,” in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 6342–6353
work page 2023
-
[40]
F. Yin, J. Vig, P . Laban, S. Joty, C. Xiong, and C.-S. J. Wu, “Did you read the instructions? rethinking the effectiveness of task defi- nitions in instruction learning,” arXiv preprint arXiv:2306.01150 , 2023
-
[41]
Discrete prompt compression with reinforcement learning,
H. Jung and K.-J. Kim, “Discrete prompt compression with reinforcement learning,” arXiv preprint arXiv:2308.08758, 2023
-
[42]
Llmlingua: Compressing prompts for accelerated inference of large language models,
H. Jiang, Q. Wu, C.-Y. Lin, Y. Yang, and L. Qiu, “Llmlingua: Compressing prompts for accelerated inference of large language models,” in The 2023 Conference on Empirical Methods in Natural Language Processing, 2023
work page 2023
- [43]
-
[44]
Boosting llm reasoning: Push the limits of few-shot learning with reinforced in-context pruning,
X. Huang, L. L. Zhang, K.-T. Cheng, and M. Yang, “Boosting llm reasoning: Push the limits of few-shot learning with reinforced in-context pruning,” arXiv preprint arXiv:2312.08901, 2023
-
[45]
A Survey on In-context Learning
Q. Dong, L. Li, D. Dai, C. Zheng, Z. Wu, B. Chang, X. Sun, J. Xu, and Z. Sui, “A survey for in-context learning,” arXiv preprint arXiv:2301.00234, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[46]
Prefix-tuning: Optimizing continuous prompts for generation,
X. L. Li and P . Liang, “Prefix-tuning: Optimizing continuous prompts for generation,” in Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th Inter- national Joint Conference on Natural Language Processing (Volume 1: Long Papers), 2021, pp. 4582–4597
work page 2021
-
[47]
X. Ning, Z. Lin, Z. Zhou, H. Yang, and Y. Wang, “Skeleton-of- thought: Large language models can do parallel decoding,” arXiv preprint arXiv:2307.15337, 2023
-
[48]
Adaptive skeleton graph decoding,
S. Jin, Y. Wu, H. Zheng, Q. Zhang, M. Lentz, Z. M. Mao, A. Prakash, F. Qian, and D. Zhuo, “Adaptive skeleton graph decoding,” arXiv preprint arXiv:2402.12280, 2024
-
[49]
Apar: Llms can do auto-parallel auto-regressive decoding,
M. Liu, A. Zeng, B. Wang, P . Zhang, J. Tang, and Y. Dong, “Apar: Llms can do auto-parallel auto-regressive decoding,” arXiv preprint arXiv:2401.06761, 2024
-
[50]
Medusa: Simple llm inference acceleration framework with mul- tiple decoding heads,
T. Cai, Y. Li, Z. Geng, H. Peng, J. D. Lee, D. Chen, and T. Dao, “Medusa: Simple llm inference acceleration framework with mul- tiple decoding heads,” 2024
work page 2024
-
[51]
Efficient memory manage- ment for large language model serving with pagedattention,
W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient memory manage- ment for large language model serving with pagedattention,” in Proceedings of the 29th Symposium on Operating Systems Principles , 2023, pp. 611–626
work page 2023
-
[52]
SGLang: Efficient Execution of Structured Language Model Programs
L. Zheng, L. Yin, Z. Xie, J. Huang, C. Sun, C. H. Yu, S. Cao, C. Kozyrakis, I. Stoica, J. E. Gonzalez et al. , “Efficiently pro- gramming large language models using sglang,” arXiv preprint arXiv:2312.07104, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[53]
Tree of thoughts: Deliberate problem solving with large language models,
S. Yao, D. Yu, J. Zhao, I. Shafran, T. Griffiths, Y. Cao, and K. Narasimhan, “Tree of thoughts: Deliberate problem solving with large language models,” Advances in Neural Information Processing Systems, vol. 36, 2024
work page 2024
-
[54]
Graph of thoughts: Solving elaborate problems with large language models,
M. Besta, N. Blach, A. Kubicek, R. Gerstenberger, M. Podstawski, L. Gianinazzi, J. Gajda, T. Lehmann, H. Niewiadomski, P . Nyczyk et al. , “Graph of thoughts: Solving elaborate problems with large language models,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 16, 2024, pp. 17 682–17 690
work page 2024
-
[55]
The Rise and Potential of Large Language Model Based Agents: A Survey
Z. Xi, W. Chen, X. Guo, W. He, Y. Ding, B. Hong, M. Zhang, J. Wang, S. Jin, E. Zhou et al. , “The rise and potential of large language model based agents: A survey,” arXiv preprint arXiv:2309.07864, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[56]
Corex: Pushing the boundaries of complex reasoning through multi- model collaboration,
Q. Sun, Z. Yin, X. Li, Z. Wu, X. Qiu, and L. Kong, “Corex: Pushing the boundaries of complex reasoning through multi- model collaboration,” arXiv preprint arXiv:2310.00280, 2023
-
[57]
Large Language Model based Multi-Agents: A Survey of Progress and Challenges
T. Guo, X. Chen, Y. Wang, R. Chang, S. Pei, N. V . Chawla, O. Wiest, and X. Zhang, “Large language model based multi- agents: A survey of progress and challenges,” arXiv preprint arXiv:2402.01680, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[58]
FrugalGPT: How to Use Large Language Models While Reducing Cost and Improving Performance
L. Chen, M. Zaharia, and J. Zou, “Frugalgpt: How to use large language models while reducing cost and improving perfor- mance,” arXiv preprint arXiv:2305.05176, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[59]
What makes convolutional models great on long sequence modeling?
Y. Li, T. Cai, Y. Zhang, D. Chen, and D. Dey, “What makes convolutional models great on long sequence modeling?” arXiv preprint arXiv:2210.09298, 2022
-
[60]
Ckconv: Continuous kernel convolution for sequential data,
D. W. Romero, A. Kuzina, E. J. Bekkers, J. M. Tomczak, and M. Hoogendoorn, “Ckconv: Continuous kernel convolution for sequential data,” arXiv preprint arXiv:2102.02611, 2021
-
[61]
Hyena hierarchy: Towards larger convolutional language models,
M. Poli, S. Massaroli, E. Nguyen, D. Y. Fu, T. Dao, S. Baccus, Y. Bengio, S. Ermon, and C. R´e, “Hyena hierarchy: Towards larger convolutional language models,” in International Conference on Machine Learning. PMLR, 2023, pp. 28 043–28 078
work page 2023
-
[62]
RWKV: Reinventing RNNs for the Transformer Era
B. Peng, E. Alcaide, Q. Anthony, A. Albalak, S. Arcadinho, H. Cao, X. Cheng, M. Chung, M. Grella, K. K. GV et al. , “Rwkv: Reinventing rnns for the transformer era,” arXiv preprint arXiv:2305.13048, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[63]
Retentive Network: A Successor to Transformer for Large Language Models
Y. Sun, L. Dong, S. Huang, S. Ma, Y. Xia, J. Xue, J. Wang, and F. Wei, “Retentive network: A successor to transformer for large language models,” arXiv preprint arXiv:2307.08621, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[64]
Hippo: Recurrent memory with optimal polynomial projections,
A. Gu, T. Dao, S. Ermon, A. Rudra, and C. R ´e, “Hippo: Recurrent memory with optimal polynomial projections,”Advances in neural information processing systems, vol. 33, pp. 1474–1487, 2020
work page 2020
-
[65]
Combining recurrent, convolutional, and continuous- time models with linear state space layers,
A. Gu, I. Johnson, K. Goel, K. Saab, T. Dao, A. Rudra, and C. R ´e, “Combining recurrent, convolutional, and continuous- time models with linear state space layers,” Advances in neural information processing systems, vol. 34, pp. 572–585, 2021
work page 2021
-
[66]
Efficiently Modeling Long Sequences with Structured State Spaces
A. Gu, K. Goel, and C. R ´e, “Efficiently modeling long sequences with structured state spaces,” arXiv preprint arXiv:2111.00396 , 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[67]
Diagonal state spaces are as ef- fective as structured state spaces,
A. Gupta, A. Gu, and J. Berant, “Diagonal state spaces are as ef- fective as structured state spaces,” Advances in Neural Information Processing Systems, vol. 35, pp. 22 982–22 994, 2022
work page 2022
-
[68]
On the parameterization and initialization of diagonal state space models,
A. Gu, K. Goel, A. Gupta, and C. R ´e, “On the parameterization and initialization of diagonal state space models,” Advances in Neural Information Processing Systems , vol. 35, pp. 35 971–35 983, 2022
work page 2022
-
[69]
Long range language modeling via gated state spaces,
H. Mehta, A. Gupta, A. Cutkosky, and B. Neyshabur, “Long range language modeling via gated state spaces,” in International Conference on Learning Representations, 2023
work page 2023
-
[70]
D. Y. Fu, T. Dao, K. K. Saab, A. W. Thomas, A. Rudra, and C. R ´e, “Hungry hungry hippos: Towards language modeling with state space models,” arXiv preprint arXiv:2212.14052, 2022
-
[71]
Liquid structural state-space models,
R. Hasani, M. Lechner, T.-H. Wang, M. Chahine, A. Amini, and D. Rus, “Liquid structural state-space models,” arXiv preprint arXiv:2209.12951, 2022
-
[72]
Sim- plified state space layers for sequence modeling.arXiv preprint arXiv:2208.04933, 2022
J. T. Smith, A. Warrington, and S. W. Linderman, “Simpli- fied state space layers for sequence modeling,” arXiv preprint arXiv:2208.04933, 2022
-
[73]
J. Pilault, M. Fathi, O. Firat, C. Pal, P .-L. Bacon, and R. Goroshin, “Block-state transformers,” Advances in Neural Information Pro- cessing Systems, vol. 36, 2024
work page 2024
-
[74]
Pretraining without attention,
J. Wang, J. N. Yan, A. Gu, and A. M. Rush, “Pretraining without attention,” arXiv preprint arXiv:2212.10544, 2022
-
[75]
Mamba: Linear-Time Sequence Modeling with Selective State Spaces
A. Gu and T. Dao, “Mamba: Linear-time sequence modeling with selective state spaces,” arXiv preprint arXiv:2312.00752, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[76]
Can mamba learn how to learn? a comparative study on in-context learning tasks,
J. Park, J. Park, Z. Xiong, N. Lee, J. Cho, S. Oymak, K. Lee, and D. Papailiopoulos, “Can mamba learn how to learn? a comparative study on in-context learning tasks,” arXiv preprint arXiv:2402.04248, 2024
-
[77]
Fast Transformer Decoding: One Write-Head is All You Need
N. Shazeer, “Fast transformer decoding: One write-head is all you need,” arXiv preprint arXiv:1911.02150, 2019
work page internal anchor Pith review Pith/arXiv arXiv 1911
-
[78]
GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints
J. Ainslie, J. Lee-Thorp, M. de Jong, Y. Zemlyanskiy, F. Lebr ´on, and S. Sanghai, “Gqa: Training generalized multi-query trans- former models from multi-head checkpoints,” arXiv preprint arXiv:2305.13245, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[79]
Linformer: Self-Attention with Linear Complexity
S. Wang, B. Z. Li, M. Khabsa, H. Fang, and H. Ma, “Lin- former: Self-attention with linear complexity,” arXiv preprint arXiv:2006.04768, 2020. 31
work page internal anchor Pith review Pith/arXiv arXiv 2006
-
[80]
Lightweight and efficient end-to-end speech recognition using low-rank transformer,
G. I. Winata, S. Cahyawijaya, Z. Lin, Z. Liu, and P . Fung, “Lightweight and efficient end-to-end speech recognition using low-rank transformer,” in ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . IEEE, 2020, pp. 6144–6148
work page 2020
-
[81]
Flurka: Fast fused low-rank & kernel attention,
A. Gupta, Y. Yuan, Y. Zhou, and C. Mendis, “Flurka: Fast fused low-rank & kernel attention,” arXiv preprint arXiv:2306.15799 , 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.