Recognition: 2 theorem links
· Lean TheoremSnapKV: LLM Knows What You are Looking for Before Generation
Pith reviewed 2026-05-13 12:52 UTC · model grok-4.3
The pith
Each attention head in an LLM shows a consistent focus on specific prompt features that can be captured from a short observation window at the end of the input, letting SnapKV compress the KV cache by selecting clustered important positions
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SnapKV automatically compresses KV caches by selecting clustered important KV positions for each attention head, drawing on the robust attention pattern that appears inside an observation window located at the end of the prompts, delivering a 3.6x increase in generation speed and an 8.2x improvement in memory efficiency for 16K-token inputs while keeping performance comparable to the baseline across 16 long-sequence datasets.
What carries the argument
Observation window at the end of the prompt that reveals consistent per-head attention patterns, from which clustered important KV positions are selected for compression
If this is right
- 3.6x faster generation speed with consistent decoding on 16K-token inputs
- 8.2x better memory efficiency than the baseline for the same inputs
- Comparable performance retained across 16 long-sequence datasets
- Ability to process up to 380K context tokens on one A100-80GB GPU with only negligible accuracy drop in needle-in-haystack tests
Where Pith is reading between the lines
- The same end-window observation could be tested as a lightweight way to prune other transformer components beyond the KV cache
- If the per-head consistency holds across model families, SnapKV-style selection might combine with quantization for even larger context gains
- Real-time applications on edge devices could use the reduced memory footprint to support longer inputs without hardware upgrades
- The approach hints that early attention signals in generation may be predictable enough to guide other runtime optimizations
Load-bearing premise
The attention pattern visible in a short window at the very end of the prompt is enough to identify which KV entries will matter for the full generation that follows.
What would settle it
Running the method on a task where the model later attends to prompt sections outside the patterns captured by the end window, producing a clear drop in accuracy after compression
read the original abstract
Large Language Models (LLMs) have made remarkable progress in processing extensive contexts, with the Key-Value (KV) cache playing a vital role in enhancing their performance. However, the growth of the KV cache in response to increasing input length poses challenges to memory and time efficiency. To address this problem, this paper introduces SnapKV, an innovative and fine-tuning-free approach that efficiently minimizes KV cache size while still delivering comparable performance in real-world applications. We discover that each attention head in the model consistently focuses on specific prompt attention features during generation. Meanwhile, this robust pattern can be obtained from an 'observation' window located at the end of the prompts. Drawing on this insight, SnapKV automatically compresses KV caches by selecting clustered important KV positions for each attention head. Our approach significantly reduces the growing computational overhead and memory footprint when processing long input sequences. Specifically, SnapKV achieves a consistent decoding speed with a 3.6x increase in generation speed and an 8.2x enhancement in memory efficiency compared to the baseline when processing inputs of 16K tokens. At the same time, it maintains comparable performance to the baseline models across 16 long sequence datasets. Moreover, SnapKV can process up to 380K context tokens on a single A100-80GB GPU using HuggingFace implementation with minor changes, exhibiting only a negligible accuracy drop in the Needle-in-a-Haystack test. Further comprehensive studies suggest SnapKV's potential for practical applications.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SnapKV, a fine-tuning-free method to compress the KV cache in LLMs. It observes that each attention head exhibits consistent focus on specific prompt features, which can be captured from a short 'observation' window at the end of the prompt. SnapKV then selects clustered important KV positions per head from the full prompt cache. This yields a claimed 3.6x increase in generation speed and 8.2x memory efficiency for 16K-token inputs while maintaining comparable performance to baselines across 16 long-sequence datasets; it also enables processing up to 380K contexts on a single A100-80GB GPU with only negligible accuracy drop on Needle-in-a-Haystack.
Significance. If the core empirical observation holds and generalizes, the work offers a practical, training-free route to longer-context inference on commodity hardware. The reported speed and memory gains at 16K tokens, combined with the ability to reach 380K contexts, would be directly useful for deployment. The per-head clustering insight is a concrete contribution that could inspire further cache-management techniques.
major comments (2)
- [§3 and §4.2] §3 (Method) and §4.2 (Experiments): the central claim that the one-time selection from the observation window suffices for the entire autoregressive generation rests on the untested assumption that attention patterns do not shift materially once generation begins. No ablation is reported that inserts query shifts (e.g., multi-hop reasoning or dispersed-fact retrieval) to measure whether critical KV entries are dropped; the 3.6×/8.2× figures therefore cannot be taken as robust without such evidence.
- [Table 2 and §4.1] Table 2 and §4.1: performance is reported as 'comparable' across 16 datasets, yet no error bars, standard deviations, or number of runs are supplied, nor are exact baseline KV-cache implementations (FlashAttention version, precision, etc.) detailed. This makes it impossible to judge whether the observed differences are statistically meaningful or merely within run-to-run variance.
minor comments (2)
- [§3.2] §3.2: the clustering procedure (threshold or k value) is described only at a high level; an explicit algorithm box or pseudocode would clarify reproducibility.
- [Figure 3] Figure 3: the attention heatmaps are useful but lack axis labels indicating token indices and head numbers; adding these would improve readability.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address the major comments below and will revise the manuscript accordingly to strengthen the claims and improve clarity.
read point-by-point responses
-
Referee: [§3 and §4.2] §3 (Method) and §4.2 (Experiments): the central claim that the one-time selection from the observation window suffices for the entire autoregressive generation rests on the untested assumption that attention patterns do not shift materially once generation begins. No ablation is reported that inserts query shifts (e.g., multi-hop reasoning or dispersed-fact retrieval) to measure whether critical KV entries are dropped; the 3.6×/8.2× figures therefore cannot be taken as robust without such evidence.
Authors: We appreciate this observation. Our method is based on the consistent focus patterns observed in the attention heads, which we found to hold across the generation process in our evaluations. To directly address the concern, we will add an ablation study in the revised version involving tasks with significant query shifts, such as multi-hop reasoning and long-context retrieval with dispersed facts. This will confirm if any critical KV entries are missed and support the reported efficiency gains. revision: yes
-
Referee: [Table 2 and §4.1] Table 2 and §4.1: performance is reported as 'comparable' across 16 datasets, yet no error bars, standard deviations, or number of runs are supplied, nor are exact baseline KV-cache implementations (FlashAttention version, precision, etc.) detailed. This makes it impossible to judge whether the observed differences are statistically meaningful or merely within run-to-run variance.
Authors: We concur that additional statistical information and implementation details are necessary. In the updated manuscript, we will provide error bars and standard deviations for the results in Table 2, based on multiple independent runs. We will also detail the exact baseline configurations, including the FlashAttention version used, precision settings, and other relevant parameters in Section 4.1. revision: yes
Circularity Check
No circularity: empirical observation applied directly as heuristic
full rationale
The paper's chain begins with an empirical discovery ('each attention head in the model consistently focuses on specific prompt attention features during generation' and 'this robust pattern can be obtained from an observation window located at the end of the prompts') and proceeds to a direct rule ('SnapKV automatically compresses KV caches by selecting clustered important KV positions for each attention head'). No equations, fitted parameters, or self-citations are invoked to derive the selection rule; the method is a one-time heuristic extracted from the observation window and applied statically. Performance numbers (3.6x speed, 8.2x memory) are reported from experiments on 16 datasets rather than forced by construction. The central claim therefore remains independent of its inputs and does not reduce to self-definition, renaming, or load-bearing self-citation.
Axiom & Free-Parameter Ledger
free parameters (2)
- observation window length
- clustering threshold or k
axioms (1)
- domain assumption Each attention head focuses on a consistent set of prompt features throughout generation.
Lean theorems connected to this paper
-
IndisputableMonolith.Foundation.HierarchyEmergencehierarchy_emergence_forces_phi unclearSnapKV achieves a consistent decoding speed with a 3.6x increase in generation speed and an 8.2x enhancement in memory efficiency
Forward citations
Cited by 22 Pith papers
-
VORT: Adaptive Power-Law Memory for NLP Transformers
VORT assigns learnable fractional orders to tokens and approximates their power-law retention kernels via sum-of-exponentials for efficient long-range dependency modeling in transformers.
-
MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference
MISA routes to a small subset of indexer heads via block statistics, matching full DSA performance on LongBench with 4-8x fewer heads and 3.82x speedup while recovering over 92% of selected tokens.
-
Long Context Pre-Training with Lighthouse Attention
Lighthouse Attention enables faster long-context pre-training via gradient-free symmetrical hierarchical compression of QKV while preserving causality, followed by a short full-attention recovery that yields lower los...
-
OptiVerse: A Comprehensive Benchmark towards Optimization Problem Solving
OptiVerse is a new benchmark spanning neglected optimization domains that shows LLMs suffer sharp accuracy drops on hard problems due to modeling and logic errors, with a Dual-View Auditor Agent proposed to improve pe...
-
Neural Garbage Collection: Learning to Forget while Learning to Reason
Language models learn to evict KV cache entries end-to-end via reinforcement learning from outcome reward alone, achieving 2-3x cache compression while maintaining accuracy on Countdown, AMC, and AIME tasks.
-
How Much Cache Does Reasoning Need? Depth-Cache Tradeoffs in KV-Compressed Transformers
Transformers need depth scaling as the product of ceil(k/s) and log n terms for k-hop pointer chasing under cache size s, with a conjectured lower bound, proved upper bound via windowed pointer doubling, and an adapti...
-
Sparse Prefix Caching for Hybrid and Recurrent LLM Serving
Sparse prefix caching via dynamic programming for optimal checkpoint placement under overlap distributions improves the Pareto frontier for recurrent and hybrid LLM serving on shared-prefix data.
-
Transactional Attention: Semantic Sponsorship for KV-Cache Retention
Transactional Attention uses semantic sponsorship from anchor patterns to retain dormant critical tokens in KV caches, achieving 100% credential retrieval at 16 tokens where all prior methods fail.
-
Sequential KV Cache Compression via Probabilistic Language Tries: Beyond the Per-Vector Shannon Limit
Sequential KV compression via probabilistic language tries and predictive delta coding achieves 3.3-4.3 bits per token entropy, yielding up to 914x better ratios than TurboQuant even with large overhead.
-
KV-Fold: One-Step KV-Cache Recurrence for Long-Context Inference
KV-Fold turns frozen transformers into stable long-context models by folding the KV cache across sequence chunks in repeated forward passes.
-
KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving
KV-RM regularizes KV-cache movement in static-graph LLM serving via block paging and merge-staged transport to improve throughput, tail latency, and memory use for variable-length decoding.
-
Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache
Louver is a new index for LLM KV caches that guarantees zero false negatives for keys above a relevance threshold, runs faster than prior sparse and some dense attention methods, and integrates lightly into existing p...
-
Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache
Louver is a new index structure that guarantees zero false negatives for sparse attention in LLM KV caches by casting the problem as halfspace range searching.
-
UniPrefill: Universal Long-Context Prefill Acceleration via Block-wise Dynamic Sparsification
UniPrefill accelerates LLM prefill via block-wise dynamic sparsification, achieving up to 2.1x TTFT speedup while supporting hybrid architectures and native vLLM continuous batching.
-
Dual-Cluster Memory Agent: Resolving Multi-Paradigm Ambiguity in Optimization Problem Solving
DCM-Agent improves LLM performance on multi-paradigm optimization problems by 11-21% via dual-cluster memory construction and dynamic inference guidance.
-
SinkRouter: Sink-Aware Routing for Efficient Long-Context Decoding in Large Language and Multimodal Models
SinkRouter identifies attention sinks as training-derived fixed points and routes around them to skip redundant KV-cache loads, delivering up to 2.03x decoding speedup on long-context benchmarks.
-
PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling
PyramidKV dynamically compresses KV cache across layers following pyramidal information funneling, matching full performance at 12% retention and outperforming alternatives at 0.7% retention with up to 20.5 accuracy gains.
-
How to Compress KV Cache in RL Post-Training? Shadow Mask Distillation for Memory-Efficient Alignment
Shadow Mask Distillation enables KV cache compression in RL post-training of LLMs by mitigating amplified off-policy bias that defeats standard importance reweighting.
-
Predictive Multi-Tier Memory Management for KV Cache in Large-Scale GPU Inference
A unified KV cache system with architecture-specific sizing, six-tier memory from GPU to filesystems, and Bayesian prediction delivers 7.4x higher batch sizes, 70-84% hit rates, and projected 1.7-2.9x throughput gains.
-
HieraSparse: Hierarchical Semi-Structured Sparse KV Attention
HieraSparse delivers a hierarchical semi-structured sparse KV attention system that achieves 1.2x KV compression and 4.57x decode attention speedup versus prior unstructured sparsity methods at equivalent sparsity, pl...
-
Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference
Flux Attention uses a context-aware Layer Router to dynamically assign full or sparse attention to each LLM layer, achieving up to 2.8x prefill and 2.0x decode speedups with competitive performance on long-context and...
-
BFLA: Block-Filtered Long-Context Attention Mechanism
BFLA is a two-stage block-filtered sparse prefill attention mechanism that constructs an input-dependent block mask and applies tile-level rescues to skip unimportant KV tiles while preserving exact attention inside r...
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Command r: Retrieval-augmented generation at production scale, March 2024
Cohere. Command r: Retrieval-augmented generation at production scale, March 2024. URL https://txt.cohere.com/command-r
work page 2024
-
[3]
The claude 3 model family: Opus, sonnet, haiku, March 2024
Anthropic. The claude 3 model family: Opus, sonnet, haiku, March 2024. URL https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/ Model_Card_Claude_3.pdf
work page 2024
-
[4]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean- baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
Efficient Streaming Language Models with Attention Sinks
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[6]
H2o: Heavy-hitter oracle for efficient generative inference of large language models
Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[7]
Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of impor- tance hypothesis for llm kv cache compression at test time. Advances in Neural Information Processing Systems, 36, 2024
work page 2024
-
[8]
Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. Model tells you what to discard: Adaptive kv cache compression for llms. arXiv preprint arXiv:2310.01801, 2023
-
[9]
Lifelong and continual learning dialogue systems: learn- ing during conversation
Bing Liu and Sahisnu Mazumder. Lifelong and continual learning dialogue systems: learn- ing during conversation. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 35, pages 15058–15063, 2021
work page 2021
-
[10]
Codeplan: Repository-level coding using llms and planning
Ramakrishna Bairi, Atharv Sonwane, Aditya Kanade, Arun Iyer, Suresh Parthasarathy, Sriram Rajamani, B Ashok, Shashank Shet, et al. Codeplan: Repository-level coding using llms and planning. arXiv preprint arXiv:2309.12499, 2023
-
[11]
Enhancing chat language models by scaling high-quality instructional conversations
Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233, 2023
-
[12]
Qmsum: A new benchmark for query-based multi-domain meeting summarization
Ming Zhong, Da Yin, Tao Yu, Ahmad Zaidi, Mutethia Mutuma, Rahul Jha, Ahmed Hassan Awadallah, Asli Celikyilmaz, Yang Liu, Xipeng Qiu, et al. Qmsum: A new benchmark for query-based multi-domain meeting summarization. arXiv preprint arXiv:2104.05938, 2021
-
[13]
L-Eval: Instituting standardized evaluation for long context language models
Chenxin An, Shansan Gong, Ming Zhong, Mukai Li, Jun Zhang, Lingpeng Kong, and Xipeng Qiu. L-eval: Instituting standardized evaluation for long context language models. arXiv preprint arXiv:2307.11088, 2023
-
[14]
Extractive opinion summarization in quantized transformer spaces
Stefanos Angelidis, Reinald Kim Amplayo, Yoshihiko Suhara, Xiaolan Wang, and Mirella Lapata. Extractive opinion summarization in quantized transformer spaces. Transactions of the Association for Computational Linguistics, 9:277–293, 2021. 13
work page 2021
-
[15]
In-context Learning and Induction Heads
Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[16]
World model on million-length video and language with ringattention
Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with ringattention. arXiv preprint arXiv:2402.08268, 2024
-
[17]
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[18]
Needle in a haystack–pressure testing llms, 2023
G Kamradt. Needle in a haystack–pressure testing llms, 2023
work page 2023
-
[19]
Dacheng Li, Rulin Shao, Anze Xie, Ying Sheng, Lianmin Zheng, Joseph Gonzalez, Ion Stoica, Xuezhe Ma, and Hao Zhang. How long can context length of open-source llms truly promise? In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023
work page 2023
-
[20]
Hongye Jin, Xiaotian Han, Jingfeng Yang, Zhimeng Jiang, Zirui Liu, Chia-Yuan Chang, Huiyuan Chen, and Xia Hu. Llm maybe longlm: Self-extend llm context window without tuning. arXiv preprint arXiv:2401.01325, 2024
-
[21]
Anastasios Nentidis, Georgios Katsimpras, Anastasia Krithara, Salvador Lima López, Eulália Farré-Maduell, Luis Gasco, Martin Krallinger, and Georgios Paliouras. Overview of bioasq 2023: The eleventh bioasq challenge on large-scale biomedical semantic indexing and question answering, 2023
work page 2023
-
[22]
HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhut- dinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[23]
Long context prompting for claude 2.1, December 2023
Anthropic. Long context prompting for claude 2.1, December 2023. URL https://www. anthropic.com/news/claude-2-1-prompting
work page 2023
-
[24]
Lost in the middle: How language models use long contexts
Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157–173, 2024
work page 2024
-
[25]
Retrieval augmented generation (rag), 2023
Cohere. Retrieval augmented generation (rag), 2023. URL https://docs.cohere.com/ docs/retrieval-augmented-generation-rag
work page 2023
-
[26]
Cohere. Cohere embed, 2023. URL https://docs.cohere.com/reference/embed
work page 2023
-
[27]
Cohere. Cohere rerank, 2023. URL https://docs.cohere.com/docs/rerank-guide
work page 2023
-
[28]
Blockwise parallel decoding for deep autoregressive models
Mitchell Stern, Noam Shazeer, and Jakob Uszkoreit. Blockwise parallel decoding for deep autoregressive models. Advances in Neural Information Processing Systems, 31, 2018
work page 2018
-
[29]
Fast inference from transformers via speculative decoding
Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274–19286. PMLR, 2023
work page 2023
-
[30]
Accelerating Large Language Model Decoding with Speculative Sampling
Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[31]
Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Rae Ying Yee Wong, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, and Zhihao Jia. Specinfer: Accelerating generative llm serving with speculative inference and token tree verification. arXiv preprint arXiv:2305.09781, 2023. 14
-
[32]
Recurrent drafter for fast speculative decoding in large language models
Aonan Zhang, Chong Wang, Yi Wang, Xuanyu Zhang, and Yunfei Cheng. Recurrent drafter for fast speculative decoding in large language models. arXiv preprint arXiv:2403.09919, 2024
-
[33]
Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in Neural Information Processing Systems, 35:16344–16359, 2022
work page 2022
-
[34]
FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning
Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[35]
Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads
Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decoding heads. arXiv preprint arXiv:2401.10774, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[36]
Flash-decoding for long-context inference, 2023
Tri Dao, Daniel Haziza, Francisco Massa, and Grigory Sizov. Flash-decoding for long-context inference, 2023
work page 2023
-
[37]
Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A Smith, and Matt Gardner. A dataset of information-seeking questions and answers anchored in research papers. arXiv preprint arXiv:2105.03011, 2021. 15 A Discussion of Generation Time Speedup To better assess SnapKV’s effectiveness across different stages, we documented a detailed time breakdown for ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.