pith. machine review for the scientific record. sign in

arxiv: 2404.14469 · v2 · submitted 2024-04-22 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

SnapKV: LLM Knows What You are Looking for Before Generation

Acyr Locatelli, Bharat Venkitesh, Bowen Yang, Deming Chen, Hanchen Ye, Patrick Lewis, Tianle Cai, Yingbing Huang, Yuhong Li

Pith reviewed 2026-05-13 12:52 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords KV cache compressionLLM efficiencyattention patternslong contextmemory optimizationgeneration speedSnapKVcontext window
0
0 comments X

The pith

Each attention head in an LLM shows a consistent focus on specific prompt features that can be captured from a short observation window at the end of the input, letting SnapKV compress the KV cache by selecting clustered important positions

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

SnapKV is a fine-tuning-free method that shrinks the Key-Value cache in large language models during text generation. The core finding is that every attention head maintains a stable pattern of which parts of the prompt it attends to, and this pattern shows up reliably inside a brief window placed at the very end of the prompt. Using that window, the method picks out the most relevant KV entries in clusters for each head and discards the rest. The result is faster decoding and much lower memory use on long inputs, with output quality staying close to the full-cache version. This lets a single GPU handle contexts up to 380,000 tokens with only minor accuracy loss on needle-in-haystack checks.

Core claim

SnapKV automatically compresses KV caches by selecting clustered important KV positions for each attention head, drawing on the robust attention pattern that appears inside an observation window located at the end of the prompts, delivering a 3.6x increase in generation speed and an 8.2x improvement in memory efficiency for 16K-token inputs while keeping performance comparable to the baseline across 16 long-sequence datasets.

What carries the argument

Observation window at the end of the prompt that reveals consistent per-head attention patterns, from which clustered important KV positions are selected for compression

If this is right

  • 3.6x faster generation speed with consistent decoding on 16K-token inputs
  • 8.2x better memory efficiency than the baseline for the same inputs
  • Comparable performance retained across 16 long-sequence datasets
  • Ability to process up to 380K context tokens on one A100-80GB GPU with only negligible accuracy drop in needle-in-haystack tests

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same end-window observation could be tested as a lightweight way to prune other transformer components beyond the KV cache
  • If the per-head consistency holds across model families, SnapKV-style selection might combine with quantization for even larger context gains
  • Real-time applications on edge devices could use the reduced memory footprint to support longer inputs without hardware upgrades
  • The approach hints that early attention signals in generation may be predictable enough to guide other runtime optimizations

Load-bearing premise

The attention pattern visible in a short window at the very end of the prompt is enough to identify which KV entries will matter for the full generation that follows.

What would settle it

Running the method on a task where the model later attends to prompt sections outside the patterns captured by the end window, producing a clear drop in accuracy after compression

read the original abstract

Large Language Models (LLMs) have made remarkable progress in processing extensive contexts, with the Key-Value (KV) cache playing a vital role in enhancing their performance. However, the growth of the KV cache in response to increasing input length poses challenges to memory and time efficiency. To address this problem, this paper introduces SnapKV, an innovative and fine-tuning-free approach that efficiently minimizes KV cache size while still delivering comparable performance in real-world applications. We discover that each attention head in the model consistently focuses on specific prompt attention features during generation. Meanwhile, this robust pattern can be obtained from an 'observation' window located at the end of the prompts. Drawing on this insight, SnapKV automatically compresses KV caches by selecting clustered important KV positions for each attention head. Our approach significantly reduces the growing computational overhead and memory footprint when processing long input sequences. Specifically, SnapKV achieves a consistent decoding speed with a 3.6x increase in generation speed and an 8.2x enhancement in memory efficiency compared to the baseline when processing inputs of 16K tokens. At the same time, it maintains comparable performance to the baseline models across 16 long sequence datasets. Moreover, SnapKV can process up to 380K context tokens on a single A100-80GB GPU using HuggingFace implementation with minor changes, exhibiting only a negligible accuracy drop in the Needle-in-a-Haystack test. Further comprehensive studies suggest SnapKV's potential for practical applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces SnapKV, a fine-tuning-free method to compress the KV cache in LLMs. It observes that each attention head exhibits consistent focus on specific prompt features, which can be captured from a short 'observation' window at the end of the prompt. SnapKV then selects clustered important KV positions per head from the full prompt cache. This yields a claimed 3.6x increase in generation speed and 8.2x memory efficiency for 16K-token inputs while maintaining comparable performance to baselines across 16 long-sequence datasets; it also enables processing up to 380K contexts on a single A100-80GB GPU with only negligible accuracy drop on Needle-in-a-Haystack.

Significance. If the core empirical observation holds and generalizes, the work offers a practical, training-free route to longer-context inference on commodity hardware. The reported speed and memory gains at 16K tokens, combined with the ability to reach 380K contexts, would be directly useful for deployment. The per-head clustering insight is a concrete contribution that could inspire further cache-management techniques.

major comments (2)
  1. [§3 and §4.2] §3 (Method) and §4.2 (Experiments): the central claim that the one-time selection from the observation window suffices for the entire autoregressive generation rests on the untested assumption that attention patterns do not shift materially once generation begins. No ablation is reported that inserts query shifts (e.g., multi-hop reasoning or dispersed-fact retrieval) to measure whether critical KV entries are dropped; the 3.6×/8.2× figures therefore cannot be taken as robust without such evidence.
  2. [Table 2 and §4.1] Table 2 and §4.1: performance is reported as 'comparable' across 16 datasets, yet no error bars, standard deviations, or number of runs are supplied, nor are exact baseline KV-cache implementations (FlashAttention version, precision, etc.) detailed. This makes it impossible to judge whether the observed differences are statistically meaningful or merely within run-to-run variance.
minor comments (2)
  1. [§3.2] §3.2: the clustering procedure (threshold or k value) is described only at a high level; an explicit algorithm box or pseudocode would clarify reproducibility.
  2. [Figure 3] Figure 3: the attention heatmaps are useful but lack axis labels indicating token indices and head numbers; adding these would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments below and will revise the manuscript accordingly to strengthen the claims and improve clarity.

read point-by-point responses
  1. Referee: [§3 and §4.2] §3 (Method) and §4.2 (Experiments): the central claim that the one-time selection from the observation window suffices for the entire autoregressive generation rests on the untested assumption that attention patterns do not shift materially once generation begins. No ablation is reported that inserts query shifts (e.g., multi-hop reasoning or dispersed-fact retrieval) to measure whether critical KV entries are dropped; the 3.6×/8.2× figures therefore cannot be taken as robust without such evidence.

    Authors: We appreciate this observation. Our method is based on the consistent focus patterns observed in the attention heads, which we found to hold across the generation process in our evaluations. To directly address the concern, we will add an ablation study in the revised version involving tasks with significant query shifts, such as multi-hop reasoning and long-context retrieval with dispersed facts. This will confirm if any critical KV entries are missed and support the reported efficiency gains. revision: yes

  2. Referee: [Table 2 and §4.1] Table 2 and §4.1: performance is reported as 'comparable' across 16 datasets, yet no error bars, standard deviations, or number of runs are supplied, nor are exact baseline KV-cache implementations (FlashAttention version, precision, etc.) detailed. This makes it impossible to judge whether the observed differences are statistically meaningful or merely within run-to-run variance.

    Authors: We concur that additional statistical information and implementation details are necessary. In the updated manuscript, we will provide error bars and standard deviations for the results in Table 2, based on multiple independent runs. We will also detail the exact baseline configurations, including the FlashAttention version used, precision settings, and other relevant parameters in Section 4.1. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical observation applied directly as heuristic

full rationale

The paper's chain begins with an empirical discovery ('each attention head in the model consistently focuses on specific prompt attention features during generation' and 'this robust pattern can be obtained from an observation window located at the end of the prompts') and proceeds to a direct rule ('SnapKV automatically compresses KV caches by selecting clustered important KV positions for each attention head'). No equations, fitted parameters, or self-citations are invoked to derive the selection rule; the method is a one-time heuristic extracted from the observation window and applied statically. Performance numbers (3.6x speed, 8.2x memory) are reported from experiments on 16 datasets rather than forced by construction. The central claim therefore remains independent of its inputs and does not reduce to self-definition, renaming, or load-bearing self-citation.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical discovery that attention heads maintain stable focus patterns visible in a short end-of-prompt window; no free parameters are explicitly named in the abstract, but window size and clustering thresholds are implicitly chosen.

free parameters (2)
  • observation window length
    Length of the suffix used to observe attention patterns; must be chosen to capture the stable focus without including too much generation noise.
  • clustering threshold or k
    Parameter controlling how many or which clustered positions are retained per head.
axioms (1)
  • domain assumption Each attention head focuses on a consistent set of prompt features throughout generation.
    Stated directly in the abstract as the key discovery enabling the compression rule.

pith-pipeline@v0.9.0 · 5591 in / 1391 out tokens · 43677 ms · 2026-05-13T12:52:55.319443+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. VORT: Adaptive Power-Law Memory for NLP Transformers

    cs.LG 2026-05 unverdicted novelty 7.0

    VORT assigns learnable fractional orders to tokens and approximates their power-law retention kernels via sum-of-exponentials for efficient long-range dependency modeling in transformers.

  2. MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference

    cs.LG 2026-05 conditional novelty 7.0

    MISA routes to a small subset of indexer heads via block statistics, matching full DSA performance on LongBench with 4-8x fewer heads and 3.82x speedup while recovering over 92% of selected tokens.

  3. Long Context Pre-Training with Lighthouse Attention

    cs.CL 2026-05 conditional novelty 7.0

    Lighthouse Attention enables faster long-context pre-training via gradient-free symmetrical hierarchical compression of QKV while preserving causality, followed by a short full-attention recovery that yields lower los...

  4. OptiVerse: A Comprehensive Benchmark towards Optimization Problem Solving

    cs.CL 2026-04 unverdicted novelty 7.0

    OptiVerse is a new benchmark spanning neglected optimization domains that shows LLMs suffer sharp accuracy drops on hard problems due to modeling and logic errors, with a Dual-View Auditor Agent proposed to improve pe...

  5. Neural Garbage Collection: Learning to Forget while Learning to Reason

    cs.LG 2026-04 conditional novelty 7.0

    Language models learn to evict KV cache entries end-to-end via reinforcement learning from outcome reward alone, achieving 2-3x cache compression while maintaining accuracy on Countdown, AMC, and AIME tasks.

  6. How Much Cache Does Reasoning Need? Depth-Cache Tradeoffs in KV-Compressed Transformers

    cs.LG 2026-04 unverdicted novelty 7.0

    Transformers need depth scaling as the product of ceil(k/s) and log n terms for k-hop pointer chasing under cache size s, with a conjectured lower bound, proved upper bound via windowed pointer doubling, and an adapti...

  7. Sparse Prefix Caching for Hybrid and Recurrent LLM Serving

    cs.LG 2026-04 unverdicted novelty 7.0

    Sparse prefix caching via dynamic programming for optimal checkpoint placement under overlap distributions improves the Pareto frontier for recurrent and hybrid LLM serving on shared-prefix data.

  8. Transactional Attention: Semantic Sponsorship for KV-Cache Retention

    cs.CL 2026-04 unverdicted novelty 7.0

    Transactional Attention uses semantic sponsorship from anchor patterns to retain dormant critical tokens in KV caches, achieving 100% credential retrieval at 16 tokens where all prior methods fail.

  9. Sequential KV Cache Compression via Probabilistic Language Tries: Beyond the Per-Vector Shannon Limit

    cs.LG 2026-04 unverdicted novelty 7.0

    Sequential KV compression via probabilistic language tries and predictive delta coding achieves 3.3-4.3 bits per token entropy, yielding up to 914x better ratios than TurboQuant even with large overhead.

  10. KV-Fold: One-Step KV-Cache Recurrence for Long-Context Inference

    cs.LG 2026-05 conditional novelty 6.0

    KV-Fold turns frozen transformers into stable long-context models by folding the KV cache across sequence chunks in repeated forward passes.

  11. KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving

    cs.AR 2026-05 unverdicted novelty 6.0

    KV-RM regularizes KV-cache movement in static-graph LLM serving via block paging and merge-staged transport to improve throughput, tail latency, and memory use for variable-length decoding.

  12. Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache

    cs.LG 2026-05 unverdicted novelty 6.0

    Louver is a new index for LLM KV caches that guarantees zero false negatives for keys above a relevance threshold, runs faster than prior sparse and some dense attention methods, and integrates lightly into existing p...

  13. Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache

    cs.LG 2026-05 unverdicted novelty 6.0

    Louver is a new index structure that guarantees zero false negatives for sparse attention in LLM KV caches by casting the problem as halfspace range searching.

  14. UniPrefill: Universal Long-Context Prefill Acceleration via Block-wise Dynamic Sparsification

    cs.CL 2026-05 unverdicted novelty 6.0

    UniPrefill accelerates LLM prefill via block-wise dynamic sparsification, achieving up to 2.1x TTFT speedup while supporting hybrid architectures and native vLLM continuous batching.

  15. Dual-Cluster Memory Agent: Resolving Multi-Paradigm Ambiguity in Optimization Problem Solving

    cs.CL 2026-04 unverdicted novelty 6.0

    DCM-Agent improves LLM performance on multi-paradigm optimization problems by 11-21% via dual-cluster memory construction and dynamic inference guidance.

  16. SinkRouter: Sink-Aware Routing for Efficient Long-Context Decoding in Large Language and Multimodal Models

    cs.LG 2026-04 unverdicted novelty 6.0

    SinkRouter identifies attention sinks as training-derived fixed points and routes around them to skip redundant KV-cache loads, delivering up to 2.03x decoding speedup on long-context benchmarks.

  17. PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

    cs.CL 2024-06 conditional novelty 6.0

    PyramidKV dynamically compresses KV cache across layers following pyramidal information funneling, matching full performance at 12% retention and outperforming alternatives at 0.7% retention with up to 20.5 accuracy gains.

  18. How to Compress KV Cache in RL Post-Training? Shadow Mask Distillation for Memory-Efficient Alignment

    cs.LG 2026-05 unverdicted novelty 5.0

    Shadow Mask Distillation enables KV cache compression in RL post-training of LLMs by mitigating amplified off-policy bias that defeats standard importance reweighting.

  19. Predictive Multi-Tier Memory Management for KV Cache in Large-Scale GPU Inference

    cs.AR 2026-04 unverdicted novelty 5.0

    A unified KV cache system with architecture-specific sizing, six-tier memory from GPU to filesystems, and Bayesian prediction delivers 7.4x higher batch sizes, 70-84% hit rates, and projected 1.7-2.9x throughput gains.

  20. HieraSparse: Hierarchical Semi-Structured Sparse KV Attention

    cs.DC 2026-04 unverdicted novelty 5.0

    HieraSparse delivers a hierarchical semi-structured sparse KV attention system that achieves 1.2x KV compression and 4.57x decode attention speedup versus prior unstructured sparsity methods at equivalent sparsity, pl...

  21. Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference

    cs.LG 2026-04 unverdicted novelty 5.0

    Flux Attention uses a context-aware Layer Router to dynamically assign full or sparse attention to each LLM layer, achieving up to 2.8x prefill and 2.0x decode speedups with competitive performance on long-context and...

  22. BFLA: Block-Filtered Long-Context Attention Mechanism

    eess.SP 2026-05 unverdicted novelty 4.0

    BFLA is a two-stage block-filtered sparse prefill attention mechanism that constructs an input-dependent block mask and applies tile-level rescues to skip unimportant KV tiles while preserving exact attention inside r...

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · cited by 21 Pith papers · 9 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  2. [2]

    Command r: Retrieval-augmented generation at production scale, March 2024

    Cohere. Command r: Retrieval-augmented generation at production scale, March 2024. URL https://txt.cohere.com/command-r

  3. [3]

    The claude 3 model family: Opus, sonnet, haiku, March 2024

    Anthropic. The claude 3 model family: Opus, sonnet, haiku, March 2024. URL https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/ Model_Card_Claude_3.pdf

  4. [4]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean- baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024

  5. [5]

    Efficient Streaming Language Models with Attention Sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023

  6. [6]

    H2o: Heavy-hitter oracle for efficient generative inference of large language models

    Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems, 36, 2024

  7. [7]

    Scissorhands: Exploiting the persistence of impor- tance hypothesis for llm kv cache compression at test time

    Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of impor- tance hypothesis for llm kv cache compression at test time. Advances in Neural Information Processing Systems, 36, 2024

  8. [8]

    Model tells you what to discard: Adaptive kv cache compression for llms.arXiv preprint arXiv:2310.01801,

    Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. Model tells you what to discard: Adaptive kv cache compression for llms. arXiv preprint arXiv:2310.01801, 2023

  9. [9]

    Lifelong and continual learning dialogue systems: learn- ing during conversation

    Bing Liu and Sahisnu Mazumder. Lifelong and continual learning dialogue systems: learn- ing during conversation. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 35, pages 15058–15063, 2021

  10. [10]

    Codeplan: Repository-level coding using llms and planning

    Ramakrishna Bairi, Atharv Sonwane, Aditya Kanade, Arun Iyer, Suresh Parthasarathy, Sriram Rajamani, B Ashok, Shashank Shet, et al. Codeplan: Repository-level coding using llms and planning. arXiv preprint arXiv:2309.12499, 2023

  11. [11]

    Enhancing chat language models by scaling high-quality instructional conversations

    Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233, 2023

  12. [12]

    Qmsum: A new benchmark for query-based multi-domain meeting summarization

    Ming Zhong, Da Yin, Tao Yu, Ahmad Zaidi, Mutethia Mutuma, Rahul Jha, Ahmed Hassan Awadallah, Asli Celikyilmaz, Yang Liu, Xipeng Qiu, et al. Qmsum: A new benchmark for query-based multi-domain meeting summarization. arXiv preprint arXiv:2104.05938, 2021

  13. [13]

    L-Eval: Instituting standardized evaluation for long context language models

    Chenxin An, Shansan Gong, Ming Zhong, Mukai Li, Jun Zhang, Lingpeng Kong, and Xipeng Qiu. L-eval: Instituting standardized evaluation for long context language models. arXiv preprint arXiv:2307.11088, 2023

  14. [14]

    Extractive opinion summarization in quantized transformer spaces

    Stefanos Angelidis, Reinald Kim Amplayo, Yoshihiko Suhara, Xiaolan Wang, and Mirella Lapata. Extractive opinion summarization in quantized transformer spaces. Transactions of the Association for Computational Linguistics, 9:277–293, 2021. 13

  15. [15]

    In-context Learning and Induction Heads

    Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022

  16. [16]

    World model on million-length video and language with ringattention

    Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with ringattention. arXiv preprint arXiv:2402.08268, 2024

  17. [17]

    LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

    Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023

  18. [18]

    Needle in a haystack–pressure testing llms, 2023

    G Kamradt. Needle in a haystack–pressure testing llms, 2023

  19. [19]

    How long can context length of open-source llms truly promise? In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023

    Dacheng Li, Rulin Shao, Anze Xie, Ying Sheng, Lianmin Zheng, Joseph Gonzalez, Ion Stoica, Xuezhe Ma, and Hao Zhang. How long can context length of open-source llms truly promise? In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023

  20. [20]

    Llm maybe longlm: Self-extend llm context window without tuning.arXiv preprint arXiv:2401.01325, 2024

    Hongye Jin, Xiaotian Han, Jingfeng Yang, Zhimeng Jiang, Zirui Liu, Chia-Yuan Chang, Huiyuan Chen, and Xia Hu. Llm maybe longlm: Self-extend llm context window without tuning. arXiv preprint arXiv:2401.01325, 2024

  21. [21]

    Overview of bioasq 2023: The eleventh bioasq challenge on large-scale biomedical semantic indexing and question answering, 2023

    Anastasios Nentidis, Georgios Katsimpras, Anastasia Krithara, Salvador Lima López, Eulália Farré-Maduell, Luis Gasco, Martin Krallinger, and Georgios Paliouras. Overview of bioasq 2023: The eleventh bioasq challenge on large-scale biomedical semantic indexing and question answering, 2023

  22. [22]

    HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhut- dinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600, 2018

  23. [23]

    Long context prompting for claude 2.1, December 2023

    Anthropic. Long context prompting for claude 2.1, December 2023. URL https://www. anthropic.com/news/claude-2-1-prompting

  24. [24]

    Lost in the middle: How language models use long contexts

    Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157–173, 2024

  25. [25]

    Retrieval augmented generation (rag), 2023

    Cohere. Retrieval augmented generation (rag), 2023. URL https://docs.cohere.com/ docs/retrieval-augmented-generation-rag

  26. [26]

    Cohere embed, 2023

    Cohere. Cohere embed, 2023. URL https://docs.cohere.com/reference/embed

  27. [27]

    Cohere rerank, 2023

    Cohere. Cohere rerank, 2023. URL https://docs.cohere.com/docs/rerank-guide

  28. [28]

    Blockwise parallel decoding for deep autoregressive models

    Mitchell Stern, Noam Shazeer, and Jakob Uszkoreit. Blockwise parallel decoding for deep autoregressive models. Advances in Neural Information Processing Systems, 31, 2018

  29. [29]

    Fast inference from transformers via speculative decoding

    Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274–19286. PMLR, 2023

  30. [30]

    Accelerating Large Language Model Decoding with Speculative Sampling

    Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318, 2023

  31. [31]

    arXiv:2305.09781 , year=

    Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Rae Ying Yee Wong, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, and Zhihao Jia. Specinfer: Accelerating generative llm serving with speculative inference and token tree verification. arXiv preprint arXiv:2305.09781, 2023. 14

  32. [32]

    Recurrent drafter for fast speculative decoding in large language models

    Aonan Zhang, Chong Wang, Yi Wang, Xuanyu Zhang, and Yunfei Cheng. Recurrent drafter for fast speculative decoding in large language models. arXiv preprint arXiv:2403.09919, 2024

  33. [33]

    Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in Neural Information Processing Systems, 35:16344–16359, 2022

    Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in Neural Information Processing Systems, 35:16344–16359, 2022

  34. [34]

    FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

    Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023

  35. [35]

    Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

    Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decoding heads. arXiv preprint arXiv:2401.10774, 2024

  36. [36]

    Flash-decoding for long-context inference, 2023

    Tri Dao, Daniel Haziza, Francisco Massa, and Grigory Sizov. Flash-decoding for long-context inference, 2023

  37. [37]

    Smith, and Matt Gardner

    Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A Smith, and Matt Gardner. A dataset of information-seeking questions and answers anchored in research papers. arXiv preprint arXiv:2105.03011, 2021. 15 A Discussion of Generation Time Speedup To better assess SnapKV’s effectiveness across different stages, we documented a detailed time breakdown for ...