arxiv: 2404.14469 · v2 · submitted 2024-04-22 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

· Lean Theorem

SnapKV: LLM Knows What You are Looking for Before Generation

Acyr Locatelli, Bharat Venkitesh, Bowen Yang, Deming Chen, Hanchen Ye, Patrick Lewis, Tianle Cai, Yingbing Huang, Yuhong Li

Pith reviewed 2026-05-13 12:52 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords KV cache compressionLLM efficiencyattention patternslong contextmemory optimizationgeneration speedSnapKVcontext window

0 comments

The pith

Each attention head in an LLM shows a consistent focus on specific prompt features that can be captured from a short observation window at the end of the input, letting SnapKV compress the KV cache by selecting clustered important positions

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

SnapKV is a fine-tuning-free method that shrinks the Key-Value cache in large language models during text generation. The core finding is that every attention head maintains a stable pattern of which parts of the prompt it attends to, and this pattern shows up reliably inside a brief window placed at the very end of the prompt. Using that window, the method picks out the most relevant KV entries in clusters for each head and discards the rest. The result is faster decoding and much lower memory use on long inputs, with output quality staying close to the full-cache version. This lets a single GPU handle contexts up to 380,000 tokens with only minor accuracy loss on needle-in-haystack checks.

Core claim

SnapKV automatically compresses KV caches by selecting clustered important KV positions for each attention head, drawing on the robust attention pattern that appears inside an observation window located at the end of the prompts, delivering a 3.6x increase in generation speed and an 8.2x improvement in memory efficiency for 16K-token inputs while keeping performance comparable to the baseline across 16 long-sequence datasets.

What carries the argument

Observation window at the end of the prompt that reveals consistent per-head attention patterns, from which clustered important KV positions are selected for compression

If this is right

3.6x faster generation speed with consistent decoding on 16K-token inputs
8.2x better memory efficiency than the baseline for the same inputs
Comparable performance retained across 16 long-sequence datasets
Ability to process up to 380K context tokens on one A100-80GB GPU with only negligible accuracy drop in needle-in-haystack tests

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same end-window observation could be tested as a lightweight way to prune other transformer components beyond the KV cache
If the per-head consistency holds across model families, SnapKV-style selection might combine with quantization for even larger context gains
Real-time applications on edge devices could use the reduced memory footprint to support longer inputs without hardware upgrades
The approach hints that early attention signals in generation may be predictable enough to guide other runtime optimizations

Load-bearing premise

The attention pattern visible in a short window at the very end of the prompt is enough to identify which KV entries will matter for the full generation that follows.

What would settle it

Running the method on a task where the model later attends to prompt sections outside the patterns captured by the end window, producing a clear drop in accuracy after compression

read the original abstract

Large Language Models (LLMs) have made remarkable progress in processing extensive contexts, with the Key-Value (KV) cache playing a vital role in enhancing their performance. However, the growth of the KV cache in response to increasing input length poses challenges to memory and time efficiency. To address this problem, this paper introduces SnapKV, an innovative and fine-tuning-free approach that efficiently minimizes KV cache size while still delivering comparable performance in real-world applications. We discover that each attention head in the model consistently focuses on specific prompt attention features during generation. Meanwhile, this robust pattern can be obtained from an 'observation' window located at the end of the prompts. Drawing on this insight, SnapKV automatically compresses KV caches by selecting clustered important KV positions for each attention head. Our approach significantly reduces the growing computational overhead and memory footprint when processing long input sequences. Specifically, SnapKV achieves a consistent decoding speed with a 3.6x increase in generation speed and an 8.2x enhancement in memory efficiency compared to the baseline when processing inputs of 16K tokens. At the same time, it maintains comparable performance to the baseline models across 16 long sequence datasets. Moreover, SnapKV can process up to 380K context tokens on a single A100-80GB GPU using HuggingFace implementation with minor changes, exhibiting only a negligible accuracy drop in the Needle-in-a-Haystack test. Further comprehensive studies suggest SnapKV's potential for practical applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SnapKV's end-of-prompt observation window plus per-head clustering gives real speed and memory wins on their tests, but the static selection may not hold if attention drifts during generation.

read the letter

The core contribution is a no-training trick: take a short observation window at the very end of the prompt, compute attention from those tokens back over the full KV cache, then keep only the clustered high-attention positions for each head. They report 3.6x faster generation and 8.2x lower memory at 16k tokens while matching baseline scores on 16 long-context datasets, plus the ability to run 380k tokens on a single A100 with only a small needle-in-haystack drop. That is the practical takeaway worth knowing first. The method is new in its specific use of the end-window pattern and the clustering step; prior KV compression work does not describe this exact selection rule. They also ship concrete numbers on real models and tasks, which is useful for anyone trying to push context length on current hardware. The experiments show the gains are consistent enough across the datasets they tried to make the approach worth trying. The main soft spot is the assumption that the positions chosen from that single observation window stay the right ones once generation starts and new query vectors appear. If attention patterns shift to different parts of the prompt, the dropped KV entries cannot be recovered. Their results do not include ablations on window length, checks for multi-step reasoning tasks where facts are scattered, or error bars on the accuracy numbers, so it is hard to judge how robust the claim is beyond the reported settings. The baselines are compared but without full implementation details it is difficult to know the exact margin. This paper is aimed at engineers and researchers who need to run longer contexts on limited GPUs without retraining. A reader looking for a drop-in compression method will find the description and numbers directly usable. It is worth sending to peer review because the efficiency claims are concrete and the method is simple enough that reviewers can verify the generalization assumption with targeted experiments.

Referee Report

2 major / 2 minor

Summary. The paper introduces SnapKV, a fine-tuning-free method to compress the KV cache in LLMs. It observes that each attention head exhibits consistent focus on specific prompt features, which can be captured from a short 'observation' window at the end of the prompt. SnapKV then selects clustered important KV positions per head from the full prompt cache. This yields a claimed 3.6x increase in generation speed and 8.2x memory efficiency for 16K-token inputs while maintaining comparable performance to baselines across 16 long-sequence datasets; it also enables processing up to 380K contexts on a single A100-80GB GPU with only negligible accuracy drop on Needle-in-a-Haystack.

Significance. If the core empirical observation holds and generalizes, the work offers a practical, training-free route to longer-context inference on commodity hardware. The reported speed and memory gains at 16K tokens, combined with the ability to reach 380K contexts, would be directly useful for deployment. The per-head clustering insight is a concrete contribution that could inspire further cache-management techniques.

major comments (2)

[§3 and §4.2] §3 (Method) and §4.2 (Experiments): the central claim that the one-time selection from the observation window suffices for the entire autoregressive generation rests on the untested assumption that attention patterns do not shift materially once generation begins. No ablation is reported that inserts query shifts (e.g., multi-hop reasoning or dispersed-fact retrieval) to measure whether critical KV entries are dropped; the 3.6×/8.2× figures therefore cannot be taken as robust without such evidence.
[Table 2 and §4.1] Table 2 and §4.1: performance is reported as 'comparable' across 16 datasets, yet no error bars, standard deviations, or number of runs are supplied, nor are exact baseline KV-cache implementations (FlashAttention version, precision, etc.) detailed. This makes it impossible to judge whether the observed differences are statistically meaningful or merely within run-to-run variance.

minor comments (2)

[§3.2] §3.2: the clustering procedure (threshold or k value) is described only at a high level; an explicit algorithm box or pseudocode would clarify reproducibility.
[Figure 3] Figure 3: the attention heatmaps are useful but lack axis labels indicating token indices and head numbers; adding these would improve readability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comments below and will revise the manuscript accordingly to strengthen the claims and improve clarity.

read point-by-point responses

Referee: [§3 and §4.2] §3 (Method) and §4.2 (Experiments): the central claim that the one-time selection from the observation window suffices for the entire autoregressive generation rests on the untested assumption that attention patterns do not shift materially once generation begins. No ablation is reported that inserts query shifts (e.g., multi-hop reasoning or dispersed-fact retrieval) to measure whether critical KV entries are dropped; the 3.6×/8.2× figures therefore cannot be taken as robust without such evidence.

Authors: We appreciate this observation. Our method is based on the consistent focus patterns observed in the attention heads, which we found to hold across the generation process in our evaluations. To directly address the concern, we will add an ablation study in the revised version involving tasks with significant query shifts, such as multi-hop reasoning and long-context retrieval with dispersed facts. This will confirm if any critical KV entries are missed and support the reported efficiency gains. revision: yes
Referee: [Table 2 and §4.1] Table 2 and §4.1: performance is reported as 'comparable' across 16 datasets, yet no error bars, standard deviations, or number of runs are supplied, nor are exact baseline KV-cache implementations (FlashAttention version, precision, etc.) detailed. This makes it impossible to judge whether the observed differences are statistically meaningful or merely within run-to-run variance.

Authors: We concur that additional statistical information and implementation details are necessary. In the updated manuscript, we will provide error bars and standard deviations for the results in Table 2, based on multiple independent runs. We will also detail the exact baseline configurations, including the FlashAttention version used, precision settings, and other relevant parameters in Section 4.1. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical observation applied directly as heuristic

full rationale

The paper's chain begins with an empirical discovery ('each attention head in the model consistently focuses on specific prompt attention features during generation' and 'this robust pattern can be obtained from an observation window located at the end of the prompts') and proceeds to a direct rule ('SnapKV automatically compresses KV caches by selecting clustered important KV positions for each attention head'). No equations, fitted parameters, or self-citations are invoked to derive the selection rule; the method is a one-time heuristic extracted from the observation window and applied statically. Performance numbers (3.6x speed, 8.2x memory) are reported from experiments on 16 datasets rather than forced by construction. The central claim therefore remains independent of its inputs and does not reduce to self-definition, renaming, or load-bearing self-citation.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical discovery that attention heads maintain stable focus patterns visible in a short end-of-prompt window; no free parameters are explicitly named in the abstract, but window size and clustering thresholds are implicitly chosen.

free parameters (2)

observation window length
Length of the suffix used to observe attention patterns; must be chosen to capture the stable focus without including too much generation noise.
clustering threshold or k
Parameter controlling how many or which clustered positions are retained per head.

axioms (1)

domain assumption Each attention head focuses on a consistent set of prompt features throughout generation.
Stated directly in the abstract as the key discovery enabling the compression rule.

pith-pipeline@v0.9.0 · 5591 in / 1391 out tokens · 43677 ms · 2026-05-13T12:52:55.319443+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear
SnapKV achieves a consistent decoding speed with a 3.6x increase in generation speed and an 8.2x enhancement in memory efficiency

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

VORT: Adaptive Power-Law Memory for NLP Transformers
cs.LG 2026-05 unverdicted novelty 7.0

VORT assigns learnable fractional orders to tokens and approximates their power-law retention kernels via sum-of-exponentials for efficient long-range dependency modeling in transformers.
MISA: Mixture of Indexer Sparse Attention for Long-Context LLM Inference
cs.LG 2026-05 conditional novelty 7.0

MISA routes to a small subset of indexer heads via block statistics, matching full DSA performance on LongBench with 4-8x fewer heads and 3.82x speedup while recovering over 92% of selected tokens.
Long Context Pre-Training with Lighthouse Attention
cs.CL 2026-05 conditional novelty 7.0

Lighthouse Attention enables faster long-context pre-training via gradient-free symmetrical hierarchical compression of QKV while preserving causality, followed by a short full-attention recovery that yields lower los...
OptiVerse: A Comprehensive Benchmark towards Optimization Problem Solving
cs.CL 2026-04 unverdicted novelty 7.0

OptiVerse is a new benchmark spanning neglected optimization domains that shows LLMs suffer sharp accuracy drops on hard problems due to modeling and logic errors, with a Dual-View Auditor Agent proposed to improve pe...
Neural Garbage Collection: Learning to Forget while Learning to Reason
cs.LG 2026-04 conditional novelty 7.0

Language models learn to evict KV cache entries end-to-end via reinforcement learning from outcome reward alone, achieving 2-3x cache compression while maintaining accuracy on Countdown, AMC, and AIME tasks.
How Much Cache Does Reasoning Need? Depth-Cache Tradeoffs in KV-Compressed Transformers
cs.LG 2026-04 unverdicted novelty 7.0

Transformers need depth scaling as the product of ceil(k/s) and log n terms for k-hop pointer chasing under cache size s, with a conjectured lower bound, proved upper bound via windowed pointer doubling, and an adapti...
Sparse Prefix Caching for Hybrid and Recurrent LLM Serving
cs.LG 2026-04 unverdicted novelty 7.0

Sparse prefix caching via dynamic programming for optimal checkpoint placement under overlap distributions improves the Pareto frontier for recurrent and hybrid LLM serving on shared-prefix data.
Transactional Attention: Semantic Sponsorship for KV-Cache Retention
cs.CL 2026-04 unverdicted novelty 7.0

Transactional Attention uses semantic sponsorship from anchor patterns to retain dormant critical tokens in KV caches, achieving 100% credential retrieval at 16 tokens where all prior methods fail.
Sequential KV Cache Compression via Probabilistic Language Tries: Beyond the Per-Vector Shannon Limit
cs.LG 2026-04 unverdicted novelty 7.0

Sequential KV compression via probabilistic language tries and predictive delta coding achieves 3.3-4.3 bits per token entropy, yielding up to 914x better ratios than TurboQuant even with large overhead.
KV-Fold: One-Step KV-Cache Recurrence for Long-Context Inference
cs.LG 2026-05 conditional novelty 6.0

KV-Fold turns frozen transformers into stable long-context models by folding the KV cache across sequence chunks in repeated forward passes.
KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving
cs.AR 2026-05 unverdicted novelty 6.0

KV-RM regularizes KV-cache movement in static-graph LLM serving via block paging and merge-staged transport to improve throughput, tail latency, and memory use for variable-length decoding.
Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache
cs.LG 2026-05 unverdicted novelty 6.0

Louver is a new index for LLM KV caches that guarantees zero false negatives for keys above a relevance threshold, runs faster than prior sparse and some dense attention methods, and integrates lightly into existing p...
Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache
cs.LG 2026-05 unverdicted novelty 6.0

Louver is a new index structure that guarantees zero false negatives for sparse attention in LLM KV caches by casting the problem as halfspace range searching.
UniPrefill: Universal Long-Context Prefill Acceleration via Block-wise Dynamic Sparsification
cs.CL 2026-05 unverdicted novelty 6.0

UniPrefill accelerates LLM prefill via block-wise dynamic sparsification, achieving up to 2.1x TTFT speedup while supporting hybrid architectures and native vLLM continuous batching.
Dual-Cluster Memory Agent: Resolving Multi-Paradigm Ambiguity in Optimization Problem Solving
cs.CL 2026-04 unverdicted novelty 6.0

DCM-Agent improves LLM performance on multi-paradigm optimization problems by 11-21% via dual-cluster memory construction and dynamic inference guidance.
SinkRouter: Sink-Aware Routing for Efficient Long-Context Decoding in Large Language and Multimodal Models
cs.LG 2026-04 unverdicted novelty 6.0

SinkRouter identifies attention sinks as training-derived fixed points and routes around them to skip redundant KV-cache loads, delivering up to 2.03x decoding speedup on long-context benchmarks.
PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling
cs.CL 2024-06 conditional novelty 6.0

PyramidKV dynamically compresses KV cache across layers following pyramidal information funneling, matching full performance at 12% retention and outperforming alternatives at 0.7% retention with up to 20.5 accuracy gains.
How to Compress KV Cache in RL Post-Training? Shadow Mask Distillation for Memory-Efficient Alignment
cs.LG 2026-05 unverdicted novelty 5.0

Shadow Mask Distillation enables KV cache compression in RL post-training of LLMs by mitigating amplified off-policy bias that defeats standard importance reweighting.
Predictive Multi-Tier Memory Management for KV Cache in Large-Scale GPU Inference
cs.AR 2026-04 unverdicted novelty 5.0

A unified KV cache system with architecture-specific sizing, six-tier memory from GPU to filesystems, and Bayesian prediction delivers 7.4x higher batch sizes, 70-84% hit rates, and projected 1.7-2.9x throughput gains.
HieraSparse: Hierarchical Semi-Structured Sparse KV Attention
cs.DC 2026-04 unverdicted novelty 5.0

HieraSparse delivers a hierarchical semi-structured sparse KV attention system that achieves 1.2x KV compression and 4.57x decode attention speedup versus prior unstructured sparsity methods at equivalent sparsity, pl...
Flux Attention: Context-Aware Hybrid Attention for Efficient LLMs Inference
cs.LG 2026-04 unverdicted novelty 5.0

Flux Attention uses a context-aware Layer Router to dynamically assign full or sparse attention to each LLM layer, achieving up to 2.8x prefill and 2.0x decode speedups with competitive performance on long-context and...
BFLA: Block-Filtered Long-Context Attention Mechanism
eess.SP 2026-05 unverdicted novelty 4.0

BFLA is a two-stage block-filtered sparse prefill attention mechanism that constructs an input-dependent block mask and applies tile-level rescues to skip unimportant KV tiles while preserving exact attention inside r...

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · cited by 21 Pith papers · 9 internal anchors

[1]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[2]

Command r: Retrieval-augmented generation at production scale, March 2024

Cohere. Command r: Retrieval-augmented generation at production scale, March 2024. URL https://txt.cohere.com/command-r

work page 2024
[3]

The claude 3 model family: Opus, sonnet, haiku, March 2024

Anthropic. The claude 3 model family: Opus, sonnet, haiku, March 2024. URL https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/ Model_Card_Claude_3.pdf

work page 2024
[4]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean- baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

Efficient Streaming Language Models with Attention Sinks

Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[6]

H2o: Heavy-hitter oracle for efficient generative inference of large language models

Zhenyu Zhang, Ying Sheng, Tianyi Zhou, Tianlong Chen, Lianmin Zheng, Ruisi Cai, Zhao Song, Yuandong Tian, Christopher Ré, Clark Barrett, et al. H2o: Heavy-hitter oracle for efficient generative inference of large language models. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[7]

Scissorhands: Exploiting the persistence of impor- tance hypothesis for llm kv cache compression at test time

Zichang Liu, Aditya Desai, Fangshuo Liao, Weitao Wang, Victor Xie, Zhaozhuo Xu, Anastasios Kyrillidis, and Anshumali Shrivastava. Scissorhands: Exploiting the persistence of impor- tance hypothesis for llm kv cache compression at test time. Advances in Neural Information Processing Systems, 36, 2024

work page 2024
[8]

Model tells you what to discard: Adaptive kv cache compression for llms.arXiv preprint arXiv:2310.01801,

Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. Model tells you what to discard: Adaptive kv cache compression for llms. arXiv preprint arXiv:2310.01801, 2023

work page arXiv 2023
[9]

Lifelong and continual learning dialogue systems: learn- ing during conversation

Bing Liu and Sahisnu Mazumder. Lifelong and continual learning dialogue systems: learn- ing during conversation. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 35, pages 15058–15063, 2021

work page 2021
[10]

Codeplan: Repository-level coding using llms and planning

Ramakrishna Bairi, Atharv Sonwane, Aditya Kanade, Arun Iyer, Suresh Parthasarathy, Sriram Rajamani, B Ashok, Shashank Shet, et al. Codeplan: Repository-level coding using llms and planning. arXiv preprint arXiv:2309.12499, 2023

work page arXiv 2023
[11]

Enhancing chat language models by scaling high-quality instructional conversations

Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233, 2023

work page arXiv 2023
[12]

Qmsum: A new benchmark for query-based multi-domain meeting summarization

Ming Zhong, Da Yin, Tao Yu, Ahmad Zaidi, Mutethia Mutuma, Rahul Jha, Ahmed Hassan Awadallah, Asli Celikyilmaz, Yang Liu, Xipeng Qiu, et al. Qmsum: A new benchmark for query-based multi-domain meeting summarization. arXiv preprint arXiv:2104.05938, 2021

work page arXiv 2021
[13]

L-Eval: Instituting standardized evaluation for long context language models

Chenxin An, Shansan Gong, Ming Zhong, Mukai Li, Jun Zhang, Lingpeng Kong, and Xipeng Qiu. L-eval: Instituting standardized evaluation for long context language models. arXiv preprint arXiv:2307.11088, 2023

work page arXiv 2023
[14]

Extractive opinion summarization in quantized transformer spaces

Stefanos Angelidis, Reinald Kim Amplayo, Yoshihiko Suhara, Xiaolan Wang, and Mirella Lapata. Extractive opinion summarization in quantized transformer spaces. Transactions of the Association for Computational Linguistics, 9:277–293, 2021. 13

work page 2021
[15]

In-context Learning and Induction Heads

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[16]

World model on million-length video and language with ringattention

Hao Liu, Wilson Yan, Matei Zaharia, and Pieter Abbeel. World model on million-length video and language with ringattention. arXiv preprint arXiv:2402.08268, 2024

work page arXiv 2024
[17]

LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[18]

Needle in a haystack–pressure testing llms, 2023

G Kamradt. Needle in a haystack–pressure testing llms, 2023

work page 2023
[19]

How long can context length of open-source llms truly promise? In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023

Dacheng Li, Rulin Shao, Anze Xie, Ying Sheng, Lianmin Zheng, Joseph Gonzalez, Ion Stoica, Xuezhe Ma, and Hao Zhang. How long can context length of open-source llms truly promise? In NeurIPS 2023 Workshop on Instruction Tuning and Instruction Following, 2023

work page 2023
[20]

Llm maybe longlm: Self-extend llm context window without tuning.arXiv preprint arXiv:2401.01325, 2024

Hongye Jin, Xiaotian Han, Jingfeng Yang, Zhimeng Jiang, Zirui Liu, Chia-Yuan Chang, Huiyuan Chen, and Xia Hu. Llm maybe longlm: Self-extend llm context window without tuning. arXiv preprint arXiv:2401.01325, 2024

work page arXiv 2024
[21]

Overview of bioasq 2023: The eleventh bioasq challenge on large-scale biomedical semantic indexing and question answering, 2023

Anastasios Nentidis, Georgios Katsimpras, Anastasia Krithara, Salvador Lima López, Eulália Farré-Maduell, Luis Gasco, Martin Krallinger, and Georgios Paliouras. Overview of bioasq 2023: The eleventh bioasq challenge on large-scale biomedical semantic indexing and question answering, 2023

work page 2023
[22]

HotpotQA: A Dataset for Diverse, Explainable Multi-hop Question Answering

Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William W Cohen, Ruslan Salakhut- dinov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. arXiv preprint arXiv:1809.09600, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[23]

Long context prompting for claude 2.1, December 2023

Anthropic. Long context prompting for claude 2.1, December 2023. URL https://www. anthropic.com/news/claude-2-1-prompting

work page 2023
[24]

Lost in the middle: How language models use long contexts

Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12:157–173, 2024

work page 2024
[25]

Retrieval augmented generation (rag), 2023

Cohere. Retrieval augmented generation (rag), 2023. URL https://docs.cohere.com/ docs/retrieval-augmented-generation-rag

work page 2023
[26]

Cohere embed, 2023

Cohere. Cohere embed, 2023. URL https://docs.cohere.com/reference/embed

work page 2023
[27]

Cohere rerank, 2023

Cohere. Cohere rerank, 2023. URL https://docs.cohere.com/docs/rerank-guide

work page 2023
[28]

Blockwise parallel decoding for deep autoregressive models

Mitchell Stern, Noam Shazeer, and Jakob Uszkoreit. Blockwise parallel decoding for deep autoregressive models. Advances in Neural Information Processing Systems, 31, 2018

work page 2018
[29]

Fast inference from transformers via speculative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast inference from transformers via speculative decoding. In International Conference on Machine Learning, pages 19274–19286. PMLR, 2023

work page 2023
[30]

Accelerating Large Language Model Decoding with Speculative Sampling

Charlie Chen, Sebastian Borgeaud, Geoffrey Irving, Jean-Baptiste Lespiau, Laurent Sifre, and John Jumper. Accelerating large language model decoding with speculative sampling. arXiv preprint arXiv:2302.01318, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

arXiv:2305.09781 , year=

Xupeng Miao, Gabriele Oliaro, Zhihao Zhang, Xinhao Cheng, Zeyu Wang, Rae Ying Yee Wong, Zhuoming Chen, Daiyaan Arfeen, Reyna Abhyankar, and Zhihao Jia. Specinfer: Accelerating generative llm serving with speculative inference and token tree verification. arXiv preprint arXiv:2305.09781, 2023. 14

work page arXiv 2023
[32]

Recurrent drafter for fast speculative decoding in large language models

Aonan Zhang, Chong Wang, Yi Wang, Xuanyu Zhang, and Yunfei Cheng. Recurrent drafter for fast speculative decoding in large language models. arXiv preprint arXiv:2403.09919, 2024

work page arXiv 2024
[33]

Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in Neural Information Processing Systems, 35:16344–16359, 2022

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christopher Ré. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in Neural Information Processing Systems, 35:16344–16359, 2022

work page 2022
[34]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Tri Dao. Flashattention-2: Faster attention with better parallelism and work partitioning. arXiv preprint arXiv:2307.08691, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads

Tianle Cai, Yuhong Li, Zhengyang Geng, Hongwu Peng, Jason D Lee, Deming Chen, and Tri Dao. Medusa: Simple llm inference acceleration framework with multiple decoding heads. arXiv preprint arXiv:2401.10774, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Flash-decoding for long-context inference, 2023

Tri Dao, Daniel Haziza, Francisco Massa, and Grigory Sizov. Flash-decoding for long-context inference, 2023

work page 2023
[37]

Smith, and Matt Gardner

Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A Smith, and Matt Gardner. A dataset of information-seeking questions and answers anchored in research papers. arXiv preprint arXiv:2105.03011, 2021. 15 A Discussion of Generation Time Speedup To better assess SnapKV’s effectiveness across different stages, we documented a detailed time breakdown for ...

work page arXiv 2021