pith. machine review for the scientific record. sign in

arxiv: 2406.02069 · v4 · submitted 2024-06-04 · 💻 cs.CL · cs.AI

Recognition: 2 theorem links

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

Authors on Pith no claims yet

Pith reviewed 2026-05-12 09:49 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords KV cache compressionlong contextattention patternsPyramidal Information FunnelingLLM efficiencydynamic allocationattention sink
0
0 comments X

The pith

LLMs funnel attention from wide scattering in lower layers to focused critical tokens in higher layers, enabling PyramidKV to compress the KV cache to 12% size while matching full performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models process long contexts through a pyramidal funneling of attention information, scattering broadly in early layers before consolidating on key tokens later. This paper introduces PyramidKV, which dynamically sizes the KV cache larger in lower layers and smaller in upper ones to match this pattern. Experiments on LongBench show it achieves full-cache accuracy with just 12% of the cache retained. At extreme compression to 0.7%, it improves over other methods by up to 20.5 points on TREC and reaches 100% accuracy on needle-in-haystack with only 128 entries for Llama-3-70B.

Core claim

Attention-based information flow in LLMs follows Pyramidal Information Funneling: attention scatters widely in lower layers, progressively consolidates, and focuses on critical tokens in higher layers. PyramidKV exploits this by dynamically adjusting KV cache sizes across layers, allocating more in lower layers and less in higher ones, rather than using uniform sizes.

What carries the argument

Pyramidal Information Funneling: the pattern of wide attention in lower layers consolidating to critical tokens (attention sinks) in higher layers, which justifies non-uniform KV cache allocation.

Load-bearing premise

The pyramidal information funneling pattern is consistent across models, tasks, and context lengths, allowing fixed layer-wise retention ratios to work effectively without per-task retuning.

What would settle it

If attention patterns on a new model or task show no increasing focus on critical tokens in higher layers, or if uniform KV cache sizes outperform the pyramidal layer-wise allocation on LongBench or needle retrieval.

read the original abstract

In this study, we investigate whether attention-based information flow inside large language models (LLMs) is aggregated through noticeable patterns for long context processing. Our observations reveal that LLMs aggregate information through Pyramidal Information Funneling where attention is scattering widely in lower layers, progressively consolidating within specific contexts, and ultimately focusing on critical tokens (a.k.a massive activation or attention sink) in higher layers. Motivated by these insights, we developed PyramidKV, a novel and effective KV cache compression method. This approach dynamically adjusts the KV cache size across different layers, allocating more cache in lower layers and less in higher ones, diverging from traditional methods that maintain a uniform KV cache size. Our experimental evaluations, utilizing the LongBench benchmark, show that PyramidKV matches the performance of models with a full KV cache while retaining only 12% of the KV cache, thus significantly reducing memory usage. In scenarios emphasizing memory efficiency, where only 0.7% of the KV cache is maintained, PyramidKV surpasses other KV cache compression techniques, achieving up to a 20.5 absolute accuracy improvement on TREC dataset. In the Needle-in-a-Haystack experiment, PyramidKV outperforms competing methods in maintaining long-context comprehension in LLMs; notably, retaining just 128 KV cache entries enables the LLAMA-3-70B model to achieve 100.0 Acc. performance.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper observes a 'pyramidal information funneling' pattern in LLM attention (wide scattering in lower layers, progressive consolidation, and focus on critical tokens/sinks in higher layers). Motivated by this, it introduces PyramidKV, a dynamic KV-cache compression scheme that allocates more cache entries to lower layers and fewer to higher layers. On LongBench it matches full-cache performance at 12% retention and outperforms prior compression methods by up to 20.5 points on TREC at 0.7% retention; on Needle-in-a-Haystack, 128 entries yield 100% accuracy for Llama-3-70B.

Significance. If the funneling pattern proves consistent and the layer-wise ratios generalize, PyramidKV would provide a simple, observation-driven route to substantial memory savings for long-context inference. The empirical gains on standard benchmarks are concrete and the method avoids heavy per-task tuning, which is a practical strength in the KV-compression literature.

major comments (3)
  1. [§4.1] §4.1 (LongBench results): absolute accuracy deltas (e.g., +20.5 on TREC at 0.7% cache) are reported without error bars, multiple random seeds, or statistical significance tests, so it is impossible to judge whether the claimed superiority over baselines is robust.
  2. [§3.2] §3.2 (Method): the layer-wise retention ratios are asserted to follow from the pyramidal observation, yet the manuscript supplies neither the exact selection procedure (attention-score thresholds, entropy heuristics, or manual tuning) nor any quantitative verification that these fixed ratios remain effective across context lengths or tasks outside the reported benchmarks.
  3. [§4.3] §4.3 (Needle-in-a-Haystack): the 100.0 accuracy claim with 128 entries for Llama-3-70B is presented without specifying needle-position distribution, number of trials, or a direct full-cache baseline under identical prompting, weakening the reproducibility of the extreme-compression result.
minor comments (3)
  1. [Abstract] Abstract and §2: the phrase 'massive activation or attention sink' should cite the original attention-sink literature for clarity.
  2. [Figures] Figure captions: attention heat-map figures lack explicit layer indexing and token-position labels, making the funneling pattern harder to inspect.
  3. [§4] §4: exact per-layer retention percentages or the formula used to derive them are not tabulated, impeding direct reproduction.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We have carefully considered each comment and provide point-by-point responses below, along with plans for revisions to address the concerns raised.

read point-by-point responses
  1. Referee: [§4.1] §4.1 (LongBench results): absolute accuracy deltas (e.g., +20.5 on TREC at 0.7% cache) are reported without error bars, multiple random seeds, or statistical significance tests, so it is impossible to judge whether the claimed superiority over baselines is robust.

    Authors: We agree that the absence of error bars and multi-seed results makes it difficult to assess robustness. In the revised manuscript, we will include a discussion of this limitation and provide results from at least three random seeds for the key experiments, along with standard deviations. revision: yes

  2. Referee: [§3.2] §3.2 (Method): the layer-wise retention ratios are asserted to follow from the pyramidal observation, yet the manuscript supplies neither the exact selection procedure (attention-score thresholds, entropy heuristics, or manual tuning) nor any quantitative verification that these fixed ratios remain effective across context lengths or tasks outside the reported benchmarks.

    Authors: The ratios are chosen to reflect the observed pyramidal funneling, with higher retention in lower layers where attention is more scattered. We will expand Section 3.2 to describe the exact procedure used to select these ratios and include additional experiments or analysis demonstrating their effectiveness on a broader range of context lengths and tasks. revision: yes

  3. Referee: [§4.3] §4.3 (Needle-in-a-Haystack): the 100.0 accuracy claim with 128 entries for Llama-3-70B is presented without specifying needle-position distribution, number of trials, or a direct full-cache baseline under identical prompting, weakening the reproducibility of the extreme-compression result.

    Authors: We will revise the description in Section 4.3 to detail the needle insertion positions (randomly distributed), the number of evaluation trials, and explicitly state the full KV cache performance under the same conditions for direct comparison. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method grounded in observations and benchmarks

full rationale

The paper begins with direct observations of attention patterns (pyramidal funneling) across layers in LLMs, uses these to motivate a layer-wise dynamic KV cache allocation heuristic, and validates the resulting PyramidKV method through external benchmarks (LongBench, Needle-in-a-Haystack, TREC). No equations, fitted parameters, or self-citations are presented as 'predictions' that reduce by construction to the inputs; the performance claims rest on empirical results rather than any definitional equivalence or load-bearing self-reference. The derivation chain is therefore self-contained and non-circular.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Central claim rests on the empirical observation of layer-wise attention consolidation and on the effectiveness of manually or heuristically chosen per-layer retention ratios.

free parameters (1)
  • layer-wise KV retention ratios
    Specific fractions of cache kept per layer are chosen to follow the pyramidal pattern and are not derived from first principles.
axioms (1)
  • domain assumption LLMs exhibit consistent pyramidal information funneling across layers
    The compression policy is motivated by and depends on this observed pattern holding generally.

pith-pipeline@v0.9.0 · 5583 in / 1273 out tokens · 71799 ms · 2026-05-12T09:49:56.960440+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • HierarchyEmergence hierarchy_emergence_forces_phi echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    Our observations reveal that LLMs aggregate information through Pyramidal Information Funneling where attention is scattering widely in lower layers, progressively consolidating within specific contexts, and ultimately focusing on critical tokens (a.k.a massive activation or attention sink) in higher layers.

  • HierarchyRealization realized_hierarchy_forces_phi echoes
    ?
    echoes

    ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

    PyramidKV allocates more KV cache to the lower layers where information is more dispersed and each KV state contains less information, while reducing the KV cache in higher layers where information becomes concentrated in fewer key tokens.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 32 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. FibQuant: Universal Vector Quantization for Random-Access KV-Cache Compression

    cs.AI 2026-05 unverdicted novelty 7.0

    FibQuant is a universal fixed-rate vector quantizer for KV-cache compression that uses a radial-angular codebook matched to the spherical-Beta source after Haar rotation and strictly outperforms scalar quantization at...

  2. How Much Cache Does Reasoning Need? Depth-Cache Tradeoffs in KV-Compressed Transformers

    cs.LG 2026-04 unverdicted novelty 7.0

    Transformers need depth scaling as the product of ceil(k/s) and log n terms for k-hop pointer chasing under cache size s, with a conjectured lower bound, proved upper bound via windowed pointer doubling, and an adapti...

  3. Sparse Prefix Caching for Hybrid and Recurrent LLM Serving

    cs.LG 2026-04 unverdicted novelty 7.0

    Sparse prefix caching via dynamic programming for optimal checkpoint placement under overlap distributions improves the Pareto frontier for recurrent and hybrid LLM serving on shared-prefix data.

  4. Transactional Attention: Semantic Sponsorship for KV-Cache Retention

    cs.CL 2026-04 unverdicted novelty 7.0

    Transactional Attention uses semantic sponsorship from anchor patterns to retain dormant critical tokens in KV caches, achieving 100% credential retrieval at 16 tokens where all prior methods fail.

  5. Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation

    cs.LG 2026-04 unverdicted novelty 7.0

    The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.

  6. TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

    cs.CL 2026-04 unverdicted novelty 7.0

    TriAttention compresses KV cache by exploiting stable pre-RoPE Q/K concentration and trigonometric distance preferences to match full-attention reasoning accuracy with far lower memory and higher speed.

  7. Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility

    cs.LG 2026-05 unverdicted novelty 6.0

    SP-KV trains a utility predictor jointly with the LLM to dynamically prune low-utility KV cache entries, achieving 3-10x memory reduction during generation with negligible performance loss.

  8. Pyramid Forcing: Head-Aware Pyramid KV Cache Policy for High-Quality Long Video Generation

    cs.CV 2026-05 unverdicted novelty 6.0

    Pyramid Forcing classifies attention heads into Anchor, Wave, and Veil types and applies type-specific KV cache policies to improve long-horizon autoregressive video generation quality.

  9. KV-Fold: One-Step KV-Cache Recurrence for Long-Context Inference

    cs.LG 2026-05 conditional novelty 6.0

    KV-Fold turns frozen transformers into stable long-context models by folding the KV cache across sequence chunks in repeated forward passes.

  10. KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving

    cs.AR 2026-05 unverdicted novelty 6.0

    KV-RM regularizes KV-cache movement in static-graph LLM serving via block paging and merge-staged transport to improve throughput, tail latency, and memory use for variable-length decoding.

  11. Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction

    cs.LG 2026-05 unverdicted novelty 6.0

    A unified learnable KV eviction policy with cross-layer calibration reduces memory and matches or exceeds full-cache performance on long-context tasks by retaining useful tokens and limiting attention dilution.

  12. Not All Thoughts Need HBM: Semantics-Aware Memory Hierarchy for LLM Reasoning

    cs.CL 2026-05 unverdicted novelty 6.0

    A semantics-aware KV cache hierarchy offloads tokens to slower memory with zero approximation error, demonstrating that LLM reasoning accuracy depends only on the permanent eviction ratio and not on HBM residency.

  13. ReST-KV: Robust KV Cache Eviction with Layer-wise Output Reconstruction and Spatial-Temporal Smoothing

    cs.CL 2026-05 conditional novelty 6.0

    ReST-KV formulates KV eviction as layer-wise output reconstruction optimization with spatial-temporal smoothing, outperforming baselines by 2.58% on LongBench and 15.2% on RULER while cutting decoding latency by 10.61...

  14. Slipstream: Trajectory-Grounded Compaction Validation for Long-Horizon Agents

    cs.MA 2026-05 unverdicted novelty 6.0

    Slipstream uses asynchronous compaction with trajectory-grounded judge validation to improve long-horizon agent accuracy by up to 8.8 percentage points and reduce latency by up to 39.7%.

  15. RDKV: Rate-Distortion Bit Allocation for Joint Eviction and Quantization of the KV Cache

    cs.LG 2026-05 unverdicted novelty 6.0

    RDKV derives per-token and per-channel weights from attention distortion, then uses reverse water-filling to assign bit-widths from full precision to zero after prefilling, recovering 97.81% accuracy with 2.48% cache ...

  16. Reformulating KV Cache Eviction Problem for Long-Context LLM Inference

    cs.CL 2026-05 unverdicted novelty 6.0

    LaProx reformulates KV cache eviction as an output-aware matrix approximation, enabling a unified global token selection strategy that preserves LLM performance at 5% cache size across long-context benchmarks.

  17. Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache

    cs.LG 2026-05 unverdicted novelty 6.0

    Louver is a new index structure that guarantees zero false negatives for sparse attention in LLM KV caches by casting the problem as halfspace range searching.

  18. Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache

    cs.LG 2026-05 unverdicted novelty 6.0

    Louver is a new index for LLM KV caches that guarantees zero false negatives for keys above a relevance threshold, runs faster than prior sparse and some dense attention methods, and integrates lightly into existing p...

  19. Shallow Prefill, Deep Decoding: Efficient Long-Context Inference via Layer-Asymmetric KV Visibility

    cs.AI 2026-05 unverdicted novelty 6.0

    SPEED uses layer-asymmetric KV visibility to process non-anchor prompt tokens only in lower layers during prefill, achieving near-baseline quality on Llama-3.1-8B with 33% better TTFT and 25% lower active KV memory at...

  20. Salca: A Sparsity-Aware Hardware Accelerator for Efficient Long-Context Attention Decoding

    cs.AR 2026-04 unverdicted novelty 6.0

    Salca is a new ASIC accelerator that achieves 3.82× speedup and 74.19× energy efficiency over A100 for long-context attention via dual-compression dynamic sparse attention and pipelined hardware.

  21. DASH-KV: Accelerating Long-Context LLM Inference via Asymmetric KV Cache Hashing

    cs.CL 2026-04 unverdicted novelty 6.0

    DASH-KV accelerates long-context LLM inference to linear complexity via asymmetric KV cache hashing and mixed-precision retention, matching full attention performance on LongBench.

  22. Graph-Guided Adaptive Channel Elimination for KV Cache Compression

    eess.SP 2026-04 unverdicted novelty 6.0

    GRACE reframes KV cache channel pruning as graph optimization to find a near-optimal subset, achieving 60% compression with negligible degradation and outperforming prior methods.

  23. RetentiveKV: State-Space Memory for Uncertainty-Aware Multimodal KV Cache Eviction

    cs.LG 2026-04 unverdicted novelty 6.0

    RetentiveKV uses entropy to drive state-space model transitions that retain and reactivate low-attention visual tokens in a continuous memory instead of pruning them, delivering 5x KV cache compression and 1.5x faster...

  24. CodecSight: Leveraging Video Codec Signals for Efficient Streaming VLM Inference

    cs.DC 2026-04 unverdicted novelty 6.0

    CodecSight reuses video codec signals for online patch pruning before the vision transformer and selective KV-cache refresh in the LLM, delivering up to 3x higher throughput and 87% lower GPU compute than prior baseli...

  25. eOptShrinkQ: Near-Lossless KV Cache Compression Through Optimal Spectral Denoising and Quantization

    cs.LG 2026-04 unverdicted novelty 6.0

    eOptShrinkQ compresses KV caches to ~2.2 bits per entry via optimal spectral shrinkage and quantization, outperforming prior methods on LongBench while matching FP16 on multi-needle retrieval.

  26. LightThinker++: From Reasoning Compression to Memory Management

    cs.CL 2026-04 unverdicted novelty 6.0

    LightThinker++ adds explicit adaptive memory management and a trajectory synthesis pipeline to LLM reasoning, cutting peak token use by ~70% while gaining accuracy in standard and long-horizon agent tasks.

  27. Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing

    cs.LG 2026-04 unverdicted novelty 6.0

    Stochastic training with random cross-layer KV attention enables depth-wise cache sharing in transformers, cutting memory footprint while preserving or improving performance.

  28. An Efficient Hybrid Sparse Attention with CPU-GPU Parallelism for Long-Context Inference

    cs.LG 2026-05 unverdicted novelty 5.0

    Fluxion achieves 1.5x-3.7x speedup in long-context LLM inference with CPU KV caches while limiting accuracy degradation to at most 0.26 relative to full attention.

  29. HieraSparse: Hierarchical Semi-Structured Sparse KV Attention

    cs.DC 2026-04 unverdicted novelty 5.0

    HieraSparse delivers a hierarchical semi-structured sparse KV attention system that achieves 1.2x KV compression and 4.57x decode attention speedup versus prior unstructured sparsity methods at equivalent sparsity, pl...

  30. From Business Events to Auditable Decisions: Ontology-Governed Graph Simulation for Enterprise AI

    cs.AI 2026-04 unverdicted novelty 5.0

    LOM-action uses business events to drive ontology-governed graph simulations that generate auditable decisions, reporting 93.82% accuracy and 98.74% tool-chain F1 versus 24-36% F1 for frontier LLMs.

  31. AudioKV: KV Cache Eviction in Efficient Large Audio Language Models

    cs.SD 2026-04 unverdicted novelty 5.0

    AudioKV prioritizes audio-critical attention heads identified via ASR analysis and applies spectral score smoothing to evict KV cache tokens, achieving high compression with minimal accuracy loss in LALMs.

  32. RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference

    cs.LG 2025-05

Reference graph

Works this paper leans on

40 extracted references · 40 canonical work pages · cited by 31 Pith papers · 8 internal anchors

  1. [1]

    GPT-4 Technical Report

    • •Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

  2. [2]

    LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding

    Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508,

  3. [3]

    An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models

    Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. arXiv preprint arXiv:2403.06764, 2024a. Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, and Jiaya Jia. Longlora: Efficient fin...

  4. [4]

    Mag- icpig: Lsh sampling for efficient llm generation.arXiv preprint arXiv:2410.16179,

    Zhuoming Chen, Ranajoy Sadhukhan, Zihao Ye, Yang Zhou, Jianyu Zhang, Niklas Nolte, Yuandong Tian, Matthijs Douze, Leon Bottou, Zhihao Jia, and Beidi Chen. Magicpig: Lsh sampling for efficient llm generation, 2024b. URL https://arxiv.org/abs/2410.16179. Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yo...

  5. [5]

    Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A Smith, and Matt Gardner

    URL https://lmsys.org/blog/2023-03-30-vicuna/ . Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A Smith, and Matt Gardner. A dataset of information-seeking questions and answers anchored in research papers. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologi...

  6. [6]

    LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens

    Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, and Mao Yang. Longrope: Extending llm context window beyond 2 million tokens. arXiv preprint arXiv:2402.13753,

  7. [7]

    Get more with less: Synthesizing recurrence with kv cache compression for efficient llm inference

    Harry Dong, Xinyu Yang, Zhenyu Zhang, Zhangyang Wang, Yuejie Chi, and Beidi Chen. Get more with less: Synthesizing recurrence with kv cache compression for efficient llm inference. arXiv preprint arXiv:2402.09398,

  8. [8]

    Multi-News: A Large-Scale Multi-Document Summarization Dataset and Abstractive Hierarchical Model , url =

    Alexander Fabbri, Irene Li, Tianwei She, Suyi Li, and Dragomir Radev. Multi-news: A large-scale multi-document summarization dataset and abstractive hierarchical model. In Anna Korhonen, David Traum, and Lluís Màrquez, editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1074–1084, Florence, Italy, July 2...

  9. [9]

    Model tells you what to discard: Adaptive kv cache compression for llms.arXiv preprint arXiv:2310.01801,

    10 Preprint. Under review. Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. Model tells you what to discard: Adaptive kv cache compression for llms. arXiv preprint arXiv:2310.01801,

  10. [10]

    Samsum corpus: A human-annotated dialogue dataset for abstractive summarization

    Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and Aleksander Wawer. Samsum corpus: A human-annotated dialogue dataset for abstractive summarization. EMNLP-IJCNLP 2019, page 70,

  11. [11]

    Longcoder: A long-range pre-trained language model for code completion

    Daya Guo, Canwen Xu, Nan Duan, Jian Yin, and Julian McAuley. Longcoder: A long-range pre-trained language model for code completion. arXiv preprint arXiv:2306.14893,

  12. [12]

    Lm-infinite: Simple on-the-fly length generalization for large language models

    Chi Han, Qifan Wang, Wenhan Xiong, Yu Chen, Heng Ji, and Sinong Wang. Lm-infinite: Simple on-the-fly length generalization for large language models. arXiv preprint arXiv:2308.16137,

  13. [13]

    Efficient attentions for long document summarization

    Luyang Huang, Shuyang Cao, Nikolaus Parulian, Heng Ji, and Lu Wang. Efficient attentions for long document summarization. In Proceedings of the 2021 Conference of the North Ameri- can Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1419–1436,

  14. [14]

    Mistral 7B

    Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825,

  15. [15]

    Greg Kamradt

    Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H Abdi, Dongsheng Li, Chin-Yew Lin, et al. Minference 1.0: Ac- celerating pre-filling for long-context llms via dynamic sparse attention. arXiv preprint arXiv:2407.02490,

  16. [16]

    Xin Li and Dan Roth

    URL https://arxiv.org/abs/2406.19707. Xin Li and Dan Roth. Learning question classifiers. In COLING 2002: The 19th International Conference on Computational Linguistics,

  17. [17]

    SnapKV: LLM Knows What You are Looking for Before Generation

    Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation. arXiv preprint arXiv:2404.14469,

  18. [18]

    Code Llama: Open Foundation Models for Code

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts, 2023a. Tianyang Liu, Canwen Xu, and Julian McAuley. Repobench: Benchmarking repository-level code auto-completion systems, 2023b. Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Soot...

  19. [19]

    Z., and Liu, Z

    11 Preprint. Under review. Mingjie Sun, Xinlei Chen, J Zico Kolter, and Zhuang Liu. Massive activations in large language models. arXiv preprint arXiv:2402.17762,

  20. [20]

    LLaMA: Open and Efficient Foundation Language Models

    Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay ...

  21. [21]

    Label words are anchors: An information flow perspective for understanding in-context learning

    Lean Wang, Lei Li, Damai Dai, Deli Chen, Hao Zhou, Fandong Meng, Jie Zhou, and Xu Sun. Label words are anchors: An information flow perspective for understanding in-context learning. arXiv preprint arXiv:2305.14160,

  22. [22]

    Rating: [[...]] Analysis:

    Wenhao Wu, Yizhong Wang, Guangxuan Xiao, Hao Peng, and Yao Fu. Retrieval head mechanistically explains long-context factuality. arXiv preprint arXiv:2404.15574,

  23. [23]

    Efficient Streaming Language Models with Attention Sinks

    Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient stream- ing language models with attention sinks. arXiv preprint arXiv:2309.17453,

  24. [24]

    Pyramid- infer: Pyramid kv cache compression for high-throughput llm inference

    Dongjie Yang, XiaoDong Han, Yan Gao, Yao Hu, Shilin Zhang, and Hai Zhao. Pyramid- infer: Pyramid kv cache compression for high-throughput llm inference. arXiv preprint arXiv:2405.12532,

  25. [25]

    Hotpotqa: A dataset for diverse, explainable multi-hop question answering

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdi- nov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380,

  26. [26]

    Qmsum: A new benchmark for query- based multi-domain meeting summarization

    Ming Zhong, Da Yin, Tao Yu, Ahmad Zaidi, Mutethia Mutuma, Rahul Jha, Ahmed Hassan, Asli Celikyilmaz, Yang Liu, Xipeng Qiu, et al. Qmsum: A new benchmark for query- based multi-domain meeting summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, page...

  27. [27]

    Pose: Efficient context window extension of llms via positional skip-wise training.arXiv preprint arXiv:2309.10400, 2023

    Dawei Zhu, Nan Yang, Liang Wang, Yifan Song, Wenhao Wu, Furu Wei, and Sujian Li. Pose: Efficient context window extension of llms via positional skip-wise training.arXiv preprint arXiv:2309.10400,

  28. [28]

    attention sink

    13 Preprint. Under review. Attention Weights Heatmap Layer 30 Attention Weights Heatmap Layer 24 LocalizedAttention AttentionSinkMassiveAttention Figure 6: Attention patterns of retrieval-augmented generation across layers in Mixtral- 8x7B-Instruct Mixture-of-Experts model. C Related Work Interpretation of LLMs Prior research has shown that attention matr...

  29. [29]

    Figure 5 and Figure 6 demonstrate that the Pyramidal Informa- tion Funneling phenomenon is also evident in both the Mistral model and Mixtral model

    for Mistral-7B-Instruct model and Mixtral-8x7B-Instruct Mixture-of-Experts model. Figure 5 and Figure 6 demonstrate that the Pyramidal Informa- tion Funneling phenomenon is also evident in both the Mistral model and Mixtral model . The results reveal that, akin to Llama-like models, Mistral exhibit a progressively narrowing attention focus across layers. ...

  30. [30]

    While Lee et al

    and a single upper layer (layer 18). While Lee et al. (2024) noted that attention becomes more skewed in upper layers, it did not provide a fine-grained observation of attention patterns across all layers. In contrast, our study reveals several novel findings: • Localized Attention: We observe that attention progressively narrows its focus, targeting spec...

  31. [31]

    We run all the experiments on NVIDIA A100. Dataset Source Avg len Metric Language #data Single-Document QA NarrativeQA Literature, Film 18,409 F1 English 200 Qasper Science 3,619 F1 English 200 MultiFieldQA-en Multi-field 4,559 F1 English 150 Multi-Document QA HotpotQA Wikipedia 9,151 F1 English 200 2WikiMultihopQA Wikipedia 4,887 F1 English 200 MuSiQue W...

  32. [32]

    α Single-Document QA Multi-Document QA Summarization Few-shot Learning Synthetic Code Avg

    leads to better performance than a larger alpha value (i.e., 24, 32, 40, 48). α Single-Document QA Multi-Document QA Summarization Few-shot Learning Synthetic Code Avg. NrtvQAQasperMF-enHotpotQA2WikiMQAMusiqueGovReportQMSumMultiNewsTRECTriviaQASAMSumPCountPReLccRB-P 8 21.40 16.92 31.62 38.45 28.72 18.59 19.96 22.49 20.96 66.50 89.35 38.43 5.92 69.00 57.86...

  33. [33]

    β Single-Document QA Multi-Document QA Summarization Few-shot Learning Synthetic Code Avg

    The results at Table 6 show that using a relatively small value ofβ yields better outcomes, and PyramidKV is generally robust to the selection of β. β Single-Document QA Multi-Document QA Summarization Few-shot Learning Synthetic Code Avg. NrtvQAQasperMF-enHotpotQA2WikiMQAMusiqueGovReportQMSumMultiNewsTRECTriviaQASAMSumPCountPReLccRB-P 20 21.40 16.92 33.7...

  34. [34]

    The results demonstrated the superior performance of PyramidKV . Furthermore, we demonstrate that MInference and PyramidKV can be seamlessly integrated to achieve highly efficient inference while maintaining performance comparable to full attention. The results of MInference combined with PyramidKV , evaluated on Longbench with a KV cache size of 128, as ...

  35. [35]

    [Prompt length, Generation length]

    Each row shows the setting of using a specific “[Prompt length, Generation length]” combination. We show the inference speed comparison between total inference time, time for allocation strategy and time for score-based selection on LlaMa-3-8B-Instruct. Each cell is the latency measured in seconds. Furthermore, our budget allocation can be calculated befo...

  36. [36]

    [Prompt length, Generation length]

    Each row shows the setting of using a specific “[Prompt length, Generation length]” combination. Each cell is the latency measured in seconds. PyramidKV does not sacrifice the speed. PyramidKV provides performance improvement and memory saving while runs at a comparable speed compared with baselines (i.e. SnapKV (Li et al., 2024), StreamingLLM (Xiao et al.,

  37. [37]

    That’s because the allocation strategy requires very limited additional complexity in the inference/generation phase compared with computation required for generation as Appendix L

    and H2O (Zhang et al., 2024)). That’s because the allocation strategy requires very limited additional complexity in the inference/generation phase compared with computation required for generation as Appendix L. N PyramidKV Excels in all KV Cache Size Limitation The evaluation results from LongBench(Bai et al.,

  38. [38]

    Attention Recall Rate Experiment

    for different KV cache sizes. Overall, PyramidKV consistently surpasses other method across a range of KV cache sizes and different backbone models, with its performance advantages becoming particularly pronounced in memory-constrained environments. Upon examining specific tasks, Pyra- midKV demonstrates a notably superior performance on the TREC task, a ...

  39. [39]

    However, with a larger budget (i.e., 2k KV Cache Size), the improvement decreases

    The results show that with a small budget, PyramidKV improves the attention recall rate (the percentage of attention computed using the keys retrieved by the method and the query, relative to the attention computed using all keys and the query.). However, with a larger budget (i.e., 2k KV Cache Size), the improvement decreases. For 64, 128, 256, 512, 1024...

  40. [40]

    massive attention

    Our findings indicate the absence of "massive attention" in any individual head. Figure 18: Attention patterns of retrieval-augmented generation across heads in the bottom layer in LlaMa. R PyramidKV Implementation at vLLM To help compare the vLLM implementation with the vanilla dense attention backend in terms of throughput, we perform the experiment. We...