Recognition: 2 theorem links
PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling
Pith reviewed 2026-05-12 09:49 UTC · model grok-4.3
The pith
LLMs funnel attention from wide scattering in lower layers to focused critical tokens in higher layers, enabling PyramidKV to compress the KV cache to 12% size while matching full performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Attention-based information flow in LLMs follows Pyramidal Information Funneling: attention scatters widely in lower layers, progressively consolidates, and focuses on critical tokens in higher layers. PyramidKV exploits this by dynamically adjusting KV cache sizes across layers, allocating more in lower layers and less in higher ones, rather than using uniform sizes.
What carries the argument
Pyramidal Information Funneling: the pattern of wide attention in lower layers consolidating to critical tokens (attention sinks) in higher layers, which justifies non-uniform KV cache allocation.
Load-bearing premise
The pyramidal information funneling pattern is consistent across models, tasks, and context lengths, allowing fixed layer-wise retention ratios to work effectively without per-task retuning.
What would settle it
If attention patterns on a new model or task show no increasing focus on critical tokens in higher layers, or if uniform KV cache sizes outperform the pyramidal layer-wise allocation on LongBench or needle retrieval.
read the original abstract
In this study, we investigate whether attention-based information flow inside large language models (LLMs) is aggregated through noticeable patterns for long context processing. Our observations reveal that LLMs aggregate information through Pyramidal Information Funneling where attention is scattering widely in lower layers, progressively consolidating within specific contexts, and ultimately focusing on critical tokens (a.k.a massive activation or attention sink) in higher layers. Motivated by these insights, we developed PyramidKV, a novel and effective KV cache compression method. This approach dynamically adjusts the KV cache size across different layers, allocating more cache in lower layers and less in higher ones, diverging from traditional methods that maintain a uniform KV cache size. Our experimental evaluations, utilizing the LongBench benchmark, show that PyramidKV matches the performance of models with a full KV cache while retaining only 12% of the KV cache, thus significantly reducing memory usage. In scenarios emphasizing memory efficiency, where only 0.7% of the KV cache is maintained, PyramidKV surpasses other KV cache compression techniques, achieving up to a 20.5 absolute accuracy improvement on TREC dataset. In the Needle-in-a-Haystack experiment, PyramidKV outperforms competing methods in maintaining long-context comprehension in LLMs; notably, retaining just 128 KV cache entries enables the LLAMA-3-70B model to achieve 100.0 Acc. performance.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper observes a 'pyramidal information funneling' pattern in LLM attention (wide scattering in lower layers, progressive consolidation, and focus on critical tokens/sinks in higher layers). Motivated by this, it introduces PyramidKV, a dynamic KV-cache compression scheme that allocates more cache entries to lower layers and fewer to higher layers. On LongBench it matches full-cache performance at 12% retention and outperforms prior compression methods by up to 20.5 points on TREC at 0.7% retention; on Needle-in-a-Haystack, 128 entries yield 100% accuracy for Llama-3-70B.
Significance. If the funneling pattern proves consistent and the layer-wise ratios generalize, PyramidKV would provide a simple, observation-driven route to substantial memory savings for long-context inference. The empirical gains on standard benchmarks are concrete and the method avoids heavy per-task tuning, which is a practical strength in the KV-compression literature.
major comments (3)
- [§4.1] §4.1 (LongBench results): absolute accuracy deltas (e.g., +20.5 on TREC at 0.7% cache) are reported without error bars, multiple random seeds, or statistical significance tests, so it is impossible to judge whether the claimed superiority over baselines is robust.
- [§3.2] §3.2 (Method): the layer-wise retention ratios are asserted to follow from the pyramidal observation, yet the manuscript supplies neither the exact selection procedure (attention-score thresholds, entropy heuristics, or manual tuning) nor any quantitative verification that these fixed ratios remain effective across context lengths or tasks outside the reported benchmarks.
- [§4.3] §4.3 (Needle-in-a-Haystack): the 100.0 accuracy claim with 128 entries for Llama-3-70B is presented without specifying needle-position distribution, number of trials, or a direct full-cache baseline under identical prompting, weakening the reproducibility of the extreme-compression result.
minor comments (3)
- [Abstract] Abstract and §2: the phrase 'massive activation or attention sink' should cite the original attention-sink literature for clarity.
- [Figures] Figure captions: attention heat-map figures lack explicit layer indexing and token-position labels, making the funneling pattern harder to inspect.
- [§4] §4: exact per-layer retention percentages or the formula used to derive them are not tabulated, impeding direct reproduction.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We have carefully considered each comment and provide point-by-point responses below, along with plans for revisions to address the concerns raised.
read point-by-point responses
-
Referee: [§4.1] §4.1 (LongBench results): absolute accuracy deltas (e.g., +20.5 on TREC at 0.7% cache) are reported without error bars, multiple random seeds, or statistical significance tests, so it is impossible to judge whether the claimed superiority over baselines is robust.
Authors: We agree that the absence of error bars and multi-seed results makes it difficult to assess robustness. In the revised manuscript, we will include a discussion of this limitation and provide results from at least three random seeds for the key experiments, along with standard deviations. revision: yes
-
Referee: [§3.2] §3.2 (Method): the layer-wise retention ratios are asserted to follow from the pyramidal observation, yet the manuscript supplies neither the exact selection procedure (attention-score thresholds, entropy heuristics, or manual tuning) nor any quantitative verification that these fixed ratios remain effective across context lengths or tasks outside the reported benchmarks.
Authors: The ratios are chosen to reflect the observed pyramidal funneling, with higher retention in lower layers where attention is more scattered. We will expand Section 3.2 to describe the exact procedure used to select these ratios and include additional experiments or analysis demonstrating their effectiveness on a broader range of context lengths and tasks. revision: yes
-
Referee: [§4.3] §4.3 (Needle-in-a-Haystack): the 100.0 accuracy claim with 128 entries for Llama-3-70B is presented without specifying needle-position distribution, number of trials, or a direct full-cache baseline under identical prompting, weakening the reproducibility of the extreme-compression result.
Authors: We will revise the description in Section 4.3 to detail the needle insertion positions (randomly distributed), the number of evaluation trials, and explicitly state the full KV cache performance under the same conditions for direct comparison. revision: yes
Circularity Check
No significant circularity; empirical method grounded in observations and benchmarks
full rationale
The paper begins with direct observations of attention patterns (pyramidal funneling) across layers in LLMs, uses these to motivate a layer-wise dynamic KV cache allocation heuristic, and validates the resulting PyramidKV method through external benchmarks (LongBench, Needle-in-a-Haystack, TREC). No equations, fitted parameters, or self-citations are presented as 'predictions' that reduce by construction to the inputs; the performance claims rest on empirical results rather than any definitional equivalence or load-bearing self-reference. The derivation chain is therefore self-contained and non-circular.
Axiom & Free-Parameter Ledger
free parameters (1)
- layer-wise KV retention ratios
axioms (1)
- domain assumption LLMs exhibit consistent pyramidal information funneling across layers
Lean theorems connected to this paper
-
HierarchyEmergencehierarchy_emergence_forces_phi echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
Our observations reveal that LLMs aggregate information through Pyramidal Information Funneling where attention is scattering widely in lower layers, progressively consolidating within specific contexts, and ultimately focusing on critical tokens (a.k.a massive activation or attention sink) in higher layers.
-
HierarchyRealizationrealized_hierarchy_forces_phi echoes?
echoesECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.
PyramidKV allocates more KV cache to the lower layers where information is more dispersed and each KV state contains less information, while reducing the KV cache in higher layers where information becomes concentrated in fewer key tokens.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 32 Pith papers
-
FibQuant: Universal Vector Quantization for Random-Access KV-Cache Compression
FibQuant is a universal fixed-rate vector quantizer for KV-cache compression that uses a radial-angular codebook matched to the spherical-Beta source after Haar rotation and strictly outperforms scalar quantization at...
-
How Much Cache Does Reasoning Need? Depth-Cache Tradeoffs in KV-Compressed Transformers
Transformers need depth scaling as the product of ceil(k/s) and log n terms for k-hop pointer chasing under cache size s, with a conjectured lower bound, proved upper bound via windowed pointer doubling, and an adapti...
-
Sparse Prefix Caching for Hybrid and Recurrent LLM Serving
Sparse prefix caching via dynamic programming for optimal checkpoint placement under overlap distributions improves the Pareto frontier for recurrent and hybrid LLM serving on shared-prefix data.
-
Transactional Attention: Semantic Sponsorship for KV-Cache Retention
Transactional Attention uses semantic sponsorship from anchor patterns to retain dormant critical tokens in KV caches, achieving 100% credential retrieval at 16 tokens where all prior methods fail.
-
Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation
The first survey on Attention Sink in Transformers structures the literature around fundamental utilization, mechanistic interpretation, and strategic mitigation.
-
TriAttention: Efficient Long Reasoning with Trigonometric KV Compression
TriAttention compresses KV cache by exploiting stable pre-RoPE Q/K concentration and trigonometric distance preferences to match full-attention reasoning accuracy with far lower memory and higher speed.
-
Self-Pruned Key-Value Attention: Learning When to Write by Predicting Future Utility
SP-KV trains a utility predictor jointly with the LLM to dynamically prune low-utility KV cache entries, achieving 3-10x memory reduction during generation with negligible performance loss.
-
Pyramid Forcing: Head-Aware Pyramid KV Cache Policy for High-Quality Long Video Generation
Pyramid Forcing classifies attention heads into Anchor, Wave, and Veil types and applies type-specific KV cache policies to improve long-horizon autoregressive video generation quality.
-
KV-Fold: One-Step KV-Cache Recurrence for Long-Context Inference
KV-Fold turns frozen transformers into stable long-context models by folding the KV cache across sequence chunks in repeated forward passes.
-
KV-RM: Regularizing KV-Cache Movement for Static-Graph LLM Serving
KV-RM regularizes KV-cache movement in static-graph LLM serving via block paging and merge-staged transport to improve throughput, tail latency, and memory use for variable-length decoding.
-
Make Each Token Count: Towards Improving Long-Context Performance with KV Cache Eviction
A unified learnable KV eviction policy with cross-layer calibration reduces memory and matches or exceeds full-cache performance on long-context tasks by retaining useful tokens and limiting attention dilution.
-
Not All Thoughts Need HBM: Semantics-Aware Memory Hierarchy for LLM Reasoning
A semantics-aware KV cache hierarchy offloads tokens to slower memory with zero approximation error, demonstrating that LLM reasoning accuracy depends only on the permanent eviction ratio and not on HBM residency.
-
ReST-KV: Robust KV Cache Eviction with Layer-wise Output Reconstruction and Spatial-Temporal Smoothing
ReST-KV formulates KV eviction as layer-wise output reconstruction optimization with spatial-temporal smoothing, outperforming baselines by 2.58% on LongBench and 15.2% on RULER while cutting decoding latency by 10.61...
-
Slipstream: Trajectory-Grounded Compaction Validation for Long-Horizon Agents
Slipstream uses asynchronous compaction with trajectory-grounded judge validation to improve long-horizon agent accuracy by up to 8.8 percentage points and reduce latency by up to 39.7%.
-
RDKV: Rate-Distortion Bit Allocation for Joint Eviction and Quantization of the KV Cache
RDKV derives per-token and per-channel weights from attention distortion, then uses reverse water-filling to assign bit-widths from full precision to zero after prefilling, recovering 97.81% accuracy with 2.48% cache ...
-
Reformulating KV Cache Eviction Problem for Long-Context LLM Inference
LaProx reformulates KV cache eviction as an output-aware matrix approximation, enabling a unified global token selection strategy that preserves LLM performance at 5% cache size across long-context benchmarks.
-
Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache
Louver is a new index structure that guarantees zero false negatives for sparse attention in LLM KV caches by casting the problem as halfspace range searching.
-
Sparse Attention as a Range Searching Problem: Towards an Inference-Efficient Index for KV Cache
Louver is a new index for LLM KV caches that guarantees zero false negatives for keys above a relevance threshold, runs faster than prior sparse and some dense attention methods, and integrates lightly into existing p...
-
Shallow Prefill, Deep Decoding: Efficient Long-Context Inference via Layer-Asymmetric KV Visibility
SPEED uses layer-asymmetric KV visibility to process non-anchor prompt tokens only in lower layers during prefill, achieving near-baseline quality on Llama-3.1-8B with 33% better TTFT and 25% lower active KV memory at...
-
Salca: A Sparsity-Aware Hardware Accelerator for Efficient Long-Context Attention Decoding
Salca is a new ASIC accelerator that achieves 3.82× speedup and 74.19× energy efficiency over A100 for long-context attention via dual-compression dynamic sparse attention and pipelined hardware.
-
DASH-KV: Accelerating Long-Context LLM Inference via Asymmetric KV Cache Hashing
DASH-KV accelerates long-context LLM inference to linear complexity via asymmetric KV cache hashing and mixed-precision retention, matching full attention performance on LongBench.
-
Graph-Guided Adaptive Channel Elimination for KV Cache Compression
GRACE reframes KV cache channel pruning as graph optimization to find a near-optimal subset, achieving 60% compression with negligible degradation and outperforming prior methods.
-
RetentiveKV: State-Space Memory for Uncertainty-Aware Multimodal KV Cache Eviction
RetentiveKV uses entropy to drive state-space model transitions that retain and reactivate low-attention visual tokens in a continuous memory instead of pruning them, delivering 5x KV cache compression and 1.5x faster...
-
CodecSight: Leveraging Video Codec Signals for Efficient Streaming VLM Inference
CodecSight reuses video codec signals for online patch pruning before the vision transformer and selective KV-cache refresh in the LLM, delivering up to 3x higher throughput and 87% lower GPU compute than prior baseli...
-
eOptShrinkQ: Near-Lossless KV Cache Compression Through Optimal Spectral Denoising and Quantization
eOptShrinkQ compresses KV caches to ~2.2 bits per entry via optimal spectral shrinkage and quantization, outperforming prior methods on LongBench while matching FP16 on multi-needle retrieval.
-
LightThinker++: From Reasoning Compression to Memory Management
LightThinker++ adds explicit adaptive memory management and a trajectory synthesis pipeline to LLM reasoning, cutting peak token use by ~70% while gaining accuracy in standard and long-horizon agent tasks.
-
Stochastic KV Routing: Enabling Adaptive Depth-Wise Cache Sharing
Stochastic training with random cross-layer KV attention enables depth-wise cache sharing in transformers, cutting memory footprint while preserving or improving performance.
-
An Efficient Hybrid Sparse Attention with CPU-GPU Parallelism for Long-Context Inference
Fluxion achieves 1.5x-3.7x speedup in long-context LLM inference with CPU KV caches while limiting accuracy degradation to at most 0.26 relative to full attention.
-
HieraSparse: Hierarchical Semi-Structured Sparse KV Attention
HieraSparse delivers a hierarchical semi-structured sparse KV attention system that achieves 1.2x KV compression and 4.57x decode attention speedup versus prior unstructured sparsity methods at equivalent sparsity, pl...
-
From Business Events to Auditable Decisions: Ontology-Governed Graph Simulation for Enterprise AI
LOM-action uses business events to drive ontology-governed graph simulations that generate auditable decisions, reporting 93.82% accuracy and 98.74% tool-chain F1 versus 24-36% F1 for frontier LLMs.
-
AudioKV: KV Cache Eviction in Efficient Large Audio Language Models
AudioKV prioritizes audio-critical attention heads identified via ASR analysis and applies spectral score smoothing to evict KV cache tokens, achieving high compression with minimal accuracy loss in LALMs.
- RetroInfer: A Vector Storage Engine for Scalable Long-Context LLM Inference
Reference graph
Works this paper leans on
-
[1]
• •Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding
Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. arXiv preprint arXiv:2308.14508,
work page internal anchor Pith review arXiv
-
[3]
Liang Chen, Haozhe Zhao, Tianyu Liu, Shuai Bai, Junyang Lin, Chang Zhou, and Baobao Chang. An image is worth 1/2 tokens after layer 2: Plug-and-play inference acceleration for large vision-language models. arXiv preprint arXiv:2403.06764, 2024a. Yukang Chen, Shengju Qian, Haotian Tang, Xin Lai, Zhijian Liu, Song Han, and Jiaya Jia. Longlora: Efficient fin...
-
[4]
Mag- icpig: Lsh sampling for efficient llm generation.arXiv preprint arXiv:2410.16179,
Zhuoming Chen, Ranajoy Sadhukhan, Zihao Ye, Yang Zhou, Jianyu Zhang, Niklas Nolte, Yuandong Tian, Matthijs Douze, Leon Bottou, Zhihao Jia, and Beidi Chen. Magicpig: Lsh sampling for efficient llm generation, 2024b. URL https://arxiv.org/abs/2410.16179. Wei-Lin Chiang, Zhuohan Li, Zi Lin, Ying Sheng, Zhanghao Wu, Hao Zhang, Lianmin Zheng, Siyuan Zhuang, Yo...
-
[5]
Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A Smith, and Matt Gardner
URL https://lmsys.org/blog/2023-03-30-vicuna/ . Pradeep Dasigi, Kyle Lo, Iz Beltagy, Arman Cohan, Noah A Smith, and Matt Gardner. A dataset of information-seeking questions and answers anchored in research papers. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologi...
work page 2023
-
[6]
LongRoPE: Extending LLM Context Window Beyond 2 Million Tokens
Yiran Ding, Li Lyna Zhang, Chengruidong Zhang, Yuanyuan Xu, Ning Shang, Jiahang Xu, Fan Yang, and Mao Yang. Longrope: Extending llm context window beyond 2 million tokens. arXiv preprint arXiv:2402.13753,
work page internal anchor Pith review arXiv
-
[7]
Get more with less: Synthesizing recurrence with kv cache compression for efficient llm inference
Harry Dong, Xinyu Yang, Zhenyu Zhang, Zhangyang Wang, Yuejie Chi, and Beidi Chen. Get more with less: Synthesizing recurrence with kv cache compression for efficient llm inference. arXiv preprint arXiv:2402.09398,
-
[8]
Alexander Fabbri, Irene Li, Tianwei She, Suyi Li, and Dragomir Radev. Multi-news: A large-scale multi-document summarization dataset and abstractive hierarchical model. In Anna Korhonen, David Traum, and Lluís Màrquez, editors, Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 1074–1084, Florence, Italy, July 2...
-
[9]
10 Preprint. Under review. Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. Model tells you what to discard: Adaptive kv cache compression for llms. arXiv preprint arXiv:2310.01801,
-
[10]
Samsum corpus: A human-annotated dialogue dataset for abstractive summarization
Bogdan Gliwa, Iwona Mochol, Maciej Biesek, and Aleksander Wawer. Samsum corpus: A human-annotated dialogue dataset for abstractive summarization. EMNLP-IJCNLP 2019, page 70,
work page 2019
-
[11]
Longcoder: A long-range pre-trained language model for code completion
Daya Guo, Canwen Xu, Nan Duan, Jian Yin, and Julian McAuley. Longcoder: A long-range pre-trained language model for code completion. arXiv preprint arXiv:2306.14893,
-
[12]
Lm-infinite: Simple on-the-fly length generalization for large language models
Chi Han, Qifan Wang, Wenhan Xiong, Yu Chen, Heng Ji, and Sinong Wang. Lm-infinite: Simple on-the-fly length generalization for large language models. arXiv preprint arXiv:2308.16137,
-
[13]
Efficient attentions for long document summarization
Luyang Huang, Shuyang Cao, Nikolaus Parulian, Heng Ji, and Lu Wang. Efficient attentions for long document summarization. In Proceedings of the 2021 Conference of the North Ameri- can Chapter of the Association for Computational Linguistics: Human Language Technologies, pages 1419–1436,
work page 2021
-
[14]
Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825,
work page internal anchor Pith review Pith/arXiv arXiv
-
[15]
Huiqiang Jiang, Yucheng Li, Chengruidong Zhang, Qianhui Wu, Xufang Luo, Surin Ahn, Zhenhua Han, Amir H Abdi, Dongsheng Li, Chin-Yew Lin, et al. Minference 1.0: Ac- celerating pre-filling for long-context llms via dynamic sparse attention. arXiv preprint arXiv:2407.02490,
-
[16]
URL https://arxiv.org/abs/2406.19707. Xin Li and Dan Roth. Learning question classifiers. In COLING 2002: The 19th International Conference on Computational Linguistics,
-
[17]
SnapKV: LLM Knows What You are Looking for Before Generation
Yuhong Li, Yingbing Huang, Bowen Yang, Bharat Venkitesh, Acyr Locatelli, Hanchen Ye, Tianle Cai, Patrick Lewis, and Deming Chen. Snapkv: Llm knows what you are looking for before generation. arXiv preprint arXiv:2404.14469,
work page internal anchor Pith review arXiv
-
[18]
Code Llama: Open Foundation Models for Code
Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts, 2023a. Tianyang Liu, Canwen Xu, and Julian McAuley. Repobench: Benchmarking repository-level code auto-completion systems, 2023b. Baptiste Roziere, Jonas Gehring, Fabian Gloeckle, Sten Soot...
work page internal anchor Pith review Pith/arXiv arXiv
-
[19]
11 Preprint. Under review. Mingjie Sun, Xinlei Chen, J Zico Kolter, and Zhuang Liu. Massive activations in large language models. arXiv preprint arXiv:2402.17762,
-
[20]
LLaMA: Open and Efficient Foundation Language Models
Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay ...
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[21]
Label words are anchors: An information flow perspective for understanding in-context learning
Lean Wang, Lei Li, Damai Dai, Deli Chen, Hao Zhou, Fandong Meng, Jie Zhou, and Xu Sun. Label words are anchors: An information flow perspective for understanding in-context learning. arXiv preprint arXiv:2305.14160,
-
[22]
Wenhao Wu, Yizhong Wang, Guangxuan Xiao, Hao Peng, and Yao Fu. Retrieval head mechanistically explains long-context factuality. arXiv preprint arXiv:2404.15574,
-
[23]
Efficient Streaming Language Models with Attention Sinks
Guangxuan Xiao, Yuandong Tian, Beidi Chen, Song Han, and Mike Lewis. Efficient stream- ing language models with attention sinks. arXiv preprint arXiv:2309.17453,
work page internal anchor Pith review Pith/arXiv arXiv
-
[24]
Pyramid- infer: Pyramid kv cache compression for high-throughput llm inference
Dongjie Yang, XiaoDong Han, Yan Gao, Yao Hu, Shilin Zhang, and Hai Zhao. Pyramid- infer: Pyramid kv cache compression for high-throughput llm inference. arXiv preprint arXiv:2405.12532,
-
[25]
Hotpotqa: A dataset for diverse, explainable multi-hop question answering
Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdi- nov, and Christopher D Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pages 2369–2380,
work page 2018
-
[26]
Qmsum: A new benchmark for query- based multi-domain meeting summarization
Ming Zhong, Da Yin, Tao Yu, Ahmad Zaidi, Mutethia Mutuma, Rahul Jha, Ahmed Hassan, Asli Celikyilmaz, Yang Liu, Xipeng Qiu, et al. Qmsum: A new benchmark for query- based multi-domain meeting summarization. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, page...
work page 2021
-
[27]
Dawei Zhu, Nan Yang, Liang Wang, Yifan Song, Wenhao Wu, Furu Wei, and Sujian Li. Pose: Efficient context window extension of llms via positional skip-wise training.arXiv preprint arXiv:2309.10400,
-
[28]
13 Preprint. Under review. Attention Weights Heatmap Layer 30 Attention Weights Heatmap Layer 24 LocalizedAttention AttentionSinkMassiveAttention Figure 6: Attention patterns of retrieval-augmented generation across layers in Mixtral- 8x7B-Instruct Mixture-of-Experts model. C Related Work Interpretation of LLMs Prior research has shown that attention matr...
work page 2023
-
[29]
for Mistral-7B-Instruct model and Mixtral-8x7B-Instruct Mixture-of-Experts model. Figure 5 and Figure 6 demonstrate that the Pyramidal Informa- tion Funneling phenomenon is also evident in both the Mistral model and Mixtral model . The results reveal that, akin to Llama-like models, Mistral exhibit a progressively narrowing attention focus across layers. ...
work page 2024
-
[30]
and a single upper layer (layer 18). While Lee et al. (2024) noted that attention becomes more skewed in upper layers, it did not provide a fine-grained observation of attention patterns across all layers. In contrast, our study reveals several novel findings: • Localized Attention: We observe that attention progressively narrows its focus, targeting spec...
work page 2024
-
[31]
We run all the experiments on NVIDIA A100. Dataset Source Avg len Metric Language #data Single-Document QA NarrativeQA Literature, Film 18,409 F1 English 200 Qasper Science 3,619 F1 English 200 MultiFieldQA-en Multi-field 4,559 F1 English 150 Multi-Document QA HotpotQA Wikipedia 9,151 F1 English 200 2WikiMultihopQA Wikipedia 4,887 F1 English 200 MuSiQue W...
work page 2023
-
[32]
α Single-Document QA Multi-Document QA Summarization Few-shot Learning Synthetic Code Avg
leads to better performance than a larger alpha value (i.e., 24, 32, 40, 48). α Single-Document QA Multi-Document QA Summarization Few-shot Learning Synthetic Code Avg. NrtvQAQasperMF-enHotpotQA2WikiMQAMusiqueGovReportQMSumMultiNewsTRECTriviaQASAMSumPCountPReLccRB-P 8 21.40 16.92 31.62 38.45 28.72 18.59 19.96 22.49 20.96 66.50 89.35 38.43 5.92 69.00 57.86...
work page 1924
-
[33]
β Single-Document QA Multi-Document QA Summarization Few-shot Learning Synthetic Code Avg
The results at Table 6 show that using a relatively small value ofβ yields better outcomes, and PyramidKV is generally robust to the selection of β. β Single-Document QA Multi-Document QA Summarization Few-shot Learning Synthetic Code Avg. NrtvQAQasperMF-enHotpotQA2WikiMQAMusiqueGovReportQMSumMultiNewsTRECTriviaQASAMSumPCountPReLccRB-P 20 21.40 16.92 33.7...
work page 2024
-
[34]
The results demonstrated the superior performance of PyramidKV . Furthermore, we demonstrate that MInference and PyramidKV can be seamlessly integrated to achieve highly efficient inference while maintaining performance comparable to full attention. The results of MInference combined with PyramidKV , evaluated on Longbench with a KV cache size of 128, as ...
work page 2024
-
[35]
[Prompt length, Generation length]
Each row shows the setting of using a specific “[Prompt length, Generation length]” combination. We show the inference speed comparison between total inference time, time for allocation strategy and time for score-based selection on LlaMa-3-8B-Instruct. Each cell is the latency measured in seconds. Furthermore, our budget allocation can be calculated befo...
work page 2048
-
[36]
[Prompt length, Generation length]
Each row shows the setting of using a specific “[Prompt length, Generation length]” combination. Each cell is the latency measured in seconds. PyramidKV does not sacrifice the speed. PyramidKV provides performance improvement and memory saving while runs at a comparable speed compared with baselines (i.e. SnapKV (Li et al., 2024), StreamingLLM (Xiao et al.,
work page 2024
-
[37]
and H2O (Zhang et al., 2024)). That’s because the allocation strategy requires very limited additional complexity in the inference/generation phase compared with computation required for generation as Appendix L. N PyramidKV Excels in all KV Cache Size Limitation The evaluation results from LongBench(Bai et al.,
work page 2024
-
[38]
Attention Recall Rate Experiment
for different KV cache sizes. Overall, PyramidKV consistently surpasses other method across a range of KV cache sizes and different backbone models, with its performance advantages becoming particularly pronounced in memory-constrained environments. Upon examining specific tasks, Pyra- midKV demonstrates a notably superior performance on the TREC task, a ...
-
[39]
However, with a larger budget (i.e., 2k KV Cache Size), the improvement decreases
The results show that with a small budget, PyramidKV improves the attention recall rate (the percentage of attention computed using the keys retrieved by the method and the query, relative to the attention computed using all keys and the query.). However, with a larger budget (i.e., 2k KV Cache Size), the improvement decreases. For 64, 128, 256, 512, 1024...
work page 2048
-
[40]
Our findings indicate the absence of "massive attention" in any individual head. Figure 18: Attention patterns of retrieval-augmented generation across heads in the bottom layer in LlaMa. R PyramidKV Implementation at vLLM To help compare the vLLM implementation with the vanilla dense attention backend in terms of throughput, we perform the experiment. We...
work page 2000
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.