Keep the cost down: A review on methods to optimize llm’s kv-cache consumption

Keep the cost down: A review on methods to optimize llm’s kv-cache consumption · 2024 · arXiv 2407.18003

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

representative citing papers

Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost

cs.AI · 2026-05-07 · conditional · novelty 7.0

Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.

Rethinking KV Cache Eviction via a Unified Information-Theoretic Objective

cs.LG · 2026-04-28 · unverdicted · novelty 7.0

KV cache eviction is unified under an information capacity maximization principle derived from a linear-Gaussian attention surrogate, with CapKV proposed as a leverage-score based implementation that outperforms prior heuristics in experiments.

How Much Cache Does Reasoning Need? Depth-Cache Tradeoffs in KV-Compressed Transformers

cs.LG · 2026-04-20 · unverdicted · novelty 7.0

Transformers need depth scaling as the product of ceil(k/s) and log n terms for k-hop pointer chasing under cache size s, with a conjectured lower bound, proved upper bound via windowed pointer doubling, and an adaptive-oblivious error separation.

TriAttention: Efficient Long Reasoning with Trigonometric KV Compression

cs.CL · 2026-04-06 · unverdicted · novelty 7.0

TriAttention compresses KV cache by exploiting stable pre-RoPE Q/K concentration and trigonometric distance preferences to match full-attention reasoning accuracy with far lower memory and higher speed.

On the (In-)Security of the Shuffling Defense in the Transformer Secure Inference

cs.CR · 2026-05-06 · conditional · novelty 6.0

An attack aligns differently shuffled intermediate activations from secure Transformer inference queries to recover model weights with low error using roughly one dollar of queries.

Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward

cs.CV · 2026-04-06 · unverdicted · novelty 6.0

Saliency-R1 uses a novel saliency map technique and GRPO with human bounding-box overlap as reward to improve VLM reasoning faithfulness and interpretability.

EchoKV: Efficient KV Cache Compression via Similarity-Based Reconstruction

cs.CL · 2026-03-24 · unverdicted · novelty 6.0

EchoKV compresses LLM KV caches by reconstructing missing components from partial data via inter- and intra-layer attention similarities, outperforming prior methods on LongBench and RULER while supporting on-demand full-cache inference.

Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models

cs.CL · 2025-03-20 · accept · novelty 5.0

A survey organizing techniques to achieve efficient reasoning in LLMs by shortening chain-of-thought outputs.

Agents Should Replace Narrow Predictive AI as the Orchestrator in 6G AI-RAN

cs.NI · 2026-05-12 · unverdicted · novelty 4.0

Position paper proposes replacing fragmented narrow AI models with LLMs as the cognitive orchestrator in the RAN Intelligent Controller for Level 5 autonomous 6G networks.

citing papers explorer

Showing 9 of 9 citing papers.

Post Reasoning: Improving the Performance of Non-Thinking Models at No Cost cs.AI · 2026-05-07 · conditional · none · ref 168
Post-Reasoning boosts LLM accuracy by reversing the usual answer-after-reasoning order, delivering mean relative gains of 17.37% across 117 model-benchmark pairs with zero extra cost.
Rethinking KV Cache Eviction via a Unified Information-Theoretic Objective cs.LG · 2026-04-28 · unverdicted · none · ref 26
KV cache eviction is unified under an information capacity maximization principle derived from a linear-Gaussian attention surrogate, with CapKV proposed as a leverage-score based implementation that outperforms prior heuristics in experiments.
How Much Cache Does Reasoning Need? Depth-Cache Tradeoffs in KV-Compressed Transformers cs.LG · 2026-04-20 · unverdicted · none · ref 16
Transformers need depth scaling as the product of ceil(k/s) and log n terms for k-hop pointer chasing under cache size s, with a conjectured lower bound, proved upper bound via windowed pointer doubling, and an adaptive-oblivious error separation.
TriAttention: Efficient Long Reasoning with Trigonometric KV Compression cs.CL · 2026-04-06 · unverdicted · none · ref 20
TriAttention compresses KV cache by exploiting stable pre-RoPE Q/K concentration and trigonometric distance preferences to match full-attention reasoning accuracy with far lower memory and higher speed.
On the (In-)Security of the Shuffling Defense in the Transformer Secure Inference cs.CR · 2026-05-06 · conditional · none · ref 9
An attack aligns differently shuffled intermediate activations from secure Transformer inference queries to recover model weights with low error using roughly one dollar of queries.
Saliency-R1: Enforcing Interpretable and Faithful Vision-language Reasoning via Saliency-map Alignment Reward cs.CV · 2026-04-06 · unverdicted · none · ref 62
Saliency-R1 uses a novel saliency map technique and GRPO with human bounding-box overlap as reward to improve VLM reasoning faithfulness and interpretability.
EchoKV: Efficient KV Cache Compression via Similarity-Based Reconstruction cs.CL · 2026-03-24 · unverdicted · none · ref 17
EchoKV compresses LLM KV caches by reconstructing missing components from partial data via inter- and intra-layer attention similarities, outperforming prior methods on LongBench and RULER while supporting on-demand full-cache inference.
Stop Overthinking: A Survey on Efficient Reasoning for Large Language Models cs.CL · 2025-03-20 · accept · none · ref 156
A survey organizing techniques to achieve efficient reasoning in LLMs by shortening chain-of-thought outputs.
Agents Should Replace Narrow Predictive AI as the Orchestrator in 6G AI-RAN cs.NI · 2026-05-12 · unverdicted · none · ref 40
Position paper proposes replacing fragmented narrow AI models with LLMs as the cognitive orchestrator in the RAN Intelligent Controller for Level 5 autonomous 6G networks.

Keep the cost down: A review on methods to optimize llm’s kv-cache consumption

fields

years

verdicts

representative citing papers

citing papers explorer