{"total":18,"items":[{"citing_arxiv_id":"2606.09508","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"From Rigid to Dynamic: Entropy-Guided Adaptive Inference for Long-Context LLMs","primary_cat":"cs.AI","submitted_at":"2026-06-08T14:02:18+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EntropyInfer adaptively allocates inference compute using per-head attention entropy for rigid/dynamic classification during prefilling and compresses KV cache with generated tokens, achieving up to 2.39x speedup on long contexts.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.23200","ref_index":3,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Adaptive Mass-Segmented KV Compression for Long-Context Reasoning","primary_cat":"cs.LG","submitted_at":"2026-05-22T03:32:52+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"AMS KV compression adaptively partitions the cache by attention mass regions and assigns quotas to protect contiguous reasoning blocks during long-context LLM inference.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.20868","ref_index":1,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Runtime-Certified Bounded-Error Quantized Attention","primary_cat":"cs.LG","submitted_at":"2026-05-20T08:04:40+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A tiered KV cache architecture computes per-head per-step error bounds on quantized attention and uses adaptive fallback to guarantee bounded or exact outputs relative to FP16 reference.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18053","ref_index":16,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Protection Is (Nearly) All You Need: Structural Protection Dominates Scoring in Globally Capped KV Eviction","primary_cat":"cs.LG","submitted_at":"2026-05-18T08:41:34+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":4.0,"formal_verification":"none","one_line_summary":"Structural protection of boundary tokens in globally capped KV cache eviction recovers 69-90% of full-cache quality at 13% retention and dominates differences among scoring policies.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.18856","ref_index":11,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SPHERICAL KV: Angle-Domain Attention and Rate-Distortion Retention for Efficient Long-Context Inference","primary_cat":"cs.LG","submitted_at":"2026-05-13T18:48:48+00:00","verdict":null,"verdict_confidence":null,"novelty_score":null,"formal_verification":null,"one_line_summary":null,"context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.08966","ref_index":18,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"VORT: Adaptive Power-Law Memory for NLP Transformers","primary_cat":"cs.LG","submitted_at":"2026-05-09T14:20:17+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"VORT assigns learnable fractional orders to tokens and approximates their power-law retention kernels via sum-of-exponentials for efficient long-range dependency modeling in transformers.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"scores but do not change the underlying memory update rule, leaving the mismatch intact. Current long-context methods partition memory into discrete tiers: eviction caches (H2O [50], SnapKV [25]) make binary keep-or-evict decisions; state-space models (S4 [16], Mamba [15], RWKV [32]) use fixed or mildly input-dependent exponential recurrences; quantised caches [18] reduce precision uniformly; compressive memory [36] applies a fixed schedule. None represents a per-token, content-adaptive power-law decay. Fractionalcalculusoffersthemissingingredient. TheGrünwald-Letnikov(GL)weighted sum of orderα∈(0, 1)assigns weight w(α) j ∼jα−1/Γ(α)to the value at lagj, matching the ARFIMA-optimal prediction weights. The obstacle is that this sum isnon-Markovian:"},{"citing_arxiv_id":"2605.03562","ref_index":10,"ref_count":2,"confidence":0.9,"is_internal_anchor":false,"paper_title":"HeadQ: Model-Visible Distortion and Score-Space Correction for KV-Cache Quantization","primary_cat":"cs.LG","submitted_at":"2026-05-05T09:34:51+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":5.0,"formal_verification":"none","one_line_summary":"HeadQ applies score-space logit corrections for keys and attention-weighted surrogates for values to KV-cache quantization, removing 84-94% of excess perplexity in 2-bit key experiments across six models.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2605.02262","ref_index":17,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"WindowQuant: Mixed-Precision KV Cache Quantization based on Window-Level Similarity for VLMs Inference Optimization","primary_cat":"cs.CV","submitted_at":"2026-05-04T06:17:13+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"WindowQuant performs window-adaptive mixed-precision KV cache quantization guided by similarity to the text prompt, with reordering to enable efficient inference in VLMs.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[28] proposed KIVI, which applies per-channel quantization to the K cache while keeping the original per-token quantization for the V cache. However, these quantization methods that uniformly quantize tokens to the same bit-width cannot maintain model accuracy effectively. Other researchers have proposed mixed-precision quantization methods to process important tokens [11, 16, 17, 46]. For example, Kim et al. [17] proposed SqueezeLLM, which divides the weights into dense matrices without outliers and sparse matrices containing outliers, then applies low-bit quantization to the dense matrices while keeping the outliers in FP16 precision. Yang et al. [46] proposed MiKV, which identifies unimportant tokens based on attention scores but quantizes them instead of"},{"citing_arxiv_id":"2604.24971","ref_index":2,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"PolyKV: A Shared Asymmetrically-Compressed KV Cache Pool for Multi-Agent LLM Inference","primary_cat":"cs.LG","submitted_at":"2026-04-27T20:10:21+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"A single shared asymmetrically compressed KV cache pool enables up to 15 concurrent LLM agents with 2.91x compression, 97.7% memory reduction, and only +0.57% perplexity increase on Llama-3-8B.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2604.15356","ref_index":4,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"Sequential KV Cache Compression via Probabilistic Language Tries: Beyond the Per-Vector Shannon Limit","primary_cat":"cs.LG","submitted_at":"2026-04-10T22:48:19+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"Sequential KV compression via probabilistic language tries and predictive delta coding achieves 3.3-4.3 bits per token entropy, yielding up to 914x better ratios than TurboQuant even with large overhead.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2509.17396","ref_index":14,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"EpiCache: Episodic KV Cache Management for Long-Term Conversation on Resource-Constrained Environments","primary_cat":"cs.CL","submitted_at":"2025-09-22T06:56:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"EpiCache clusters long conversation history into coherent episodes for per-episode KV cache eviction, delivering up to 30% accuracy gains and 3.7x peak memory reduction on LongConvQA tasks under fixed budgets.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2504.19874","ref_index":30,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate","primary_cat":"cs.LG","submitted_at":"2025-04-28T15:05:35+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"TurboQuant achieves near-optimal vector quantization distortion for both MSE and inner products via random rotation and per-coordinate scalar quantization, with a formal proof that it matches lower bounds within a factor of approximately 2.7.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2407.08608","ref_index":26,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"FlashAttention-3: Fast and Accurate Attention with Asynchrony and Low-precision","primary_cat":"cs.LG","submitted_at":"2024-07-11T15:44:48+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":7.0,"formal_verification":"none","one_line_summary":"FlashAttention-3 achieves 1.5-2x speedup on H100 GPUs for attention, reaching 740 TFLOPs/s (75% utilization) in FP16 and near 1.2 PFLOPs/s in FP8 while cutting numerical error by 2.6x versus baseline FP8 attention.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2405.04434","ref_index":23,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model","primary_cat":"cs.CL","submitted_at":"2024-05-07T15:56:43+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":7.0,"formal_verification":"none","one_line_summary":"DeepSeek-V2 delivers top-tier open-source LLM performance using only 21B active parameters by compressing the KV cache 93.3% and cutting training costs 42.5% via MLA and DeepSeekMoE.","context_count":1,"top_context_role":"method","top_context_polarity":"use_method","context_text":"three kinds of auxiliary losses, for controlling expert-level load balance (LExpBal), device-level load balance (LDevBal), and communication balance (LCommBal), respectively. Expert-Level Balance Loss. We use an expert-level balance loss (Fedus et al., 2021; Lepikhin et al., 2021) to mitigate the risk of routing collapse: LExpBal = 𝛼1 𝑁𝑟∑︁ 𝑖=1 𝑓𝑖 𝑃𝑖, (23) 𝑓𝑖 = 𝑁𝑟 𝐾𝑟𝑇 𝑇∑︁ 𝑡=1 1(Token 𝑡 selects Expert 𝑖), (24) 𝑃𝑖 = 1 𝑇 𝑇∑︁ 𝑡=1 𝑠𝑖,𝑡, (25) where 𝛼1 is a hyper-parameter called expert-level balance factor; 1(·) denotes the indicator function; and 𝑇 denotes the number of tokens in a sequence. Device-Level Balance Loss. In addition to the expert-level balance loss, we additionally design a device-level balance loss to ensure balanced computation across different devices."},{"citing_arxiv_id":"2404.14294","ref_index":219,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"A Survey on Efficient Inference for Large Language Models","primary_cat":"cs.CL","submitted_at":"2024-04-22T15:53:08+00:00","verdict":"ACCEPT","verdict_confidence":"MODERATE","novelty_score":3.0,"formal_verification":"none","one_line_summary":"The paper surveys techniques to speed up and reduce the resource needs of LLM inference, organized by data-level, model-level, and system-level changes, with comparative experiments on representative methods.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"LUT-GEMM [193] ✓ Non-uniform Statistic-based AWQ [194] ✓ Uniform Search-based ✓ SqueezeLLM [197] ✓ Non-uniform Statistic-based LLM.int8() [204] ✓ ✓ Uniform Statistic-based SmoothQuant [205] ✓ ✓ Uniform Statistic-based ✓ RPTQ [207] ✓ ✓ Uniform Statistic-based OmniQuant [210] ✓ ✓ Uniform Search-based FlexGen [203] ✓ ✓ Uniform Statistic-based Atom [212] ✓ ✓ ✓ Uniform Statistic-based KVQuant [219] ✓ Non-uniform Statistic-based KIVI [220] ✓ Uniform Statistic-based FP16GEMM/GEMVFP16AccumulatorActivation(FP16)(a) Weight-onlyQuantization Weight(INT8) Activation(FP16) De-quantizeBias(INT8)De-quantize INT8GEMM/GEMVINT8AccumulatorActivation(FP16)(b) Weight-ActivationQuantization Weight(INT8) Activation(FP16)QuantizationDe-quantizeINT32Bias(INT8) Fig."},{"citing_arxiv_id":"2402.02750","ref_index":7,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache","primary_cat":"cs.CL","submitted_at":"2024-02-05T06:06:47+00:00","verdict":"CONDITIONAL","verdict_confidence":"MODERATE","novelty_score":6.0,"formal_verification":"none","one_line_summary":"KIVI applies asymmetric 2-bit quantization to KV cache with per-channel keys and per-token values, reducing memory 2.6x and boosting throughput up to 3.47x with near-identical quality on Llama, Falcon, and Mistral.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null},{"citing_arxiv_id":"2312.07104","ref_index":15,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"SGLang: Efficient Execution of Structured Language Model Programs","primary_cat":"cs.AI","submitted_at":"2023-12-12T09:34:27+00:00","verdict":"CONDITIONAL","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"SGLang is a new system that speeds up structured LLM programs by up to 6.4x using RadixAttention for KV cache reuse and compressed finite state machines for output decoding.","context_count":1,"top_context_role":"background","top_context_polarity":"background","context_text":"[13] Guidance AI. A guidance language for controlling large language models. https://github. com/guidance-ai/guidance. Accessed: 2023-11. [14] Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. In International Conference on Learning Representations, 2020. [15] Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Yakun Sophia Shao, Kurt Keutzer, and Amir Gholami. Kvquant: Towards 10 million context length llm inference with kv cache quantization. arXiv preprint arXiv:2401.18079, 2024. [16] Hugging Face. Text generation inference. https://github.com/huggingface/ text-generation-inference. Accessed: 2023-11."},{"citing_arxiv_id":"2312.05821","ref_index":10,"ref_count":1,"confidence":0.9,"is_internal_anchor":false,"paper_title":"ASVD: Activation-aware Singular Value Decomposition for Compressing Large Language Models","primary_cat":"cs.CL","submitted_at":"2023-12-10T08:41:24+00:00","verdict":"UNVERDICTED","verdict_confidence":"LOW","novelty_score":6.0,"formal_verification":"none","one_line_summary":"ASVD compresses LLMs by 10-30% and KV caches by 50% via activation-aware SVD that absorbs outliers into transformed weights and calibrates per-layer sensitivity.","context_count":0,"top_context_role":null,"top_context_polarity":null,"context_text":null}],"limit":50,"offset":0}