OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond

Chao Zhang; Dayou Du; Hongxia Yang; Jing Xiong; Ngai Wong; Rui Yang; Wei Wu; Xialie Zhuang; Yaxiu Liu; Yifan Zhang

arxiv: 2605.19660 · v1 · pith:7KAAPLDQnew · submitted 2026-05-19 · 💻 cs.LG · cs.CL

OScaR: The Occam's Razor for Extreme KV Cache Quantization in LLMs and Beyond

Zunhai Su , Rui Yang , Chao Zhang , Yaxiu Liu , Yifan Zhang , Wei Wu , Jing Xiong , Dayou Du

show 6 more authors

Xialie Zhuang Yulei Qian Yuchen Xie Yik-Chung Wu Hongxia Yang Ngai Wong

This is my paper

Pith reviewed 2026-05-20 07:53 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords KV cache quantizationINT2 quantizationToken Norm ImbalanceCanalized RotationOmni-Token ScalingLLM inferenceMemory compressionDecoding speedup

0 comments

The pith

OScaR fixes token norm imbalance through canalized rotation and omni-token scaling to reach near-lossless INT2 KV cache quantization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that standard per-channel quantization fails at extreme low-bit settings because tokens with very different norms force shared scale factors to compromise on accuracy. It identifies Token Norm Imbalance as the main remaining error source after channel-wise outliers are already handled. OScaR counters this with a lightweight sequence of canalized rotation followed by omni-token scaling, which equalizes norms without adding heavy per-token overhead or model-specific retuning. The result is a simple, universal method that works across text, multimodal, and omni-modal models while delivering measurable speed and memory gains at inference time. A sympathetic reader would care because KV cache size is now the dominant barrier to long-context and multimodal deployment, and a low-complexity fix that stays near lossless at INT2 would materially widen practical use of these models.

Core claim

By advancing the per-channel paradigm with Canalized Rotation followed by Omni-Token Scaling, OScaR removes the sequence-dimensional variance caused by Token Norm Imbalance, enabling near-lossless INT2 quantization of the KV cache across X-LLMs with lower complexity than prior pipelines.

What carries the argument

Canalized Rotation plus Omni-Token Scaling inside the OScaR framework, which equalizes token norms before quantization so that shared per-channel scales incur less error.

If this is right

Near-lossless performance is maintained at INT2 across text-only, multimodal, and omni-modal LLMs without per-model retuning.
Decoding speed reaches up to 3.0x, memory footprint drops by up to 5.3x, and throughput increases by up to 4.1x relative to BF16 FlashDecoding-v2.
The method defines a new low-complexity Pareto front that outperforms more intricate quantization pipelines.
The same two-step correction applies uniformly to long-context and multi-modal settings without added sequence-level distortion.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same norm-equalization steps could be tested on activation tensors or weight matrices where similar norm spreads appear at low precision.
If token-norm variance grows with context length, the speedup and memory gains would compound for very long sequences.
Because the correction is sequence-aware yet lightweight, it might integrate directly into existing CUDA kernels for other compression ratios beyond INT2.

Load-bearing premise

Token Norm Imbalance remains the dominant source of quantization error once channel-wise outliers are handled, and the rotation-plus-scaling steps correct it without creating new sequence-level distortions or needing per-model retuning.

What would settle it

Measure whether the residual quantization error after per-channel scaling still correlates strongly with the range of per-token norms inside each channel; if the correlation disappears or performance does not improve when that range is artificially reduced, the central claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.19660 by Chao Zhang, Dayou Du, Hongxia Yang, Jing Xiong, Ngai Wong, Rui Yang, Wei Wu, Xialie Zhuang, Yaxiu Liu, Yifan Zhang, Yik-Chung Wu, Yuchen Xie, Yulei Qian, Zunhai Su.

**Figure 2.** Figure 2: Visualization of Key and Value magnitude patterns and the KIVI quantization scheme [ [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: L2 norm distributions (top row) and heatmaps (bottom row) of Query, Key, and Value states. [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Conceptual overview of OScaR. The detailed algorithm is presented in Algorithm 1. [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Key magnitude (top row) and L2 norm distribution (bottom row) across different processing [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Efficiency analysis of OScaR against BF16 FlashDecoding-v2. Annotations highlight OScaR’s performance at 128K context length (latency) and batch size 48 (throughput and memory). 5.2 Main Experimental Results Results on Text-Only LLMs The LongBench-E results are presented in [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: L2 norm distributions (row 1), value heatmaps (row 2), and attention maps (row 3) of [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Example image used as visual input. As discussed in Section 4.1, we visualize the L2 norm distributions of Query, Key, and Value states in multi-modal LLMs. The input is formatted using the model’s chat template with add_generation_prompt=True, resulting in the token sequence shown below, where </td> denotes the sequence of image patch tokens corresponding to the example image in [PITH_FULL_IMAGE:figures… view at source ↗

**Figure 9.** Figure 9: Pareto front analysis of KV cache quantization methods on Qwen3-8B. The x-axis represents [PITH_FULL_IMAGE:figures/full_fig_p030_9.png] view at source ↗

**Figure 10.** Figure 10: L2 norm distributions (top row) and value heatmaps (bottom row) of Query, Key, and [PITH_FULL_IMAGE:figures/full_fig_p032_10.png] view at source ↗

**Figure 11.** Figure 11: L2 norm distributions (top row) and value heatmaps (bottom row) of Query, Key, and [PITH_FULL_IMAGE:figures/full_fig_p032_11.png] view at source ↗

**Figure 12.** Figure 12: L2 norm distributions (top row) and value heatmaps (bottom row) of Query, Key, and [PITH_FULL_IMAGE:figures/full_fig_p032_12.png] view at source ↗

**Figure 13.** Figure 13: L2 norm distributions (top row) and value heatmaps (bottom row) of Query, Key, and [PITH_FULL_IMAGE:figures/full_fig_p033_13.png] view at source ↗

**Figure 14.** Figure 14: L2 norm distributions (top row) and value heatmaps (bottom row) of Query, Key, and [PITH_FULL_IMAGE:figures/full_fig_p033_14.png] view at source ↗

**Figure 15.** Figure 15: L2 norm distributions (top row) and value heatmaps (bottom row) of Query, Key, and [PITH_FULL_IMAGE:figures/full_fig_p033_15.png] view at source ↗

**Figure 16.** Figure 16: L2 norm distributions (top row) and value heatmaps (bottom row) of Query, Key, and [PITH_FULL_IMAGE:figures/full_fig_p034_16.png] view at source ↗

**Figure 17.** Figure 17: L2 norm distributions (top row) and value heatmaps (bottom row) of Query, Key, and [PITH_FULL_IMAGE:figures/full_fig_p034_17.png] view at source ↗

**Figure 18.** Figure 18: L2 norm distributions (top row) and value heatmaps (bottom row) of Query, Key, and [PITH_FULL_IMAGE:figures/full_fig_p034_18.png] view at source ↗

**Figure 19.** Figure 19: L2 norm distributions of Query, Key, and Value states in Layer 24 of Qwen-3-VL-8B, [PITH_FULL_IMAGE:figures/full_fig_p035_19.png] view at source ↗

**Figure 20.** Figure 20: L2 norm distributions of Query, Key, and Value states in Layer 0 of Qwen-3-VL-8B, [PITH_FULL_IMAGE:figures/full_fig_p036_20.png] view at source ↗

**Figure 21.** Figure 21: L2 norm distributions of Query, Key, and Value states in Layer 15 of Qwen-3-VL-8B, [PITH_FULL_IMAGE:figures/full_fig_p037_21.png] view at source ↗

**Figure 22.** Figure 22: Key magnitude (top row) and L2 norm distribution (bottom row) across different processing [PITH_FULL_IMAGE:figures/full_fig_p038_22.png] view at source ↗

**Figure 23.** Figure 23: Key magnitude (top row) and L2 norm distribution (bottom row) across different processing [PITH_FULL_IMAGE:figures/full_fig_p038_23.png] view at source ↗

**Figure 24.** Figure 24: Key magnitude (top row) and L2 norm distribution (bottom row) across different processing [PITH_FULL_IMAGE:figures/full_fig_p038_24.png] view at source ↗

**Figure 25.** Figure 25: Token norm distribution on Llama-3.1-8B before and after applying OScaR. [PITH_FULL_IMAGE:figures/full_fig_p039_25.png] view at source ↗

**Figure 26.** Figure 26: Token norm distribution on Qwen3-VL-8B before and after applying OScaR. [PITH_FULL_IMAGE:figures/full_fig_p039_26.png] view at source ↗

**Figure 27.** Figure 27: Token norm distribution on Qwen3-8B before and after applying OScaR. [PITH_FULL_IMAGE:figures/full_fig_p040_27.png] view at source ↗

**Figure 28.** Figure 28: Token norm distribution on Qwen2.5-VL-7B before and after applying OScaR. [PITH_FULL_IMAGE:figures/full_fig_p040_28.png] view at source ↗

**Figure 29.** Figure 29: NIAH evaluation results. All competing methods except TurboQuant+ are configured with [PITH_FULL_IMAGE:figures/full_fig_p041_29.png] view at source ↗

read the original abstract

The rapid advancement toward long-context reasoning and multi-modal intelligence has made the memory footprint of the Key-Value (KV) cache a dominant memory bottleneck for efficient deployment. While the established per-channel quantization effectively accommodates intrinsic channel-wise outliers in Key tensors, its efficacy diminishes under extreme compression. In this work, we revisit the inherent limitations of the per-channel quantization paradigm from both empirical and theoretical perspectives. Our analysis identifies Token Norm Imbalance (TNI) as the primary bottleneck to quantization fidelity. We demonstrate that TNI systematically amplifies errors when shared quantization parameters are required to span token groups exhibiting substantial norm disparities. Instead of relying on intricate quantization pipelines (e.g., TurboQuant), we propose OScaR (Omni-Scaled Canalized Rotation), an accurate and lightweight KV cache compression framework for X-LLMs (i.e., text-only, multi-modal, and omni-modal LLMs). Advancing the per-channel paradigm, OScaR employs Canalized Rotation followed by Omni-Token Scaling to mitigate TNI-induced sequence-dimensional variance both effectively and efficiently, further supported by our optimized system design and CUDA kernels. Extensive evaluations across X-LLMs show that OScaR consistently outperforms existing methods and achieves near-lossless performance under INT2 quantization, establishing it as a robust, low-complexity, and universal framework that defines a new Pareto front. Compared with the BF16 FlashDecoding-v2 baseline, our OScaR implementation achieves a notable up to 3.0x speedup in decoding, reduces memory footprint by 5.3x, and increases throughput by 4.1x. The code for OScaR is publicly available at https://github.com/ZunhaiSu/OScaR-KV-Quant.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OScaR identifies Token Norm Imbalance as the bottleneck for extreme KV cache quantization and proposes a lightweight fix that delivers strong practical results.

read the letter

The core takeaway is that this paper identifies Token Norm Imbalance as the main culprit behind poor performance in extreme KV cache quantization and proposes OScaR to handle it through a combination of Canalized Rotation and Omni-Token Scaling. What stands out as new is the diagnosis of TNI and the specific techniques to address sequence-dimensional variance in a lightweight manner. The paper does well in conducting extensive evaluations across different types of LLMs, from text-only to multi-modal and omni-modal, and in reporting system-level improvements such as up to 3x speedup in decoding and 5.3x memory reduction compared to the baseline. Making the code public at the GitHub link is helpful for reproducibility. That said, there are some soft spots. The Omni-Token Scaling appears to involve per-token factors that might be determined from the data used in testing, which raises questions about whether the results are truly predictive or partly fitted. Additionally, while the abstract emphasizes near-lossless INT2 performance, more evidence is needed to confirm that other potential error sources, like those in value tensors or attention scores, are not significant under this compression level. The claim that the method is universal would benefit from clearer demonstrations that it avoids introducing sequence-level distortions without model-specific adjustments. This work is primarily for practitioners and researchers in efficient inference and model compression for large language models. Readers dealing with long-context or multi-modal applications would likely find the practical insights and implementation details valuable. Given the potential impact on deployment costs and the structured approach to a real bottleneck, the paper deserves serious peer review to verify the experiments and strengthen the theoretical grounding. I recommend sending it for review.

Referee Report

3 major / 2 minor

Summary. The paper claims that Token Norm Imbalance (TNI) is the primary bottleneck limiting per-channel KV cache quantization at extreme compression ratios such as INT2. It proposes OScaR, which applies Canalized Rotation followed by Omni-Token Scaling to mitigate sequence-dimensional variance, and reports that this yields near-lossless performance across text-only, multi-modal, and omni-modal LLMs while delivering up to 3.0x decoding speedup, 5.3x memory reduction, and 4.1x throughput improvement over a BF16 FlashDecoding-v2 baseline. The method is positioned as a lightweight, universal framework that advances the per-channel paradigm and defines a new Pareto front.

Significance. If the central performance claims hold under rigorous verification, the work would be significant for practical deployment of long-context and multi-modal models, offering a low-complexity alternative to more intricate quantization pipelines. Public release of the code is a positive factor that supports reproducibility.

major comments (3)

[Theoretical analysis and § on error sources] The abstract and introduction assert that TNI is the dominant error source and that Canalized Rotation plus Omni-Token Scaling removes it without introducing new sequence-level distortions or requiring per-model retuning. However, the manuscript must explicitly demonstrate that other error sources (value-tensor outliers, attention-score quantization) remain negligible at INT2; without such isolation experiments the dominance claim is not yet load-bearing.
[Method description of Omni-Token Scaling] Omni-Token Scaling factors appear to be computed per-token from the input data. The manuscript should clarify whether these factors are derived from first principles or fitted on the same sequences used for final accuracy measurement; if the latter, the reported gains risk circularity and reduced generalizability across unseen models or modalities.
[Experimental evaluation section] Extensive evaluations are claimed, yet the provided details lack error bars, full ablation tables separating the contribution of Canalized Rotation from Omni-Token Scaling, and explicit checks that relative token norms and long-range dependencies remain intact. These omissions prevent independent verification of the near-lossless INT2 result.

minor comments (2)

[Preliminaries] Notation for Token Norm Imbalance and the canalized rotation matrix should be defined with explicit equations in the main text rather than deferred to appendices.
[Figures] Figure legends and axis labels in the Pareto-front plots could be enlarged for readability; current scaling makes quantitative comparison difficult.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights opportunities to strengthen the clarity and rigor of our claims. We respond to each major comment below and will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses

Referee: The abstract and introduction assert that TNI is the dominant error source and that Canalized Rotation plus Omni-Token Scaling removes it without introducing new sequence-level distortions or requiring per-model retuning. However, the manuscript must explicitly demonstrate that other error sources (value-tensor outliers, attention-score quantization) remain negligible at INT2; without such isolation experiments the dominance claim is not yet load-bearing.

Authors: Our theoretical analysis in Section 3 formally derives that TNI is the primary error driver at INT2 by showing how shared per-channel scales are forced to accommodate large norm disparities, leading to disproportionate rounding errors on high-norm tokens. The near-lossless results across models, combined with the fact that our per-channel baseline already handles channel-wise outliers, indicate that value-tensor outliers and attention-score quantization contribute negligibly once TNI is mitigated. To make this explicit as requested, we will add a dedicated subsection with isolation experiments that quantify the residual error from these other sources at INT2. revision: yes
Referee: Omni-Token Scaling factors appear to be computed per-token from the input data. The manuscript should clarify whether these factors are derived from first principles or fitted on the same sequences used for final accuracy measurement; if the latter, the reported gains risk circularity and reduced generalizability across unseen models or modalities.

Authors: The scaling factors are derived directly from the first-principles analysis of TNI presented in Section 3: each token is scaled by the inverse of its observed norm to equalize quantization ranges. These factors are computed online and per-token from the current input activations at inference time, with no offline fitting, hyperparameter search, or use of the evaluation sequences. This ensures the procedure is input-adaptive and generalizes to unseen models and modalities without retuning. We will revise the method description and add pseudocode to state this derivation and computation process unambiguously. revision: yes
Referee: Extensive evaluations are claimed, yet the provided details lack error bars, full ablation tables separating the contribution of Canalized Rotation from Omni-Token Scaling, and explicit checks that relative token norms and long-range dependencies remain intact. These omissions prevent independent verification of the near-lossless INT2 result.

Authors: We acknowledge that the current experimental section would benefit from greater detail for independent verification. While Section 5 already contains ablation studies comparing OScaR variants, we will expand it to include (i) error bars computed over multiple random seeds, (ii) complete tables that isolate the incremental contribution of Canalized Rotation versus Omni-Token Scaling, and (iii) additional metrics and visualizations confirming that relative token norms and long-range attention patterns are preserved after quantization. These revisions will directly address the verification concern. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's core argument proceeds from empirical observation of Token Norm Imbalance under per-channel quantization, introduces Canalized Rotation and Omni-Token Scaling as a lightweight mitigation, and validates the resulting compression via standard perplexity and throughput benchmarks on held-out model suites. No step equates a claimed prediction or first-principles result to its own fitted inputs by construction; scaling factors are computed deterministically from observed token norms as part of the algorithm rather than tuned to match final accuracy metrics. Self-citations, if present, are not load-bearing for the uniqueness or dominance claims, and the evaluation remains externally falsifiable on independent datasets and models. The derivation is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the empirical identification of TNI as the main error source and on the effectiveness of the two new operations; scaling factors are expected to be data-dependent and no machine-checked proof or parameter-free derivation is mentioned.

free parameters (1)

Omni-Token Scaling factors
Per-token or per-group scale values chosen to balance norm disparities; these are fitted or computed from the input activations rather than fixed constants.

axioms (1)

domain assumption Token Norm Imbalance is the primary bottleneck to quantization fidelity when shared parameters must cover token groups with large norm disparities.
Stated as the outcome of the paper's empirical and theoretical analysis of per-channel quantization limits.

pith-pipeline@v0.9.0 · 5902 in / 1372 out tokens · 46728 ms · 2026-05-20T07:53:57.558025+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Our analysis identifies Token Norm Imbalance (TNI) as the primary bottleneck to quantization fidelity. We demonstrate that TNI systematically amplifies errors when shared quantization parameters are required to span token groups exhibiting substantial norm disparities.
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean J_uniquely_calibrated_via_higher_derivative unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

OScaR employs Canalized Rotation followed by Omni-Token Scaling to mitigate TNI-induced sequence-dimensional variance

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

84 extracted references · 84 canonical work pages · 22 internal anchors

[1]

Agarwal, R

K. Agarwal, R. Astra, A. Hoque, and et al. Hadacore: Tensor core accelerated hadamard transform kernel.arXiv preprint arXiv:2412.08832, 2024

work page arXiv 2024
[2]

gpt-oss-120b & gpt-oss-20b Model Card

Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Y . An, X. Zhao, T. Yu, and et al. Systematic outliers in large language models.arXiv preprint arXiv:2502.06415, 2025

work page arXiv 2025
[4]

Quarot: Outlier-free 4-bit inference in rotated llms.Advances in Neural Information Processing Systems, 37:100213– 100240, 2024

Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. Quarot: Outlier-free 4-bit inference in rotated llms.Advances in Neural Information Processing Systems, 37:100213– 100240, 2024

work page 2024
[5]

S. Bai, Y . Cai, R. Chen, and et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Longbench: A bilingual, multitask benchmark for long context understanding

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 3119–3137, 2024

work page 2024
[7]

Bondarenko, M

Y . Bondarenko, M. Nagel, and T. Blankevoort. Quantizable transformers: Removing outliers by helping attention heads do nothing.Advances in Neural Information Processing Systems (NeurIPS), 36:75067–75096, 2023

work page 2023
[8]

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Yucheng Li, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, et al. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling.arXiv preprint arXiv:2406.02069, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Qwen3-Coder-Next Technical Report

Ruisheng Cao, Mouxiang Chen, Jiawei Chen, Zeyu Cui, Yunlong Feng, Binyuan Hui, Yuheng Jing, Kaixin Li, Mingze Li, Junyang Lin, et al. Qwen3-coder-next technical report.arXiv preprint arXiv:2603.00729, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[10]

D. Du, S. Cao, J. Cheng, and et al. Bitdecoding: Unlocking tensor cores for long-context llms decoding with low-bit kv cache.arXiv e-prints, 2025

work page 2025
[11]

Skvq: Sliding-window key and value cache quantization for large language models.arXiv preprint arXiv:2405.06219, 2024

Haojie Duanmu, Zhihang Yuan, Xiuhong Li, Jiangfei Duan, Xingcheng Zhang, and Dahua Lin. Skvq: Sliding-window key and value cache quantization for large language models.arXiv preprint arXiv:2405.06219, 2024

work page arXiv 2024
[12]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[13]

Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs

Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. Model tells you what to discard: Adaptive kv cache compression for llms.arXiv preprint arXiv:2310.01801, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[14]

T. Guo, D. Pai, Y . Bai, and et al. Active-dormant attention heads: Mechanistically demystifying extreme-token phenomena in llms.arXiv preprint arXiv:2410.13835, 2024

work page arXiv 2024
[15]

Z. Guo, H. Kamigaito, and T. Watanabe. Attention score is not all you need for token importance indicator in kv cache reduction: Value also matters. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 21158–21166, 2024

work page 2024
[16]

PolarQuant: Quantizing KV caches with polar transformation.arXiv preprint arXiv:2502.02617,

Insu Han, Praneeth Kacham, Amin Karbasi, Vahab Mirrokni, and Amir Zandieh. Polarquant: Quantizing kv caches with polar transformation.arXiv preprint arXiv:2502.02617, 2025

work page arXiv 2025
[17]

A survey on large language model acceleration based on kv cache management.Transactions on Machine Learning Research, 2025

LI Haoyang, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, HU Nicole, Wei Dong, Li Qing, and Lei Chen. A survey on large language model acceleration based on kv cache management.Transactions on Machine Learning Research, 2025. 10

work page 2025
[18]

Zipcache: Accurate and efficient kv cache quantization with salient token identification.Advances in Neural Information Processing Systems, 37:68287–68307, 2024

Yefei He, Luoming Zhang, Weijia Wu, Jing Liu, Hong Zhou, and Bohan Zhuang. Zipcache: Accurate and efficient kv cache quantization with salient token identification.Advances in Neural Information Processing Systems, 37:68287–68307, 2024

work page 2024
[19]

WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

Jiaxing Hong, Siyu Yan, Jun Cai, et al. Worldsense: Evaluating real-world omnimodal under- standing for multimodal llms.arXiv preprint arXiv:2502.04326, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[20]

Kvquant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Systems, 37:1270–1303, 2024

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Yakun S Shao, Kurt Keutzer, and Amir Gholami. Kvquant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Systems, 37:1270–1303, 2024

work page 2024
[21]

The llama 3 herd of models.preprint, 2024

Kunal Chawla Huang, Kushal Lakhotia, Kyle Huang, Lailin Chen, Lakshya Garg, A Lavender, Leandro Silva, Lee Bell, Lei Zhang, Liangpeng Guo, et al. The llama 3 herd of models.preprint, 2024

work page 2024
[22]

Isoquant: Hardware-aligned so (4) isoclinic rotations for llm kv cache compres- sion.arXiv preprint arXiv:2603.28430, 2026

Zhongping Ji. Isoquant: Hardware-aligned so (4) isoclinic rotations for llm kv cache compres- sion.arXiv preprint arXiv:2603.28430, 2026

work page arXiv 2026
[23]

M. Jin, K. Mei, W. Xu, and et al. Massive values in self-attention modules are the key to contextual knowledge understanding.arXiv preprint arXiv:2502.01563, 2025

work page arXiv 2025
[24]

Llmtest_needleinahaystack

Greg Kamradt. Llmtest_needleinahaystack. GitHub, 2023

work page 2023
[25]

See what you are told: Visual attention sink in large multimodal models

Seil Kang, Jinyeong Kim, Junhyeok Kim, and Seong Jae Hwang. See what you are told: Visual attention sink in large multimodal models. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025
[26]

Kumar, Š

S. Kumar, Š. Sedláˇcek, V . Lokegaonkar, et al. Mmau-pro: A challenging and comprehensive benchmark for holistic evaluation of audio general intelligence. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 22688–22697, 2026

work page 2026
[27]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

A survey on large language model acceleration based on kv cache management.arXiv preprint arXiv:2412.19442, 2024

Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, and Lei Chen. A survey on large language model acceleration based on kv cache management.arXiv preprint arXiv:2412.19442, 2024

work page arXiv 2024
[29]

Memory-efficient visual au- toregressive modeling with scale-aware kv cache compression.arXiv preprint arXiv:2505.19602, 2025

Kunjun Li, Zigeng Chen, Cheng-Yen Yang, and Jenq-Neng Hwang. Memory-efficient visual au- toregressive modeling with scale-aware kv cache compression.arXiv preprint arXiv:2505.19602, 2025

work page arXiv 2025
[30]

Tweo: Transformers without extreme outliers enables fp8 training and quantization for dummies.arXiv preprint arXiv:2511.23225, 2025

Guang Liang, Jie Shao, Ningyuan Tang, Xinyao Liu, and Jianxin Wu. Tweo: Transformers without extreme outliers enables fp8 training and quantization for dummies.arXiv preprint arXiv:2511.23225, 2025

work page arXiv 2025
[31]

Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024

work page 2024
[32]

Y . Lin, H. Tang, S. Yang, and et al. Qserve: W4a8kv4 quantization and system co-design for efficient llm serving.Proceedings of Machine Learning and Systems (MLSys), 7, 2025

work page 2025
[33]

Minicache: Kv cache compression in depth dimension for large language models.Advances in Neural Information Processing Systems, 37:139997–140031, 2024

Akide Liu, Jing Liu, Zizheng Pan, Yefei He, Gholamreza Haffari, and Bohan Zhuang. Minicache: Kv cache compression in depth dimension for large language models.Advances in Neural Information Processing Systems, 37:139997–140031, 2024

work page 2024
[34]

H. Liu, C. Li, Y . Li, and et al. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26296–26306, 2024. 11

work page 2024
[35]

H. Liu, C. Li, Q. Wu, and et al. Visual instruction tuning.Advances in Neural Information Processing Systems (NeurIPS), 36:34892–34916, 2023

work page 2023
[36]

Kv cache compression for inference efficiency in llms: A review

Yanyu Liu, Jingying Fu, Sixiang Liu, Yitian Zou, Shouhua Zhang, and Jiehan Zhou. Kv cache compression for inference efficiency in llms: A review. InProceedings of the 4th International Conference on Artificial Intelligence and Intelligent Information Processing, pages 207–212, 2025

work page 2025
[37]

Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China Information Sciences, 67(12):220102, 2024

Yuliang Liu, Zhang Li, Ming Huang, and et al. Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China Information Sciences, 67(12):220102, 2024

work page 2024
[38]

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. Kivi: A tuning-free asymmetric 2bit quantization for kv cache.arXiv preprint arXiv:2402.02750, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[39]

Docvqa: A dataset for vqa on document images

Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021

work page 2021
[40]

A White Paper on Neural Network Quantization

Markus Nagel, Marios Fournarakis, Rana Ali Amjad, Yelysei Bondarenko, Mart Van Baalen, and Tijmen Blankevoort. A white paper on neural network quantization.arXiv preprint arXiv:2106.08295, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[41]

Rotorquant: Clifford algebra vector quantization for llm kv cache compression

John D Pope. Rotorquant: Clifford algebra vector quantization for llm kv cache compression. github, 2026

work page 2026
[42]

Head-aware kv cache compression for efficient visual autoregressive modeling

Ziran Qin, Youru Lv, Mingbao Lin, Hang Guo, Zeren Zhang, Danping Zou, and Weiyao Lin. Head-aware kv cache compression for efficient visual autoregressive modeling. InProceedings of the AAAI Conference on Artificial Intelligence, 2026

work page 2026
[43]

Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free

Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, et al. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025
[44]

Accurate kv cache quantization with outlier tokens tracing

Yi Su, Yuechi Zhou, Quantong Qiu, Juntao Li, Qingrong Xia, Ping Li, Xinyu Duan, Zhefeng Wang, and Min Zhang. Accurate kv cache quantization with outlier tokens tracing. InProceed- ings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12895–12915, 2025

work page 2025
[45]

RotateKV: Accurate and robust 2-bit KV cache quantization for LLMs via outlier-aware adaptive rotations.arXiv preprint arXiv:2501.16383, 2025

Zunhai Su, Zhe Chen, Wang Shen, Hanyu Wei, Linge Li, Huangqi Yu, and Kehong Yuan. Rotatekv: Accurate and robust 2-bit kv cache quantization for llms via outlier-aware adaptive rotations.arXiv preprint arXiv:2501.16383, 2025

work page arXiv 2025
[46]

Unveiling super experts in mixture-of-experts large language models

Zunhai Su, Qingyuan Li, Hao Zhang, Weihao Ye, Qibo Xue, YuLei Qian, Yuchen Xie, Ngai Wong, and Kehong Yuan. Unveiling super experts in mixture-of-experts large language models. arXiv preprint arXiv:2507.23279, 2025

work page arXiv 2025
[47]

Akvq-vl: Attention-aware kv cache adaptive 2-bit quantization for vision-language models

Zunhai Su, Wang Shen, Linge Li, Zhe Chen, Hanyu Wei, Huangqi Yu, and Kehong Yuan. Akvq-vl: Attention-aware kv cache adaptive 2-bit quantization for vision-language models. In 2025 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2025

work page 2025
[48]

Xstreamvggt: Extremely memory-efficient streaming vision geometry grounded transformer with kv cache compression.Journal of the Society for Information Display, 2026

Zunhai Su, Weihao Ye, Hansen Feng, Keyu Fan, Jing Zhang, Dahai Yu, Zhengwu Liu, and Ngai Wong. Xstreamvggt: Extremely memory-efficient streaming vision geometry grounded transformer with kv cache compression.Journal of the Society for Information Display, 2026

work page 2026
[49]

Kvsink: Understanding and enhancing the preservation of attention sinks in kv cache quantization for llms.arXiv preprint arXiv:2508.04257, 2025

Zunhai Su and Kehong Yuan. Kvsink: Understanding and enhancing the preservation of attention sinks in kv cache quantization for llms.arXiv preprint arXiv:2508.04257, 2025

work page arXiv 2025
[50]

Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation

Zunhai Su, Hengyuan Zhang, Wei Wu, Yifan Zhang, Yaxiu Liu, He Xiao, Qingyao Yang, Yuxuan Sun, Rui Yang, Chao Zhang, et al. Attention sink in transformers: A survey on utilization, interpretation, and mitigation.arXiv preprint arXiv:2604.10098, 2026. 12

work page internal anchor Pith review Pith/arXiv arXiv 2026
[51]

Massive Activations in Large Language Models

Mingjie Sun, Xinlei Chen, J Zico Kolter, and Zhuang Liu. Massive activations in large language models.arXiv preprint arXiv:2402.17762, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[52]

Plug-and-play 1

Keda Tao, Haoxuan You, Yang Sui, Can Qin, and Huan Wang. Plug-and-play 1. x-bit kv cache quantization for video large language models.arXiv preprint arXiv:2503.16257, 2025

work page arXiv 2025
[53]

Kimi Linear: An Expressive, Efficient Attention Architecture

Kimi Team, Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, et al. Kimi linear: An expressive, efficient attention architecture.arXiv preprint arXiv:2510.26692, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[54]

M. L. C. Team, B. Wang, B. Xiao, and et al. Longcat-flash-omni technical report.arXiv preprint arXiv:2511.00279, 2025

work page arXiv 2025
[55]

M. L. C. Team, B. Xiao, C. Wang, and et al. Longcat-next: Lexicalizing modalities as discrete tokens.arXiv preprint arXiv:2603.27538, 2026

work page arXiv 2026
[56]

Longcat-video technical report

Meituan LongCat Team, Xunliang Cai, Qilong Huang, Zhuoliang Kang, Hongyu Li, Shijun Liang, Liya Ma, Siyu Ren, Xiaoming Wei, Rixu Xie, et al. Longcat-video technical report. arXiv preprint arXiv:2510.22200, 2025

work page arXiv 2025
[57]

Longcat-flash-thinking-2601 technical report.arXiv preprint arXiv:2601.16725, 2026

Meituan LongCat Team, Anchun Gui, Bei Li, Bingyang Tao, Bole Zhou, Borun Chen, Chao Zhang, Chen Gao, Chen Zhang, Chengcheng Han, et al. Longcat-flash-thinking-2601 technical report.arXiv preprint arXiv:2601.16725, 2026

work page arXiv 2026
[58]

Introducing longcat-flash-thinking: A technical report.arXiv preprint arXiv:2509.18883, 2025

Meituan LongCat Team, Anchun Gui, Bei Li, Bingyang Tao, Bole Zhou, Borun Chen, Chao Zhang, Chengcheng Han, Chenhui Yang, Chi Zhang, et al. Introducing longcat-flash-thinking: A technical report.arXiv preprint arXiv:2509.18883, 2025

work page arXiv 2025
[59]

Longcat-flash technical report.arXiv preprint arXiv:2509.01322,

Meituan LongCat Team, Bei Li, Bingye Lei, Bo Wang, Bolin Rong, Chao Wang, Chao Zhang, Chen Gao, Chen Zhang, Cheng Sun, et al. Longcat-flash technical report.arXiv preprint arXiv:2509.01322, 2025

work page arXiv 2025
[60]

LongCat-Image Technical Report

Meituan LongCat Team, Hanghang Ma, Haoxian Tan, Jiale Huang, Junqiang Wu, Jun-Yan He, Lishuai Gao, Songlin Xiao, Xiaoming Wei, Xiaoqi Ma, et al. Longcat-image technical report. arXiv preprint arXiv:2512.07584, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[61]

Attention reallocation: Towards zero-cost and controllable hallucination mitigation of mllms

Chongjun Tu, Peng Ye, Dongzhan Zhou, Lei Bai, Gang Yu, Tao Chen, and Wanli Ouyang. Attention reallocation: Towards zero-cost and controllable hallucination mitigation of mllms. International Journal of Computer Vision, 134(1):22, 2026

work page 2026
[62]

Turboquant+

Tom Turney and Contributors. Turboquant+. GitHub repository, May 2026. Online; accessed 2026-05-01

work page 2026
[63]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, and et al. Attention is all you need.Advances in Neural Information Processing Systems (NeurIPS), 30, 2017

work page 2017
[64]

Look-m: Look-once optimization in kv cache for efficient multimodal long-context inference

Zhongwei Wan, Ziang Wu, Che Liu, Jinfa Huang, Zhihong Zhu, Peng Jin, Longyue Wang, and Li Yuan. Look-m: Look-once optimization in kv cache for efficient multimodal long-context inference. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 4065–4078, 2024

work page 2024
[65]

arXiv preprint arXiv:2603.21065 , year=

Jianing Wang, Jianfei Zhang, Qi Guo, Linsen Guo, Rumei Li, Chao Zhang, Chong Peng, Cunguang Wang, Dengchang Zhao, Jiarong Shi, et al. Longcat-flash-prover: Advancing native formal reasoning via agentic tool-integrated reinforcement learning.arXiv preprint arXiv:2603.21065, 2026

work page arXiv 2026
[66]

Outlier suppression+: Accurate quantization of large language models by equivalent and effective shifting and scaling

Xiuying Wei, Yunchen Zhang, Yuhang Li, Xiangguo Zhang, Ruihao Gong, Jinyang Guo, and Xianglong Liu. Outlier suppression+: Accurate quantization of large language models by equivalent and effective shifting and scaling. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1648–1665, 2023

work page 2023
[67]

Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding

Chengyue Wu, Hao Zhang, Shuchen Xue, Zhijian Liu, Shizhe Diao, Ligeng Zhu, Ping Luo, Song Han, and Enze Xie. Fast-dllm: Training-free acceleration of diffusion llm by enabling kv cache and parallel decoding.arXiv preprint arXiv:2505.22618, 2025. 13

work page internal anchor Pith review Pith/arXiv arXiv 2025
[68]

G. Xiao, Y . Tian, B. Chen, and et al. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[69]

Smoothquant: Accurate and efficient post-training quantization for large language models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models. InInternational conference on machine learning, pages 38087–38099. PMLR, 2023

work page 2023
[70]

Exploring layer-wise information effectiveness for post-training quantization in small language models.arXiv preprint arXiv:2508.03332, 2025

He Xiao, Qingyao Yang, Dirui Xie, Wendong Xu, Zunhai Su, Wenyong Zhou, Haobo Liu, Zhengwu Liu, Ngai Wong, et al. Exploring layer-wise information effectiveness for post-training quantization in small language models.arXiv preprint arXiv:2508.03332, 2025

work page arXiv 2025
[71]

Dope: Denoising rotary position embedding.arXiv preprint arXiv:2511.09146, 2025

Jing Xiong, Liyang Fan, Hui Shen, Zunhai Su, Min Yang, Lingpeng Kong, and Ngai Wong. Dope: Denoising rotary position embedding.arXiv preprint arXiv:2511.09146, 2025

work page arXiv 2025
[72]

Qwen3-Omni Technical Report

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[73]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[74]

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

Amir Zandieh, Majid Daliri, Majid Hadian, and Vahab Mirrokni. Turboquant: Online vector quantization with near-optimal distortion rate.arXiv preprint arXiv:2504.19874, 2025

work page internal anchor Pith review arXiv 2025
[75]

Qjl: 1-bit quantized jl transform for kv cache quan- tization with zero overhead

Amir Zandieh, Majid Daliri, and Insu Han. Qjl: 1-bit quantized jl transform for kv cache quan- tization with zero overhead. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 25805–25813, 2025

work page 2025
[76]

Beyond outliers: A data-free layer-wise mixed- precision quantization approach driven by numerical and structural dual-sensitivity.arXiv preprint arXiv:2603.17354, 2026

Hengyuan Zhang, Xinrong Chen, Zunhai Su, Xiao Liang, Jing Xiong, Wendong Xu, He Xiao, Chaofan Tao, Wei Zhang, Ruobing Xie, et al. Beyond outliers: A data-free layer-wise mixed- precision quantization approach driven by numerical and structural dual-sensitivity.arXiv preprint arXiv:2603.17354, 2026

work page arXiv 2026
[77]

Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models

Hengyuan Zhang, Zhihao Zhang, Mingyang Wang, Zunhai Su, Yiwei Wang, Qianli Wang, Shuzhou Yuan, Ercong Nie, Xufeng Duan, Qibo Xue, et al. Locate, steer, and improve: A practical survey of actionable mechanistic interpretability in large language models.arXiv preprint arXiv:2601.14004, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[78]

Streaming 4D Visual Geometry Transformer

Dong Zhuo, Wenzhao Zheng, Jiahe Guo, Yuqi Wu, Jie Zhou, and Jiwen Lu. Streaming 4d visual geometry transformer.arXiv preprint arXiv:2507.11539, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[79]

Softpick: No Attention Sink, No Massive Activations with Rectified Softmax

Zayd MK Zuhri, Erland Hilman Fuadi, and Alham Fikri Aji. Softpick: No attention sink, no massive activations with rectified softmax.arXiv preprint arXiv:2504.20966, 2025. 14 Appendix Contents A Limitations and Future Directions 17 B Algorithm of OScaR 17 C Preliminaries on Low-Bit Quantization 17 D Token Norm Imbalance in Text-Only LLMs 17 E Outlier Token...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[80]

Hadamard rotation and token-wise normalization for keys, building upon HadaCore’s efficient transform primitive

work page

Showing first 80 references.

[1] [1]

Agarwal, R

K. Agarwal, R. Astra, A. Hoque, and et al. Hadacore: Tensor core accelerated hadamard transform kernel.arXiv preprint arXiv:2412.08832, 2024

work page arXiv 2024

[2] [2]

gpt-oss-120b & gpt-oss-20b Model Card

Sandhini Agarwal, Lama Ahmad, Jason Ai, Sam Altman, Andy Applebaum, Edwin Arbus, Rahul K Arora, Yu Bai, Bowen Baker, Haiming Bao, et al. gpt-oss-120b & gpt-oss-20b model card.arXiv preprint arXiv:2508.10925, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Y . An, X. Zhao, T. Yu, and et al. Systematic outliers in large language models.arXiv preprint arXiv:2502.06415, 2025

work page arXiv 2025

[4] [4]

Quarot: Outlier-free 4-bit inference in rotated llms.Advances in Neural Information Processing Systems, 37:100213– 100240, 2024

Saleh Ashkboos, Amirkeivan Mohtashami, Maximilian L Croci, Bo Li, Pashmina Cameron, Martin Jaggi, Dan Alistarh, Torsten Hoefler, and James Hensman. Quarot: Outlier-free 4-bit inference in rotated llms.Advances in Neural Information Processing Systems, 37:100213– 100240, 2024

work page 2024

[5] [5]

S. Bai, Y . Cai, R. Chen, and et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Longbench: A bilingual, multitask benchmark for long context understanding

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 3119–3137, 2024

work page 2024

[7] [7]

Bondarenko, M

Y . Bondarenko, M. Nagel, and T. Blankevoort. Quantizable transformers: Removing outliers by helping attention heads do nothing.Advances in Neural Information Processing Systems (NeurIPS), 36:75067–75096, 2023

work page 2023

[8] [8]

PyramidKV: Dynamic KV Cache Compression based on Pyramidal Information Funneling

Zefan Cai, Yichi Zhang, Bofei Gao, Yuliang Liu, Yucheng Li, Tianyu Liu, Keming Lu, Wayne Xiong, Yue Dong, Junjie Hu, et al. Pyramidkv: Dynamic kv cache compression based on pyramidal information funneling.arXiv preprint arXiv:2406.02069, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Qwen3-Coder-Next Technical Report

Ruisheng Cao, Mouxiang Chen, Jiawei Chen, Zeyu Cui, Yunlong Feng, Binyuan Hui, Yuheng Jing, Kaixin Li, Mingze Li, Junyang Lin, et al. Qwen3-coder-next technical report.arXiv preprint arXiv:2603.00729, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[10] [10]

D. Du, S. Cao, J. Cheng, and et al. Bitdecoding: Unlocking tensor cores for long-context llms decoding with low-bit kv cache.arXiv e-prints, 2025

work page 2025

[11] [11]

Skvq: Sliding-window key and value cache quantization for large language models.arXiv preprint arXiv:2405.06219, 2024

Haojie Duanmu, Zhihang Yuan, Xiuhong Li, Jiangfei Duan, Xingcheng Zhang, and Dahua Lin. Skvq: Sliding-window key and value cache quantization for large language models.arXiv preprint arXiv:2405.06219, 2024

work page arXiv 2024

[12] [12]

GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. Gptq: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[13] [13]

Model Tells You What to Discard: Adaptive KV Cache Compression for LLMs

Suyu Ge, Yunan Zhang, Liyuan Liu, Minjia Zhang, Jiawei Han, and Jianfeng Gao. Model tells you what to discard: Adaptive kv cache compression for llms.arXiv preprint arXiv:2310.01801, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[14] [14]

T. Guo, D. Pai, Y . Bai, and et al. Active-dormant attention heads: Mechanistically demystifying extreme-token phenomena in llms.arXiv preprint arXiv:2410.13835, 2024

work page arXiv 2024

[15] [15]

Z. Guo, H. Kamigaito, and T. Watanabe. Attention score is not all you need for token importance indicator in kv cache reduction: Value also matters. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 21158–21166, 2024

work page 2024

[16] [16]

PolarQuant: Quantizing KV caches with polar transformation.arXiv preprint arXiv:2502.02617,

Insu Han, Praneeth Kacham, Amin Karbasi, Vahab Mirrokni, and Amir Zandieh. Polarquant: Quantizing kv caches with polar transformation.arXiv preprint arXiv:2502.02617, 2025

work page arXiv 2025

[17] [17]

A survey on large language model acceleration based on kv cache management.Transactions on Machine Learning Research, 2025

LI Haoyang, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, HU Nicole, Wei Dong, Li Qing, and Lei Chen. A survey on large language model acceleration based on kv cache management.Transactions on Machine Learning Research, 2025. 10

work page 2025

[18] [18]

Zipcache: Accurate and efficient kv cache quantization with salient token identification.Advances in Neural Information Processing Systems, 37:68287–68307, 2024

Yefei He, Luoming Zhang, Weijia Wu, Jing Liu, Hong Zhou, and Bohan Zhuang. Zipcache: Accurate and efficient kv cache quantization with salient token identification.Advances in Neural Information Processing Systems, 37:68287–68307, 2024

work page 2024

[19] [19]

WorldSense: Evaluating Real-world Omnimodal Understanding for Multimodal LLMs

Jiaxing Hong, Siyu Yan, Jun Cai, et al. Worldsense: Evaluating real-world omnimodal under- standing for multimodal llms.arXiv preprint arXiv:2502.04326, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[20] [20]

Kvquant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Systems, 37:1270–1303, 2024

Coleman Hooper, Sehoon Kim, Hiva Mohammadzadeh, Michael W Mahoney, Yakun S Shao, Kurt Keutzer, and Amir Gholami. Kvquant: Towards 10 million context length llm inference with kv cache quantization.Advances in Neural Information Processing Systems, 37:1270–1303, 2024

work page 2024

[21] [21]

The llama 3 herd of models.preprint, 2024

Kunal Chawla Huang, Kushal Lakhotia, Kyle Huang, Lailin Chen, Lakshya Garg, A Lavender, Leandro Silva, Lee Bell, Lei Zhang, Liangpeng Guo, et al. The llama 3 herd of models.preprint, 2024

work page 2024

[22] [22]

Isoquant: Hardware-aligned so (4) isoclinic rotations for llm kv cache compres- sion.arXiv preprint arXiv:2603.28430, 2026

Zhongping Ji. Isoquant: Hardware-aligned so (4) isoclinic rotations for llm kv cache compres- sion.arXiv preprint arXiv:2603.28430, 2026

work page arXiv 2026

[23] [23]

M. Jin, K. Mei, W. Xu, and et al. Massive values in self-attention modules are the key to contextual knowledge understanding.arXiv preprint arXiv:2502.01563, 2025

work page arXiv 2025

[24] [24]

Llmtest_needleinahaystack

Greg Kamradt. Llmtest_needleinahaystack. GitHub, 2023

work page 2023

[25] [25]

See what you are told: Visual attention sink in large multimodal models

Seil Kang, Jinyeong Kim, Junhyeok Kim, and Seong Jae Hwang. See what you are told: Visual attention sink in large multimodal models. InThe Thirteenth International Conference on Learning Representations, 2025

work page 2025

[26] [26]

Kumar, Š

S. Kumar, Š. Sedláˇcek, V . Lokegaonkar, et al. Mmau-pro: A challenging and comprehensive benchmark for holistic evaluation of audio general intelligence. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 22688–22697, 2026

work page 2026

[27] [27]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

A survey on large language model acceleration based on kv cache management.arXiv preprint arXiv:2412.19442, 2024

Haoyang Li, Yiming Li, Anxin Tian, Tianhao Tang, Zhanchao Xu, Xuejia Chen, Nicole Hu, Wei Dong, Qing Li, and Lei Chen. A survey on large language model acceleration based on kv cache management.arXiv preprint arXiv:2412.19442, 2024

work page arXiv 2024

[29] [29]

Memory-efficient visual au- toregressive modeling with scale-aware kv cache compression.arXiv preprint arXiv:2505.19602, 2025

Kunjun Li, Zigeng Chen, Cheng-Yen Yang, and Jenq-Neng Hwang. Memory-efficient visual au- toregressive modeling with scale-aware kv cache compression.arXiv preprint arXiv:2505.19602, 2025

work page arXiv 2025

[30] [30]

Tweo: Transformers without extreme outliers enables fp8 training and quantization for dummies.arXiv preprint arXiv:2511.23225, 2025

Guang Liang, Jie Shao, Ningyuan Tang, Xinyao Liu, and Jianxin Wu. Tweo: Transformers without extreme outliers enables fp8 training and quantization for dummies.arXiv preprint arXiv:2511.23225, 2025

work page arXiv 2025

[31] [31]

Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024

Ji Lin, Jiaming Tang, Haotian Tang, Shang Yang, Wei-Ming Chen, Wei-Chen Wang, Guangxuan Xiao, Xingyu Dang, Chuang Gan, and Song Han. Awq: Activation-aware weight quantization for on-device llm compression and acceleration.Proceedings of machine learning and systems, 6:87–100, 2024

work page 2024

[32] [32]

Y . Lin, H. Tang, S. Yang, and et al. Qserve: W4a8kv4 quantization and system co-design for efficient llm serving.Proceedings of Machine Learning and Systems (MLSys), 7, 2025

work page 2025

[33] [33]

Minicache: Kv cache compression in depth dimension for large language models.Advances in Neural Information Processing Systems, 37:139997–140031, 2024

Akide Liu, Jing Liu, Zizheng Pan, Yefei He, Gholamreza Haffari, and Bohan Zhuang. Minicache: Kv cache compression in depth dimension for large language models.Advances in Neural Information Processing Systems, 37:139997–140031, 2024

work page 2024

[34] [34]

H. Liu, C. Li, Y . Li, and et al. Improved baselines with visual instruction tuning. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 26296–26306, 2024. 11

work page 2024

[35] [35]

H. Liu, C. Li, Q. Wu, and et al. Visual instruction tuning.Advances in Neural Information Processing Systems (NeurIPS), 36:34892–34916, 2023

work page 2023

[36] [36]

Kv cache compression for inference efficiency in llms: A review

Yanyu Liu, Jingying Fu, Sixiang Liu, Yitian Zou, Shouhua Zhang, and Jiehan Zhou. Kv cache compression for inference efficiency in llms: A review. InProceedings of the 4th International Conference on Artificial Intelligence and Intelligent Information Processing, pages 207–212, 2025

work page 2025

[37] [37]

Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China Information Sciences, 67(12):220102, 2024

Yuliang Liu, Zhang Li, Ming Huang, and et al. Ocrbench: on the hidden mystery of ocr in large multimodal models.Science China Information Sciences, 67(12):220102, 2024

work page 2024

[38] [38]

KIVI: A Tuning-Free Asymmetric 2bit Quantization for KV Cache

Zirui Liu, Jiayi Yuan, Hongye Jin, Shaochen Zhong, Zhaozhuo Xu, Vladimir Braverman, Beidi Chen, and Xia Hu. Kivi: A tuning-free asymmetric 2bit quantization for kv cache.arXiv preprint arXiv:2402.02750, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[39] [39]

Docvqa: A dataset for vqa on document images

Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021

work page 2021

[40] [40]

A White Paper on Neural Network Quantization

Markus Nagel, Marios Fournarakis, Rana Ali Amjad, Yelysei Bondarenko, Mart Van Baalen, and Tijmen Blankevoort. A white paper on neural network quantization.arXiv preprint arXiv:2106.08295, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[41] [41]

Rotorquant: Clifford algebra vector quantization for llm kv cache compression

John D Pope. Rotorquant: Clifford algebra vector quantization for llm kv cache compression. github, 2026

work page 2026

[42] [42]

Head-aware kv cache compression for efficient visual autoregressive modeling

Ziran Qin, Youru Lv, Mingbao Lin, Hang Guo, Zeren Zhang, Danping Zou, and Weiyao Lin. Head-aware kv cache compression for efficient visual autoregressive modeling. InProceedings of the AAAI Conference on Artificial Intelligence, 2026

work page 2026

[43] [43]

Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free

Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, et al. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

work page 2025

[44] [44]

Accurate kv cache quantization with outlier tokens tracing

Yi Su, Yuechi Zhou, Quantong Qiu, Juntao Li, Qingrong Xia, Ping Li, Xinyu Duan, Zhefeng Wang, and Min Zhang. Accurate kv cache quantization with outlier tokens tracing. InProceed- ings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12895–12915, 2025

work page 2025

[45] [45]

RotateKV: Accurate and robust 2-bit KV cache quantization for LLMs via outlier-aware adaptive rotations.arXiv preprint arXiv:2501.16383, 2025

Zunhai Su, Zhe Chen, Wang Shen, Hanyu Wei, Linge Li, Huangqi Yu, and Kehong Yuan. Rotatekv: Accurate and robust 2-bit kv cache quantization for llms via outlier-aware adaptive rotations.arXiv preprint arXiv:2501.16383, 2025

work page arXiv 2025

[46] [46]

Unveiling super experts in mixture-of-experts large language models

Zunhai Su, Qingyuan Li, Hao Zhang, Weihao Ye, Qibo Xue, YuLei Qian, Yuchen Xie, Ngai Wong, and Kehong Yuan. Unveiling super experts in mixture-of-experts large language models. arXiv preprint arXiv:2507.23279, 2025

work page arXiv 2025

[47] [47]

Akvq-vl: Attention-aware kv cache adaptive 2-bit quantization for vision-language models

Zunhai Su, Wang Shen, Linge Li, Zhe Chen, Hanyu Wei, Huangqi Yu, and Kehong Yuan. Akvq-vl: Attention-aware kv cache adaptive 2-bit quantization for vision-language models. In 2025 IEEE International Conference on Multimedia and Expo (ICME), pages 1–6. IEEE, 2025

work page 2025

[48] [48]

Xstreamvggt: Extremely memory-efficient streaming vision geometry grounded transformer with kv cache compression.Journal of the Society for Information Display, 2026

Zunhai Su, Weihao Ye, Hansen Feng, Keyu Fan, Jing Zhang, Dahai Yu, Zhengwu Liu, and Ngai Wong. Xstreamvggt: Extremely memory-efficient streaming vision geometry grounded transformer with kv cache compression.Journal of the Society for Information Display, 2026

work page 2026

[49] [49]

Kvsink: Understanding and enhancing the preservation of attention sinks in kv cache quantization for llms.arXiv preprint arXiv:2508.04257, 2025

Zunhai Su and Kehong Yuan. Kvsink: Understanding and enhancing the preservation of attention sinks in kv cache quantization for llms.arXiv preprint arXiv:2508.04257, 2025

work page arXiv 2025

[50] [50]

Attention Sink in Transformers: A Survey on Utilization, Interpretation, and Mitigation

Zunhai Su, Hengyuan Zhang, Wei Wu, Yifan Zhang, Yaxiu Liu, He Xiao, Qingyao Yang, Yuxuan Sun, Rui Yang, Chao Zhang, et al. Attention sink in transformers: A survey on utilization, interpretation, and mitigation.arXiv preprint arXiv:2604.10098, 2026. 12

work page internal anchor Pith review Pith/arXiv arXiv 2026

[51] [51]

Massive Activations in Large Language Models

Mingjie Sun, Xinlei Chen, J Zico Kolter, and Zhuang Liu. Massive activations in large language models.arXiv preprint arXiv:2402.17762, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[52] [52]

Plug-and-play 1

Keda Tao, Haoxuan You, Yang Sui, Can Qin, and Huan Wang. Plug-and-play 1. x-bit kv cache quantization for video large language models.arXiv preprint arXiv:2503.16257, 2025

work page arXiv 2025

[53] [53]

Kimi Linear: An Expressive, Efficient Attention Architecture

Kimi Team, Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, et al. Kimi linear: An expressive, efficient attention architecture.arXiv preprint arXiv:2510.26692, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[54] [54]

M. L. C. Team, B. Wang, B. Xiao, and et al. Longcat-flash-omni technical report.arXiv preprint arXiv:2511.00279, 2025

work page arXiv 2025

[55] [55]

M. L. C. Team, B. Xiao, C. Wang, and et al. Longcat-next: Lexicalizing modalities as discrete tokens.arXiv preprint arXiv:2603.27538, 2026

work page arXiv 2026

[56] [56]

Longcat-video technical report

Meituan LongCat Team, Xunliang Cai, Qilong Huang, Zhuoliang Kang, Hongyu Li, Shijun Liang, Liya Ma, Siyu Ren, Xiaoming Wei, Rixu Xie, et al. Longcat-video technical report. arXiv preprint arXiv:2510.22200, 2025

work page arXiv 2025

[57] [57]

Longcat-flash-thinking-2601 technical report.arXiv preprint arXiv:2601.16725, 2026

Meituan LongCat Team, Anchun Gui, Bei Li, Bingyang Tao, Bole Zhou, Borun Chen, Chao Zhang, Chen Gao, Chen Zhang, Chengcheng Han, et al. Longcat-flash-thinking-2601 technical report.arXiv preprint arXiv:2601.16725, 2026

work page arXiv 2026

[58] [58]

Introducing longcat-flash-thinking: A technical report.arXiv preprint arXiv:2509.18883, 2025

Meituan LongCat Team, Anchun Gui, Bei Li, Bingyang Tao, Bole Zhou, Borun Chen, Chao Zhang, Chengcheng Han, Chenhui Yang, Chi Zhang, et al. Introducing longcat-flash-thinking: A technical report.arXiv preprint arXiv:2509.18883, 2025

work page arXiv 2025

[59] [59]

Longcat-flash technical report.arXiv preprint arXiv:2509.01322,

Meituan LongCat Team, Bei Li, Bingye Lei, Bo Wang, Bolin Rong, Chao Wang, Chao Zhang, Chen Gao, Chen Zhang, Cheng Sun, et al. Longcat-flash technical report.arXiv preprint arXiv:2509.01322, 2025

work page arXiv 2025

[60] [60]

LongCat-Image Technical Report

Meituan LongCat Team, Hanghang Ma, Haoxian Tan, Jiale Huang, Junqiang Wu, Jun-Yan He, Lishuai Gao, Songlin Xiao, Xiaoming Wei, Xiaoqi Ma, et al. Longcat-image technical report. arXiv preprint arXiv:2512.07584, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[61] [61]

Attention reallocation: Towards zero-cost and controllable hallucination mitigation of mllms

Chongjun Tu, Peng Ye, Dongzhan Zhou, Lei Bai, Gang Yu, Tao Chen, and Wanli Ouyang. Attention reallocation: Towards zero-cost and controllable hallucination mitigation of mllms. International Journal of Computer Vision, 134(1):22, 2026

work page 2026

[62] [62]

Turboquant+

Tom Turney and Contributors. Turboquant+. GitHub repository, May 2026. Online; accessed 2026-05-01

work page 2026

[63] [63]

Vaswani, N

A. Vaswani, N. Shazeer, N. Parmar, and et al. Attention is all you need.Advances in Neural Information Processing Systems (NeurIPS), 30, 2017

work page 2017

[64] [64]

Look-m: Look-once optimization in kv cache for efficient multimodal long-context inference

Zhongwei Wan, Ziang Wu, Che Liu, Jinfa Huang, Zhihong Zhu, Peng Jin, Longyue Wang, and Li Yuan. Look-m: Look-once optimization in kv cache for efficient multimodal long-context inference. InFindings of the Association for Computational Linguistics: EMNLP 2024, pages 4065–4078, 2024

work page 2024

[65] [65]

arXiv preprint arXiv:2603.21065 , year=

Jianing Wang, Jianfei Zhang, Qi Guo, Linsen Guo, Rumei Li, Chao Zhang, Chong Peng, Cunguang Wang, Dengchang Zhao, Jiarong Shi, et al. Longcat-flash-prover: Advancing native formal reasoning via agentic tool-integrated reinforcement learning.arXiv preprint arXiv:2603.21065, 2026

work page arXiv 2026

[66] [66]

Outlier suppression+: Accurate quantization of large language models by equivalent and effective shifting and scaling

Xiuying Wei, Yunchen Zhang, Yuhang Li, Xiangguo Zhang, Ruihao Gong, Jinyang Guo, and Xianglong Liu. Outlier suppression+: Accurate quantization of large language models by equivalent and effective shifting and scaling. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 1648–1665, 2023

work page 2023

[67] [67]

Fast-dLLM: Training-free Acceleration of Diffusion LLM by Enabling KV Cache and Parallel Decoding

Chengyue Wu, Hao Zhang, Shuchen Xue, Zhijian Liu, Shizhe Diao, Ligeng Zhu, Ping Luo, Song Han, and Enze Xie. Fast-dllm: Training-free acceleration of diffusion llm by enabling kv cache and parallel decoding.arXiv preprint arXiv:2505.22618, 2025. 13

work page internal anchor Pith review Pith/arXiv arXiv 2025

[68] [68]

G. Xiao, Y . Tian, B. Chen, and et al. Efficient streaming language models with attention sinks. arXiv preprint arXiv:2309.17453, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[69] [69]

Smoothquant: Accurate and efficient post-training quantization for large language models

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. Smoothquant: Accurate and efficient post-training quantization for large language models. InInternational conference on machine learning, pages 38087–38099. PMLR, 2023

work page 2023

[70] [70]

Exploring layer-wise information effectiveness for post-training quantization in small language models.arXiv preprint arXiv:2508.03332, 2025

He Xiao, Qingyao Yang, Dirui Xie, Wendong Xu, Zunhai Su, Wenyong Zhou, Haobo Liu, Zhengwu Liu, Ngai Wong, et al. Exploring layer-wise information effectiveness for post-training quantization in small language models.arXiv preprint arXiv:2508.03332, 2025

work page arXiv 2025

[71] [71]

Dope: Denoising rotary position embedding.arXiv preprint arXiv:2511.09146, 2025

Jing Xiong, Liyang Fan, Hui Shen, Zunhai Su, Min Yang, Lingpeng Kong, and Ngai Wong. Dope: Denoising rotary position embedding.arXiv preprint arXiv:2511.09146, 2025

work page arXiv 2025

[72] [72]

Qwen3-Omni Technical Report

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[73] [73]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[74] [74]

TurboQuant: Online Vector Quantization with Near-optimal Distortion Rate

Amir Zandieh, Majid Daliri, Majid Hadian, and Vahab Mirrokni. Turboquant: Online vector quantization with near-optimal distortion rate.arXiv preprint arXiv:2504.19874, 2025

work page internal anchor Pith review arXiv 2025

[75] [75]

Qjl: 1-bit quantized jl transform for kv cache quan- tization with zero overhead

Amir Zandieh, Majid Daliri, and Insu Han. Qjl: 1-bit quantized jl transform for kv cache quan- tization with zero overhead. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 25805–25813, 2025

work page 2025

[76] [76]

Beyond outliers: A data-free layer-wise mixed- precision quantization approach driven by numerical and structural dual-sensitivity.arXiv preprint arXiv:2603.17354, 2026

Hengyuan Zhang, Xinrong Chen, Zunhai Su, Xiao Liang, Jing Xiong, Wendong Xu, He Xiao, Chaofan Tao, Wei Zhang, Ruobing Xie, et al. Beyond outliers: A data-free layer-wise mixed- precision quantization approach driven by numerical and structural dual-sensitivity.arXiv preprint arXiv:2603.17354, 2026

work page arXiv 2026

[77] [77]

Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models

Hengyuan Zhang, Zhihao Zhang, Mingyang Wang, Zunhai Su, Yiwei Wang, Qianli Wang, Shuzhou Yuan, Ercong Nie, Xufeng Duan, Qibo Xue, et al. Locate, steer, and improve: A practical survey of actionable mechanistic interpretability in large language models.arXiv preprint arXiv:2601.14004, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[78] [78]

Streaming 4D Visual Geometry Transformer

Dong Zhuo, Wenzhao Zheng, Jiahe Guo, Yuqi Wu, Jie Zhou, and Jiwen Lu. Streaming 4d visual geometry transformer.arXiv preprint arXiv:2507.11539, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[79] [79]

Softpick: No Attention Sink, No Massive Activations with Rectified Softmax

Zayd MK Zuhri, Erland Hilman Fuadi, and Alham Fikri Aji. Softpick: No attention sink, no massive activations with rectified softmax.arXiv preprint arXiv:2504.20966, 2025. 14 Appendix Contents A Limitations and Future Directions 17 B Algorithm of OScaR 17 C Preliminaries on Low-Bit Quantization 17 D Token Norm Imbalance in Text-Only LLMs 17 E Outlier Token...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[80] [80]

Hadamard rotation and token-wise normalization for keys, building upon HadaCore’s efficient transform primitive

work page