HACK++: Towards More Effective Head-Aware Key-Value Compression for Efficient Visual Autoregressive Modeling

Hang Guo; Mingbao Lin; Weiyao Lin; Wen Fei; Youru Lv; Yuchen Jiang; Ziran Qin

arxiv: 2606.08302 · v1 · pith:XUCYXJONnew · submitted 2026-06-06 · 💻 cs.CV

HACK++: Towards More Effective Head-Aware Key-Value Compression for Efficient Visual Autoregressive Modeling

Ziran Qin , Yuchen Jiang , Mingbao Lin , Youru Lv , Hang Guo , Wen Fei , Weiyao Lin This is my paper

Pith reviewed 2026-06-27 19:45 UTC · model grok-4.3

classification 💻 cs.CV

keywords visual autoregressive modelsKV cache compressionhead-aware compressionnext-scale predictionattention budget allocationtraining-free methodefficient inference

0 comments

The pith

HACK++ classifies VAR attention heads into contextual and structural types to apply separate compression budgets that cut cache use to 10% while keeping generation near lossless.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Visual autoregressive models generate images scale by scale yet accumulate large key-value caches that dominate memory use. The paper finds that attention heads divide into two stable groups: contextual heads that track semantic consistency and structural heads that track spatial layout, each needing different amounts of past-scale information that also change by layer and step. HACK++ runs one offline calibration to label the heads, then decouples the cost of computing attention on the current scale from the compression of the accumulated cache, applying head-specific strategies and adaptive budgets. This yields near-lossless output on models such as Infinity-2B and 8B at 30% attention budget and 10% cache budget, and the method works across text-to-image, class-conditional, and unified tasks without any retraining. Readers care because the approach removes a central scaling barrier for these efficient next-scale generators.

Core claim

HACK++ is a training-free framework that first partitions VAR attention heads via offline calibration into Contextual Heads focused on semantic consistency and Structural Heads focused on spatial coherence; these types exhibit distinct and shifting reliance on historical scales across layers and steps. At inference the method applies independent budgets to bound current-scale attention cost while compressing the accumulated KV cache far more aggressively through pattern-specific strategies and reliance-aware per-head allocation, delivering near-lossless generation quality at 30% attention budget and 10% cache budget on Infinity-2B/8B models.

What carries the argument

Head-type classification into Contextual and Structural categories with decoupled attention and cache budgets plus reliance-aware allocation.

If this is right

VAR models can run with 90% less KV cache memory while preserving generation quality.
The same head-aware split supports even 1% cache budgets without collapse.
No retraining is required, so the compression applies directly to existing VAR checkpoints.
Attention cost stays bounded independently of how aggressively the historical cache is reduced.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The offline head classification could transfer to other next-scale or next-token autoregressive vision models that accumulate caches.
If reliance patterns drift more than expected, an online update to the head labels might yield still-higher compression ratios.
Reduced memory footprint could enable higher-resolution outputs or additional scales on the same hardware.

Load-bearing premise

Attention heads in VAR models can be stably partitioned into two functional categories by a single offline calibration whose labels and reliance patterns remain valid across layers and generation steps.

What would settle it

Run HACK++ on Infinity-8B at 10% cache budget and compare FID scores or visual coherence against the full-cache baseline; a large quality drop would show the near-lossless claim does not hold.

Figures

Figures reproduced from arXiv: 2606.08302 by Hang Guo, Mingbao Lin, Weiyao Lin, Wen Fei, Youru Lv, Yuchen Jiang, Ziran Qin.

**Figure 2.** Figure 2: Attention Patterns of Contextual and Structural Heads. Both Contextual and Structural heads exhibit consistent vertical and multi-diagonal patterns, respectively, across different samples and scales. practical deployment. We address these challenges via VARspecific KV cache compression that reduces resource cost while preserving generation quality. KV Cache Compression. KV cache compression has been exten… view at source ↗

**Figure 3.** Figure 3: Impact of selective head masking (15% of total heads for each type on VAR-d30, 30% for Infinity-8B). Compared with the original generation (a), masking contextual heads (b) disrupts semantic grounding, losing fine-grained content consistency with the conditioning, whereas masking structural heads (c) retains the coarse semantic direction but causes geometric deformation and degraded spatial coherence. … view at source ↗

**Figure 4.** Figure 4: Attention and cache require stage-decoupled compression. (a) Scale-normalized attention entropy decreases [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Heterogeneous reliance on historical scales across head types, layers, and generation steps, motivating a reliance [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 6.** Figure 6: Overview of the HACK++ framework. (a) Offline Calibration: On a small calibration set, HACK++ computes perhead, per-scale mean attention distributions to derive scale priors and historical reliance scores (Sec. V-A1) and classifies all heads into contextual heads and structural heads via attention variance ranking (Sec. V-A2). (b) Online Inference: At each scale k where the accumulated KV length exceeds t… view at source ↗

**Figure 7.** Figure 7: Comparison of mean attention distributions under full [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative results of text-to-image generation by HACK++ on Infinity-8B [21]. Rows show, from top to bottom, the [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: Zoom-in comparison of HACK++ ( [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative comparison of class-conditional image [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗

**Figure 12.** Figure 12: Efficiency profiling of average latency on attention [PITH_FULL_IMAGE:figures/full_fig_p013_12.png] view at source ↗

**Figure 13.** Figure 13: Throughput and memory comparison between the [PITH_FULL_IMAGE:figures/full_fig_p013_13.png] view at source ↗

**Figure 14.** Figure 14: , HACK++ remains stable across the entire range of both hyperparameters and consistently surpasses the HACK baseline on both metrics and both models, while operating under a substantially tighter cache budget. The behavior of the two hyperparameters differs in nature: α is set per model according to the attention variance distribution and stays TABLE X: Ablation on decoupled compression on Infinity2B. We… view at source ↗

**Figure 15.** Figure 15: Head classification results of Infinity-8B and [PITH_FULL_IMAGE:figures/full_fig_p015_15.png] view at source ↗

read the original abstract

Visual Autoregressive (VAR) models adopt a next-scale prediction paradigm, offering high-quality generation with substantially fewer decoding steps. However, existing VAR models suffer from significant attention complexity and severe memory overhead due to the accumulation of key-value (KV) caches across scales. In this paper, we tackle this challenge by introducing KV cache compression into the next-scale paradigm. We begin with an in-depth analysis of VAR attention and observe that attention heads can be stably divided into two functionally distinct categories: Contextual Heads focus on maintaining semantic consistency, while Structural Heads preserve spatial coherence. Their functional divergence makes existing one-size-fits-all compression methods perform poorly on VAR models. We further find that the two head types differ markedly in their reliance on historical scales, and that this reliance shifts across layers and generation steps, arguing for an adaptive cache budget allocation. To address these challenges, we propose HACK++, a training-free Head-Aware key-value Compression frameworK for VAR models. From a one-time offline calibration, HACK++ classifies head types and derives head-specific priors. At inference, it decouples attention from cache compression under independent budgets, bounding the current-scale attention cost while compressing the accumulated cache far more aggressively, via pattern-specific strategies and a reliance-aware budget allocation. Extensive experiments on multiple VAR models across text-to-image, class-conditional, and unified understanding-and-generation tasks validate the effectiveness and generalizability of HACK++. For example, on Infinity-2B/8B, HACK++ maintains near-lossless generation with only a 30% attention budget and a 10% cache budget, and remains robust even under a 1% cache budget.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HACK++ shows head-type partitioning plus decoupled budgets can cut VAR cache to 10% or 1% with little quality loss, but the offline classification step needs stronger robustness checks.

read the letter

The paper's core move is to split VAR attention heads into contextual and structural categories via one-time offline calibration, then apply different compression rules and reliance-aware budget shifts for historical scales. This lets them run current-scale attention at 30% budget while compressing the growing cache far more aggressively than uniform methods allow.

They do a clean job of showing why one-size-fits-all KV compression fails here: the two head types have measurably different dependence on past scales, and that dependence changes across layers and steps. The training-free design and the explicit decoupling of attention cost from cache size are practical. The abstract reports results on Infinity-2B/8B plus other models across text-to-image, class-conditional, and unified tasks, which at least suggests the pattern is not model-specific.

The main soft spot is exactly the one the stress-test flags. A fixed offline partition only works if the contextual/structural roles stay stable across inputs and generation steps. If they drift, the adaptive allocation could starve the wrong heads. The paper claims robustness down to 1% cache, so the experiments need to include clear sensitivity checks on the calibration step and cross-prompt validation; without those, the quality guarantee rests on an untested assumption.

This is for groups already running or optimizing VAR-style models who care about inference memory. The idea is concrete enough and the bottleneck real enough that it deserves referee time, even if the current evidence is mostly in the abstract.

Referee Report

2 major / 1 minor

Summary. The paper presents HACK++, a training-free Head-Aware Key-Value Compression framework for Visual Autoregressive (VAR) models. It analyzes attention heads, classifying them into Contextual Heads (semantic consistency) and Structural Heads (spatial coherence) through one-time offline calibration, notes differences in reliance on historical scales that shift across layers and steps, and proposes decoupled compression with independent budgets for current-scale attention and accumulated cache using pattern-specific strategies and reliance-aware allocation. Experiments claim near-lossless generation on models like Infinity-2B/8B with 30% attention budget and 10% (or 1%) cache budget across various tasks.

Significance. If the head classification remains stable and the adaptive allocation holds, this method could substantially improve the efficiency of VAR models by mitigating KV cache memory overhead and attention complexity without retraining, potentially enabling larger models or real-time applications in text-to-image generation and unified tasks. The training-free nature and extreme compression robustness are positive aspects.

major comments (2)

[Abstract] The assertion that attention heads 'can be stably divided' into two categories via one-time offline calibration is load-bearing for the entire approach, yet no details on validation of this stability across different inputs, prompts, or generation steps are provided; this directly impacts whether the decoupled budgets can reliably maintain near-lossless quality at the claimed 30% attention / 10% cache budgets on Infinity-2B/8B.
[Head Classification and Reliance Analysis] The method relies on the functional divergence and predictable reliance shifts being consistent enough for adaptive per-head budget allocation; without reported cross-validation or sensitivity analysis of the calibration step, the claims of robustness even under a 1% cache budget risk being overstated if Contextual/Structural roles vary with input or step.

minor comments (1)

[Experiments] The abstract mentions extensive experiments on multiple VAR models and tasks, but specific metrics, baseline comparisons, and controls for post-hoc tuning should be clearly tabulated for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and constructive feedback. We address the two major comments point-by-point below, agreeing that additional validation details would strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] The assertion that attention heads 'can be stably divided' into two categories via one-time offline calibration is load-bearing for the entire approach, yet no details on validation of this stability across different inputs, prompts, or generation steps are provided; this directly impacts whether the decoupled budgets can reliably maintain near-lossless quality at the claimed 30% attention / 10% cache budgets on Infinity-2B/8B.

Authors: We agree that explicit validation of classification stability is important for supporting the load-bearing claim. The manuscript derives the Contextual/Structural division from offline calibration on a representative prompt set and reports consistent near-lossless results across text-to-image, class-conditional, and unified tasks on Infinity-2B/8B. To directly address the concern, the revised version will add a new subsection (and appendix) reporting head-type consistency metrics across varied prompts, generation steps, and calibration-set perturbations. revision: yes
Referee: [Head Classification and Reliance Analysis] The method relies on the functional divergence and predictable reliance shifts being consistent enough for adaptive per-head budget allocation; without reported cross-validation or sensitivity analysis of the calibration step, the claims of robustness even under a 1% cache budget risk being overstated if Contextual/Structural roles vary with input or step.

Authors: We concur that cross-validation and sensitivity analysis of the calibration would better support the adaptive allocation and 1% cache-budget claims. The current experiments demonstrate robustness across models and tasks, but we will add explicit sensitivity results on reliance shifts and head-role stability in the revision to mitigate the risk of overstatement. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with independent calibration and validation

full rationale

The paper's core contribution is a training-free compression scheme whose head classification and budget allocation derive from a one-time offline analysis of attention patterns, not from any equation or fit that tautologically reproduces the reported performance metrics. No self-definitional loops, fitted-input predictions, or load-bearing self-citations appear in the provided text; the near-lossless claims rest on external experimental measurements on Infinity-2B/8B rather than on quantities defined in terms of themselves. The derivation chain therefore remains self-contained against the stated benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Central claim rests on the domain assumption that head categories are stable and that reliance patterns are sufficiently consistent to permit offline-derived priors and online adaptive allocation.

axioms (1)

domain assumption Attention heads in VAR models divide stably into Contextual Heads (semantic consistency) and Structural Heads (spatial coherence) whose reliance on historical scales differs and shifts across layers and steps.
Stated as the result of in-depth analysis of VAR attention; used to justify head-specific strategies and adaptive budgets.

pith-pipeline@v0.9.1-grok · 5855 in / 1201 out tokens · 25784 ms · 2026-06-27T19:45:00.609236+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

69 extracted references · 16 linked inside Pith

[1]

Mars: Mixture of auto-regressive models for fine- grained text-to-image synthesis,

W. He, S. Fu, M. Liu, X. Wang, W. Xiao, F. Shu, Y . Wang, L. Zhang, Z. Yu, H. Liet al., “Mars: Mixture of auto-regressive models for fine- grained text-to-image synthesis,”arXiv preprint arXiv:2407.07614, 2024

arXiv 2024
[2]

Autoregressive model beats diffusion: Llama for scalable image generation,

P. Sun, Y . Jiang, S. Chen, S. Zhang, B. Peng, P. Luo, and Z. Yuan, “Autoregressive model beats diffusion: Llama for scalable image generation,”arXiv preprint arXiv:2406.06525, 2024

Pith/arXiv arXiv 2024
[3]

Autoregressive image gen- eration without vector quantization,

T. Li, Y . Tian, H. Li, M. Deng, and K. He, “Autoregressive image gen- eration without vector quantization,”Advances in Neural Information Processing Systems, vol. 37, pp. 56 424–56 445, 2024

2024
[4]

Janus-pro: Unified multimodal understanding and generation with data and model scaling,

X. Chen, Z. Wu, X. Liu, Z. Pan, W. Liu, Z. Xie, X. Yu, and C. Ruan, “Janus-pro: Unified multimodal understanding and generation with data and model scaling,”arXiv preprint arXiv:2501.17811, 2025

Pith/arXiv arXiv 2025
[5]

Janus: Decoupling visual encoding for unified multimodal understanding and generation,

C. Wu, X. Chen, Z. Wu, Y . Ma, X. Liu, Z. Pan, W. Liu, Z. Xie, X. Yu, C. Ruanet al., “Janus: Decoupling visual encoding for unified multimodal understanding and generation,”arXiv preprint arXiv:2410.13848, 2024

Pith/arXiv arXiv 2024
[6]

Chameleon: Mixed-modal early-fusion foundation models,

C. Team, “Chameleon: Mixed-modal early-fusion foundation models,”
[7]

Available: https://arxiv.org/abs/2405.09818

[Online]. Available: https://arxiv.org/abs/2405.09818

Pith/arXiv arXiv
[8]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” inAdvances in neural information processing systems, 2020, pp. 6840–6851

2020
[9]

Sdxl: Improving latent diffusion models for high-resolution image synthesis,

D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. M ¨uller, J. Penna, and R. Rombach, “Sdxl: Improving latent diffusion models for high-resolution image synthesis,”arXiv preprint arXiv:2307.01952, 2023

Pith/arXiv arXiv 2023
[10]

Scalable diffusion models with transformers,

W. Peebles and S. Xie, “Scalable diffusion models with transformers,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4195–4205

2023
[11]

Scaling rectified flow transformers for high-resolution image synthesis,

P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. M ¨uller, H. Saini, Y . Levi, D. Lorenz, A. Sauer, F. Boeselet al., “Scaling rectified flow transformers for high-resolution image synthesis,” inForty-first international conference on machine learning, 2024

2024
[12]

Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation,

J. Chen, C. Ge, E. Xie, Y . Wu, L. Yao, X. Ren, Z. Wang, P. Luo, H. Lu, and Z. Li, “Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation,” inEuropean Conference on Computer Vision, 2024, pp. 74–91

2024
[13]

Visual autoregres- sive modeling: Scalable image generation via next-scale prediction,

K. Tian, Y . Jiang, Z. Yuan, B. Peng, and L. Wang, “Visual autoregres- sive modeling: Scalable image generation via next-scale prediction,” Advances in neural information processing systems, vol. 37, pp. 84 839–84 865, 2024

2024
[14]

Head- aware kv cache compression for efficient visual autoregressive model- ing,

Z. Qin, Y . Lv, M. Lin, H. Guo, Z. Zhang, D. Zou, and W. Lin, “Head- aware kv cache compression for efficient visual autoregressive model- ing,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 30, 2026, pp. 24 982–24 990

2026
[15]

Gpt-4 technical report,

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

Pith/arXiv arXiv 2023
[16]

The claude 3 model family: Opus, sonnet, haiku,

Anthropic, “The claude 3 model family: Opus, sonnet, haiku,” 2024. [Online]. Available: https://www-cdn.anthropic.com/ de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model Card Claude 3.pdf

2024
[17]

The llama 3 herd of models,

A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fanet al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

Pith/arXiv arXiv 2024
[18]

Deepseek-v3 technical report,

A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruanet al., “Deepseek-v3 technical report,”arXiv preprint arXiv:2412.19437, 2024

Pith/arXiv arXiv 2024
[19]

Qwen3 technical report,

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025
[20]

Neural discrete representation learning,

A. Van Den Oord, O. Vinyalset al., “Neural discrete representation learning,”Advances in neural information processing systems, vol. 30, 2017

2017
[21]

Taming transformers for high-resolution image synthesis,

P. Esser, R. Rombach, and B. Ommer, “Taming transformers for high-resolution image synthesis,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 12 873–12 883

2021
[22]

Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis,

J. Han, J. Liu, Y . Jiang, B. Yan, Y . Zhang, Z. Yuan, B. Peng, and X. Liu, “Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis,”arXiv preprint arXiv:2412.04431, 2024

arXiv 2024
[23]

Hart: Efficient visual generation with hybrid autoregressive transformer,

H. Tang, Y . Wu, S. Yang, E. Xie, J. Chen, J. Chen, Z. Zhang, H. Cai, Y . Lu, and S. Han, “Hart: Efficient visual generation with hybrid autoregressive transformer,”arXiv preprint arXiv:2410.10812, 2024

arXiv 2024
[24]

Toward guidance-free ar visual generation via condition contrastive alignment,

H. Chen, H. Su, P. Sun, and J. Zhu, “Toward guidance-free ar visual generation via condition contrastive alignment,”arXiv preprint arXiv:2410.09347, 2024

arXiv 2024
[25]

M-var: Decoupled scale-wise autoregressive modeling for high-quality image generation,

S. Ren, Y . Yu, N. Ruiz, F. Wang, A. Yuille, and C. Xie, “M-var: Decoupled scale-wise autoregressive modeling for high-quality image generation,”arXiv preprint arXiv:2411.10433, 2024

arXiv 2024
[26]

Switti: Designing scale-wise transformers for text-to- image synthesis,

A. V oronov, D. Kuznedelev, M. Khoroshikh, V . Khrulkov, and D. Baranchuk, “Switti: Designing scale-wise transformers for text-to- image synthesis,” 2024

2024
[27]

Infinitystar: Unified spacetime autoregressive modeling JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 17 for visual generation,

J. Liu, J. Han, B. Yan, H. Wu, F. Zhu, X. Wang, Y . Jiang, B. Peng, and Z. Yuan, “Infinitystar: Unified spacetime autoregressive modeling JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 17 for visual generation,”Advances in Neural Information Processing Systems, vol. 38, pp. 170 054–170 072, 2026

2020
[28]

Videoar: Autoregressive video generation via next-frame & scale prediction,

L. Ji, X. Liu, J. Shang, S. Wang, Y . Sun, H. Wu, and H. Wang, “Videoar: Autoregressive video generation via next-frame & scale prediction,”arXiv preprint arXiv:2601.05966, 2026

arXiv 2026
[29]

Controlvar: Exploring controllable visual autoregressive modeling,

X. Li, K. Qiu, H. Chen, J. Kuen, Z. Lin, R. Singh, and B. Raj, “Controlvar: Exploring controllable visual autoregressive modeling,” arXiv preprint arXiv:2406.09750, 2024

arXiv 2024
[30]

Visual autoregressive modeling for image super-resolution,

Y . Qu, K. Yuan, J. Hao, K. Zhao, Q. Xie, M. Sun, and C. Zhou, “Visual autoregressive modeling for image super-resolution,”arXiv preprint arXiv:2501.18993, 2025

arXiv 2025
[31]

Sar3d: Autoregressive 3d object generation and understanding via multi-scale 3d vqvae,

Y . Chen, Y . Lan, S. Zhou, T. Wang, and X. Pan, “Sar3d: Autoregressive 3d object generation and understanding via multi-scale 3d vqvae,” in CVPR, 2025

2025
[32]

Mars: Mesh autoregressive model for 3d shape detailization,

J. Gao, W. Liu, W. Sun, S. Wang, X. Song, T. Shang, S. Chen, H. Li, X. Yang, Y . Yanet al., “Mars: Mesh autoregressive model for 3d shape detailization,”arXiv preprint arXiv:2502.11390, 2025

arXiv 2025
[33]

Vargpt: Unified understanding and generation in a visual autoregressive multimodal large language model,

X. Zhuang, Y . Xie, Y . Deng, L. Liang, J. Ru, Y . Yin, and Y . Zou, “Vargpt: Unified understanding and generation in a visual autoregressive multimodal large language model,”arXiv preprint arXiv:2501.12327, 2025

arXiv 2025
[34]

Onecat: Decoder-only auto-regressive model for unified understanding and generation,

H. Li, X. Peng, Y . Wang, Z. Peng, X. Chen, R. Weng, J. Wang, X. Cai, W. Dai, and H. Xiong, “Onecat: Decoder-only auto-regressive model for unified understanding and generation,”ArXiv, vol. abs/2509.03498,

arXiv
[35]

Available: https://api.semanticscholar.org/CorpusID: 281092364

[Online]. Available: https://api.semanticscholar.org/CorpusID: 281092364
[36]

Kivi: A tuning-free asymmetric 2bit quantization for kv cache,

Z. Liu, J. Yuan, H. Jin, S. Zhong, Z. Xu, V . Braverman, B. Chen, and X. Hu, “Kivi: A tuning-free asymmetric 2bit quantization for kv cache,”arXiv preprint arXiv:2402.02750, 2024

Pith/arXiv arXiv 2024
[37]

Wkvquant: Quantizing weight and key/value cache for large language models gains more,

Y . Yue, Z. Yuan, H. Duanmu, S. Zhou, J. Wu, and L. Nie, “Wkvquant: Quantizing weight and key/value cache for large language models gains more,”arXiv preprint arXiv:2402.12065, 2024

arXiv 2024
[38]

Gear: An efficient kv cache compression recipefor near- lossless generative inference of llm,

H. Kang, Q. Zhang, S. Kundu, G. Jeong, Z. Liu, T. Krishna, and T. Zhao, “Gear: An efficient kv cache compression recipefor near- lossless generative inference of llm,”arXiv preprint arXiv:2403.05527, 2024

arXiv 2024
[39]

Zipcache: Accurate and efficient kv cache quantization with salient token iden- tification,

Y . He, L. Zhang, W. Wu, J. Liu, H. Zhou, and B. Zhuang, “Zipcache: Accurate and efficient kv cache quantization with salient token iden- tification,”arXiv preprint arXiv:2405.14256, 2024

arXiv 2024
[40]

H2o: Heavy-hitter oracle for efficient generative inference of large language models,

Z. Zhang, Y . Sheng, T. Zhou, T. Chen, L. Zheng, R. Cai, Z. Song, Y . Tian, C. R´e, C. Barrettet al., “H2o: Heavy-hitter oracle for efficient generative inference of large language models,”Advances in Neural Information Processing Systems, vol. 36, 2024

2024
[41]

Snapkv: Llm knows what you are looking for before generation,

Y . Li, Y . Huang, B. Yang, B. Venkitesh, A. Locatelli, H. Ye, T. Cai, P. Lewis, and D. Chen, “Snapkv: Llm knows what you are looking for before generation,”arXiv preprint arXiv:2404.14469, 2024

Pith/arXiv arXiv 2024
[42]

Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time,

Z. Liu, A. Desai, F. Liao, W. Wang, V . Xie, Z. Xu, A. Kyrillidis, and A. Shrivastava, “Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time,”Advances in Neural Information Processing Systems, vol. 36, 2024

2024
[43]

Transformers are multi- state rnns,

M. Oren, M. Hassid, Y . Adi, and R. Schwartz, “Transformers are multi- state rnns,”arXiv preprint arXiv:2401.06104, 2024

arXiv 2024
[44]

On the efficacy of eviction policy for key- value constrained generative language model inference,

S. Ren and K. Q. Zhu, “On the efficacy of eviction policy for key- value constrained generative language model inference,”arXiv preprint arXiv:2402.06262, 2024

arXiv 2024
[45]

Autoregressive image generation needs only a few lines of cached tokens,

Z. Qin, Y . Lv, M. Lin, Z. Zhang, C. Gan, T. Chen, and W. Lin, “Autoregressive image generation needs only a few lines of cached tokens,”arXiv preprint arXiv:2512.04857, 2025

arXiv 2025
[46]

Look-m: Look-once optimization in kv cache for efficient multimodal long-context inference,

Z. Wan, Z. Wu, C. Liu, J. Huang, Z. Zhu, P. Jin, L. Wang, and L. Yuan, “Look-m: Look-once optimization in kv cache for efficient multimodal long-context inference,”arXiv preprint arXiv:2406.18139, 2024

arXiv 2024
[47]

Cam: Cache merging for memory-efficient llms inference,

Y . Zhang, Y . Du, G. Luo, Y . Zhong, Z. Zhang, S. Liu, and R. Ji, “Cam: Cache merging for memory-efficient llms inference,” inForty- first International Conference on Machine Learning, 2024

2024
[48]

Minicache: Kv cache compression in depth dimension for large language models,

A. Liu, J. Liu, Z. Pan, Y . He, G. Haffari, and B. Zhuang, “Minicache: Kv cache compression in depth dimension for large language models,” arXiv preprint arXiv:2405.14366, 2024

arXiv 2024
[49]

D2o: Dynamic discriminative operations for efficient generative inference of large language models,

Z. Wan, X. Wu, Y . Zhang, Y . Xin, C. Tao, Z. Zhu, X. Wang, S. Luo, J. Xiong, and M. Zhang, “D2o: Dynamic discriminative operations for efficient generative inference of large language models,”arXiv preprint arXiv:2406.13035, 2024

arXiv 2024
[50]

Efficient streaming language models with attention sinks,

G. Xiao, Y . Tian, B. Chen, S. Han, and M. Lewis, “Efficient streaming language models with attention sinks,”arXiv preprint arXiv:2309.17453, 2023

Pith/arXiv arXiv 2023
[51]

Duoattention: Efficient long-context llm inference with retrieval and streaming heads,

G. Xiao, J. Tang, J. Zuo, J. Guo, S. Yang, H. Tang, Y . Fu, and S. Han, “Duoattention: Efficient long-context llm inference with retrieval and streaming heads,” inInternational Conference on Learning Represen- tations, vol. 2025, 2025, pp. 37 228–37 253

2025
[52]

CAKE: Cascading and adaptive KV cache eviction with layer preferences,

Z. Qin, Y . Cao, M. Lin, W. Hu, S. Fan, K. Cheng, W. Lin, and J. Li, “CAKE: Cascading and adaptive KV cache eviction with layer preferences,” inThe Thirteenth International Conference on Learning Representations, 2025. [Online]. Available: https: //openreview.net/forum?id=EQgEMAD4kv

2025
[53]

Vl-cache: Sparsity and modality-aware kv cache compression for vision-language model inference acceleration,

D. Tu, D. Vashchilenko, Y . Lu, and P. Xu, “Vl-cache: Sparsity and modality-aware kv cache compression for vision-language model inference acceleration,” inInternational Conference on Learning Rep- resentations, vol. 2025, 2025, pp. 219–239

2025
[54]

Collaborative decod- ing makes visual auto-regressive modeling efficient,

Z. Chen, X. Ma, G. Fang, and X. Wang, “Collaborative decod- ing makes visual auto-regressive modeling efficient,”arXiv preprint arXiv:2411.17787, 2024

arXiv 2024
[55]

Progressive supernet training for efficient visual autoregressive mod- eling,

X. Chen, Y . Shi, K. Li, H. Wang, Y . Li, X. Gu, X. Chen, and M. Lin, “Progressive supernet training for efficient visual autoregressive mod- eling,”arXiv preprint arXiv:2511.16546, 2025

arXiv 2025
[56]

Fastvar: Linear visual autoregressive modeling via cached token pruning,

H. Guo, Y . Li, T. Zhang, J. Wang, T. Dai, S.-T. Xia, and L. Benini, “Fastvar: Linear visual autoregressive modeling via cached token pruning,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 19 011–19 021

2025
[57]

Frequency-aware au- toregressive modeling for efficient high-resolution image synthesis,

Z. Chen, J. Fan, Z. Yu, B. Zhuang, and M. Tan, “Frequency-aware au- toregressive modeling for efficient high-resolution image synthesis,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 17 140–17 149

2025
[58]

Toprovar: Efficient visual autoregressive modeling via tri-dimensional entropy-aware semantic analysis and sparsity optimization,

J. Chen, R. Lin, Z. Zheng, J. Li, M. Li, G. Luo, and X. Chen, “Toprovar: Efficient visual autoregressive modeling via tri-dimensional entropy-aware semantic analysis and sparsity optimization,”arXiv preprint arXiv:2602.22948, 2026

arXiv 2026
[59]

Litevar: Compressing visual autoregressive modelling with efficient attention and quantization,

R. Xie, T. Zhao, Z. Yuan, R. Wan, W. Gao, Z. Zhu, X. Ning, and Y . Wang, “Litevar: Compressing visual autoregressive modelling with efficient attention and quantization,”arXiv preprint arXiv:2411.17178, 2024

arXiv 2024
[60]

Sparvar: Exploring sparsity in visual autoregressive modeling for training-free acceleration,

Z. Li, N. Wang, T. Bai, C. Mei, P. Wang, S. Qiu, and J. Cheng, “Sparvar: Exploring sparsity in visual autoregressive modeling for training-free acceleration,”arXiv preprint arXiv:2602.04361, 2026

arXiv 2026
[61]

Memory-efficient visual autoregressive modeling with scale-aware kv cache compression,

K. Li, Z. Chen, C.-Y . Yang, and J.-N. Hwang, “Memory-efficient visual autoregressive modeling with scale-aware kv cache compression,” Advances in Neural Information Processing Systems, 2025

2025
[62]

Playground v2. 5: Three insights towards enhancing aesthetic quality in text-to-image generation,

D. Li, A. Kamko, E. Akhgari, A. Sabet, L. Xu, and S. Doshi, “Playground v2. 5: Three insights towards enhancing aesthetic quality in text-to-image generation,”arXiv preprint arXiv:2402.17245, 2024

Pith/arXiv arXiv 2024
[63]

Human preference score v2: A solid benchmark for evaluating human pref- erences of text-to-image synthesis,

X. Wu, Y . Hao, K. Sun, Y . Chen, F. Zhu, R. Zhao, and H. Li, “Human preference score v2: A solid benchmark for evaluating human pref- erences of text-to-image synthesis,”arXiv preprint arXiv:2306.09341, 2023

Pith/arXiv arXiv 2023
[64]

Meda: Dynamic kv cache allocation for efficient multimodal long-context inference,

Z. Wan, H. Shen, X. Wang, C. Liu, Z. Mai, and M. Zhang, “Meda: Dynamic kv cache allocation for efficient multimodal long-context inference,”arXiv preprint arXiv:2502.17599, 2025

arXiv 2025
[65]

Imagereward: Learning and evaluating human preferences for text- to-image generation,

J. Xu, X. Liu, Y . Wu, Y . Tong, Q. Li, M. Ding, J. Tang, and Y . Dong, “Imagereward: Learning and evaluating human preferences for text- to-image generation,”Advances in Neural Information Processing Systems, pp. 15 903–15 935, 2023

2023
[66]

Imagenet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” inIEEE conference on computer vision and pattern recognition, 2009, pp. 248–255

2009
[67]

Geneval: An object- focused framework for evaluating text-to-image alignment,

D. Ghosh, H. Hajishirzi, and L. Schmidt, “Geneval: An object- focused framework for evaluating text-to-image alignment,”Advances in Neural Information Processing Systems, pp. 52 132–52 152, 2023

2023
[68]

Ella: Equip diffusion models with llm for enhanced semantic alignment,

X. Hu, R. Wang, Y . Fang, B. Fu, P. Cheng, and G. Yu, “Ella: Equip diffusion models with llm for enhanced semantic alignment,”arXiv preprint arXiv:2403.05135, 2024

Pith/arXiv arXiv 2024
[69]

Flashattention-2: Faster attention with better parallelism and work partitioning,

T. Dao, “Flashattention-2: Faster attention with better parallelism and work partitioning,”arXiv preprint arXiv:2307.08691, 2023

Pith/arXiv arXiv 2023

[1] [1]

Mars: Mixture of auto-regressive models for fine- grained text-to-image synthesis,

W. He, S. Fu, M. Liu, X. Wang, W. Xiao, F. Shu, Y . Wang, L. Zhang, Z. Yu, H. Liet al., “Mars: Mixture of auto-regressive models for fine- grained text-to-image synthesis,”arXiv preprint arXiv:2407.07614, 2024

arXiv 2024

[2] [2]

Autoregressive model beats diffusion: Llama for scalable image generation,

P. Sun, Y . Jiang, S. Chen, S. Zhang, B. Peng, P. Luo, and Z. Yuan, “Autoregressive model beats diffusion: Llama for scalable image generation,”arXiv preprint arXiv:2406.06525, 2024

Pith/arXiv arXiv 2024

[3] [3]

Autoregressive image gen- eration without vector quantization,

T. Li, Y . Tian, H. Li, M. Deng, and K. He, “Autoregressive image gen- eration without vector quantization,”Advances in Neural Information Processing Systems, vol. 37, pp. 56 424–56 445, 2024

2024

[4] [4]

Janus-pro: Unified multimodal understanding and generation with data and model scaling,

X. Chen, Z. Wu, X. Liu, Z. Pan, W. Liu, Z. Xie, X. Yu, and C. Ruan, “Janus-pro: Unified multimodal understanding and generation with data and model scaling,”arXiv preprint arXiv:2501.17811, 2025

Pith/arXiv arXiv 2025

[5] [5]

Janus: Decoupling visual encoding for unified multimodal understanding and generation,

C. Wu, X. Chen, Z. Wu, Y . Ma, X. Liu, Z. Pan, W. Liu, Z. Xie, X. Yu, C. Ruanet al., “Janus: Decoupling visual encoding for unified multimodal understanding and generation,”arXiv preprint arXiv:2410.13848, 2024

Pith/arXiv arXiv 2024

[6] [6]

Chameleon: Mixed-modal early-fusion foundation models,

C. Team, “Chameleon: Mixed-modal early-fusion foundation models,”

[7] [7]

Available: https://arxiv.org/abs/2405.09818

[Online]. Available: https://arxiv.org/abs/2405.09818

Pith/arXiv arXiv

[8] [8]

Denoising diffusion probabilistic models,

J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” inAdvances in neural information processing systems, 2020, pp. 6840–6851

2020

[9] [9]

Sdxl: Improving latent diffusion models for high-resolution image synthesis,

D. Podell, Z. English, K. Lacey, A. Blattmann, T. Dockhorn, J. M ¨uller, J. Penna, and R. Rombach, “Sdxl: Improving latent diffusion models for high-resolution image synthesis,”arXiv preprint arXiv:2307.01952, 2023

Pith/arXiv arXiv 2023

[10] [10]

Scalable diffusion models with transformers,

W. Peebles and S. Xie, “Scalable diffusion models with transformers,” inProceedings of the IEEE/CVF international conference on computer vision, 2023, pp. 4195–4205

2023

[11] [11]

Scaling rectified flow transformers for high-resolution image synthesis,

P. Esser, S. Kulal, A. Blattmann, R. Entezari, J. M ¨uller, H. Saini, Y . Levi, D. Lorenz, A. Sauer, F. Boeselet al., “Scaling rectified flow transformers for high-resolution image synthesis,” inForty-first international conference on machine learning, 2024

2024

[12] [12]

Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation,

J. Chen, C. Ge, E. Xie, Y . Wu, L. Yao, X. Ren, Z. Wang, P. Luo, H. Lu, and Z. Li, “Pixart-σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation,” inEuropean Conference on Computer Vision, 2024, pp. 74–91

2024

[13] [13]

Visual autoregres- sive modeling: Scalable image generation via next-scale prediction,

K. Tian, Y . Jiang, Z. Yuan, B. Peng, and L. Wang, “Visual autoregres- sive modeling: Scalable image generation via next-scale prediction,” Advances in neural information processing systems, vol. 37, pp. 84 839–84 865, 2024

2024

[14] [14]

Head- aware kv cache compression for efficient visual autoregressive model- ing,

Z. Qin, Y . Lv, M. Lin, H. Guo, Z. Zhang, D. Zou, and W. Lin, “Head- aware kv cache compression for efficient visual autoregressive model- ing,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 40, no. 30, 2026, pp. 24 982–24 990

2026

[15] [15]

Gpt-4 technical report,

J. Achiam, S. Adler, S. Agarwal, L. Ahmad, I. Akkaya, F. L. Aleman, D. Almeida, J. Altenschmidt, S. Altman, S. Anadkatet al., “Gpt-4 technical report,”arXiv preprint arXiv:2303.08774, 2023

Pith/arXiv arXiv 2023

[16] [16]

The claude 3 model family: Opus, sonnet, haiku,

Anthropic, “The claude 3 model family: Opus, sonnet, haiku,” 2024. [Online]. Available: https://www-cdn.anthropic.com/ de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model Card Claude 3.pdf

2024

[17] [17]

The llama 3 herd of models,

A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al-Dahle, A. Letman, A. Mathur, A. Schelten, A. Yang, A. Fanet al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024

Pith/arXiv arXiv 2024

[18] [18]

Deepseek-v3 technical report,

A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruanet al., “Deepseek-v3 technical report,”arXiv preprint arXiv:2412.19437, 2024

Pith/arXiv arXiv 2024

[19] [19]

Qwen3 technical report,

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

Pith/arXiv arXiv 2025

[20] [20]

Neural discrete representation learning,

A. Van Den Oord, O. Vinyalset al., “Neural discrete representation learning,”Advances in neural information processing systems, vol. 30, 2017

2017

[21] [21]

Taming transformers for high-resolution image synthesis,

P. Esser, R. Rombach, and B. Ommer, “Taming transformers for high-resolution image synthesis,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 12 873–12 883

2021

[22] [22]

Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis,

J. Han, J. Liu, Y . Jiang, B. Yan, Y . Zhang, Z. Yuan, B. Peng, and X. Liu, “Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis,”arXiv preprint arXiv:2412.04431, 2024

arXiv 2024

[23] [23]

Hart: Efficient visual generation with hybrid autoregressive transformer,

H. Tang, Y . Wu, S. Yang, E. Xie, J. Chen, J. Chen, Z. Zhang, H. Cai, Y . Lu, and S. Han, “Hart: Efficient visual generation with hybrid autoregressive transformer,”arXiv preprint arXiv:2410.10812, 2024

arXiv 2024

[24] [24]

Toward guidance-free ar visual generation via condition contrastive alignment,

H. Chen, H. Su, P. Sun, and J. Zhu, “Toward guidance-free ar visual generation via condition contrastive alignment,”arXiv preprint arXiv:2410.09347, 2024

arXiv 2024

[25] [25]

M-var: Decoupled scale-wise autoregressive modeling for high-quality image generation,

S. Ren, Y . Yu, N. Ruiz, F. Wang, A. Yuille, and C. Xie, “M-var: Decoupled scale-wise autoregressive modeling for high-quality image generation,”arXiv preprint arXiv:2411.10433, 2024

arXiv 2024

[26] [26]

Switti: Designing scale-wise transformers for text-to- image synthesis,

A. V oronov, D. Kuznedelev, M. Khoroshikh, V . Khrulkov, and D. Baranchuk, “Switti: Designing scale-wise transformers for text-to- image synthesis,” 2024

2024

[27] [27]

Infinitystar: Unified spacetime autoregressive modeling JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 17 for visual generation,

J. Liu, J. Han, B. Yan, H. Wu, F. Zhu, X. Wang, Y . Jiang, B. Peng, and Z. Yuan, “Infinitystar: Unified spacetime autoregressive modeling JOURNAL OF LATEX CLASS FILES, VOL. 18, NO. 9, SEPTEMBER 2020 17 for visual generation,”Advances in Neural Information Processing Systems, vol. 38, pp. 170 054–170 072, 2026

2020

[28] [28]

Videoar: Autoregressive video generation via next-frame & scale prediction,

L. Ji, X. Liu, J. Shang, S. Wang, Y . Sun, H. Wu, and H. Wang, “Videoar: Autoregressive video generation via next-frame & scale prediction,”arXiv preprint arXiv:2601.05966, 2026

arXiv 2026

[29] [29]

Controlvar: Exploring controllable visual autoregressive modeling,

X. Li, K. Qiu, H. Chen, J. Kuen, Z. Lin, R. Singh, and B. Raj, “Controlvar: Exploring controllable visual autoregressive modeling,” arXiv preprint arXiv:2406.09750, 2024

arXiv 2024

[30] [30]

Visual autoregressive modeling for image super-resolution,

Y . Qu, K. Yuan, J. Hao, K. Zhao, Q. Xie, M. Sun, and C. Zhou, “Visual autoregressive modeling for image super-resolution,”arXiv preprint arXiv:2501.18993, 2025

arXiv 2025

[31] [31]

Sar3d: Autoregressive 3d object generation and understanding via multi-scale 3d vqvae,

Y . Chen, Y . Lan, S. Zhou, T. Wang, and X. Pan, “Sar3d: Autoregressive 3d object generation and understanding via multi-scale 3d vqvae,” in CVPR, 2025

2025

[32] [32]

Mars: Mesh autoregressive model for 3d shape detailization,

J. Gao, W. Liu, W. Sun, S. Wang, X. Song, T. Shang, S. Chen, H. Li, X. Yang, Y . Yanet al., “Mars: Mesh autoregressive model for 3d shape detailization,”arXiv preprint arXiv:2502.11390, 2025

arXiv 2025

[33] [33]

Vargpt: Unified understanding and generation in a visual autoregressive multimodal large language model,

X. Zhuang, Y . Xie, Y . Deng, L. Liang, J. Ru, Y . Yin, and Y . Zou, “Vargpt: Unified understanding and generation in a visual autoregressive multimodal large language model,”arXiv preprint arXiv:2501.12327, 2025

arXiv 2025

[34] [34]

Onecat: Decoder-only auto-regressive model for unified understanding and generation,

H. Li, X. Peng, Y . Wang, Z. Peng, X. Chen, R. Weng, J. Wang, X. Cai, W. Dai, and H. Xiong, “Onecat: Decoder-only auto-regressive model for unified understanding and generation,”ArXiv, vol. abs/2509.03498,

arXiv

[35] [35]

Available: https://api.semanticscholar.org/CorpusID: 281092364

[Online]. Available: https://api.semanticscholar.org/CorpusID: 281092364

[36] [36]

Kivi: A tuning-free asymmetric 2bit quantization for kv cache,

Z. Liu, J. Yuan, H. Jin, S. Zhong, Z. Xu, V . Braverman, B. Chen, and X. Hu, “Kivi: A tuning-free asymmetric 2bit quantization for kv cache,”arXiv preprint arXiv:2402.02750, 2024

Pith/arXiv arXiv 2024

[37] [37]

Wkvquant: Quantizing weight and key/value cache for large language models gains more,

Y . Yue, Z. Yuan, H. Duanmu, S. Zhou, J. Wu, and L. Nie, “Wkvquant: Quantizing weight and key/value cache for large language models gains more,”arXiv preprint arXiv:2402.12065, 2024

arXiv 2024

[38] [38]

Gear: An efficient kv cache compression recipefor near- lossless generative inference of llm,

H. Kang, Q. Zhang, S. Kundu, G. Jeong, Z. Liu, T. Krishna, and T. Zhao, “Gear: An efficient kv cache compression recipefor near- lossless generative inference of llm,”arXiv preprint arXiv:2403.05527, 2024

arXiv 2024

[39] [39]

Zipcache: Accurate and efficient kv cache quantization with salient token iden- tification,

Y . He, L. Zhang, W. Wu, J. Liu, H. Zhou, and B. Zhuang, “Zipcache: Accurate and efficient kv cache quantization with salient token iden- tification,”arXiv preprint arXiv:2405.14256, 2024

arXiv 2024

[40] [40]

H2o: Heavy-hitter oracle for efficient generative inference of large language models,

Z. Zhang, Y . Sheng, T. Zhou, T. Chen, L. Zheng, R. Cai, Z. Song, Y . Tian, C. R´e, C. Barrettet al., “H2o: Heavy-hitter oracle for efficient generative inference of large language models,”Advances in Neural Information Processing Systems, vol. 36, 2024

2024

[41] [41]

Snapkv: Llm knows what you are looking for before generation,

Y . Li, Y . Huang, B. Yang, B. Venkitesh, A. Locatelli, H. Ye, T. Cai, P. Lewis, and D. Chen, “Snapkv: Llm knows what you are looking for before generation,”arXiv preprint arXiv:2404.14469, 2024

Pith/arXiv arXiv 2024

[42] [42]

Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time,

Z. Liu, A. Desai, F. Liao, W. Wang, V . Xie, Z. Xu, A. Kyrillidis, and A. Shrivastava, “Scissorhands: Exploiting the persistence of importance hypothesis for llm kv cache compression at test time,”Advances in Neural Information Processing Systems, vol. 36, 2024

2024

[43] [43]

Transformers are multi- state rnns,

M. Oren, M. Hassid, Y . Adi, and R. Schwartz, “Transformers are multi- state rnns,”arXiv preprint arXiv:2401.06104, 2024

arXiv 2024

[44] [44]

On the efficacy of eviction policy for key- value constrained generative language model inference,

S. Ren and K. Q. Zhu, “On the efficacy of eviction policy for key- value constrained generative language model inference,”arXiv preprint arXiv:2402.06262, 2024

arXiv 2024

[45] [45]

Autoregressive image generation needs only a few lines of cached tokens,

Z. Qin, Y . Lv, M. Lin, Z. Zhang, C. Gan, T. Chen, and W. Lin, “Autoregressive image generation needs only a few lines of cached tokens,”arXiv preprint arXiv:2512.04857, 2025

arXiv 2025

[46] [46]

Look-m: Look-once optimization in kv cache for efficient multimodal long-context inference,

Z. Wan, Z. Wu, C. Liu, J. Huang, Z. Zhu, P. Jin, L. Wang, and L. Yuan, “Look-m: Look-once optimization in kv cache for efficient multimodal long-context inference,”arXiv preprint arXiv:2406.18139, 2024

arXiv 2024

[47] [47]

Cam: Cache merging for memory-efficient llms inference,

Y . Zhang, Y . Du, G. Luo, Y . Zhong, Z. Zhang, S. Liu, and R. Ji, “Cam: Cache merging for memory-efficient llms inference,” inForty- first International Conference on Machine Learning, 2024

2024

[48] [48]

Minicache: Kv cache compression in depth dimension for large language models,

A. Liu, J. Liu, Z. Pan, Y . He, G. Haffari, and B. Zhuang, “Minicache: Kv cache compression in depth dimension for large language models,” arXiv preprint arXiv:2405.14366, 2024

arXiv 2024

[49] [49]

D2o: Dynamic discriminative operations for efficient generative inference of large language models,

Z. Wan, X. Wu, Y . Zhang, Y . Xin, C. Tao, Z. Zhu, X. Wang, S. Luo, J. Xiong, and M. Zhang, “D2o: Dynamic discriminative operations for efficient generative inference of large language models,”arXiv preprint arXiv:2406.13035, 2024

arXiv 2024

[50] [50]

Efficient streaming language models with attention sinks,

G. Xiao, Y . Tian, B. Chen, S. Han, and M. Lewis, “Efficient streaming language models with attention sinks,”arXiv preprint arXiv:2309.17453, 2023

Pith/arXiv arXiv 2023

[51] [51]

Duoattention: Efficient long-context llm inference with retrieval and streaming heads,

G. Xiao, J. Tang, J. Zuo, J. Guo, S. Yang, H. Tang, Y . Fu, and S. Han, “Duoattention: Efficient long-context llm inference with retrieval and streaming heads,” inInternational Conference on Learning Represen- tations, vol. 2025, 2025, pp. 37 228–37 253

2025

[52] [52]

CAKE: Cascading and adaptive KV cache eviction with layer preferences,

Z. Qin, Y . Cao, M. Lin, W. Hu, S. Fan, K. Cheng, W. Lin, and J. Li, “CAKE: Cascading and adaptive KV cache eviction with layer preferences,” inThe Thirteenth International Conference on Learning Representations, 2025. [Online]. Available: https: //openreview.net/forum?id=EQgEMAD4kv

2025

[53] [53]

Vl-cache: Sparsity and modality-aware kv cache compression for vision-language model inference acceleration,

D. Tu, D. Vashchilenko, Y . Lu, and P. Xu, “Vl-cache: Sparsity and modality-aware kv cache compression for vision-language model inference acceleration,” inInternational Conference on Learning Rep- resentations, vol. 2025, 2025, pp. 219–239

2025

[54] [54]

Collaborative decod- ing makes visual auto-regressive modeling efficient,

Z. Chen, X. Ma, G. Fang, and X. Wang, “Collaborative decod- ing makes visual auto-regressive modeling efficient,”arXiv preprint arXiv:2411.17787, 2024

arXiv 2024

[55] [55]

Progressive supernet training for efficient visual autoregressive mod- eling,

X. Chen, Y . Shi, K. Li, H. Wang, Y . Li, X. Gu, X. Chen, and M. Lin, “Progressive supernet training for efficient visual autoregressive mod- eling,”arXiv preprint arXiv:2511.16546, 2025

arXiv 2025

[56] [56]

Fastvar: Linear visual autoregressive modeling via cached token pruning,

H. Guo, Y . Li, T. Zhang, J. Wang, T. Dai, S.-T. Xia, and L. Benini, “Fastvar: Linear visual autoregressive modeling via cached token pruning,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 19 011–19 021

2025

[57] [57]

Frequency-aware au- toregressive modeling for efficient high-resolution image synthesis,

Z. Chen, J. Fan, Z. Yu, B. Zhuang, and M. Tan, “Frequency-aware au- toregressive modeling for efficient high-resolution image synthesis,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 17 140–17 149

2025

[58] [58]

Toprovar: Efficient visual autoregressive modeling via tri-dimensional entropy-aware semantic analysis and sparsity optimization,

J. Chen, R. Lin, Z. Zheng, J. Li, M. Li, G. Luo, and X. Chen, “Toprovar: Efficient visual autoregressive modeling via tri-dimensional entropy-aware semantic analysis and sparsity optimization,”arXiv preprint arXiv:2602.22948, 2026

arXiv 2026

[59] [59]

Litevar: Compressing visual autoregressive modelling with efficient attention and quantization,

R. Xie, T. Zhao, Z. Yuan, R. Wan, W. Gao, Z. Zhu, X. Ning, and Y . Wang, “Litevar: Compressing visual autoregressive modelling with efficient attention and quantization,”arXiv preprint arXiv:2411.17178, 2024

arXiv 2024

[60] [60]

Sparvar: Exploring sparsity in visual autoregressive modeling for training-free acceleration,

Z. Li, N. Wang, T. Bai, C. Mei, P. Wang, S. Qiu, and J. Cheng, “Sparvar: Exploring sparsity in visual autoregressive modeling for training-free acceleration,”arXiv preprint arXiv:2602.04361, 2026

arXiv 2026

[61] [61]

Memory-efficient visual autoregressive modeling with scale-aware kv cache compression,

K. Li, Z. Chen, C.-Y . Yang, and J.-N. Hwang, “Memory-efficient visual autoregressive modeling with scale-aware kv cache compression,” Advances in Neural Information Processing Systems, 2025

2025

[62] [62]

Playground v2. 5: Three insights towards enhancing aesthetic quality in text-to-image generation,

D. Li, A. Kamko, E. Akhgari, A. Sabet, L. Xu, and S. Doshi, “Playground v2. 5: Three insights towards enhancing aesthetic quality in text-to-image generation,”arXiv preprint arXiv:2402.17245, 2024

Pith/arXiv arXiv 2024

[63] [63]

Human preference score v2: A solid benchmark for evaluating human pref- erences of text-to-image synthesis,

X. Wu, Y . Hao, K. Sun, Y . Chen, F. Zhu, R. Zhao, and H. Li, “Human preference score v2: A solid benchmark for evaluating human pref- erences of text-to-image synthesis,”arXiv preprint arXiv:2306.09341, 2023

Pith/arXiv arXiv 2023

[64] [64]

Meda: Dynamic kv cache allocation for efficient multimodal long-context inference,

Z. Wan, H. Shen, X. Wang, C. Liu, Z. Mai, and M. Zhang, “Meda: Dynamic kv cache allocation for efficient multimodal long-context inference,”arXiv preprint arXiv:2502.17599, 2025

arXiv 2025

[65] [65]

Imagereward: Learning and evaluating human preferences for text- to-image generation,

J. Xu, X. Liu, Y . Wu, Y . Tong, Q. Li, M. Ding, J. Tang, and Y . Dong, “Imagereward: Learning and evaluating human preferences for text- to-image generation,”Advances in Neural Information Processing Systems, pp. 15 903–15 935, 2023

2023

[66] [66]

Imagenet: A large-scale hierarchical image database,

J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” inIEEE conference on computer vision and pattern recognition, 2009, pp. 248–255

2009

[67] [67]

Geneval: An object- focused framework for evaluating text-to-image alignment,

D. Ghosh, H. Hajishirzi, and L. Schmidt, “Geneval: An object- focused framework for evaluating text-to-image alignment,”Advances in Neural Information Processing Systems, pp. 52 132–52 152, 2023

2023

[68] [68]

Ella: Equip diffusion models with llm for enhanced semantic alignment,

X. Hu, R. Wang, Y . Fang, B. Fu, P. Cheng, and G. Yu, “Ella: Equip diffusion models with llm for enhanced semantic alignment,”arXiv preprint arXiv:2403.05135, 2024

Pith/arXiv arXiv 2024

[69] [69]

Flashattention-2: Faster attention with better parallelism and work partitioning,

T. Dao, “Flashattention-2: Faster attention with better parallelism and work partitioning,”arXiv preprint arXiv:2307.08691, 2023

Pith/arXiv arXiv 2023