arxiv: 2604.17286 · v1 · submitted 2026-04-19 · 💻 cs.CV

Recognition: unknown

Depth Adaptive Efficient Visual Autoregressive Modeling

Chunliang Li , Tianze Cao , Sanyuan Zhao

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:19 UTC · model grok-4.3

classification 💻 cs.CV

keywords visual autoregressive modelingefficient inferenceadaptive depth allocationimage generationtransformer accelerationtraining-free optimization

0 comments

The pith

DepthVAR assigns variable computational depth to each token in visual autoregressive image models, cutting inference time 2.3 to 3.1 times with only small quality loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Visual autoregressive models apply the same full stack of transformer layers to every generated token even when many positions require less refinement. The paper demonstrates that this fixed-depth approach leaves exploitable redundancy that can be removed without retraining. DepthVAR uses a cyclic rotated scheduler to assign different depths across positions and a dynamic process that applies only the needed layers before blending codes proportionally to depth. This produces faster generation than fixed-depth baselines while outperforming prior hard-pruning methods that discard tokens outright.

Core claim

Visual autoregressive models possess significant depth redundancy. This redundancy can be exploited by a training-free adaptive allocation scheme that assigns per-token computational depth through a cyclic rotated scheduler and translates the assignments into layer-major masks that selectively run transformer blocks, followed by code blending that scales each token's contribution exactly to its received depth.

What carries the argument

The adaptive depth scheduler that cycles depth assignments across tokens together with the code-blending step that normalizes each token's output influence to its allocated depth.

If this is right

Lower-depth tokens skip later transformer layers, directly reducing multiply-add operations per generation step.
The cyclic schedule distributes refinement evenly so no single position is chronically under- or over-processed.
Code blending ensures the final representation for each token reflects the precise fraction of computation it received.
The resulting images maintain competitive quality at 2.3 to 3.1 times the baseline speed.
The adaptive scheme yields a better quality-speed curve than binary token-pruning baselines that remove positions entirely.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same per-token depth variation could be tested in autoregressive models for video or audio to see whether similar redundancy exists outside images.
If the cyclic schedule proves robust, it might be replaced by a lightweight learned predictor without losing the training-free property.
Hardware that supports dynamic layer skipping would amplify the reported speedups beyond what software masking alone achieves.

Load-bearing premise

VAR models contain enough distributed depth redundancy that cyclic adaptive allocation plus proportional code blending recovers nearly the original generation quality.

What would settle it

Running the full DepthVAR procedure on a standard VAR checkpoint at the claimed speedups and finding that FID or perceptual quality metrics degrade far more than the minimal loss reported in the paper's experiments.

Figures

Figures reproduced from arXiv: 2604.17286 by Chunliang Li, Sanyuan Zhao, Tianze Cao.

**Figure 1.** Figure 1: Comparison of VAR acceleration paradigms. (a) Hard token pruning (e.g. [24]) discards tokens. (b) Sparse token selection (e.g. [5]) retains anchor tokens to preserve background structure. (c) We adaptively vary the layers processed per token. quadratically with each scale, creating significant computational overhead from the inefficient uniform processing of tokens that represent regions requiring less … view at source ↗

**Figure 2.** Figure 2: Limitations of frequency-based pruning. (a) The accuracy of frequency map approximation, a common heuristic for token pruning, correlates poorly with final image quality. (b) Employing a perfect oracle frequency mask for hard-pruning still results in significant quality degradation, which points to an inherent flaw in the strategy. ity. While depth redundancy [11, 14, 16] motivates modern early exiting [… view at source ↗

**Figure 3.** Figure 3: Evidence of depth redundancy in pretrained VAR models. (a) Token-wise representation similarity between consecutive layers shows saturation at different depths (darker colors for smaller scales). (b) Generation quality peaks before the final layer with early exiting, confirming full depth is not always optimal. tion of parallel decoding strategies from sequential generation, such as Speculative Decoding … view at source ↗

**Figure 4.** Figure 4: Overview of dynamic depth inference in DepthVAR. Left: At each scale i, we first use Adaptive Depth Score Scheduler to generate adaptive depth scores Si using layer-wise changes from the previous scale, which are converted to a layer-major mask Mi via bit-reversal. The dynamic depth inference performs masked prediction using Mi, reinstating cached layer behaviors from the last scale, and blends the resulti… view at source ↗

**Figure 5.** Figure 5: Adaptive Depth Score Scheduler Pipeline. Feature changes from scale i − 1 are aggregated, upsampled, and normalized into percentiles, which are then mapped via a schedule function to continuous depth scores for scale i. language modeling [47] confidence scores is non-trivial: The visual code space is too large for reliable top-k softmax estimation; regions may visually saturate while hiddenstate similar… view at source ↗

**Figure 6.** Figure 6: Qualitative visual comparisons between our method, the baseline, and other approaches with relatively fixed inference latency. [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗

**Figure 7.** Figure 7: Visualization of depth maps in the presence and absence of Cyclic Percentile Rotation. This rotation operation enables updates [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Profiles of different schedule functions under the same constraint. Linear a and b pass (0, 1) and (1, 0). ken rank, ranging from gradual tapering (sigmoid) to linear decay. We evaluate their effectiveness by comparing GenEval scores across different compute constraints ( [PITH_FULL_IMAGE:figures/full_fig_p008_8.png] view at source ↗

**Figure 9.** Figure 9: Sensitivity analysis of [ℓbegin, ℓend] = [a, b], E = EMAE. =7 0 5 10 15 20 25 30 a index 0 5 10 15 20 25 30 b index SSIM Mean (continuous) 0 5 10 15 20 25 30 a index 0 5 10 15 20 25 30 b index SSIM Std (continuous) 0.744 0.746 0.748 0.750 0.752 0.754 0.756 0.758 0.760 0.762 SSIM Mean 0.118 0.119 0.119 0.120 0.120 0.121 0.121 0.122 0.122 0.123 SSIM Std 0 5 10 15 20 25 30 a index 0 5 10 15 20 25 30 b index S… view at source ↗

**Figure 10.** Figure 10: Sensitivity analysis of [ℓbegin, ℓend] = [a, b], E = EMSE. 1 [PITH_FULL_IMAGE:figures/full_fig_p012_10.png] view at source ↗

**Figure 11.** Figure 11: (a) Comparison of reference metrics, where [PITH_FULL_IMAGE:figures/full_fig_p013_11.png] view at source ↗

**Figure 13.** Figure 13: Generalization of depth redundancy to the HART [ [PITH_FULL_IMAGE:figures/full_fig_p013_13.png] view at source ↗

**Figure 15.** Figure 15: Sensitivity analysis over additional hyperparameters. E. Universality of Depth Redundancy To show that the observed depth redundancy is a fundamental property of VAR models, we extend our analysis from Sec. 3.1 to the HART [54] architecture, a distinct hybrid variant. Applying the same evaluation protocols, we conducted token-wise layer similarity and early-exiting analyses. As shown in [PITH_FULL_I… view at source ↗

**Figure 16.** Figure 16: Qualitative failure cases. (a) Loss of fine-grained de [PITH_FULL_IMAGE:figures/full_fig_p014_16.png] view at source ↗

**Figure 17.** Figure 17: Additional qualitative visual comparisons from the HPSv2.1 benchmark. DepthVAR consistently preserves visual fidelity and [PITH_FULL_IMAGE:figures/full_fig_p015_17.png] view at source ↗

read the original abstract

Visual Autoregressive (VAR) modeling inefficiently applies a fixed computational depth to each position when generating high-resolution images. While existing methods accelerate inference by pruning tokens using frequency maps, their binary hard-pruning approach is fundamentally limited and fails to improve quality even with better frequency estimation. Observing that VAR models possess significant depth redundancy, we propose a paradigm shift from pruning entire tokens to adaptively allocating per-token computational depth. To this end, we introduce DepthVAR, a training-free framework that dynamically allocates computation. It integrates an adaptive depth scheduler, which assigns computational depth via a cyclic rotated schedule for balanced, non-static refinement, with a dynamic inference process that translates these depths into layer-major masks, selectively applies transformer blocks, and blends the resulting codes to ensure each token's influence is proportional to its processing depth. Extensive experiments show that DepthVAR achieves 2.3$\times$-3.1$\times$ acceleration with minimal quality loss, offering a competitive compute-performance trade-off compared to existing hard-pruning approaches. Code is available at https://github.com/STOVAGtz/DepthVAR

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DepthVAR replaces binary token pruning with per-token depth allocation via cyclic scheduling and code blending, but the abstract gives no concrete metrics or ablations to judge whether the quality claim holds.

read the letter

The paper's central move is to stop pruning whole tokens and instead give each token a variable number of transformer layers in a training-free manner. They schedule depths with a cyclic rotated pattern, build layer-major masks, and blend the resulting codes proportionally so that tokens processed at lower depth still contribute at the right scale. This is presented as exploiting depth redundancy that already exists in trained VAR models. The reported outcome is 2.3–3.1× faster inference with only minimal quality drop, positioned as better than prior hard-pruning baselines. The code release is a plus for anyone who wants to test the scheduler directly. The shift from binary decisions to continuous depth allocation is the clearest novelty relative to the pruning papers they cite. The implementation details for the masks and blending look mechanical enough that a practitioner could reproduce the speed-up if the numbers check out. The main limitation is that everything rests on the abstract's summary. There are no FID scores, no per-resolution breakdowns, no ablations on the cyclic schedule versus simpler alternatives, and no analysis showing that the blended hidden states stay close enough to the training distribution for the autoregressive head to remain stable. The stress-test concern about distribution drift is therefore still open; without those results it is impossible to tell whether the minimal-loss claim is robust or just holds on the particular images they tried. Readers working on practical deployment of high-resolution VAR models will find the idea useful as a starting point for further tuning. The work is coherent on its own terms and addresses a real efficiency bottleneck, so it is worth sending to referees who can ask for the missing tables and controls.

Referee Report

1 major / 1 minor

Summary. The paper introduces DepthVAR, a training-free framework for Visual Autoregressive (VAR) models that exploits observed depth redundancy by adaptively allocating per-token computational depth. It uses a cyclic rotated scheduler for balanced refinement, translates depths into layer-major masks, selectively applies transformer blocks, and blends codes proportionally to depth. Experiments claim 2.3×-3.1× inference acceleration with minimal quality loss relative to hard-pruning baselines.

Significance. If the central claim holds, the shift from binary token pruning to depth-adaptive allocation could improve efficiency-quality trade-offs in autoregressive image generation without retraining. The training-free design and public code release are clear strengths enabling reproducibility. However, the low soundness rating stems from missing details on ablations, baselines, and verification that blending preserves output distributions.

major comments (1)

[Dynamic inference process description] The load-bearing assumption that cyclic scheduling plus proportional code blending produces hidden states whose statistics remain close to full-depth training (preventing drift in subsequent autoregressive predictions) is stated without derivation, ablation, or analysis of in-distribution properties for the blended codes.

minor comments (1)

[Abstract] The abstract should explicitly state the VAR model variants, image resolutions, and quantitative metrics (FID, etc.) underlying the 2.3×-3.1× acceleration range and 'minimal quality loss' claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for your thorough review and valuable feedback on our paper 'Depth Adaptive Efficient Visual Autoregressive Modeling'. We address the major comment point-by-point below and commit to revisions that strengthen the presentation of our dynamic inference process.

read point-by-point responses

Referee: The load-bearing assumption that cyclic scheduling plus proportional code blending produces hidden states whose statistics remain close to full-depth training (preventing drift in subsequent autoregressive predictions) is stated without derivation, ablation, or analysis of in-distribution properties for the blended codes.

Authors: We agree that additional justification for the blending mechanism is warranted. Although our experiments demonstrate that DepthVAR maintains competitive image quality with significant speedups, indicating that any distributional drift is not detrimental to the autoregressive generation process, we will enhance the manuscript with: a more detailed description of the blending operation and its motivation; new ablations isolating the effect of proportional blending versus alternatives (e.g., no blending or hard selection); and quantitative analysis comparing the statistics of blended hidden states to full-depth ones, including cosine similarity and norm differences at various layers. These additions will be included in the revised version to better verify preservation of in-distribution properties. revision: yes

Circularity Check

0 steps flagged

No significant circularity; method is training-free and observation-driven

full rationale

The paper's derivation chain consists of an empirical observation of depth redundancy in VAR models followed by a proposed training-free framework (cyclic rotated scheduler, layer-major masks, and proportional code blending). No equations, fitted parameters, or self-citations are presented that reduce the central claims to inputs by construction. The approach is explicitly positioned as a paradigm shift grounded in observation and validated by experiments, with no load-bearing self-referential steps or ansatzes imported from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption of exploitable depth redundancy in VAR models and the effectiveness of the cyclic scheduler and blending procedure; no explicit free parameters or invented entities are described in the abstract, but the scheduler likely involves tunable rotation and depth assignment rules.

axioms (1)

domain assumption VAR models exhibit significant depth redundancy across tokens
Invoked to justify adaptive allocation instead of fixed depth or pruning

pith-pipeline@v0.9.0 · 5487 in / 1147 out tokens · 36166 ms · 2026-05-10T06:19:15.350171+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

67 extracted references · 21 canonical work pages · 8 internal anchors

[1]

Binarybert: Pushing the limit of bert quantization

Haoli Bai, Wei Zhang, Lu Hou, Lifeng Shang, Jin Jin, Xin Jiang, Qun Liu, Michael Lyu, and Irwin King. Binarybert: Pushing the limit of bert quantization. InProceedings of the 59th Annual Meeting of the Association for Computa- tional Linguistics and the 11th International Joint Confer- ence on Natural Language Processing (Volume 1: Long Pa- pers), pages 4...

2021
[2]

Adaptive neural networks for efficient inference

Tolga Bolukbasi, Joseph Wang, Ofer Dekel, and Venkatesh Saligrama. Adaptive neural networks for efficient inference. InInternational conference on machine learning, pages 527–
[3]

Token merging: Your ViT but faster

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your ViT but faster. InInternational Conference on Learning Representations, 2023. 5, 6

2023
[4]

Tts-var: A test- time scaling framework for visual auto-regressive genera- tion

Zhekai Chen, Ruihang Chu, Yukang Chen, Shiwei Zhang, Yujie Wei, Yingya Zhang, and Xihui Liu. Tts-var: A test- time scaling framework for visual auto-regressive genera- tion. InThe Thirty-ninth Annual Conference on Neural In- formation Processing Systems. 2
[5]

Frequency-aware autoregressive modeling for efficient high-resolution image synthesis

Zhuokun Chen, Jugang Fan, Zhuowei Yu, Bohan Zhuang, and Mingkui Tan. Frequency-aware autoregressive modeling for efficient high-resolution image synthesis. InProceedings of the IEEE/CVF International Conference on Computer Vi- sion, pages 17140–17149, 2025. 1, 3, 4, 5, 6, 7

2025
[6]

Collaborative decoding makes visual auto-regressive modeling efficient

Zigeng Chen, Xinyin Ma, Gongfan Fang, and Xinchao Wang. Collaborative decoding makes visual auto-regressive modeling efficient. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23334–23344,
[7]

A survey of techniques for optimizing transformer inference.Journal of Systems Architecture, 144:102990, 2023

Krishna Teja Chitty-Venkata, Sparsh Mittal, Murali Emani, Venkatram Vishwanath, and Arun K Somani. A survey of techniques for optimizing transformer inference.Journal of Systems Architecture, 144:102990, 2023. 2

2023
[8]

An algorithm for the machine calculation of complex fourier series.Mathematics of computation, 19(90):297–301, 1965

James W Cooley and John W Tukey. An algorithm for the machine calculation of complex fourier series.Mathematics of computation, 19(90):297–301, 1965. 4

1965
[9]

Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022

Tri Dao, Dan Fu, Stefano Ermon, Atri Rudra, and Christo- pher R ´e. Flashattention: Fast and memory-efficient exact attention with io-awareness.Advances in neural information processing systems, 35:16344–16359, 2022. 6

2022
[10]

Universal Transformers

Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Universal transformers.arXiv preprint arXiv:1807.03819, 2018. 2

work page internal anchor Pith review arXiv 2018
[11]

Skipdecode: Autoregressive skip decoding with batching and caching for efficient llm inference,

Luciano Del Corro, Allie Del Giorno, Sahaj Agarwal, Bin Yu, Ahmed Awadallah, and Subhabrata Mukherjee. Skipdecode: Autoregressive skip decoding with batching and caching for efficient llm inference.arXiv preprint arXiv:2307.02628, 2023. 2

work page arXiv 2023
[12]

Bert: Pre-training of deep bidirectional trans- formers for language understanding

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional trans- formers for language understanding. InProceedings of the 2019 conference of the North American chapter of the asso- ciation for computational linguistics: human language tech- nologies, volume 1 (long and short papers), pages 4171– 4186, 2019. 2

2019
[13]

Depth-adaptive transformer

Maha Elbayad, Jiatao Gu, Edouard Grave, and Michael Auli. Depth-adaptive transformer. InICLR 2020-Eighth Interna- tional Conference on Learning Representations, pages 1–14,

2020
[14]

Layer- skip: Enabling early exit inference and self-speculative de- coding

Mostafa Elhoushi, Akshat Shrivastava, Diana Liskovich, Basil Hosmer, Bram Wasti, Liangzhen Lai, Anas Mahmoud, Bilge Acun, Saurabh Agarwal, Ahmed Roman, et al. Layer- skip: Enabling early exit inference and self-speculative de- coding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1...

2024
[15]

Fast bit-reversal algorithms

Anne C Elster. Fast bit-reversal algorithms. InInternational Conference on Acoustics, Speech, and Signal Processing, pages 1099–1102. IEEE, 1989. 4

1989
[16]

arXiv preprint arXiv:1909.11556 , year=

Angela Fan, Edouard Grave, and Armand Joulin. Reducing transformer depth on demand with structured dropout.arXiv preprint arXiv:1909.11556, 2019. 2, 3

work page arXiv 1909
[17]

Position-aware depth decay de- coding (d 3): Boosting large language model inference effi- ciency.arXiv preprint arXiv:2503.08524, 2025

Siqi Fan, Xuezhi Fang, Xingrun Xing, Peng Han, Shuo Shang, and Yequan Wang. Position-aware depth decay de- coding (d 3): Boosting large language model inference effi- ciency.arXiv preprint arXiv:2503.08524, 2025. 2

work page arXiv 2025
[18]

Switch transformers: Scaling to trillion parameter models with sim- ple and efficient sparsity.Journal of Machine Learning Re- search, 23(120):1–39, 2022

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with sim- ple and efficient sparsity.Journal of Machine Learning Re- search, 23(120):1–39, 2022. 2

2022
[19]

Deecap: Dynamic early exiting for efficient image captioning

Zhengcong Fei, Xu Yan, Shuhui Wang, and Qi Tian. Deecap: Dynamic early exiting for efficient image captioning. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12216–12226, 2022. 2

2022
[20]

Geneval: An object-focused framework for evaluating text- to-image alignment.Advances in Neural Information Pro- cessing Systems, 36:52132–52152, 2023

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text- to-image alignment.Advances in Neural Information Pro- cessing Systems, 36:52132–52152, 2023. 6, 1, 2

2023
[21]

Adaptive Computation Time for Recurrent Neural Networks

Alex Graves. Adaptive computation time for recurrent neural networks.arXiv preprint arXiv:1603.08983, 2016. 2, 5

work page internal anchor Pith review arXiv 2016
[22]

Mamba: Linear-time sequence mod- eling with selective state spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence mod- eling with selective state spaces. InFirst conference on lan- guage modeling, 2024. 3

2024
[23]

MiniLLM: On-Policy Distillation of Large Language Models

Yuxian Gu, Li Dong, Furu Wei, and Minlie Huang. Minillm: Knowledge distillation of large language models.arXiv preprint arXiv:2306.08543, 2023. 2

work page internal anchor Pith review arXiv 2023
[24]

Fastvar: Linear vi- sual autoregressive modeling via cached token pruning

Hang Guo, Yawei Li, Taolin Zhang, Jiangshan Wang, Tao Dai, Shu-Tao Xia, and Luca Benini. Fastvar: Linear vi- sual autoregressive modeling via cached token pruning. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 19011–19021, 2025. 1, 3, 5, 6, 7, 2

2025
[25]

Infinity: Scaling bit- 9 wise autoregressive modeling for high-resolution image syn- thesis

Jian Han, Jinlai Liu, Yi Jiang, Bin Yan, Yuqi Zhang, Zehuan Yuan, Bingyue Peng, and Xiaobing Liu. Infinity: Scaling bit- 9 wise autoregressive modeling for high-resolution image syn- thesis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15733–15744, 2025. 1, 2, 3, 4, 5, 6, 7

2025
[26]

Router-tuning: A simple and effec- tive approach for dynamic depth

Shwai He, Tao Ge, Guoheng Sun, Bowei Tian, Xiaoyang Wang, and Dong Yu. Router-tuning: A simple and effec- tive approach for dynamic depth. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Pro- cessing, pages 1925–1938, 2025. 2

2025
[27]

Rotary position embedding for vision transformer

Byeongho Heo, Song Park, Dongyoon Han, and Sangdoo Yun. Rotary position embedding for vision transformer. In European Conference on Computer Vision, pages 289–305. Springer, 2024. 4

2024
[28]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distill- ing the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015. 2

work page internal anchor Pith review Pith/arXiv arXiv 2015
[29]

Dynabert: Dynamic bert with adaptive width and depth.Advances in Neural Information Processing Sys- tems, 33:9782–9793, 2020

Lu Hou, Zhiqi Huang, Lifeng Shang, Xin Jiang, Xiao Chen, and Qun Liu. Dynabert: Dynamic bert with adaptive width and depth.Advances in Neural Information Processing Sys- tems, 33:9782–9793, 2020. 2

2020
[30]

arXiv preprint arXiv:1703.09844 (2017)

Gao Huang, Danlu Chen, Tianhong Li, Felix Wu, Laurens Van Der Maaten, and Kilian Q Weinberger. Multi-scale dense networks for resource efficient image classification. arXiv preprint arXiv:1703.09844, 2017. 2

work page arXiv 2017
[31]

On computational limits and provably efficient criteria of visual autoregressive mod- els: A fine-grained complexity analysis.arXiv preprint arXiv:2501.04377, 2025

Yekun Ke, Xiaoyu Li, Yingyu Liang, Zhizhou Sha, Zhen- mei Shi, and Zhao Song. On computational limits and provably efficient criteria of visual autoregressive mod- els: A fine-grained complexity analysis.arXiv preprint arXiv:2501.04377, 2025. 3

work page arXiv 2025
[32]

Learning multiple layers of features from tiny images

Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009. 1

2009
[33]

Hmar: Efficient hierarchical masked auto- regressive image generation

Hermann Kumbong, Xian Liu, Tsung-Yi Lin, Ming-Yu Liu, Xihui Liu, Ziwei Liu, Daniel Y Fu, Christopher Re, and David W Romero. Hmar: Efficient hierarchical masked auto- regressive image generation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2535– 2544, 2025. 3

2025
[34]

GShard: Scaling Giant Models with Conditional Computation and Automatic Sharding

Dmitry Lepikhin, HyoukJoong Lee, Yuanzhong Xu, Dehao Chen, Orhan Firat, Yanping Huang, Maxim Krikun, Noam Shazeer, and Zhifeng Chen. Gshard: Scaling giant models with conditional computation and automatic sharding.arXiv preprint arXiv:2006.16668, 2020. 2

work page internal anchor Pith review arXiv 2006
[35]

Fast in- ference from transformers via speculative decoding

Yaniv Leviathan, Matan Kalman, and Yossi Matias. Fast in- ference from transformers via speculative decoding. InIn- ternational Conference on Machine Learning, pages 19274– 19286. PMLR, 2023. 3

2023
[36]

arXiv preprint arXiv:2506.08908 , year=

Jiajun Li, Yue Ma, Xinyu Zhang, Qingyan Wei, Songhua Liu, and Linfeng Zhang. Skipvar: Accelerating visual au- toregressive modeling via adaptive frequency-aware skip- ping.arXiv preprint arXiv:2506.08908, 2025. 3, 5, 6, 7

work page arXiv 2025
[37]

Memory-efficient visual autoregressive modeling with scale-aware kv cache compression

Kunjun Li, Zigeng Chen, Cheng-Yen Yang, and Jenq-Neng Hwang. Memory-efficient visual autoregressive modeling with scale-aware kv cache compression. InThe Thirty-ninth Annual Conference on Neural Information Processing Sys- tems. 3
[38]

Freqexit: En- abling early-exit inference for visual autoregressive models via frequency-aware guidance

Ying Li, Chengfei Lv, and Huan Wang. Freqexit: En- abling early-exit inference for visual autoregressive models via frequency-aware guidance. InNeurIPS, 2025. 1, 2, 3

2025
[39]

Spin- quant: Llm quantization with learned rotations

Zechun Liu, Changsheng Zhao, Igor Fedorov, Bilge So- ran, Dhruv Choudhary, Raghuraman Krishnamoorthi, Vikas Chandra, Yuandong Tian, and Tijmen Blankevoort. Spin- quant: Llm quantization with learned rotations. InThe Thirteenth International Conference on Learning Represen- tations. 2
[40]

Sentence- t5: Scalable sentence encoders from pre-trained text-to-text models

Jianmo Ni, Gustavo Hernandez Abrego, Noah Constant, Ji Ma, Keith Hall, Daniel Cer, and Yinfei Yang. Sentence- t5: Scalable sentence encoders from pre-trained text-to-text models. InFindings of the association for computational linguistics: ACL 2022, pages 1864–1874, 2022. 3

2022
[41]

gpt-oss-120b & gpt-oss-20b model card, 2025

OpenAI. gpt-oss-120b & gpt-oss-20b model card, 2025. 1

2025
[42]

Dynamicvit: Efficient vision transformers with dynamic token sparsification.Advances in neural information processing systems, 34:13937–13949,

Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: Efficient vision transformers with dynamic token sparsification.Advances in neural information processing systems, 34:13937–13949,
[43]

Mixture- of-depths: Dynamically allocating compute in transformer-based language models.arXiv preprint arXiv:2404.02258,

David Raposo, Sam Ritter, Blake Richards, Timothy Lillicrap, Peter Conway Humphreys, and Adam San- toro. Mixture-of-depths: Dynamically allocating com- pute in transformer-based language models.arXiv preprint arXiv:2404.02258, 2024. 2

work page arXiv 2024
[44]

M-var: Decoupled scale-wise autoregressive modeling for high-quality image generation

Sucheng Ren, Yaodong Yu, Nataniel Ruiz, Feng Wang, Alan Yuille, and Cihang Xie. M-var: Decoupled scale-wise autoregressive modeling for high-quality image generation. arXiv preprint arXiv:2411.10433, 2024. 3

work page arXiv 2024
[45]

DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter

Victor Sanh, Lysandre Debut, Julien Chaumond, and Thomas Wolf. Distilbert, a distilled version of bert: smaller, faster, cheaper and lighter.arXiv preprint arXiv:1910.01108,

work page internal anchor Pith review arXiv 1910
[46]

Consistent accelerated inference via confident adaptive transformers.arXiv preprint arXiv:2104.08803, 2021

Tal Schuster, Adam Fisch, Tommi Jaakkola, and Regina Barzilay. Consistent accelerated inference via confident adaptive transformers.arXiv preprint arXiv:2104.08803,

work page arXiv
[47]

Con- fident adaptive language modeling.Advances in Neural In- formation Processing Systems, 35:17456–17472, 2022

Tal Schuster, Adam Fisch, Jai Gupta, Mostafa Dehghani, Dara Bahri, Vinh Tran, Yi Tay, and Donald Metzler. Con- fident adaptive language modeling.Advances in Neural In- formation Processing Systems, 35:17456–17472, 2022. 2, 3, 5

2022
[48]

The right tool for the job: Matching model and instance complexities.arXiv preprint arXiv:2004.07453, 2020

Roy Schwartz, Gabriel Stanovsky, Swabha Swayamdipta, Jesse Dodge, and Noah A Smith. The right tool for the job: Matching model and instance complexities.arXiv preprint arXiv:2004.07453, 2020. 2

work page arXiv 2004
[49]

Early exit is a nat- ural capability in transformer-based models: An empirical study on early exit without joint optimization.arXiv preprint arXiv:2412.01455, 2024

Weiqiao Shan, Long Meng, Tong Zheng, Yingfeng Luo, Bei Li, Tong Xiao, Jingbo Zhu, et al. Early exit is a nat- ural capability in transformer-based models: An empirical study on early exit without joint optimization.arXiv preprint arXiv:2412.01455, 2024. 2

work page arXiv 2024
[50]

Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer

Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outra- geously large neural networks: The sparsely-gated mixture- of-experts layer.arXiv preprint arXiv:1701.06538, 2017. 2

work page internal anchor Pith review Pith/arXiv arXiv 2017
[51]

Q- bert: Hessian based ultra low precision quantization of bert

Sheng Shen, Zhen Dong, Jiayu Ye, Linjian Ma, Zhewei Yao, Amir Gholami, Michael W Mahoney, and Kurt Keutzer. Q- bert: Hessian based ultra low precision quantization of bert. 10 InProceedings of the AAAI conference on artificial intelli- gence, pages 8815–8821, 2020. 2

2020
[52]

Block- wise parallel decoding for deep autoregressive models.Ad- vances in Neural Information Processing Systems, 31, 2018

Mitchell Stern, Noam Shazeer, and Jakob Uszkoreit. Block- wise parallel decoding for deep autoregressive models.Ad- vances in Neural Information Processing Systems, 31, 2018. 3

2018
[53]

arXiv preprint arXiv:2004.02984 , year=

Zhiqing Sun, Hongkun Yu, Xiaodan Song, Renjie Liu, Yim- ing Yang, and Denny Zhou. Mobilebert: a compact task- agnostic bert for resource-limited devices.arXiv preprint arXiv:2004.02984, 2020. 2

work page arXiv 2004
[54]

Hart: Efficient visual generation with hybrid autoregressive transformer

Haotian Tang, Yecheng Wu, Shang Yang, Enze Xie, Jun- song Chen, Junyu Chen, Zhuoyang Zhang, Han Cai, Yao Lu, and Song Han. Hart: Efficient visual generation with hybrid autoregressive transformer. InThe Thirteenth International Conference on Learning Representations. 1, 2, 3, 5, 6, 7
[55]

You need multiple exiting: Dynamic early exiting for accelerating unified vision language model

Shengkun Tang, Yaqing Wang, Zhenglun Kong, Tianchi Zhang, Yao Li, Caiwen Ding, Yanzhi Wang, Yi Liang, and Dongkuan Xu. You need multiple exiting: Dynamic early exiting for accelerating unified vision language model. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 10781–10791, 2023. 2

2023
[56]

A survey on transformer compression,

Yehui Tang, Yunhe Wang, Jianyuan Guo, Zhijun Tu, Kai Han, Hailin Hu, and Dacheng Tao. A survey on transformer compression.arXiv preprint arXiv:2402.05964, 2024. 2

work page arXiv 2024
[57]

Branchynet: Fast inference via early exiting from deep neural networks

Surat Teerapittayanon, Bradley McDanel, and Hsiang-Tsung Kung. Branchynet: Fast inference via early exiting from deep neural networks. In2016 23rd international con- ference on pattern recognition (ICPR), pages 2464–2469. IEEE, 2016. 2

2016
[58]

Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural in- formation processing systems, 37:84839–84865, 2024

Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Li- wei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural in- formation processing systems, 37:84839–84865, 2024. 1, 2, 3

2024
[59]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 2, 3

2017
[60]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341,

work page internal anchor Pith review arXiv
[61]

Litevar: Compressing visual autoregressive modelling with efficient attention and quantization

Rui Xie, Tianchen Zhao, Zhihang Yuan, Rui Wan, Wenxi Gao, Zhenhua Zhu, Xuefei Ning, and Yu Wang. Litevar: Compressing visual autoregressive modelling with efficient attention and quantization. InWorkshop on Machine Learn- ing and Compression, NeurIPS 2024. 3

2024
[62]

Berxit: Early exiting for bert with better fine-tuning and extension to regression

Ji Xin, Raphael Tang, Yaoliang Yu, and Jimmy Lin. Berxit: Early exiting for bert with better fine-tuning and extension to regression. InProceedings of the 16th conference of the European chapter of the association for computational lin- guistics: Main Volume, pages 91–104, 2021. 2

2021
[63]

Imagereward: learning and evaluating human preferences for text-to-image generation

Jiazheng Xu, Xiao Liu, Yuchen Wu, Yuxuan Tong, Qinkai Li, Ming Ding, Jie Tang, and Yuxiao Dong. Imagereward: learning and evaluating human preferences for text-to-image generation. InProceedings of the 37th International Con- ference on Neural Information Processing Systems, pages 15903–15935, 2023. 6, 7, 1, 3

2023
[64]

Bert loses patience: Fast and robust inference with early exit.Advances in Neural Information Processing Systems, 33:18330–18341, 2020

Wangchunshu Zhou, Canwen Xu, Tao Ge, Julian McAuley, Ke Xu, and Furu Wei. Bert loses patience: Fast and robust inference with early exit.Advances in Neural Information Processing Systems, 33:18330–18341, 2020. 2

2020
[65]

Leebert: Learned early exit for bert with cross- level optimization

Wei Zhu. Leebert: Learned early exit for bert with cross- level optimization. InProceedings of the 59th Annual Meet- ing of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pages 2968–2980,
[66]

Ablation on Different Reference Metrics The reference metricEand its layer range[ℓ begin, ℓend]de- termine the base decision rank map that guides depth allo- cation

2 11 Depth Adaptive Efficient Visual Autoregressive Modeling Supplementary Material A. Ablation on Different Reference Metrics The reference metricEand its layer range[ℓ begin, ℓend]de- termine the base decision rank map that guides depth allo- cation. We ablate these choices in Table 6, including met- rics analogous to those in SparseV AR [5] (EMSE on Bl...
[67]

and FastV AR (ESUB), under different reference scalesr R. R Reference Metric GenEval ImageReward E[ℓ begin, ℓend]Score↑ Avg Latency (ms)↓Score↑ Avg Latency (ms)↓ 7 EMAE [3,19]0.7256 1168 0.9088 1174 EMAE [0,31]0.7216 1228 0.9081 1253 EMSE [3,19]0.7219 1217 0.9094 1214 EMSE [0,31]0.7304 1270 0.8948 1295 EMSE Block 3 0.7198 1164 0.9078 1184 ESUB −0.7210 124...