arxiv: 2605.10045 · v1 · submitted 2026-05-11 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

ExtraVAR: Stage-Aware RoPE Remapping for Resolution Extrapolation in Visual Autoregressive Models

Feihong Yan , Shaoyu Liu , Haixuan Wang , Shuai Lu , Linfeng Zhang , Huiqi Li , Xiangyang Ji

Authors on Pith no claims yet

Pith reviewed 2026-05-12 02:58 UTC · model grok-4.3

classification 💻 cs.CV

keywords visual autoregressive modelsresolution extrapolationRoPE remappingattention calibrationimage synthesistraining-free adaptationfrequency bandsentropy scaling

0 comments

The pith

Stage-aware frequency remapping and entropy-based attention scaling let visual autoregressive models generate higher-resolution images without repetition or lost detail.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that three common failure modes in high-resolution VAR generation all stem from a single mismatch between RoPE frequency bands and the coarse-to-fine generation stages. It introduces two training-free fixes: one that gives each frequency band a stage-specific remapping rule, and another that computes a per-head scaling factor from normalized entropy to keep attention behavior consistent across resolutions. If these fixes work, VAR models trained at one resolution can directly produce coherent, detailed outputs at larger sizes without retraining or hand-tuned constants.

Core claim

Visual autoregressive models generate images stage by stage, with each stage dominated by a different RoPE frequency band; when resolution increases, the dominant band for a stage is no longer aligned, producing global repetition, local repetition, or detail loss. Stage-Aware RoPE Remapping assigns a distinct remapping rule to each band so that every stage retains its intended frequency behavior. Entropy-Driven Adaptive Attention Calibration measures attention dispersion with a resolution-invariant normalized entropy and supplies a closed-form per-head scale that restores the training-resolution entropy distribution at the new resolution.

What carries the argument

Stage-Aware RoPE Remapping, which pairs each dominant frequency band with a stage-specific remapping rule, together with Entropy-Driven Adaptive Attention Calibration, which derives a closed-form per-head scaling factor from normalized entropy to match extrapolated attention entropy to its training counterpart.

Load-bearing premise

The three failure modes are caused only by a band-stage mismatch in RoPE frequencies, and the proposed remapping rules plus entropy scaling will continue to work at new resolutions without any retraining or per-resolution tuning.

What would settle it

Generate images at a held-out resolution using the method and count whether global or local repetition patterns appear at rates comparable to naive extrapolation baselines.

Figures

Figures reproduced from arXiv: 2605.10045 by Feihong Yan, Haixuan Wang, Huiqi Li, Linfeng Zhang, Shaoyu Liu, Shuai Lu, Xiangyang Ji.

**Figure 1.** Figure 1: Intermediate outputs of VAR at different scale steps. 1 2 3 4 5 6 7 8 9 10 11 12 Very-Low (supportive) Low (global) Mid (local) High (detail) Scale id 0.31 0.30 0.27 0.26 0.23 0.21 0.21 0.20 0.18 0.16 0.15 0.15 0.29 0.39 0.44 0.43 0.41 0.37 0.34 0.31 0.27 0.24 0.22 0.21 0.21 0.18 0.18 0.21 0.25 0.30 0.32 0.34 0.35 0.33 0.32 0.30 0.19 0.13 0.11 0.10 0.11 0.12 0.13 0.15 0.20 0.27 0.31 0.34 [PITH_FULL_IMAGE:… view at source ↗

**Figure 3.** Figure 3: Very low-frequency RoPE components exhibit near-NoPE behavior. (a) Unperturbed image. (b) NoPE substitution on very low-frequency RoPE components preserves visual quality. (c) Q/K zeroing during the Layout Construction Stage disrupts concept formation. (d) Q/K zeroing during the Detail Refinement Stage blurs fine details. Normal Attn. Global Repetition Attn. (a) (b) (c) (d) 1.0 0.8 0.6 0.4 0.2 0.0 [PITH_… view at source ↗

**Figure 5.** Figure 5: Mid-frequency distortion causes local repetition in the Local Refinement Stage. (a) Unperturbed image with a detail crop at the typical element size. (b) Short-wavelength forcing to L/6 induces local repetition with smaller and more numerous compositional elements. Observation 3 (Mid-frequency components govern mid-sized object composition in the Local Refinement Stage). We next study midfrequency RoP… view at source ↗

**Figure 6.** Figure 6: High-frequency components are critical for fine detail in the Detail Refinement Stage. (a) Unperturbed image with a detail crop showing sharp edges. (b) Corresponding attention map with a typical diagonal pattern. (c) Late-stage NoPE on these dimensions blurs edges and fine textures. (d) Corresponding attention map with leakage around the diagonal. Observation 4 (High-frequency components are critical for … view at source ↗

**Figure 7.** Figure 7: Qualitative comparison of ExtraVAR against representative resolution-extrapolation baselines [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 8.** Figure 8: Sensitivity analysis of hyperparameters kl and kh [PITH_FULL_IMAGE:figures/full_fig_p009_8.png] view at source ↗

read the original abstract

Visual Autoregressive (VAR) models have emerged as a strong alternative to diffusion for image synthesis, yet their fixed training resolution prevents direct generation at higher resolutions. Naively transferring training-free extrapolation methods from LLMs or diffusion models to VAR yields three characteristic failure modes: global repetition, local repetition, and detail degradation. We trace them to a unified band-stage mismatch: VAR generates images in a coarse-to-fine, scale-wise process where each stage is driven by a distinct dominant RoPE frequency band, and each failure mode emerges when the dominant band of a particular stage is disrupted. Building on this insight, we propose Stage-Aware RoPE Remapping, a training-free strategy that assigns each frequency band a stage-specific remapping rule, jointly suppressing all three failure modes. We further observe that attention becomes systematically dispersed as the image resolution increases. Existing methods typically depend on predefined attention scaling factors, which are neither adaptive to the target resolution nor capable of faithfully capturing the actual extent of attention dispersion. We therefore propose Entropy-Driven Adaptive Attention Calibration, which quantifies dispersion via a resolution-invariant normalized entropy and yields a closed-form per-head scaling factor that realigns the extrapolated-resolution attention entropy with its training-resolution counterpart. Extensive experiments show that our method consistently outperforms prior resolution-extrapolation methods in both structural coherence and fine-detail fidelity. Our code is available at https://github.com/feihongyan1/ExtraVAR.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ExtraVAR gives a practical training-free fix for VAR resolution extrapolation by linking RoPE remapping to the model's coarse-to-fine stages and using entropy to derive attention scales.

read the letter

The paper's core contribution is a diagnosis that ties global repetition, local repetition, and detail degradation in extrapolated VAR outputs to a single band-stage mismatch in RoPE frequencies, then fixes it with stage-specific remapping rules plus a closed-form entropy-based attention calibration. This is more targeted than just porting LLM or diffusion extrapolation tricks, and the abstract reports consistent gains in structural coherence and fine details across experiments, with code released for checking.

Referee Report

2 major / 2 minor

Summary. The paper claims that naive resolution extrapolation in Visual Autoregressive (VAR) models produces three failure modes—global repetition, local repetition, and detail degradation—due to a unified band-stage mismatch in RoPE frequencies across the coarse-to-fine stages. It proposes two training-free fixes: Stage-Aware RoPE Remapping, which assigns stage-specific remapping rules to frequency bands, and Entropy-Driven Adaptive Attention Calibration, which computes a resolution-invariant normalized entropy to derive a closed-form per-head scaling factor that restores training-resolution attention entropy. Experiments reportedly show consistent outperformance over prior extrapolation methods in structural coherence and fine-detail fidelity, with code released.

Significance. If the diagnosis and fixes hold, the work is significant for enabling practical high-resolution generation with VAR models without retraining or per-resolution tuning. Strengths include the training-free closed-form derivations, the unified treatment of multiple failure modes, and the public code release, which supports reproducibility. This addresses a key limitation in scale-wise autoregressive vision models as they compete with diffusion approaches.

major comments (2)

[Introduction and §3] Introduction and §3 (diagnosis of failure modes): The claim that global repetition, local repetition, and detail degradation arise solely from a unified band-stage RoPE mismatch is load-bearing for the entire method. The manuscript traces the modes to dominant frequency bands per stage but provides no ablation or analysis ruling out independent causes such as stage-wise token count changes or unmodeled cross-stage dependencies. If these exist, the proposed remapping rules address only a subset of the problem.
[§4.2] §4.2 (Entropy-Driven Adaptive Attention Calibration): The closed-form per-head scaling factor is derived from a resolution-invariant normalized entropy to realign attention dispersion. However, the derivation assumes entropy fully captures dispersion; it is unclear whether this accounts for resolution-dependent changes in attention patterns beyond entropy (e.g., head-specific or spatial variations). The paper should report a direct comparison of the predicted scaling factor against measured entropy at extrapolated resolutions to validate the formula.

minor comments (2)

[Abstract] Abstract: States 'extensive experiments' and 'consistent outperformance' but omits specific quantitative metrics, datasets, or extrapolation factors, making it hard to assess the strength of the empirical claims without the full tables.
[Method] Notation: The description of 'normalized entropy' and 'stage-specific remapping rules' would benefit from an explicit equation or pseudocode early in the method section to clarify the closed-form nature before the experiments.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We have carefully considered each major comment and provide point-by-point responses below. We believe these clarifications and proposed revisions will improve the paper's rigor and clarity.

read point-by-point responses

Referee: [Introduction and §3] Introduction and §3 (diagnosis of failure modes): The claim that global repetition, local repetition, and detail degradation arise solely from a unified band-stage RoPE mismatch is load-bearing for the entire method. The manuscript traces the modes to dominant frequency bands per stage but provides no ablation or analysis ruling out independent causes such as stage-wise token count changes or unmodeled cross-stage dependencies. If these exist, the proposed remapping rules address only a subset of the problem.

Authors: We appreciate the referee highlighting this important aspect. In §3, we provide a detailed diagnosis linking each failure mode to specific band-stage mismatches through frequency analysis and visualization of attention patterns. While we did not include an explicit ablation isolating RoPE effects from token count variations, our experiments demonstrate that applying the stage-aware remapping alone significantly mitigates all three failure modes, even as token counts change with resolution. This suggests the RoPE mismatch is the dominant factor. To address the concern, we will add a discussion and a targeted ablation in the revised manuscript to further rule out or quantify contributions from other potential causes. revision: partial
Referee: [§4.2] §4.2 (Entropy-Driven Adaptive Attention Calibration): The closed-form per-head scaling factor is derived from a resolution-invariant normalized entropy to realign attention dispersion. However, the derivation assumes entropy fully captures dispersion; it is unclear whether this accounts for resolution-dependent changes in attention patterns beyond entropy (e.g., head-specific or spatial variations). The paper should report a direct comparison of the predicted scaling factor against measured entropy at extrapolated resolutions to validate the formula.

Authors: Thank you for this valuable suggestion. The Entropy-Driven Adaptive Attention Calibration in §4.2 derives the scaling factor from the normalized entropy, which we show is approximately invariant across resolutions in our analysis. To validate the closed-form formula, we will include in the revised manuscript a direct comparison between the predicted scaling factors and the measured attention entropy at extrapolated resolutions for various heads and resolutions. This will confirm that the formula accurately restores the training-resolution entropy levels. revision: yes

Circularity Check

0 steps flagged

No circularity: explicit rules and closed-form factors derived from analysis, not fitted inputs or self-citations

full rationale

The paper's core claims rest on an observational diagnosis of failure modes traced to band-stage RoPE mismatch, followed by explicit stage-specific remapping rules and a closed-form entropy-based scaling factor. These are presented as training-free constructions that do not reduce to parameters fitted on target-resolution data or to quantities defined by the same constants used in evaluation. No equations equate a 'prediction' back to its own inputs by construction, and the central premise is not justified solely via self-citation chains. The method remains self-contained against external benchmarks, with the remapping and calibration steps adding independent content beyond renaming or ansatz smuggling.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the diagnostic that failure modes share a single band-stage mismatch cause and on the empirical effectiveness of the two proposed heuristics; no free parameters are introduced in the abstract description, no new physical entities are postulated, and background assumptions are standard transformer position encoding plus attention entropy as a dispersion measure.

axioms (1)

domain assumption VAR models generate images in a coarse-to-fine scale-wise process where each stage is driven by a distinct dominant RoPE frequency band
Invoked in the abstract to unify the three failure modes; this is a modeling assumption about the internal dynamics of VAR rather than a standard math fact.

pith-pipeline@v0.9.0 · 5581 in / 1380 out tokens · 25921 ms · 2026-05-12T02:58:11.309773+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We trace them to a unified band-stage mismatch: VAR generates images in a coarse-to-fine, scale-wise process where each stage is driven by a distinct dominant RoPE frequency band... Stage-Aware RoPE Remapping... Entropy-Driven Adaptive Attention Calibration, which quantifies dispersion via a resolution-invariant normalized entropy
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

K=13 generation scale steps... kl=6 and kh=9... High/Mid and Mid/Low band boundaries

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 8 internal anchors

[1]

Multidiffusion: Fusing diffusion paths for controlled image generation

Omer Bar-Tal, Lior Yariv, Yaron Lipman, and Tali Dekel. Multidiffusion: Fusing diffusion paths for controlled image generation. InInternational Conference on Machine Learning, 2023

work page 2023
[2]

Jiazi Bu, Pengyang Ling, Yujie Zhou, Pan Zhang, Tong Wu, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Dahua Lin, and Jiaqi Wang

Jiazi Bu, Pengyang Ling, Yujie Zhou, Pan Zhang, Tong Wu, Xiaoyi Dong, Yuhang Zang, Yuhang Cao, Dahua Lin, and Jiaqi Wang. Hiflow: Training-free high-resolution image generation with flow-aligned guidance.arXiv preprint arXiv:2504.06232, 2025

work page arXiv 2025
[3]

Pixart- σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation

Junsong Chen, Chongjian Ge, Enze Xie, Yue Wu, Lewei Yao, Xiaozhe Ren, Zhongdao Wang, Ping Luo, Huchuan Lu, and Zhenguo Li. Pixart- σ: Weak-to-strong training of diffusion transformer for 4k text-to-image generation. InEuropean Conference on Computer Vision, pages 74–91. Springer, 2024

work page 2024
[4]

Generative pretraining from pixels

Mark Chen, Alec Radford, Rewon Child, Jeffrey Wu, Heewoo Jun, David Luan, and Ilya Sutskever. Generative pretraining from pixels. InInternational conference on machine learning, pages 1691–1703. PMLR, 2020

work page 2020
[5]

Extending Context Window of Large Language Models via Positional Interpolation

Shouyuan Chen, Sherman Wong, Liangjian Chen, and Yuandong Tian. Extending context window of large language models via positional interpolation.arXiv preprint arXiv:2306.15595, 2023

work page internal anchor Pith review arXiv 2023
[6]

Critical attention scaling in long-context transformers

Shi Chen, Zhengjiang Lin, Yury Polyanskiy, and Philippe Rigollet. Critical attention scaling in long-context transformers. InInternational Conference on Learning Representations, 2026

work page 2026
[7]

I-max: Max- imize the resolution potential of pre-trained rectified flow transformers with projected flow.ArXiv, abs/2410.07536, 2024

Ruoyi Du, Dongyang Liu, Le Zhuo, Qi Qin, Hongsheng Li, Zhanyu Ma, and Peng Gao. I-max: Max- imize the resolution potential of pre-trained rectified flow transformers with projected flow.ArXiv, abs/2410.07536, 2024

work page arXiv 2024
[8]

Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. Geneval: An object-focused framework for evaluating text-to-image alignment.Advances in Neural Information Processing Systems, 36:52132–52152, 2023

work page 2023
[9]

Fastvar: Linear visual autoregressive modeling via cached token pruning

Hang Guo, Yawei Li, Taolin Zhang, Jiangshan Wang, Tao Dai, Shu-Tao Xia, and Luca Benini. Fastvar: Linear visual autoregressive modeling via cached token pruning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 19011–19021, 2025

work page 2025
[10]

Make a cheap scaling: A self-cascade diffusion model for higher-resolution adaptation

Lanqing Guo, Yingqing He, Haoxin Chen, Menghan Xia, Xiaodong Cun, Yufei Wang, Siyu Huang, Yong Zhang, Xintao Wang, Qifeng Chen, et al. Make a cheap scaling: A self-cascade diffusion model for higher-resolution adaptation. InEuropean conference on computer vision, pages 39–55. Springer, 2024

work page 2024
[11]

Elasticdiffusion: Training-free arbitrary size image generation through global-local content separation

Moayed Haji-Ali, Guha Balakrishnan, and Vicente Ordonez. Elasticdiffusion: Training-free arbitrary size image generation through global-local content separation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6603–6612, 2024

work page 2024
[12]

Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis

Jian Han, Jinlai Liu, Yi Jiang, Bin Yan, Yuqi Zhang, Zehuan Yuan, Bingyue Peng, and Xiaobing Liu. Infinity: Scaling bitwise autoregressive modeling for high-resolution image synthesis. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 15733–15744, 2025

work page 2025
[13]

Scalecrafter: Tuning-free higher-resolution visual generation with diffusion models

Yingqing He, Shaoshu Yang, Haoxin Chen, Xiaodong Cun, Menghan Xia, Yong Zhang, Xintao Wang, Ran He, Qifeng Chen, and Ying Shan. Scalecrafter: Tuning-free higher-resolution visual generation with diffusion models. InThe Twelfth International Conference on Learning Representations, 2023

work page 2023
[14]

simple diffusion: End-to-end diffusion for high resolution images

Emiel Hoogeboom, Jonathan Heek, and Tim Salimans. simple diffusion: End-to-end diffusion for high resolution images. InInternational Conference on Machine Learning, pages 13213–13232. PMLR, 2023

work page 2023
[15]

ELLA: Equip Diffusion Models with LLM for Enhanced Semantic Alignment

Xiwei Hu, Rui Wang, Yixiao Fang, Bin Fu, Pei Cheng, and Gang Yu. Ella: Equip diffusion models with llm for enhanced semantic alignment.arXiv preprint arXiv:2403.05135, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Fouriscale: A frequency perspective on training-free high-resolution image synthesis

Linjiang Huang, Rongyao Fang, Aiping Zhang, Guanglu Song, Si Liu, Yu Liu, and Hongsheng Li. Fouriscale: A frequency perspective on training-free high-resolution image synthesis. InEuropean conference on computer vision, pages 196–212. Springer, 2024

work page 2024
[17]

Entropy rectifying guidance for diffusion and flow models

Tariq Berrada Ifriqi, Adriana Romero-Soriano, Michal Drozdzal, Jakob Verbeek, and Karteek Alahari. Entropy rectifying guidance for diffusion and flow models. InNeurIPS 2025-Thirty-ninth Conference on Neural Information Processing Systems, 2025

work page 2025
[18]

Dype: Dynamic position extrapolation for ultra high resolution diffusion.arXiv preprint arXiv:2510.20766, 2025

Noam Issachar, Guy Yariv, Sagie Benaim, Yossi Adi, Dani Lischinski, and Raanan Fattal. Dype: Dynamic position extrapolation for ultra high resolution diffusion.arXiv preprint arXiv:2510.20766, 2025. 10

work page arXiv 2025
[19]

Training-free diffusion model adaptation for variable- sized text-to-image synthesis.Advances in Neural Information Processing Systems, 36:70847–70860, 2023

Zhiyu Jin, Xuli Shen, Bin Li, and Xiangyang Xue. Training-free diffusion model adaptation for variable- sized text-to-image synthesis.Advances in Neural Information Processing Systems, 36:70847–70860, 2023

work page 2023
[20]

The impact of positional encoding on length generalization in transformers.Advances in Neural Information Processing Systems, 36:24892–24928, 2023

Amirhossein Kazemnejad, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Payel Das, and Siva Reddy. The impact of positional encoding on length generalization in transformers.Advances in Neural Information Processing Systems, 36:24892–24928, 2023

work page 2023
[21]

Diffusehigh: Training-free progressive high-resolution image synthesis through structure guidance

Younghyun Kim, Geunmin Hwang, Junyu Zhang, and Eunbyung Park. Diffusehigh: Training-free progressive high-resolution image synthesis through structure guidance. InProceedings of the AAAI conference on artificial intelligence, pages 4338–4346, 2025

work page 2025
[22]

Scalediff: Higher- resolution image synthesis via efficient and model-agnostic diffusion

Sungho Koh, SeungJu Cha, Hyunwoo Oh, Kwanyoung Lee, and Dong-Jin Kim. Scalediff: Higher- resolution image synthesis via efficient and model-agnostic diffusion. InAdvances in Neural Information Processing Systems, 2025

work page 2025
[23]

Syncdiffusion: Coherent montage via synchronized joint diffusions.Advances in Neural Information Processing Systems, 36:50648–50660, 2023

Yuseung Lee, Kunho Kim, Hyunjin Kim, and Minhyuk Sung. Syncdiffusion: Coherent montage via synchronized joint diffusions.Advances in Neural Information Processing Systems, 36:50648–50660, 2023

work page 2023
[24]

Alignvid: Training-free attention scaling for semantic fidelity in text-guided image-to-video generation.arXiv preprint arXiv:2512.01334, 2025

Yexin Liu, Wen-Jie Shu, Zile Huang, Haoze Zheng, Yueze Wang, Manyuan Zhang, Ser-Nam Lim, and Harry Yang. Alignvid: Training-free attention scaling for semantic fidelity in text-guided image-to-video generation.arXiv preprint arXiv:2512.01334, 2025

work page arXiv 2025
[25]

Nakanishi

Ken M Nakanishi. Scalable-softmax is superior for attention.arXiv preprint arXiv:2501.19399, 2025

work page arXiv 2025
[26]

Yarn: Efficient context window extension of large language models

Bowen Peng, Jeffrey Quesnelle, Honglu Fan, and Enrico Shippole. Yarn: Efficient context window extension of large language models. InInternational Conference on Learning Representations, 2024

work page 2024
[27]

SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

Dustin Podell, Zion English, Kyle Lacey, Andreas Blattmann, Tim Dockhorn, Jonas Müller, Joe Penna, and Robin Rombach. Sdxl: Improving latent diffusion models for high-resolution image synthesis.arXiv preprint arXiv:2307.01952, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Freescale: Unleashing the resolution of diffusion models via tuning-free scale fusion

Haonan Qiu, Shiwei Zhang, Yujie Wei, Ruihang Chu, Hangjie Yuan, Xiang Wang, Yingya Zhang, and Ziwei Liu. Freescale: Unleashing the resolution of diffusion models via tuning-free scale fusion. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 16893–16903, 2025

work page 2025
[29]

Ultrapixel: Advancing ultra high-resolution image synthesis to new peaks.Advances in Neural Information Processing Systems, 37:111131–111171, 2024

Jingjing Ren, Wenbo Li, Haoyu Chen, Renjing Pei, Bin Shao, Yong Guo, Long Peng, Fenglong Song, and Lei Zhu. Ultrapixel: Advancing ultra high-resolution image synthesis to new peaks.Advances in Neural Information Processing Systems, 37:111131–111171, 2024

work page 2024
[30]

Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

Jianlin Su, Murtadha Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

work page 2024
[31]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Peize Sun, Yi Jiang, Shoufa Chen, Shilong Zhang, Bingyue Peng, Ping Luo, and Zehuan Yuan. Autore- gressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Freepca: Integrating consistency information across long-short frames in training-free long video generation via principal component analysis

Jiangtong Tan, Hu Yu, Jie Huang, Jie Xiao, and Feng Zhao. Freepca: Integrating consistency information across long-short frames in training-free long video generation via principal component analysis. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27979– 27988, 2025

work page 2025
[33]

Hart: Efficient visual generation with hybrid autoregressive transformer.arXiv preprint arXiv:2410.10812,

Haotian Tang, Yecheng Wu, Shang Yang, Enze Xie, Junsong Chen, Junyu Chen, Zhuoyang Zhang, Han Cai, Yao Lu, and Song Han. Hart: Efficient visual generation with hybrid autoregressive transformer.arXiv preprint arXiv:2410.10812, 2024

work page arXiv 2024
[34]

Relay diffusion: Unifying diffusion process across resolutions for image synthesis

Jiayan Teng, Wendi Zheng, Ming Ding, Wenyi Hong, Jianqiao Wangni, Zhuoyi Yang, and Jie Tang. Relay diffusion: Unifying diffusion process across resolutions for image synthesis. InInternational Conference on Learning Representations, 2024

work page 2024
[35]

Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural information processing systems, 37:84839–84865, 2024

Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advances in neural information processing systems, 37:84839–84865, 2024

work page 2024
[36]

Pixel recurrent neural networks

Aäron Van Den Oord, Nal Kalchbrenner, and Koray Kavukcuoglu. Pixel recurrent neural networks. In International conference on machine learning, pages 1747–1756. PMLR, 2016. 11

work page 2016
[37]

Emu3: Next-Token Prediction is All You Need

Xinlong Wang, Xiaosong Zhang, Zhengxiong Luo, Quan Sun, Yufeng Cui, Jinsheng Wang, Fan Zhang, Yueze Wang, Zhen Li, Qiying Yu, et al. Emu3: Next-token prediction is all you need.arXiv preprint arXiv:2409.18869, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

Native-resolution image synthesis

Zidong Wang, Lei Bai, Xiangyu Yue, Wanli Ouyang, and Yiyuan Zhang. Native-resolution image synthesis. InAdvances in Neural Information Processing Systems, 2025

work page 2025
[39]

Human Preference Score v2: A Solid Benchmark for Evaluating Human Preferences of Text-to-Image Synthesis

Xiaoshi Wu, Yiming Hao, Keqiang Sun, Yixiong Chen, Feng Zhu, Rui Zhao, and Hongsheng Li. Human preference score v2: A solid benchmark for evaluating human preferences of text-to-image synthesis.arXiv preprint arXiv:2306.09341, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[40]

Sana: Efficient high-resolution image synthesis with linear diffusion transformers

Enze Xie, Junsong Chen, Junyu Chen, Han Cai, Haotian Tang, Yujun Lin, Zhekai Zhang, Muyang Li, Ligeng Zhu, Yao Lu, et al. Sana: Efficient high-resolution image synthesis with linear diffusion transformers. InInternational Conference on Learning Representations, 2025

work page 2025
[41]

Show-o: One Single Transformer to Unify Multimodal Understanding and Generation

Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. Show-o: One single transformer to unify multimodal understanding and generation.arXiv preprint arXiv:2408.12528, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

Scaling Autoregressive Models for Content-Rich Text-to-Image Generation

Jiahui Yu, Yuanzhong Xu, Jing Yu Koh, Thang Luong, Gunjan Baid, Zirui Wang, Vijay Vasudevan, Alexander Ku, Yinfei Yang, Burcu Karagol Ayan, et al. Scaling autoregressive models for content-rich text-to-image generation.arXiv preprint arXiv:2206.10789, 2(3):5, 2022

work page internal anchor Pith review arXiv 2022
[43]

Ultra-resolution adaptation with ease

Ruonan Yu, Songhua Liu, Zhenxiong Tan, and Xinchao Wang. Ultra-resolution adaptation with ease. In Proceedings of the 42nd International Conference on Machine Learning, pages 73241–73261. PMLR, 2025

work page 2025
[44]

Infoscale: Unleashing training-free variable-scaled image generation via effective utilization of information.arXiv preprint arXiv:2509.01421, 2025

Guohui Zhang, Jiangtong Tan, Linjiang Huang, Zhonghang Yuan, Mingde Yao, Jie Huang, and Feng Zhao. Infoscale: Unleashing training-free variable-scaled image generation via effective utilization of information.arXiv preprint arXiv:2509.01421, 2025

work page arXiv 2025
[45]

Diffusion-4k: Ultra-high-resolution im- age synthesis with latent diffusion models

Jinjin Zhang, Qiuyu Huang, Junjie Liu, Xiefan Guo, and Di Huang. Diffusion-4k: Ultra-high-resolution im- age synthesis with latent diffusion models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23464–23473, 2025

work page 2025
[46]

Hidiffusion: Unlock- ing higher-resolution creativity and efficiency in pretrained diffusion models

Shen Zhang, Zhaowei Chen, Zhenyu Zhao, Yuhao Chen, Yao Tang, and Jiajun Liang. Hidiffusion: Unlock- ing higher-resolution creativity and efficiency in pretrained diffusion models. InEuropean Conference on Computer Vision, pages 145–161. Springer, 2024

work page 2024
[47]

Ledit: Your length-extrapolatable diffusion transformer without positional encoding

Shen Zhang, Siyuan Liang, Yaning Tan, Zhaowei Chen, Linze Li, Ge Wu, Yuhao Chen, Shuheng Li, Zhenyu Zhao, Caihua Chen, et al. Ledit: Your length-extrapolatable diffusion transformer without positional encoding. InAdvances in Neural Information Processing Systems, 2025

work page 2025
[48]

Extending llms’ context window with 100 samples.arXiv preprint arXiv:2401.07004, 2024

Yikai Zhang, Junlong Li, and Pengfei Liu. Extending llms’ context window with 100 samples.arXiv preprint arXiv:2401.07004, 2024

work page arXiv 2024
[49]

Found in the middle: How language models use long contexts better via plug-and-play positional encoding.Advances in Neural Information Processing Systems, 37:60755–60775, 2024

Zhenyu Zhang, Runjin Chen, Shiwei Liu, Zhewei Yao, Olatunji Ruwase, Beidi Chen, Xiaoxia Wu, and Zhangyang Wang. Found in the middle: How language models use long contexts better via plug-and-play positional encoding.Advances in Neural Information Processing Systems, 37:60755–60775, 2024

work page 2024
[50]

Frecas: Efficient higher-resolution image generation via frequency-aware cascaded sampling

Zhengqiang Zhang, Ruihuang Li, and Lei Zhang. Frecas: Efficient higher-resolution image generation via frequency-aware cascaded sampling. InInternational Conference on Learning Representations, 2025

work page 2025
[51]

Riflex: A free lunch for length extrapolation in video diffusion transformers

Min Zhao, Guande He, Yixiao Chen, Hongzhou Zhu, Chongxuan Li, and Jun Zhu. Riflex: A free lunch for length extrapolation in video diffusion transformers.arXiv preprint arXiv:2502.15894, 2025

work page arXiv 2025
[52]

Ultraimage: Rethinking resolution extrapolation in image diffusion transformers.arXiv preprint arXiv:2512.04504, 2025

Min Zhao, Bokai Yan, Xue Yang, Hongzhou Zhu, Jintao Zhang, Shilong Liu, Chongxuan Li, and Jun Zhu. Ultraimage: Rethinking resolution extrapolation in image diffusion transformers.arXiv preprint arXiv:2512.04504, 2025

work page arXiv 2025
[53]

Any- size-diffusion: Toward efficient text-driven synthesis for any-size hd images

Qingping Zheng, Yuanfan Guo, Jiankang Deng, Jianhua Han, Ying Li, Songcen Xu, and Hang Xu. Any- size-diffusion: Toward efficient text-driven synthesis for any-size hd images. InProceedings of the AAAI Conference on Artificial Intelligence, pages 7571–7578, 2024

work page 2024
[54]

Understanding the rope extensions of long-context llms: An attention perspective

Meizhi Zhong, Chen Zhang, Yikun Lei, Xikai Liu, Yan Gao, Yao Hu, Kehai Chen, and Min Zhang. Understanding the rope extensions of long-context llms: An attention perspective. InProceedings of the 31st International Conference on Computational Linguistics, pages 8955–8962, 2025. 12 A Background A.1 Visual autoregressive modeling via next-scale prediction Vi...

work page 2025