Hierarchical Image Tokenization for Multi-Scale Image Super Resolution

Adrian Bulat; Brais Martinez; Enrique Sanchez; Georgios Tzimiropoulos; Isma Hadji

arxiv: 2605.14891 · v1 · pith:OEFI7ZLYnew · submitted 2026-05-14 · 💻 cs.CV

Hierarchical Image Tokenization for Multi-Scale Image Super Resolution

Isma Hadji , Enrique Sanchez , Adrian Bulat , Brais Martinez , Georgios Tzimiropoulos This is my paper

Pith reviewed 2026-06-30 21:08 UTC · model grok-4.3

classification 💻 cs.CV

keywords image super-resolutionvisual auto-regressive modelinghierarchical tokenizationresidual quantizationmulti-scale generationdirect preference optimization

0 comments

The pith

Hierarchical tokenization with scale overlap lets a 300M model do multi-scale super-resolution in one pass without extra data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a multi-scale image super-resolution method on visual auto-regressive models that already decompose images into additive residual scales. It adds Hierarchical Image Tokenization to force shared tokens across those scales and a simple preference term that pushes the model toward the high-resolution target using only low-high pairs. This combination supplies an inductive bias strong enough for a 300M-parameter transformer to reach state-of-the-art accuracy while producing all intermediate scales from a single forward pass and without any external training images.

Core claim

By replacing standard residual quantization with Hierarchical Image Tokenization that enforces token overlap across scales and adding a Direct Preference Optimization term on low-resolution versus high-resolution pairs, visual auto-regressive training for image super-resolution becomes flexible enough to deliver multi-scale outputs from one forward pass, reach state-of-the-art performance with only 300 million parameters, and require no external annotated data.

What carries the argument

Hierarchical Image Tokenization (HIT): a progressive tokenization procedure that represents an image at multiple scales while enforcing overlap between tokens at successive scales inside residual quantization.

If this is right

A single trained model produces consistent outputs at every intermediate scale instead of being locked to one fixed output resolution.
Model size can be reduced from 1 billion to 300 million parameters while still beating prior VAR-based super-resolution methods.
Training succeeds using only the low-resolution and high-resolution image pairs already present in ordinary super-resolution datasets.
The same forward pass yields all scales, removing the need for separate models or repeated inference.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The overlap constraint may transfer to other residual-quantization generative tasks such as image editing or video prediction where consistent structure across resolutions is useful.
Because the method avoids external data, it could be retrained quickly on domain-specific paired images such as medical or satellite imagery.
The single-pass multi-scale property might allow progressive refinement pipelines in which a downstream task inspects an intermediate scale before deciding whether to continue to higher resolution.

Load-bearing premise

Enforcing token overlap across scales will give the transformer enough inductive bias to reach state-of-the-art performance with only 300 million parameters and no external data.

What would settle it

Train an otherwise identical 300M-parameter VAR model on the same low-high pairs but without the HIT overlap constraint and measure whether it still matches the reported state-of-the-art PSNR or perceptual scores on standard multi-scale super-resolution benchmarks.

Figures

Figures reproduced from arXiv: 2605.14891 by Adrian Bulat, Brais Martinez, Enrique Sanchez, Georgios Tzimiropoulos, Isma Hadji.

**Figure 1.** Figure 1: We propose a VAR-based Multi-scale SR method.(Left) Existing methods (Qu et al., 2025) can only target a single scale (e.g. ×4). (Right) Instead we take full advantage of the next-scale prediction paradigm by introducing Hierarchical Image Tokenization. Using this approach our model can progressively upscale an image while keeping semantic consistency. Top-to-bottom images correspond to output of the mod… view at source ↗

**Figure 2.** Figure 2: Top: we represent the quantization of an input image (up to L = 6 scales for clarity of visualization). We observe that reconstructing the image using all 6 residuals leads to perfect reconstruction. However, mapping the residuals up to the first three scales does not result in the desired 2×-downsampled version of the input image, i.e., the initial scales do not convey scale-wise semantic information of t… view at source ↗

**Figure 4.** Figure 4: Qualitative results: (Top) without and (Bottom) with our proposed DPO-based regularization. We can see the role of the regularization term in sharpening results. Hierarchical RQVAE. (H-RQVAE) The compression factor of the encoder is f = 1/16. To align with prior work on VAR, we use L = 10 steps, with resolutions ρl = (4, 6, 8, 10, 14, 16, 20, 24, 28, 32). Steps l = 1 . . . 3 are mapped to the 128 × 128 out… view at source ↗

**Figure 3.** Figure 3: Multi-scale SR evaluation of (a) VARSR, (b) [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 5.** Figure 5: Qualitative results. (a) Input LR (upsampled to target resolution); (b) Ground truth; (c) StableSR; (d) Resshift; (e) [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative results. (a) Input LR (upsampled to target resolution); (b) Ground truth; (c) StableSR; (d) Resshift; (e) [PITH_FULL_IMAGE:figures/full_fig_p015_6.png] view at source ↗

read the original abstract

We introduce a multi-scale Image Super Resolution (ISR) method building on recent advances in Visual Auto-Regressive (VAR) modeling. VAR models break image tokenization into additive, gradually increasing scales, using Residual Quantization (RQ), an approach that aligns perfectly with our target ISR task. Previous works taking advantage of this synergy suffer from two main shortcomings. First, due to the limitations in RQ, they only generate images at a predefined fixed scale, failing to map intermediate outputs to the corresponding image scales. They also rely on large backbones or a large corpus of annotated data to achieve better performance. To address both shortcomings, we introduce two novel components to the VAR training for ISR, aiming at increasing its flexibility and reducing its complexity. In particular, we introduce a) a \textbf{Hierarchical Image Tokenization (HIT)} approach that progressively represents images at different scales while enforcing token overlap across scales, and b) a \textbf{Direct Preference Optimization (DPO) regularization term} that, relying solely on the (LR,HR) pair, encourages the transformer to produce the latter over the former. Our proposed HIT acts as a strong inductive bias for the VAR training, resulting in a small model (300M params vs 1B params of VARSR), that achieves state-of-the-art results without external training data, and that delivers multi-scale outputs with a single forward pass.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

HIT adds overlap enforcement and DPO to VAR for multi-scale ISR, but the abstract gives no numbers or ablations to back the inductive-bias claim that lets the 300M model beat the 1B baseline.

read the letter

The paper's main move is to add Hierarchical Image Tokenization (HIT) that forces token overlap across scales inside residual quantization, plus a DPO regularizer that only needs LR-HR pairs. They say this combination supplies enough inductive bias for a 300M model to hit SOTA multi-scale super-resolution in one forward pass without extra data, beating the prior 1B-param VARSR work.

What is actually new is the explicit overlap mechanism in the tokenization and the application of DPO in this setting. Earlier VARSR approaches stayed at fixed scales and leaned on bigger models or more data, so these two pieces target those limits directly.

The work does a clear job naming the shortcomings and sketching components meant to fix flexibility and size. The single-pass multi-scale output is a practical angle if it holds.

The soft spot is that the central claim depends on HIT's overlap delivering a strong enough bias to close the performance gap with a much smaller model, yet the abstract shows no quantitative results, no ablation isolating the overlap, and no description of how overlap is added without breaking the additive property of residual quantization. The DPO term is described as auxiliary, so the performance edge rides mostly on the untested bias argument. That concern from the stress-test note stands up on the given text.

This is for researchers working on autoregressive vision models who need variable-resolution outputs. A reader already following VARSR would get value from seeing whether the overlap idea actually moves the needle once the experiments are checked.

I would send it for peer review so the full results and controls can be examined. The idea is worth testing even if the abstract leaves the key assumption open.

Referee Report

1 major / 0 minor

Summary. The paper introduces Hierarchical Image Tokenization (HIT) and a Direct Preference Optimization (DPO) regularization term to adapt Visual Auto-Regressive (VAR) models with Residual Quantization (RQ) for multi-scale image super-resolution (ISR). It claims that enforcing token overlap across scales via HIT provides a strong inductive bias, enabling a 300M-parameter model to achieve state-of-the-art multi-scale ISR results without external training data and to produce outputs at multiple scales in a single forward pass, overcoming limitations of prior VAR-based methods that require larger backbones (e.g., 1B params), annotated data, or fixed-scale generation.

Significance. If validated by experiments, the result would be significant for efficient multi-scale ISR, as it shows how a targeted tokenization scheme can reduce model size by a factor of three while eliminating external data needs and enabling flexible scale outputs. The alignment of RQ's additive scales with ISR is a natural fit, and the single-pass multi-scale capability addresses a practical gap in existing approaches.

major comments (1)

[Abstract] Abstract: The central claim that HIT's enforced token overlap supplies a sufficiently strong inductive bias for 300M-param SOTA performance (vs. 1B-param VARSR) without external data is load-bearing for the contribution, yet the provided manuscript text contains no quantitative results, ablation studies isolating the overlap mechanism, or implementation details on preserving RQ's additive property. This prevents assessment of whether the inductive bias is adequate as asserted.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and the opportunity to clarify the presentation of our results. We address the single major comment below.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that HIT's enforced token overlap supplies a sufficiently strong inductive bias for 300M-param SOTA performance (vs. 1B-param VARSR) without external data is load-bearing for the contribution, yet the provided manuscript text contains no quantitative results, ablation studies isolating the overlap mechanism, or implementation details on preserving RQ's additive property. This prevents assessment of whether the inductive bias is adequate as asserted.

Authors: We agree that the abstract, owing to length constraints, does not itself contain quantitative numbers, ablations, or implementation details. These elements appear in the full manuscript: quantitative comparisons establishing 300M-param SOTA performance versus the 1B-param baseline are reported in the experiments section and associated tables/figures; ablations that isolate the contribution of the enforced token overlap appear in the ablation study; and the method section details how the hierarchical tokenization is constructed so that RQ's additive residual property is preserved. To make the central claim easier to evaluate from the abstract alone, we will revise the abstract to incorporate concise quantitative highlights and explicit references to the supporting ablations and implementation choices. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on new components, not self-referential definitions or fits

full rationale

The paper introduces HIT (enforcing token overlap across scales) and a DPO term as novel additions to VAR training for multi-scale ISR. The central claim—that these yield a 300M model with SOTA results and single-pass multi-scale output—is presented as an empirical outcome, not a quantity derived by construction from fitted parameters or prior self-citations. No equations appear in the abstract that equate a 'prediction' to an input fit, and no uniqueness theorems or ansatzes are smuggled via self-citation. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Abstract-only review limits visibility into free parameters or background assumptions; the central additions are the two new training components whose effectiveness is asserted without visible derivation.

axioms (1)

domain assumption Residual Quantization aligns perfectly with the ISR task
Stated directly in the abstract as the basis for building on VAR models.

invented entities (1)

Hierarchical Image Tokenization (HIT) no independent evidence
purpose: Progressively represent images at different scales while enforcing token overlap across scales
New component introduced to increase flexibility of VAR training for ISR.

pith-pipeline@v0.9.1-grok · 5789 in / 1263 out tokens · 30493 ms · 2026-06-30T21:08:59.655145+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

7 extracted references · 4 canonical work pages · 1 internal anchor

[1]

and Timofte, R

Agustsson, E. and Timofte, R. NTIRE 2017 challenge on single image super-resolution: Dataset and study. InIEEE 9 Hierarchical Image Tokenization for Multi-Scale Image Super Resolution Conference on Computer Vision and Pattern Recognition - Workshops,

2017
[2]

Ntire 2024 challenge on image super-resolution (x4): Methods and results

Chen, Z., Wu, Z., Zamfir, E., Zhang, K., Zhang, Y ., Timofte, R., Yang, X., Yu, H., Wan, C., Hong, Y ., et al. Ntire 2024 challenge on image super-resolution (x4): Methods and results. InIEEE Conference on Computer Vision and Pattern Recognition, pp. 6108–6132,

2024
[3]

Swinir: Image restoration using swin transformer

Liang, J., Cao, J., Sun, G., Zhang, K., Van Gool, L., and Tim- ofte, R. Swinir: Image restoration using swin transformer. arXiv preprint arXiv:2108.10257,

work page arXiv
[4]

Liu, D., Zhao, S., Zhuo, L., Lin, W., Xin, Y ., Li, X., Qin, Q., Qiao, Y ., Li, H., and Gao, P. Lumina-mgpt: Illu- minate flexible photorealistic text-to-image generation 10 Hierarchical Image Tokenization for Multi-Scale Image Super Resolution with multimodal generative pretraining.arXiv preprint arXiv:2408.02657,

work page arXiv
[5]

STAR: Scale-wise text-to-image genera- tion via auto-regressive representations.arXiv preprint arXiv:2406.10797,

Ma, X., Zhou, M., Liang, T., Bai, Y ., Zhao, T., Chen, H., and Jin, Y . STAR: Scale-wise text-to-image genera- tion via auto-regressive representations.arXiv preprint arXiv:2406.10797,

work page arXiv
[6]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Sun, P., Jiang, Y ., Chen, S., Zhang, S., Peng, B., Luo, P., and Yuan, Z. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

V ., Yang, M., and Zhang, L

Timofte, R., Agustsson, E., Gool, L. V ., Yang, M., and Zhang, L. NTIRE 2017 challenge on single image super- resolution: Methods and results. InIEEE Conference on Computer Vision and Pattern Recognition - Workshops,

2017

[1] [1]

and Timofte, R

Agustsson, E. and Timofte, R. NTIRE 2017 challenge on single image super-resolution: Dataset and study. InIEEE 9 Hierarchical Image Tokenization for Multi-Scale Image Super Resolution Conference on Computer Vision and Pattern Recognition - Workshops,

2017

[2] [2]

Ntire 2024 challenge on image super-resolution (x4): Methods and results

Chen, Z., Wu, Z., Zamfir, E., Zhang, K., Zhang, Y ., Timofte, R., Yang, X., Yu, H., Wan, C., Hong, Y ., et al. Ntire 2024 challenge on image super-resolution (x4): Methods and results. InIEEE Conference on Computer Vision and Pattern Recognition, pp. 6108–6132,

2024

[3] [3]

Swinir: Image restoration using swin transformer

Liang, J., Cao, J., Sun, G., Zhang, K., Van Gool, L., and Tim- ofte, R. Swinir: Image restoration using swin transformer. arXiv preprint arXiv:2108.10257,

work page arXiv

[4] [4]

Liu, D., Zhao, S., Zhuo, L., Lin, W., Xin, Y ., Li, X., Qin, Q., Qiao, Y ., Li, H., and Gao, P. Lumina-mgpt: Illu- minate flexible photorealistic text-to-image generation 10 Hierarchical Image Tokenization for Multi-Scale Image Super Resolution with multimodal generative pretraining.arXiv preprint arXiv:2408.02657,

work page arXiv

[5] [5]

STAR: Scale-wise text-to-image genera- tion via auto-regressive representations.arXiv preprint arXiv:2406.10797,

Ma, X., Zhou, M., Liang, T., Bai, Y ., Zhao, T., Chen, H., and Jin, Y . STAR: Scale-wise text-to-image genera- tion via auto-regressive representations.arXiv preprint arXiv:2406.10797,

work page arXiv

[6] [6]

Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

Sun, P., Jiang, Y ., Chen, S., Zhang, S., Peng, B., Luo, P., and Yuan, Z. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525,

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

V ., Yang, M., and Zhang, L

Timofte, R., Agustsson, E., Gool, L. V ., Yang, M., and Zhang, L. NTIRE 2017 challenge on single image super- resolution: Methods and results. InIEEE Conference on Computer Vision and Pattern Recognition - Workshops,

2017