pith. sign in

arxiv: 2511.08704 · v2 · pith:D7DKMJDMnew · submitted 2025-11-11 · 💻 cs.CV · cs.LG

Rethinking Generative Image Pretraining: How Far Are We From Scaling Up Next-Pixel Prediction?

Pith reviewed 2026-05-21 18:12 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords scaling lawsautoregressive modelsnext pixel predictionimage generationtransformercompute scalingvision models
0
0 comments X

The pith

Next-pixel prediction for images is primarily limited by compute rather than data, becoming feasible within five years.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper studies the scaling behavior of simple autoregressive models that predict images one pixel after another. Using transformer networks trained on 32 by 32 images under fixed compute budgets, it measures performance on prediction accuracy, classification, and generation quality. The results show that how to best scale model size versus data size changes depending on the end task, with generation needing more data growth. When considering higher resolutions, model size must increase faster than data. Overall, the work concludes that future progress hinges on more compute, which is growing rapidly enough to enable detailed pixel-level image modeling soon.

Core claim

The central discovery is that scaling next-pixel prediction requires task-specific strategies: at fixed low resolution, generation tasks demand data sizes that grow three to five times faster than for classification tasks, while higher resolutions favor much larger models over additional data. Projecting these trends reveals compute as the key bottleneck, with annual growth rates suggesting that practical pixel-by-pixel image modeling will be achievable in the next five years.

What carries the argument

IsoFlops profiles that train families of Transformers while holding total compute fixed to reveal optimal model and data scaling for next-pixel prediction objectives.

If this is right

  • Generation performance at 32x32 benefits from allocating more of the compute budget to data rather than model parameters compared to classification.
  • As resolution increases, optimal configurations shift toward larger models and relatively smaller datasets.
  • The dominant constraint on scaling autoregressive vision models is the available compute rather than data volume.
  • With compute budgets expanding four to five times each year, next-pixel prediction at high resolutions becomes realistic within five years.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Unified models that handle both vision and language through next-token prediction may become more viable as compute scales.
  • Efforts in vision pretraining could focus more on efficient use of compute rather than amassing ever-larger image collections.
  • Similar scaling analyses could be applied to other sequential prediction tasks like video frame modeling.

Load-bearing premise

The scaling relationships between model size, data size, and performance measured at 32x32 resolution will hold without major changes when applied to higher resolutions and larger scales.

What would settle it

A direct test would be to measure the optimal scaling ratios at 128x128 or higher resolution and check whether they match the extrapolations from the 32x32 experiments, or to see if data efficiency improves or worsens differently than predicted.

Figures

Figures reproduced from arXiv: 2511.08704 by Adams Wei Yu, Chen Liang, Lijun Yu, Quoc V. Le, Xinchen Yan, Yifeng Lu.

Figure 1
Figure 1. Figure 1: Key findings on the scaling properties of next-pixel prediction, based on training Transformers on 32 × 32 images. (a) Learning on raw pixels (blue line) requires 10 − 20× higher optimal token-to-parameter ratio than learning on language tokens (yellow line). (b) The optimal scaling strategy varies: generation quality (Frechet Distance, green) requires more training data optimally than classification ´ (To… view at source ↗
Figure 2
Figure 2. Figure 2: Scaling properties prediction given image resolutions at 32×32: next-pixel prediction loss (See subfigure (a) and (d)), ImageNet classification accuracy (See subfigure (b) and (e)), and image completion-based Frechet Distance (See subfigure (c) and (f)). We report ´ the best-layer linear probing accuracy. We estimate Frechet Distance between 2,048 reference images at ´ 32 × 32 and corresponding 8,192 gener… view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative examples at 32 × 32. The unmasked top image is provided as initialization and we auto-regressively predict the bottom half image one pixel at a time. Zoom in for a better view. Image recognition. We follow the process described in Sec. 3.5 where we report the best-layer linear probing accu￾racy as the robust measure for scaling study. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Optimal model / data scaling predictions vs. FLOPs across different image resolutions. We keep the ground-truth reference images at the native resolutions, and report the Frechet Distance between 10,000 reference images and 10,000 generated images. ´ It is crucial to note that while the Transformer architectures themselves remained the same (e.g., S-28M always had the same parameter count), the input seque… view at source ↗
read the original abstract

This paper investigates the scaling properties of autoregressive next-pixel prediction, a simple, end-to-end yet under-explored framework for unified vision models. Starting with images at resolutions of 32x32, we train a family of Transformers using IsoFlops profiles across compute budgets up to 7e19 FLOPs and evaluate three distinct target metrics: next-pixel prediction objective, ImageNet classification accuracy, and generation-based completion measured by Fr'echet Distance. First, optimal scaling strategy is critically task-dependent. At a fixed resolution of 32x32 alone, the optimal scaling properties for image classification and image generation diverge, where generation optimal setup requires the data size grow three to five times faster than for the classification optimal setup. Second, as image resolution increases, the optimal scaling strategy indicates that the model size must grow much faster than data size. Surprisingly, by projecting our findings, we discover that the primary bottleneck is compute rather than the amount of training data. As compute continues to grow four to five times annually, we forecast the feasibility of pixel-by-pixel modeling of images within the next five years.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript investigates scaling properties of autoregressive next-pixel prediction for unified vision models using Transformer architectures. Experiments begin at 32x32 resolution with IsoFLOP training profiles up to 7e19 FLOPs and evaluate three metrics: next-pixel prediction loss, ImageNet classification accuracy, and Fréchet Distance for generative completion. Key findings are that optimal scaling is task-dependent at fixed 32x32 (generation requires data to scale 3-5 times faster than classification), that higher resolutions favor faster growth in model size than data size, and that projecting these trends identifies compute (not data) as the dominant bottleneck, forecasting feasibility of pixel-by-pixel image modeling within five years given 4-5x annual compute growth.

Significance. If the reported scaling relationships and bottleneck identification generalize beyond the tested regime, the work would provide useful empirical guidance for scaling generative pretraining in vision and highlight the promise of simple next-pixel objectives. The breadth of compute budgets and multi-metric evaluation across prediction, classification, and generation strengthens the empirical contribution. The five-year forecast, if substantiated with additional validation, could shape research priorities toward compute-efficient scaling of autoregressive vision models.

major comments (2)
  1. Abstract: The five-year feasibility forecast and claim that compute is the primary bottleneck are obtained by projecting fitted scaling coefficients from the 32x32 IsoFLOP experiments; the manuscript provides neither the exact functional forms of the scaling fits, error bars on the measurements, nor validation of the extrapolation assumptions, leaving the central projection unverified.
  2. Abstract: The statement that 'as image resolution increases, the optimal scaling strategy indicates that the model size must grow much faster than data size' is presented without scaling curves, exponent fits, or IsoFLOP tables for any resolution above 32x32; this absence is load-bearing for the compute-dominance diagnosis and the five-year timeline.
minor comments (2)
  1. The manuscript would benefit from explicit equations for the fitted scaling laws, inclusion of error bars on all reported scaling plots, and a clear statement of data exclusion criteria used in the fits.
  2. Clarify how the three-to-five-times faster data growth for generation versus classification at 32x32 was quantitatively derived from the IsoFLOP profiles.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps clarify the presentation of our scaling projections and their empirical basis. We address the major comments point by point below, proposing targeted revisions to improve transparency while preserving the core contributions of the work.

read point-by-point responses
  1. Referee: Abstract: The five-year feasibility forecast and claim that compute is the primary bottleneck are obtained by projecting fitted scaling coefficients from the 32x32 IsoFLOP experiments; the manuscript provides neither the exact functional forms of the scaling fits, error bars on the measurements, nor validation of the extrapolation assumptions, leaving the central projection unverified.

    Authors: We agree that greater transparency is needed for the projection methodology. In the revised manuscript, we will add an appendix section that reports the exact functional forms of the fitted scaling laws (power-law relationships between optimal model size, data size, and compute budget derived from the IsoFLOP profiles), includes error bars obtained via bootstrap resampling across training runs, and provides a validation analysis that holds out intermediate compute budgets to assess extrapolation accuracy within the 32x32 regime. These additions will allow independent verification of the five-year forecast under the stated 4-5x annual compute growth assumption. revision: yes

  2. Referee: Abstract: The statement that 'as image resolution increases, the optimal scaling strategy indicates that the model size must grow much faster than data size' is presented without scaling curves, exponent fits, or IsoFLOP tables for any resolution above 32x32; this absence is load-bearing for the compute-dominance diagnosis and the five-year timeline.

    Authors: The indicated trend for higher resolutions is an extrapolation from the observed task-dependent optimal data-to-model ratios at 32x32, combined with established scaling principles that larger input dimensionality favors increased model capacity over data volume. We acknowledge that direct IsoFLOP curves or tables at resolutions above 32x32 are not included. In revision, we will expand the discussion to detail the extrapolation procedure, including any supporting assumptions or small-scale diagnostic runs, and rephrase the abstract claim to emphasize that it is a projected indication rather than a direct measurement. This will strengthen the grounding for the compute-bottleneck conclusion without requiring new large-scale experiments at higher resolutions. revision: partial

Circularity Check

1 steps flagged

Five-year feasibility forecast and compute-bottleneck claim reduce to extrapolation of scaling coefficients fitted at 32x32

specific steps
  1. fitted input called prediction [Abstract]
    "Surprisingly, by projecting our findings, we discover that the primary bottleneck is compute rather than the amount of training data. As compute continues to grow four to five times annually, we forecast the feasibility of pixel-by-pixel modeling of images within the next five years."

    The scaling laws, optimal data-vs-model growth rates, and task-dependent behaviors are all measured and fitted on the 32x32 IsoFLOP runs. Labeling the direct extrapolation of those fitted coefficients as a 'discovery' of the bottleneck and a quantitative five-year forecast makes the claimed prediction equivalent to re-applying the same fitted model to new compute and resolution values.

full rationale

The paper trains models and fits scaling relationships exclusively at 32x32 resolution, then invokes those same fitted relationships to 'project' both the identification of compute as the dominant bottleneck and the five-year timeline for pixel-by-pixel modeling. Because the projection step simply extends the empirically tuned exponents and optimal data-vs-model ratios to unseen regimes (higher resolution and future compute budgets) without new measurements or external validation, the headline forecast is statistically forced by the input fits rather than independently derived.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central projections rest on empirical scaling laws fitted to 32x32 data; the abstract supplies no explicit list of fitted exponents or normalization constants, so the ledger remains minimal pending full text.

axioms (1)
  • domain assumption Power-law scaling relationships observed at 32x32 continue to govern behavior at higher resolutions and larger compute budgets
    Invoked when projecting the five-year feasibility timeline from current IsoFlops sweeps

pith-pipeline@v0.9.0 · 5742 in / 1405 out tokens · 67548 ms · 2026-05-21T18:12:52.458300+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages · 9 internal anchors

  1. [1]

    Cosmos World Foundation Model Platform for Physical AI

    Agarwal, N., Ali, A., Bala, M., Balaji, Y ., Barker, E., Cai, T., Chattopadhyay, P., Chen, Y ., Cui, Y ., Ding, Y ., et al. Cosmos world foundation model platform for physical AI. arXiv:2501.03575,

  2. [2]

    PaLM 2 Technical Report

    Anil, R., Dai, A. M., Firat, O., Johnson, M., Lepikhin, D., Passos, A., Shakeri, S., Taropa, E., Bailey, P., Chen, Z., et al. Palm 2 technical report. arXiv:2305.10403,

  3. [3]

    The Llama 3 Herd of Models

    Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al. The llama 3 herd of models. arXiv:2407.21783,

  4. [4]

    Scaling Laws for Autoregressive Generative Modeling

    Henighan, T., Kaplan, J., Katz, M., Chen, M., Hesse, C., Jackson, J., Jun, H., Brown, T. B., Dhariwal, P., Gray, S., et al. Scaling laws for autoregressive generative modeling. arXiv:2010.14701,

  5. [5]

    Distilling the Knowledge in a Neural Network

    Hinton, G., Vinyals, O., and Dean, J. Distilling the knowl- edge in a neural network. arXiv:1503.02531,

  6. [6]

    Scaling Laws for Neural Language Models

    Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., and Amodei, D. Scaling laws for neural language models. arXiv:2001.08361,

  7. [7]

    Fractal generative models

    Li, T., Sun, Q., Fan, L., and He, K. Fractal generative models. arXiv:2502.17437,

  8. [8]

    GLU Variants Improve Transformer

    Shazeer, N. GLU variants improve transformer. arXiv:2002.05202,

  9. [9]

    Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation

    Sun, P., Jiang, Y ., Chen, S., Zhang, S., Peng, B., Luo, P., and Yuan, Z. Autoregressive model beats diffusion: Llama for scalable image generation. arXiv:2406.06525,

  10. [10]

    Gemma: Open Models Based on Gemini Research and Technology

    Team, G., Mesnard, T., Hardin, C., Dadashi, R., Bhupatiraju, S., Pathak, S., Sifre, L., Rivi`ere, M., Kale, M. S., Love, J., et al. Gemma: Open models based on gemini research and technology. arXiv:2403.08295,