Recognition: unknown
Image Generators are Generalist Vision Learners
Pith reviewed 2026-05-10 01:10 UTC · model grok-4.3
The pith
Image generation pretraining equips models with general visual understanding for state-of-the-art task performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Image generation training serves a role similar to LLM pretraining, and lets models learn powerful and general visual representations that enable SOTA performance on various vision tasks. By parameterizing the output space of vision tasks as RGB images, perception is reframed as image generation. After instruction-tuning on a mixture of original training data and a small amount of vision task data, the model achieves SOTA results on tasks involving both 2D and 3D understanding, beating or rivaling zero-shot domain-specialists on segmentation and metric depth estimation without sacrificing generation capabilities.
What carries the argument
Parameterizing outputs of vision tasks as RGB images to reframe them as image generation inside a pretrained generator, then applying light instruction tuning on mixed generative and task data.
If this is right
- A single model trained chiefly on generation can handle multiple distinct vision tasks at specialist level.
- Segmentation and metric depth estimation can be solved at or above the level of dedicated models such as Segment Anything or Depth Anything series.
- Generative ability remains available after the tuning step, so one system supports both creation and comprehension.
- Computer vision could move toward foundational models whose core capability is learned through generative pretraining.
Where Pith is reading between the lines
- Larger-scale generative pretraining alone, with even less task data, might further strengthen the understanding capabilities.
- The same reframing could extend to video generators for learning temporal structure without separate temporal models.
- Models built this way could integrate more naturally with language-based interfaces since both generation and understanding use a common output modality.
- Interactive applications might benefit from a single model that can both describe scenes and synthesize new images in response.
Load-bearing premise
The performance gains on understanding tasks come mainly from the image generation pretraining rather than from the choice of tuning data mixture or the RGB output format.
What would settle it
Training a comparable non-generative model from scratch using only the vision task data and the same RGB output parameterization, then obtaining similar benchmark results, would indicate that generative pretraining is not the primary driver.
read the original abstract
Recent works show that image and video generators exhibit zero-shot visual understanding behaviors, in a way reminiscent of how LLMs develop emergent capabilities of language understanding and reasoning from generative pretraining. While it has long been conjectured that the ability to create visual content implies an ability to understand it, there has been limited evidence that generative vision models have developed strong understanding capabilities. In this work, we demonstrate that image generation training serves a role similar to LLM pretraining, and lets models learn powerful and general visual representations that enable SOTA performance on various vision tasks. We introduce Vision Banana, a generalist model built by instruction-tuning Nano Banana Pro (NBP) on a mixture of its original training data alongside a small amount of vision task data. By parameterizing the output space of vision tasks as RGB images, we seamlessly reframe perception as image generation. Our generalist model, Vision Banana, achieves SOTA results on a variety of vision tasks involving both 2D and 3D understanding, beating or rivaling zero-shot domain-specialists, including Segment Anything Model 3 on segmentation tasks, and the Depth Anything series on metric depth estimation. We show that these results can be achieved with lightweight instruction-tuning without sacrificing the base model's image generation capabilities. The superior results suggest that image generation pretraining is a generalist vision learner. It also shows that image generation serves as a unified and universal interface for vision tasks, similar to text generation's role in language understanding and reasoning. We could be witnessing a major paradigm shift for computer vision, where generative vision pretraining takes a central role in building Foundational Vision Models for both generation and understanding.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that image generation pretraining plays a role analogous to LLM pretraining by developing powerful general visual representations. It introduces Vision Banana, created by instruction-tuning Nano Banana Pro on a mixture of its original generative data plus a small amount of vision task data, with all outputs reframed as RGB images. This yields a generalist model that achieves SOTA or near-SOTA results on 2D and 3D vision tasks (e.g., beating SAM3 on segmentation and rivaling Depth Anything on metric depth) while preserving the base model's image generation capabilities through lightweight tuning. The work positions image generation as a unified interface for vision tasks and suggests a paradigm shift toward generative pretraining for foundational vision models.
Significance. If the central empirical claims hold after proper controls, the result would support treating generative vision models as generalist learners comparable to LLMs, with the RGB-output reframing offering a practical unification of generation and perception. This could influence future work on foundational models that handle both understanding and synthesis without task-specific architectures.
major comments (2)
- [Abstract / Experimental Setup] Abstract and experimental sections: the central claim that generative pretraining (rather than the instruction-tuning mixture or RGB parameterization) produces the generalist capabilities is not isolated by any ablation. No comparison is reported between (a) the generative Nano Banana Pro backbone plus task mixture versus (b) an equivalent non-generative backbone trained with the identical tuning recipe and data volumes. This directly undermines the assertion that 'image generation training serves a role similar to LLM pretraining.'
- [Results / Experiments] Results sections: SOTA claims (e.g., vs. SAM3 on segmentation, Depth Anything on depth) are presented without quantitative details on evaluation protocols, exact data volumes for the 'small amount' of task data, baseline training data scales, or statistical significance. Without these, it is impossible to rule out that gains arise from undisclosed data advantages or evaluation differences rather than the generative pretraining itself.
minor comments (2)
- [Method] Clarify the precise composition of the instruction-tuning mixture (which tasks, exact sample counts) and the training hyperparameters used for the lightweight tuning stage.
- [Discussion] Add explicit discussion of potential data overlap between the original Nano Banana Pro training set and the added vision task data to address possible leakage.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for strengthening the experimental rigor and clarity of our claims. We address each major point below and have revised the manuscript accordingly to improve transparency and address potential alternative explanations for our results.
read point-by-point responses
-
Referee: [Abstract / Experimental Setup] Abstract and experimental sections: the central claim that generative pretraining (rather than the instruction-tuning mixture or RGB parameterization) produces the generalist capabilities is not isolated by any ablation. No comparison is reported between (a) the generative Nano Banana Pro backbone plus task mixture versus (b) an equivalent non-generative backbone trained with the identical tuning recipe and data volumes. This directly undermines the assertion that 'image generation training serves a role similar to LLM pretraining.'
Authors: We acknowledge that a direct ablation isolating generative pretraining from the instruction-tuning mixture and RGB output parameterization would provide stronger causal evidence. Training a comparable non-generative backbone at the scale of Nano Banana Pro with identical data volumes and tuning is computationally prohibitive within the scope of this study. In the revised manuscript, we have added a dedicated limitations subsection discussing this gap, including comparisons to existing non-generative generalist models (e.g., from the literature) and preliminary small-scale experiments showing that the untuned generative backbone already exhibits stronger zero-shot task performance than similarly sized non-generative models. We have also moderated the language in the abstract and introduction to frame our contribution as demonstrating that generative models can serve as effective generalist learners rather than claiming strict uniqueness of the pretraining paradigm. revision: partial
-
Referee: [Results / Experiments] Results sections: SOTA claims (e.g., vs. SAM3 on segmentation, Depth Anything on depth) are presented without quantitative details on evaluation protocols, exact data volumes for the 'small amount' of task data, baseline training data scales, or statistical significance. Without these, it is impossible to rule out that gains arise from undisclosed data advantages or evaluation differences rather than the generative pretraining itself.
Authors: We agree that additional quantitative details are necessary for reproducibility and to rule out confounds. The revised manuscript now includes: (1) full evaluation protocols with exact metrics, datasets, and splits; (2) precise data volumes, specifying that the task-specific data constitutes less than 5% of the total tuning mixture by token count; (3) baseline training scales for all compared models; and (4) statistical significance via multiple random seeds with reported standard deviations. These additions confirm that performance advantages are not due to hidden data scale differences. We have also expanded the experimental setup section with a new table summarizing all hyperparameters and data compositions. revision: yes
Circularity Check
No circularity: empirical results from training and evaluation
full rationale
The paper presents no mathematical derivations, equations, or fitted parameters that reduce to inputs by construction. Claims rest on experimental outcomes of instruction-tuning Nano Banana Pro on a data mixture and evaluating against external baselines like SAM3 and Depth Anything. No self-definitional loops, renamed predictions, or load-bearing self-citations that substitute for independent evidence are identifiable in the text. The central assertion that generative pretraining enables generalist vision is framed as an empirical finding supported by SOTA comparisons, not a tautological reduction.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 5 Pith papers
-
PathPainter: Transferring the Generalization Ability of Image Generation Models to Embodied Navigation
PathPainter transfers image generation models to embodied navigation by generating traversability masks from BEV images and language instructions while using cross-view localization to reduce odometry drift.
-
Leveraging Image Generators to Address Training Data Scarcity: The Gen4Regen Dataset for Forest Regeneration Mapping
Mixing real UAV imagery with 2101 AI-generated image-mask pairs improves semantic segmentation F1 scores for fine-grained forest species by over 15 percentage points overall and up to 30 points for rare classes.
-
Open-Source Image Editing Models Are Zero-Shot Vision Learners
Open-source image-editing models show competitive zero-shot performance on monocular depth, surface normals, and semantic segmentation, sometimes matching tuned models.
-
Diffusion Model as a Generalist Segmentation Learner
DiGSeg repurposes diffusion U-Nets as generalist segmentation learners by conditioning on image-mask latents and multi-scale CLIP text features, achieving strong cross-domain performance.
-
Thinking with Novel Views: A Systematic Analysis of Generative-Augmented Spatial Intelligence
Integrating generative novel-view synthesis into LMM reasoning loops improves accuracy on spatial subtasks by 1.3 to 3.9 percentage points across multiple models and tasks.
Reference graph
Works this paper leans on
-
[1]
H. Bao, L. Dong, S. Piao, and F. Wei. Beit: Bert pre-training of image transformers.arXiv preprint arXiv:2106.08254,
work page internal anchor Pith review arXiv
- [2]
-
[3]
Depth Pro: Sharp Monocular Metric Depth in Less Than a Second
A. Bochkovskii, A. Delaunoy, H. Germain, M. Santos, Y. Zhou, S. R. Richter, and V. Koltun. Depth pro: Sharp monocular metric depth in less than a second.arXiv preprint arXiv:2410.02073,
work page internal anchor Pith review arXiv
-
[4]
Brown, B
T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,
1901
-
[5]
Accessed: 2026-03-18. Y. Cabon, N. Murray, and M. Humenberger. Virtual kitti 2.arXiv preprint arXiv:2001.10773,
work page internal anchor Pith review arXiv 2026
-
[6]
Depthlm: Metric depth from vision language models,
Z. Cai, C.-F. Yeh, H. Xu, Z. Liu, G. Meyer, X. Lei, C. Zhao, S.-W. Li, V. Chandra, and Y. Shi. Depthlm: Metric depth from vision language models.arXiv preprint arXiv:2509.25413,
-
[7]
B. Cao, K. Chen, K.-K. Maninis, K. Chen, A. Karpur, Y. Xia, S. Dua, T. Dabral, G. Han, B. Han, et al. Tipsv2: Advancing vision-language pretraining with enhanced patch-text alignment.arXiv preprint arXiv:2604.12012,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
SAM 3: Segment Anything with Concepts
N. Carion, L. Gustafson, Y.-T. Hu, S. Debnath, R. Hu, D. Suris, C. Ryali, K. V. Alwala, H. Khedr, A. Huang, et al. Sam 3: Segment anything with concepts.arXiv preprint arXiv:2511.16719,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
M. Chen, A. Radford, R. Child, J. Wu, H. Jun, D. Luan, and I. Sutskever. Generative pretraining from pixels. InInternational conference on machine learning, pages 1691–1703. PMLR, 2020a. 15 Image Generators are Generalist Vision Learners T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. A simple framework for contrastive learning of visual representations...
work page internal anchor Pith review arXiv 2003
-
[10]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Min- derer, G.Heigold, S.Gelly, etal. Animageisworth16x16words: Transformersforimagerecognition at scale.arXiv preprint arXiv:2010.11929,
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[11]
Introducing nano banana pro
Google. Introducing nano banana pro. https://blog.google/innovation-and-ai/ products/nano-banana-pro/, 2025a. Accessed: 2026-03-15. Google. Veo 3 announcement. https://blog.google/innovation-and-ai/products/ generative-media-models-io-2025/, 2025b. Accessed: 2026-03-15. J.-B. Grill, F. Strub, F. Altché, C. Tallec, P. Richemond, E. Buchatskaya, C. Doersch,...
2026
- [12]
- [13]
-
[14]
Kazemzadeh, V
S. Kazemzadeh, V. Ordonez, M. Matten, and T. Berg. Referitgame: Referring to objects in photographs of natural scenes. InProceedings of the 2014 conference on empirical methods in natural language processing (EMNLP), pages 787–798,
2014
- [15]
-
[16]
H. Lin, S. Chen, J. Liew, D. Y. Chen, Z. Li, G. Shi, J. Feng, and B. Kang. Depth anything 3: Recovering the visual space from any views.arXiv preprint arXiv:2511.10647,
work page internal anchor Pith review arXiv
- [17]
-
[18]
Accessed: 2026-03-19. M. Minderer, A. Gritsenko, and N. Houlsby. Scaling open-vocabulary object detection.Advances in Neural Information Processing Systems, 36:72983–73007,
2026
-
[19]
Accessed: 2026-03-19. M. Oquab, T. Darcet, T. Moutakanni, H. Vo, M. Szafraniec, V. Khalidov, P. Fernandez, D. Haziza, F. Massa, A. El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193,
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[20]
N. Ravi, V. Gabeur, Y.-T. Hu, R. Hu, C. Ryali, T. Ma, H. Khedr, R. Rädle, C. Rolland, L. Gustafson, E. Mintun, J. Pan, K. V. Alwala, N. Carion, C.-Y. Wu, R. Girshick, P. Dollár, and C. Feichtenhofer. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714,
work page internal anchor Pith review arXiv
- [21]
-
[22]
18 Image Generators are Generalist Vision Learners O. Siméoni, H. V. Vo, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V. Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
M. Tschannen, A. Gritsenko, X. Wang, M. F. Naeem, I. Alabdulmohsin, N. Parthasarathy, T. Evans, L. Beyer, Y. Xia, B. Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786,
work page internal anchor Pith review arXiv
-
[25]
URLhttp://arxiv.org/abs/1908.00463. H. Wang, L. Qiao, Z. Jie, Z. Huang, C. Feng, Q. Zheng, L. Ma, X. Lan, and X. Liang. X-sam: From segmentanythingtoanysegmentation. InProceedingsoftheAAAIConferenceonArtificialIntelligence, volume 40, pages 26187–26196, 2026a. J. Wang, M. Chen, N. Karaev, A. Vedaldi, C. Rupprecht, and D. Novotny. Vggt: Visual geometry gro...
-
[26]
J. Wei, M. Bosma, V. Y. Zhao, K. Guu, A. W. Yu, B. Lester, N. Du, A. M. Dai, and Q. V. Le. Finetuned language models are zero-shot learners.arXiv preprint arXiv:2109.01652,
work page internal anchor Pith review arXiv
-
[27]
Video models are zero-shot learners and reasoners
T. Wiedemer, Y. Li, P. Vicol, S. S. Gu, N. Matarese, K. Swersky, B. Kim, P. Jaini, and R. Geirhos. Video models are zero-shot learners and reasoners.arXiv preprint arXiv:2509.20328,
work page internal anchor Pith review arXiv
-
[28]
Y. Ye, X. He, Z. Li, B. Lin, S. Yuan, Z. Yan, B. Hou, and L. Yuan. Imgedit: A unified image editing dataset and benchmark.arXiv preprint arXiv:2505.20275,
work page internal anchor Pith review arXiv
- [29]
-
[30]
arXiv preprint arXiv:2512.08924 , year=
19 Image Generators are Generalist Vision Learners C. Zhang, G. L. Moing, S. Koppula, I. Rocco, L. Momeni, J. Xie, S. Sun, R. Sukthankar, J. K. Barral, R. Hadsell, et al. Efficiently reconstructing dynamic scenes one d4rt at a time.arXiv preprint arXiv:2512.08924,
- [31]
-
[32]
J. Zhou, C. Wei, H. Wang, W. Shen, C. Xie, A. Yuille, and T. Kong. ibot: Image bert pre-training with online tokenizer.arXiv preprint arXiv:2111.07832,
work page internal anchor Pith review arXiv
-
[33]
A lantern casting dim light in a haunted forest
20 Image Generators are Generalist Vision Learners Appendix - Additional Demonstrations A ghostly ship sailing on a fog-shrouded, moonlit sea. A lantern casting dim light in a haunted forest. A yellow taxi waiting outside a modern glass building. A samurai with a silk sash in a cherry blossom garden. Figure 9 | Comparing Vision Banana (left) and Nano Bana...
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.