GPIC: A Giant Permissive Image Corpus for Visual Generation

Jiajun Wu; Juan Carlos Niebles; Justin Johnson; Keshigeyan Chandrasegaran; Kyle Sargent; Li Fei-Fei; Michael Jang; Michael Poli; Suchir Agarwal

arxiv: 2605.30341 · v1 · pith:HLIAJPP3new · submitted 2026-05-28 · 💻 cs.CV · cs.AI

GPIC: A Giant Permissive Image Corpus for Visual Generation

Keshigeyan Chandrasegaran , Kyle Sargent , Suchir Agarwal , Michael Jang , Michael Poli , Juan Carlos Niebles , Justin Johnson , Jiajun Wu

show 1 more author

Li Fei-Fei

This is my paper

Pith reviewed 2026-06-29 07:34 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords image datasetvisual generationpermissive licensegenerative modelingflow matchinglarge-scale corpusvision-language captions

0 comments

The pith

GPIC introduces a 28-trillion-pixel image corpus with permissive licenses for visual generative modeling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GPIC as a large collection of diverse internet images totaling approximately 28 trillion pixels, each captioned by a state-of-the-art vision-language model. It supplies 100 million training examples plus validation and test splits, with all images under licenses permitting both research and commercial use. The corpus is safety-filtered, deduplicated, and hosted centrally on Hugging Face, accompanied by a benchmarking protocol and a pixel-space flow matching baseline. This setup targets the need for scalable, accessible datasets in visual generative modeling studies.

Core claim

GPIC is a Giant Permissive Image Corpus of approximately 28 trillion pixels comprising diverse internet images captioned by a state-of-the-art vision-language model, including 100M training, 200K validation, and 1M test examples, with all images permissively licensed for both research and commercial use, safety-filtered, deduplicated, and centrally hosted.

What carries the argument

The Giant Permissive Image Corpus (GPIC), a large-scale dataset of captioned images with permissive licensing that carries the argument for accessible training data.

If this is right

Enables training of visual generative models without licensing barriers for both research and commercial applications.
Supplies a standardized benchmarking protocol to compare generative modeling approaches on this corpus.
Includes a reference baseline using pixel-space flow matching for direct performance comparisons.
Provides a centrally hosted, deduplicated, and safety-filtered resource to reduce setup costs for large-scale experiments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Access to such a permissively licensed corpus could accelerate experimentation by removing data acquisition hurdles common in visual generation research.
The scale of 28 trillion pixels may support training regimes that reveal scaling behaviors not visible in smaller datasets.
Central hosting on a public platform could encourage community contributions of improved models or evaluations on the same data.

Load-bearing premise

Captions produced by the state-of-the-art vision-language model are accurate and detailed enough to support effective training of generative models.

What would settle it

Training multiple generative models on GPIC and measuring whether their output quality and diversity fall substantially below equivalent models trained on human-captioned datasets of similar scale.

Figures

Figures reproduced from arXiv: 2605.30341 by Jiajun Wu, Juan Carlos Niebles, Justin Johnson, Keshigeyan Chandrasegaran, Kyle Sargent, Li Fei-Fei, Michael Jang, Michael Poli, Suchir Agarwal.

**Figure 2.** Figure 2: GPIC dataset statistics. The figure shows GPIC’s image height and width distributions, license composition, caption statistics, release format, dataset splits, and benchmark scales. GPIC images have an average height of 479 pixels and an average width of 587 pixels. GPIC is centrally hosted on Hugging Face as 8,000 shards totaling 12.9TB and released under the MIT license. GPICLite (10M) and GPIC-Nano (1M… view at source ↗

**Figure 3.** Figure 3: Our dataset construction pipeline. We develop a four-stage pipeline to create GPIC. We source permissive images from Flickr and Wikimedia (Stage 1), filter low-quality and harmful images (Stage 2), deduplicate images using similarity scores derived from SSCD [23] copy detection features (Stage 3), and caption into one of tag, short, medium, or long (Stage 4). Qwen-3-VL-4B-Instruct [19] is used for filterin… view at source ↗

**Figure 4.** Figure 4: Example images that are filtered due to low resolution and poor visual quality. We apply a sequence of image-level filters to remove images unsuitable for training or benchmarking. First, we remove images with extreme resolutions or aspect ratios. Together, these filters remove approximately 0.01% of the source pool. We also discard images whose longest side is smaller than 256 pixels. Next, we apply VLM-… view at source ↗

**Figure 5.** Figure 5: Qualitative examples of similar image pairs across SSCD similarity ranges. Each group shows nearest-neighbor image pairs within the indicated SSCD similarity interval. At lower thresholds, similar pairs often contain visually related but distinct images, including changes in pose, viewpoint, or object identity. At higher thresholds, pairs increasingly correspond to near-duplicates, but visible differences … view at source ↗

**Figure 6.** Figure 6: Image collision models. SSCDbased duplicate removals follow a power-law trend across subset sizes and similarity thresholds. Extrapolating to the 110M-image source pool shows that θ = 0.95 is estimated to remove 9.62×106 images, leaving approximately 1.01 × 108 images. choice strongly affects how many images are removed. We therefore build predictive collision models on smaller subsets before running … view at source ↗

**Figure 7.** Figure 7: Captioning model selection. We evaluate Qwen3-VL-Instruct models on the GPIC captioning microbenchmark across five caption-quality criteria and throughput. Throughput in images per second (1xH100) is shown in parentheses below each model. Qwen3-VL-4B-Instruct provides the best quality-throughput tradeoff: it matches or approaches the best quality scores across short, medium, and long captions while maintai… view at source ↗

**Figure 8.** Figure 8: GPIC shard statistics. We show the per-shard distribution of image counts and captiontype percentages for GPIC-Full. GPIC-Full is shuffled into 8000 approximately balanced shards, each containing ≈ 12,500 images and preserving the target caption mixture of 1% tag, 45% short, 45% medium, and 9% long captions. Benchmark scales. We divide the GPIC train set into three nested tiers: GPIC-Nano with 1M images, … view at source ↗

**Figure 9.** Figure 9: Comparison of FID and FD-DINOv2 on ImageNet-1K. ImageNet-1K FID is saturated: several models achieve lower FID than the distance between 50K held-out real ImageNet-1K images and the ImageNet-1K training set. By contrast, FD-DINOv2 remains unsaturated: all evaluated models have higher FD-DINOv2 than the corresponding held-out real-image distance, including models trained with DINOv2 features. Dotted lines i… view at source ↗

**Figure 10.** Figure 10: Pretraining loss for the JiTT2I reference baseline [34] on GPICFull. We show training loss vs. iterations. The model is trained for one epoch on GPIC-Full (100M text-image pairs). Experiment Setup. We train JiT-T2I on GPIC-Full for one epoch at 256 × 256 resolution. The global batch size is 256. We use AdamW with learning rate 10−4 , betas 0.9 and 0.95, and no weight decay. We use a constant learning-r… view at source ↗

**Figure 11.** Figure 11: JiT-T2I samples after training on GPIC-Full for one epoch. We show generated images for prompts in the held-out Test-50K subset. Each group contains a real test image, the corresponding text prompt, and JiT-T2I generations sampled with classifier-free guidance scales CFG = 1.75, 4.00, and 6.25, respectively. The examples span diverse object-centric and scenelevel prompts, including animals, vehicles, nat… view at source ↗

read the original abstract

Studying scalable methods for visual generative modeling requires large, accessible, and stable datasets. We introduce GPIC, a Giant Permissive Image Corpus of approximately 28 trillion pixels. GPIC comprises diverse internet images captioned by a state-of-the-art vision-language model, including 100M training, 200K validation, and 1M test examples. Moreover, all GPIC images are permissively licensed for both research and commercial use. GPIC is safety-filtered, deduplicated, and centrally hosted on Hugging Face. We provide a benchmarking protocol for generative modeling on GPIC. Finally, we provide a reference baseline for pixel-space flow matching on GPIC. Our dataset, benchmark, and models are available at https://huggingface.co/datasets/stanford-vision-lab/gpic. Evaluation toolkit and code are available at https://gpic.stanford.edu

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GPIC releases a 100M-image permissive dataset with VLM captions and a baseline, but supplies no checks on caption quality or effectiveness.

read the letter

GPIC is a dataset release that gives 100 million training images under permissive licenses for research and commercial use, all captioned by a state-of-the-art VLM, with safety filtering, deduplication, and central hosting on Hugging Face. It also includes standard splits, a benchmarking protocol, and a pixel-space flow matching baseline.

The practical advance is the licensing. Previous large image collections often carried restrictions that blocked commercial work or created legal uncertainty. By prioritizing permissive licenses at this scale and making everything centrally available with an evaluation toolkit, the paper removes a real barrier for groups training visual generative models.

The soft spot is the captions. The work assumes the VLM outputs are accurate and detailed enough to support effective generative training, but it reports no human evaluations, no caption-to-image fidelity metrics, and no ablations comparing these captions to alternatives on downstream generation quality. The baseline exists, yet it does not test whether the captions themselves limit performance. That assumption remains unverified.

This paper is for researchers in visual generative modeling who need large, legally clear data for training or standardized benchmarks. Anyone planning scale experiments or comparisons will get direct use from the release.

It deserves peer review because the dataset itself is new at this scale with a concrete licensing contribution, even though the supporting analysis is thin. I would send it out rather than desk reject.

Referee Report

2 major / 1 minor

Summary. The paper introduces GPIC, a Giant Permissive Image Corpus comprising ~100M training images (plus 200K validation and 1M test examples) sourced from the internet and captioned by a state-of-the-art vision-language model, totaling approximately 28 trillion pixels. All images are asserted to be permissively licensed for research and commercial use; the corpus is safety-filtered, deduplicated, and centrally hosted on Hugging Face. A benchmarking protocol for generative modeling is provided along with a reference baseline using pixel-space flow matching. The dataset, benchmark, and models are released publicly.

Significance. If the licensing, filtering, and caption-quality claims are substantiated, GPIC would offer a valuable large-scale, openly accessible resource for scalable visual generative modeling research, addressing limitations of existing restricted datasets. The explicit public release of the full dataset, evaluation toolkit, and code on Hugging Face and the project site is a clear strength that supports reproducibility and community use.

major comments (2)

[Abstract] Abstract: The central utility claim—that GPIC enables effective training of generative models—rests on the unverified assumption that captions from the state-of-the-art VLM are sufficiently accurate and detailed; no human evaluation, image-caption fidelity metrics (e.g., CLIPScore or human preference studies), or downstream ablation comparing VLM captions to human captions is supplied.
[Abstract] Abstract and dataset description: No quantitative details or verification steps are given for licensing checks across the full 100M images, the effectiveness of the safety filter, or the deduplication procedure; these omissions are load-bearing for the permissiveness and safety assertions that distinguish GPIC.

minor comments (1)

[Abstract] The abstract states 'approximately 28 trillion pixels' without providing a per-split breakdown or total pixel count verification method.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback. We address each major comment below and indicate planned revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central utility claim—that GPIC enables effective training of generative models—rests on the unverified assumption that captions from the state-of-the-art VLM are sufficiently accurate and detailed; no human evaluation, image-caption fidelity metrics (e.g., CLIPScore or human preference studies), or downstream ablation comparing VLM captions to human captions is supplied.

Authors: We agree that the manuscript would be strengthened by additional evidence on caption quality. The current version relies on a state-of-the-art VLM without including CLIPScore, human studies, or ablations against human captions. In revision we will add CLIPScore computed on a representative subset of images, qualitative caption examples, and an explicit discussion of this limitation. A full-scale human preference study or exhaustive ablation is not feasible at this corpus size, but the provided baseline training results demonstrate practical utility of the captions as-is. revision: partial
Referee: [Abstract] Abstract and dataset description: No quantitative details or verification steps are given for licensing checks across the full 100M images, the effectiveness of the safety filter, or the deduplication procedure; these omissions are load-bearing for the permissiveness and safety assertions that distinguish GPIC.

Authors: We acknowledge that the abstract and high-level dataset description omit quantitative verification statistics. The full manuscript describes the overall pipeline, but we will expand the relevant section with concrete numbers: the fraction of images removed by the safety filter, the deduplication rate and method (e.g., perceptual hash or embedding similarity threshold), and the licensing verification approach (source-level permissive license filtering with automated metadata checks). These additions will directly support the permissiveness and safety claims. revision: yes

Circularity Check

0 steps flagged

No circularity: direct dataset release with no derivations or predictions

full rationale

The paper is a dataset introduction paper that releases GPIC (100M captioned images, splits, safety filtering, and a hosted baseline). It contains no equations, no claimed predictions, no fitted parameters, and no derivation chain that could reduce to self-referential inputs. The central claim is the existence and permissiveness of the corpus itself, which is externally verifiable by download and inspection rather than by any internal reduction. Self-citations, if present, are not load-bearing for any result. This is the standard non-circular outcome for a data release.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a dataset curation and release paper containing no mathematical derivations, fitted parameters, background axioms, or postulated entities.

pith-pipeline@v0.9.1-grok · 5705 in / 1118 out tokens · 45211 ms · 2026-06-29T07:34:19.372847+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

95 extracted references · 21 canonical work pages · 14 internal anchors

[1]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[2]

High- resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

2022
[3]

Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022. 1

2022
[4]

Qwen-Image-VAE-2.0 Technical Report

Zekai Zhang, Deqing Li, Kuan Cao, Yujia Wu, Chenfei Wu, Yu Wu, Liang Peng, Hao Meng, Jia- hao Li, Jie Zhang, et al. Qwen-image-vae-2.0 technical report.arXiv preprint arXiv:2605.13565,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024. URL https://openai.com/research/ video-generation-models-as-world-simulators. 1

2024
[6]

Nano banana 2: Google’s latest ai image generation model

Google. Nano banana 2: Google’s latest ai image generation model. https://blog.google/ innovation-and-ai/technology/ai/nano-banana-2/ , February 2026. Accessed: 2026- 05-24. 1

2026
[7]

Large Scale GAN Training for High Fidelity Natural Image Synthesis

Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity natural image synthesis.CoRR, abs/1809.11096, 2018. URL http://arxiv.org/abs/1809. 11096. 1

work page internal anchor Pith review Pith/arXiv arXiv 2018
[8]

Neural Discrete Representation Learning

Aäron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning.CoRR, abs/1711.00937, 2017. URLhttp://arxiv.org/abs/1711.00937. 1

work page internal anchor Pith review Pith/arXiv arXiv 2017
[9]

Taming transformers for high-resolution image synthesis, 2021

Patrick Esser, Robin Rombach, and Björn Ommer. Taming transformers for high-resolution image synthesis, 2021. URLhttps://arxiv.org/abs/2012.09841. 1

work page arXiv 2021
[10]

Scalable Diffusion Models with Transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers, 2023. URL https://arxiv.org/abs/2212.09748. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

Diffusion transformers with representation autoencoders, 2025

Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders, 2025. 1

2025
[12]

Simpler diffusion (sid2): 1.5 fid on imagenet512 with pixel-space diffusion, 2025

Emiel Hoogeboom, Thomas Mensink, Jonathan Heek, Kay Lamerigts, Ruiqi Gao, and Tim Salimans. Simpler diffusion (sid2): 1.5 fid on imagenet512 with pixel-space diffusion, 2025. URLhttps://arxiv.org/abs/2410.19324. 8

work page arXiv 2025
[13]

Reconstruction vs

Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimiza- tion dilemma in latent diffusion models, 2025. URLhttps://arxiv.org/abs/2501.01423

work page arXiv 2025
[14]

Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction

Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction, 2024. URL https://arxiv. org/abs/2404.02905. 1

work page arXiv 2024
[15]

Datacomp: In search of the next generation of multimodal datasets

Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, Eyal Orgad, Rahim Entezari, Giannis Daras, Sarah Pratt, Vivek Ramanujan, Yonatan Bitton, Kalyani Marathe, Stephen Mussmann, Richard Vencu, Mehdi Cherti, Ranjay Krishna, Pang Wei Koh, Olga Saukh, Alexan...

work page arXiv 2023
[16]

Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li

Bart Thomee, David A. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. Yfcc100m: the new data in multimedia research.Communications of the ACM, 59(2):64–73, January 2016. ISSN 1557-7317. doi: 10.1145/2812802. URL http://dx.doi.org/10.1145/2812802. 3, 10

work page doi:10.1145/2812802 2016
[17]

Laion-5b: an open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. Laion-5b: an open large-scale dataset for training next generation image-text models....

2022
[18]

img2dataset: Easily turn large sets of image urls to an image dataset

Romain Beaumont. img2dataset: Easily turn large sets of image urls to an image dataset. https://github.com/rom1504/img2dataset, 2021. 3

2021
[19]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2511.21631 2025
[20]

Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models

George Stein, Jesse Cresswell, Rasa Hosseinzadeh, Yi Sui, Brendan Ross, Valentin Villecroze, Zhaoyan Liu, Anthony L Caterini, Eric Taylor, and Gabriel Loaiza-Ganem. Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models. InAdvances in Neural Information Processing Systems, volume 36, 2023. 3, 7, 16, 17

2023
[21]

Imagenet: A large- scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. IEEE, 2009. 3

2009
[22]

Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, Tom Duerig, and Vittorio Ferrari. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale.International Journal of Computer Vision, 128...

work page doi:10.1007/s11263-020-01316-z 1956
[23]

A self- supervised descriptor for image copy detection

Ed Pizzi, Sreya Dutta Roy, Sugosh Nagavara Ravindra, Priya Goyal, and Matthijs Douze. A self- supervised descriptor for image copy detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14532–14542, 2022. 4

2022
[24]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023. 6

2023
[25]

Sglang: Efficient execution of structured language model programs.Advances in neural information processing systems, 37: 62557–62583, 2024

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody H Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Efficient execution of structured language model programs.Advances in neural information processing systems, 37: 62557–62583, 2024. 6

2024
[26]

GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium, 2018. URL https://arxiv.org/abs/1706.08500. 7 12

work page internal anchor Pith review Pith/arXiv arXiv 2018
[27]

Going Deeper with Convolutions

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–9, 2015. URLhttps://arxiv.org/abs/1409.4842. 7

work page internal anchor Pith review Pith/arXiv arXiv 2015
[28]

Improved precision and recall metric for assessing generative models

Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models. InNeural Information Processing Systems, 2019. URLhttps://api.semanticscholar.org/CorpusID:118648975. 7

2019
[29]

Mehdi S. M. Sajjadi, Olivier Bachem, Mario Lu ˇci´c, Olivier Bousquet, and Sylvain Gelly. Assessing generative models via precision and recall.arXiv, abs/1806.00035, 2018. URL https://api.semanticscholar.org/CorpusID:44104089

work page internal anchor Pith review Pith/arXiv arXiv 2018
[30]

Reli- able fidelity and diversity metrics for generative models

Muhammad Ferjad Naeem, Seong Joon Oh, Youngjung Uh, Yunjey Choi, and Jaejun Yoo. Reli- able fidelity and diversity metrics for generative models. InInternational Conference on Machine Learning, 2020. URLhttps://api.semanticscholar.org/CorpusID:211259260. 7

2020
[31]

Oriane Siméoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julie...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[32]

Sigmoid Loss for Language Image Pre-Training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training, 2023. URLhttps://arxiv.org/abs/2303.15343. 8

work page internal anchor Pith review Pith/arXiv arXiv 2023
[33]

Back to Basics: Let Denoising Generative Models Denoise

Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise.arXiv preprint arXiv:2511.13720, 2025. 9

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

PixelGen: Improving Pixel Diffusion with Perceptual Supervision

Zehong Ma, Ruihan Xu, and Shiliang Zhang. Pixelgen: Pixel diffusion beats latent diffusion with perceptual loss.arXiv preprint arXiv:2602.02493, 2026. 9

work page internal anchor Pith review Pith/arXiv arXiv 2026
[35]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report, 2025. URL https: //arxiv.org/abs/2505.09388. 9

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

Pillow (pil fork) documentation, 2015

Alex Clark. Pillow (pil fork) documentation, 2015. URL https://buildmedia. readthedocs.org/media/pdf/pillow/latest/pillow.pdf. 16

2015
[37]

KEEP BACK

Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On aliased resizing and surprising subtleties in GAN evaluation. InProceedings of the Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2022. 16 13 Appendix Contents A Additional Image-Text examples from GPIC 14 B Evaluation 16 B.1 Construction of Imagenet-256 and GPIC-256 . . . . . . . . . ...

2022
[38]

Center crop along the longer edge to form a square image
[39]

Olympic Games

Bicubic downsampling to 256×256 from the Pillow library [36]. We note that popular Python image libraries use different bicubic interpolation kernels. Our choice of Pillow is consistent with prior work [20, 37]. B.2 Additional Oracle Reference Metrics GPIC Subset FD↓Precision↑Recall↑Density↑Coverage↑ Full 0.07 0.757 0.762 0.973 0.966 Lite 0.07 0.762 0.768...

2012
[40]

MAIN SUBJECTS: - Include the 1–3 most important people, animals, objects, or landmarks
[41]

red car",

ATTRIBUTES: - Include visible attributes (color, texture, condition) (example: "red car", "snowy road", "bare trees")
[42]

forest",

SETTING: - Include 1–2 coarse environment tags (example: "forest", "street", "kitchen")
[43]

running",

ACTION (RARE): - Only include if extremely obvious and short (e.g., "running", "sitting"). - Prefer nouns over verbs. COMPRESSION RULE: - Do NOT try to describe everything. - Include only key visual elements. - Missing details are acceptable. COUNTING: - Avoid exact numbers unless extremely obvious. - Prefer plural forms (e.g., "trees"). TEXT IN IMAGE: - ...
[44]

Output ONLY the tag list
[45]

No sentences, no explanations
[46]

No punctuation except commas
[47]

snowy road, forest, bare trees, winter, cloudy

All lowercase. EXAMPLES (style reference only): - "snowy road, forest, bare trees, winter, cloudy" - "white cat, sunlight, cozy" - "city street, night, race car, neon lights" USER MESSAGE Write a keyword-style caption (tag-style) for the image shown. Figure F.1: Prompt used to generate tag-styled captions for GPIC images. 21 VLM Captioning: Short MAIN INS...
[48]

MAIN SUBJECTS: - Mention the 1–3 most important people/animals/objects/landmarks
[49]

red tie",

SIMPLE DETAILS: - Add 1–3 simple visible details that help identify them (example: "red tie", "blue bottle", "white car")
[50]

street",

SETTING: - Add ONE short setting word (example: "street", "park", "kitchen")
[51]

walking",

MAIN ACTION (ONLY if clearly visible): - Use a simple verb (example: "walking", "sitting", "holding hands"). - If the action is not clearly visible, DO NOT include action(s). COUNTING (STRICT): - Use an exact number ONLY if it is very easy and unambiguous to count. - If an exact number is unclear, do NOT guess. - You may use "several" or "a group of" only...
[52]

Output 1–2 sentences (max 2)
[53]

Start immediately with the main subjects (no meta phrases)
[55]

photo",

Do NOT mention the words "photo", "image" or "picture"
[56]

Two cyclists ride on a paved road

Use neutral, literal language. LENGTH (STRICT): - Aim for ~12–25 words total. - Keep sentences short and easy to read. EXAMPLES (style reference only): - "Two cyclists ride on a paved road." - "A white cat lies on a bed near a window." - "A bowl of noodles sits on a table with chopsticks." USER MESSAGE Write a short caption (1–2 sentences) for the image s...
[57]

Two cyclists

Start immediately with the main visible entity/entities (no meta phrases). Example starts: "Two cyclists ...", "Close-up of ...", "Passengers ...", "A street ..."
[58]

Include the following (when clearly visible): - main objects/entities (people/animals/vehicles/objects/structures) - key visible attributes (color/material/clothing/object type) - scene context (indoor/outdoor + setting such as street/room/park/store/stadium) - grounded spatial layout (foreground/background/left/right/next to/in front of)
[59]

several" or

Count entities ONLY when clearly countable. If an exact number is unclear, do NOT guess. You may use "several" or "a group of" only when clearly correct
[60]

Otherwise describe a static configuration

Describe actions/poses ONLY when directly supported visually. Otherwise describe a static configuration
[61]

NOT VISIBLE

Do NOT write any text from the image. Exception: include visible text ONLY if it is large, clearly readable, and necessary to identify the main subject or scene. EDGE CASES: - If the image is blank OR the main content is not visible/understandable (for example, all black/white, too blurry, too dark, overexposed, or corrupted), output exactly: "NOT VISIBLE...
[62]

Output TWO sentences by default
[63]

Use THREE sentences ONLY when absolutely required to identify the scene clearly
[64]

Aim for ~25–60 words total
[67]

image",

Do NOT mention the words "image", "photo", or "picture"
[68]

Use neutral, literal language
[69]

- "Close-up of a white cat lying on a bed near a window. Soft daylight falls across the blanket, and a curtain is visible along the edge of the frame

Be informative but do NOT attempt exhaustive object listing. EXAMPLES (style reference only): - "Two cyclists ride on a paved road with dashed lane markings. An orange barrier lines the left side, with several people standing behind it on the sidewalk. Trees and buildings appear in the background." - "Close-up of a white cat lying on a bed near a window. ...
[70]

1.2) Cover important secondary elements, but do NOT attempt to list every small background object

OBJECTS / ENTITIES (nodes) 1.1) Identify the main visible entities in the scene (people, animals, vehicles, objects, structures). 1.2) Cover important secondary elements, but do NOT attempt to list every small background object. 1.3) Prefer describing entities in a grounded order such as foreground → background when possible. 1.4) If multiple similar enti...
[71]

2.2) Do NOT guess precise brands, logos, or fine details unless the text/marking is clearly readable

ATTRIBUTES (visible-only) 2.1) Describe visible attributes only when clearly observable: color, size, shape, material, texture, patterns. 2.2) Do NOT guess precise brands, logos, or fine details unless the text/marking is clearly readable
[72]

3.2) If a pose or action is not clearly verifiable, do NOT infer it

POSE + ACTIONS (confidence-gated) 3.1) Describe poses (standing, sitting, leaning, arms extended, head direction) and actions (riding, walking, holding) ONLY when directly supported by clearly visible body position and/or physical contact with an object. 3.2) If a pose or action is not clearly verifiable, do NOT infer it. Instead describe what the body lo...
[73]

side-by-side

RELATIONS / LAYOUT 4.1) Describe spatial layout using grounded relationships such as: left, right, top, bottom. 4.2) Do NOT overstate alignment or formation (e.g., do not say “side-by-side” unless clearly true). TEXT (OCR) REQUIREMENT:
[74]

If any text is visible anywhere (signs, labels, screens, posters, documents, packaging, subtitles, watermarks, UI elements, logos with words, etc.), you MUST try to transcribe it
[75]

Reproduce visible text exactly as written, preserving casing, punctuation, numbers, symbols, and spelling
[76]

Only transcribe text that is clearly legible
[77]

If text is present but not fully readable, do NOT guess; simply say that text is present
[78]

NOT VISIBLE

When including OCR text, place it naturally into the caption (prefer Sentences 3–5), so the caption remains coherent and readable. EDGE CASES: - If the image is blank OR the main content is not visible/understandable (for example, all black/white, too blurry, too dark, overexposed, or corrupted), output exactly: "NOT VISIBLE." OUTPUT RULES (STRICT):
[79]

Produce exactly 5–7 sentences
[80]

Use this sentence structure (STRICT): - Sentences 1–3: main subjects + key attributes + main actions + core setting - Sentences 4–6: layout + secondary elements + background context (include OCR here when possible) - Sentence 7 (optional): extra fine details that help reconstruction
[81]

The sentences must be information-dense rather than brief
[82]

Do NOT use bullet points, lists, headings, or JSON
[83]

Do NOT include disclaimers or meta commentary

Showing first 80 references.

[1] [1]

Wan: Open and Advanced Large-Scale Video Generative Models

Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 1

work page internal anchor Pith review Pith/arXiv arXiv 2025

[2] [2]

High- resolution image synthesis with latent diffusion models

Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

2022

[3] [3]

Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022. 1

2022

[4] [4]

Qwen-Image-VAE-2.0 Technical Report

Zekai Zhang, Deqing Li, Kuan Cao, Yujia Wu, Chenfei Wu, Yu Wu, Liang Peng, Hao Meng, Jia- hao Li, Jie Zhang, et al. Qwen-image-vae-2.0 technical report.arXiv preprint arXiv:2605.13565,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Video generation models as world simulators

Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024. URL https://openai.com/research/ video-generation-models-as-world-simulators. 1

2024

[6] [6]

Nano banana 2: Google’s latest ai image generation model

Google. Nano banana 2: Google’s latest ai image generation model. https://blog.google/ innovation-and-ai/technology/ai/nano-banana-2/ , February 2026. Accessed: 2026- 05-24. 1

2026

[7] [7]

Large Scale GAN Training for High Fidelity Natural Image Synthesis

Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity natural image synthesis.CoRR, abs/1809.11096, 2018. URL http://arxiv.org/abs/1809. 11096. 1

work page internal anchor Pith review Pith/arXiv arXiv 2018

[8] [8]

Neural Discrete Representation Learning

Aäron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning.CoRR, abs/1711.00937, 2017. URLhttp://arxiv.org/abs/1711.00937. 1

work page internal anchor Pith review Pith/arXiv arXiv 2017

[9] [9]

Taming transformers for high-resolution image synthesis, 2021

Patrick Esser, Robin Rombach, and Björn Ommer. Taming transformers for high-resolution image synthesis, 2021. URLhttps://arxiv.org/abs/2012.09841. 1

work page arXiv 2021

[10] [10]

Scalable Diffusion Models with Transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers, 2023. URL https://arxiv.org/abs/2212.09748. 1

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

Diffusion transformers with representation autoencoders, 2025

Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders, 2025. 1

2025

[12] [12]

Simpler diffusion (sid2): 1.5 fid on imagenet512 with pixel-space diffusion, 2025

Emiel Hoogeboom, Thomas Mensink, Jonathan Heek, Kay Lamerigts, Ruiqi Gao, and Tim Salimans. Simpler diffusion (sid2): 1.5 fid on imagenet512 with pixel-space diffusion, 2025. URLhttps://arxiv.org/abs/2410.19324. 8

work page arXiv 2025

[13] [13]

Reconstruction vs

Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimiza- tion dilemma in latent diffusion models, 2025. URLhttps://arxiv.org/abs/2501.01423

work page arXiv 2025

[14] [14]

Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction

Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction, 2024. URL https://arxiv. org/abs/2404.02905. 1

work page arXiv 2024

[15] [15]

Datacomp: In search of the next generation of multimodal datasets

Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, Eyal Orgad, Rahim Entezari, Giannis Daras, Sarah Pratt, Vivek Ramanujan, Yonatan Bitton, Kalyani Marathe, Stephen Mussmann, Richard Vencu, Mehdi Cherti, Ranjay Krishna, Pang Wei Koh, Olga Saukh, Alexan...

work page arXiv 2023

[16] [16]

Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li

Bart Thomee, David A. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. Yfcc100m: the new data in multimedia research.Communications of the ACM, 59(2):64–73, January 2016. ISSN 1557-7317. doi: 10.1145/2812802. URL http://dx.doi.org/10.1145/2812802. 3, 10

work page doi:10.1145/2812802 2016

[17] [17]

Laion-5b: an open large-scale dataset for training next generation image-text models

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. Laion-5b: an open large-scale dataset for training next generation image-text models....

2022

[18] [18]

img2dataset: Easily turn large sets of image urls to an image dataset

Romain Beaumont. img2dataset: Easily turn large sets of image urls to an image dataset. https://github.com/rom1504/img2dataset, 2021. 3

2021

[19] [19]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2511.21631 2025

[20] [20]

Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models

George Stein, Jesse Cresswell, Rasa Hosseinzadeh, Yi Sui, Brendan Ross, Valentin Villecroze, Zhaoyan Liu, Anthony L Caterini, Eric Taylor, and Gabriel Loaiza-Ganem. Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models. InAdvances in Neural Information Processing Systems, volume 36, 2023. 3, 7, 16, 17

2023

[21] [21]

Imagenet: A large- scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. IEEE, 2009. 3

2009

[22] [22]

Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, Tom Duerig, and Vittorio Ferrari. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale.International Journal of Computer Vision, 128...

work page doi:10.1007/s11263-020-01316-z 1956

[23] [23]

A self- supervised descriptor for image copy detection

Ed Pizzi, Sreya Dutta Roy, Sugosh Nagavara Ravindra, Priya Goyal, and Matthijs Douze. A self- supervised descriptor for image copy detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14532–14542, 2022. 4

2022

[24] [24]

Efficient memory management for large language model serving with pagedattention

Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023. 6

2023

[25] [25]

Sglang: Efficient execution of structured language model programs.Advances in neural information processing systems, 37: 62557–62583, 2024

Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody H Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Efficient execution of structured language model programs.Advances in neural information processing systems, 37: 62557–62583, 2024. 6

2024

[26] [26]

GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium

Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium, 2018. URL https://arxiv.org/abs/1706.08500. 7 12

work page internal anchor Pith review Pith/arXiv arXiv 2018

[27] [27]

Going Deeper with Convolutions

Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–9, 2015. URLhttps://arxiv.org/abs/1409.4842. 7

work page internal anchor Pith review Pith/arXiv arXiv 2015

[28] [28]

Improved precision and recall metric for assessing generative models

Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models. InNeural Information Processing Systems, 2019. URLhttps://api.semanticscholar.org/CorpusID:118648975. 7

2019

[29] [29]

Mehdi S. M. Sajjadi, Olivier Bachem, Mario Lu ˇci´c, Olivier Bousquet, and Sylvain Gelly. Assessing generative models via precision and recall.arXiv, abs/1806.00035, 2018. URL https://api.semanticscholar.org/CorpusID:44104089

work page internal anchor Pith review Pith/arXiv arXiv 2018

[30] [30]

Reli- able fidelity and diversity metrics for generative models

Muhammad Ferjad Naeem, Seong Joon Oh, Youngjung Uh, Yunjey Choi, and Jaejun Yoo. Reli- able fidelity and diversity metrics for generative models. InInternational Conference on Machine Learning, 2020. URLhttps://api.semanticscholar.org/CorpusID:211259260. 7

2020

[31] [31]

Oriane Siméoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julie...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[32] [32]

Sigmoid Loss for Language Image Pre-Training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training, 2023. URLhttps://arxiv.org/abs/2303.15343. 8

work page internal anchor Pith review Pith/arXiv arXiv 2023

[33] [33]

Back to Basics: Let Denoising Generative Models Denoise

Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise.arXiv preprint arXiv:2511.13720, 2025. 9

work page internal anchor Pith review Pith/arXiv arXiv 2025

[34] [34]

PixelGen: Improving Pixel Diffusion with Perceptual Supervision

Zehong Ma, Ruihan Xu, and Shiliang Zhang. Pixelgen: Pixel diffusion beats latent diffusion with perceptual loss.arXiv preprint arXiv:2602.02493, 2026. 9

work page internal anchor Pith review Pith/arXiv arXiv 2026

[35] [35]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report, 2025. URL https: //arxiv.org/abs/2505.09388. 9

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

Pillow (pil fork) documentation, 2015

Alex Clark. Pillow (pil fork) documentation, 2015. URL https://buildmedia. readthedocs.org/media/pdf/pillow/latest/pillow.pdf. 16

2015

[37] [37]

KEEP BACK

Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On aliased resizing and surprising subtleties in GAN evaluation. InProceedings of the Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2022. 16 13 Appendix Contents A Additional Image-Text examples from GPIC 14 B Evaluation 16 B.1 Construction of Imagenet-256 and GPIC-256 . . . . . . . . . ...

2022

[38] [38]

Center crop along the longer edge to form a square image

[39] [39]

Olympic Games

Bicubic downsampling to 256×256 from the Pillow library [36]. We note that popular Python image libraries use different bicubic interpolation kernels. Our choice of Pillow is consistent with prior work [20, 37]. B.2 Additional Oracle Reference Metrics GPIC Subset FD↓Precision↑Recall↑Density↑Coverage↑ Full 0.07 0.757 0.762 0.973 0.966 Lite 0.07 0.762 0.768...

2012

[40] [40]

MAIN SUBJECTS: - Include the 1–3 most important people, animals, objects, or landmarks

[41] [41]

red car",

ATTRIBUTES: - Include visible attributes (color, texture, condition) (example: "red car", "snowy road", "bare trees")

[42] [42]

forest",

SETTING: - Include 1–2 coarse environment tags (example: "forest", "street", "kitchen")

[43] [43]

running",

ACTION (RARE): - Only include if extremely obvious and short (e.g., "running", "sitting"). - Prefer nouns over verbs. COMPRESSION RULE: - Do NOT try to describe everything. - Include only key visual elements. - Missing details are acceptable. COUNTING: - Avoid exact numbers unless extremely obvious. - Prefer plural forms (e.g., "trees"). TEXT IN IMAGE: - ...

[44] [44]

Output ONLY the tag list

[45] [45]

No sentences, no explanations

[46] [46]

No punctuation except commas

[47] [47]

snowy road, forest, bare trees, winter, cloudy

All lowercase. EXAMPLES (style reference only): - "snowy road, forest, bare trees, winter, cloudy" - "white cat, sunlight, cozy" - "city street, night, race car, neon lights" USER MESSAGE Write a keyword-style caption (tag-style) for the image shown. Figure F.1: Prompt used to generate tag-styled captions for GPIC images. 21 VLM Captioning: Short MAIN INS...

[48] [48]

MAIN SUBJECTS: - Mention the 1–3 most important people/animals/objects/landmarks

[49] [49]

red tie",

SIMPLE DETAILS: - Add 1–3 simple visible details that help identify them (example: "red tie", "blue bottle", "white car")

[50] [50]

street",

SETTING: - Add ONE short setting word (example: "street", "park", "kitchen")

[51] [51]

walking",

MAIN ACTION (ONLY if clearly visible): - Use a simple verb (example: "walking", "sitting", "holding hands"). - If the action is not clearly visible, DO NOT include action(s). COUNTING (STRICT): - Use an exact number ONLY if it is very easy and unambiguous to count. - If an exact number is unclear, do NOT guess. - You may use "several" or "a group of" only...

[52] [52]

Output 1–2 sentences (max 2)

[53] [53]

Start immediately with the main subjects (no meta phrases)

[54] [55]

photo",

Do NOT mention the words "photo", "image" or "picture"

[55] [56]

Two cyclists ride on a paved road

Use neutral, literal language. LENGTH (STRICT): - Aim for ~12–25 words total. - Keep sentences short and easy to read. EXAMPLES (style reference only): - "Two cyclists ride on a paved road." - "A white cat lies on a bed near a window." - "A bowl of noodles sits on a table with chopsticks." USER MESSAGE Write a short caption (1–2 sentences) for the image s...

[56] [57]

Two cyclists

Start immediately with the main visible entity/entities (no meta phrases). Example starts: "Two cyclists ...", "Close-up of ...", "Passengers ...", "A street ..."

[57] [58]

Include the following (when clearly visible): - main objects/entities (people/animals/vehicles/objects/structures) - key visible attributes (color/material/clothing/object type) - scene context (indoor/outdoor + setting such as street/room/park/store/stadium) - grounded spatial layout (foreground/background/left/right/next to/in front of)

[58] [59]

several" or

Count entities ONLY when clearly countable. If an exact number is unclear, do NOT guess. You may use "several" or "a group of" only when clearly correct

[59] [60]

Otherwise describe a static configuration

Describe actions/poses ONLY when directly supported visually. Otherwise describe a static configuration

[60] [61]

NOT VISIBLE

Do NOT write any text from the image. Exception: include visible text ONLY if it is large, clearly readable, and necessary to identify the main subject or scene. EDGE CASES: - If the image is blank OR the main content is not visible/understandable (for example, all black/white, too blurry, too dark, overexposed, or corrupted), output exactly: "NOT VISIBLE...

[61] [62]

Output TWO sentences by default

[62] [63]

Use THREE sentences ONLY when absolutely required to identify the scene clearly

[63] [64]

Aim for ~25–60 words total

[64] [67]

image",

Do NOT mention the words "image", "photo", or "picture"

[65] [68]

Use neutral, literal language

[66] [69]

- "Close-up of a white cat lying on a bed near a window. Soft daylight falls across the blanket, and a curtain is visible along the edge of the frame

Be informative but do NOT attempt exhaustive object listing. EXAMPLES (style reference only): - "Two cyclists ride on a paved road with dashed lane markings. An orange barrier lines the left side, with several people standing behind it on the sidewalk. Trees and buildings appear in the background." - "Close-up of a white cat lying on a bed near a window. ...

[67] [70]

1.2) Cover important secondary elements, but do NOT attempt to list every small background object

OBJECTS / ENTITIES (nodes) 1.1) Identify the main visible entities in the scene (people, animals, vehicles, objects, structures). 1.2) Cover important secondary elements, but do NOT attempt to list every small background object. 1.3) Prefer describing entities in a grounded order such as foreground → background when possible. 1.4) If multiple similar enti...

[68] [71]

2.2) Do NOT guess precise brands, logos, or fine details unless the text/marking is clearly readable

ATTRIBUTES (visible-only) 2.1) Describe visible attributes only when clearly observable: color, size, shape, material, texture, patterns. 2.2) Do NOT guess precise brands, logos, or fine details unless the text/marking is clearly readable

[69] [72]

3.2) If a pose or action is not clearly verifiable, do NOT infer it

POSE + ACTIONS (confidence-gated) 3.1) Describe poses (standing, sitting, leaning, arms extended, head direction) and actions (riding, walking, holding) ONLY when directly supported by clearly visible body position and/or physical contact with an object. 3.2) If a pose or action is not clearly verifiable, do NOT infer it. Instead describe what the body lo...

[70] [73]

side-by-side

RELATIONS / LAYOUT 4.1) Describe spatial layout using grounded relationships such as: left, right, top, bottom. 4.2) Do NOT overstate alignment or formation (e.g., do not say “side-by-side” unless clearly true). TEXT (OCR) REQUIREMENT:

[71] [74]

If any text is visible anywhere (signs, labels, screens, posters, documents, packaging, subtitles, watermarks, UI elements, logos with words, etc.), you MUST try to transcribe it

[72] [75]

Reproduce visible text exactly as written, preserving casing, punctuation, numbers, symbols, and spelling

[73] [76]

Only transcribe text that is clearly legible

[74] [77]

If text is present but not fully readable, do NOT guess; simply say that text is present

[75] [78]

NOT VISIBLE

When including OCR text, place it naturally into the caption (prefer Sentences 3–5), so the caption remains coherent and readable. EDGE CASES: - If the image is blank OR the main content is not visible/understandable (for example, all black/white, too blurry, too dark, overexposed, or corrupted), output exactly: "NOT VISIBLE." OUTPUT RULES (STRICT):

[76] [79]

Produce exactly 5–7 sentences

[77] [80]

Use this sentence structure (STRICT): - Sentences 1–3: main subjects + key attributes + main actions + core setting - Sentences 4–6: layout + secondary elements + background context (include OCR here when possible) - Sentence 7 (optional): extra fine details that help reconstruction

[78] [81]

The sentences must be information-dense rather than brief

[79] [82]

Do NOT use bullet points, lists, headings, or JSON

[80] [83]

Do NOT include disclaimers or meta commentary