pith. sign in

arxiv: 2605.30341 · v1 · pith:HLIAJPP3new · submitted 2026-05-28 · 💻 cs.CV · cs.AI

GPIC: A Giant Permissive Image Corpus for Visual Generation

Pith reviewed 2026-06-29 07:34 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords image datasetvisual generationpermissive licensegenerative modelingflow matchinglarge-scale corpusvision-language captions
0
0 comments X

The pith

GPIC introduces a 28-trillion-pixel image corpus with permissive licenses for visual generative modeling.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents GPIC as a large collection of diverse internet images totaling approximately 28 trillion pixels, each captioned by a state-of-the-art vision-language model. It supplies 100 million training examples plus validation and test splits, with all images under licenses permitting both research and commercial use. The corpus is safety-filtered, deduplicated, and hosted centrally on Hugging Face, accompanied by a benchmarking protocol and a pixel-space flow matching baseline. This setup targets the need for scalable, accessible datasets in visual generative modeling studies.

Core claim

GPIC is a Giant Permissive Image Corpus of approximately 28 trillion pixels comprising diverse internet images captioned by a state-of-the-art vision-language model, including 100M training, 200K validation, and 1M test examples, with all images permissively licensed for both research and commercial use, safety-filtered, deduplicated, and centrally hosted.

What carries the argument

The Giant Permissive Image Corpus (GPIC), a large-scale dataset of captioned images with permissive licensing that carries the argument for accessible training data.

If this is right

  • Enables training of visual generative models without licensing barriers for both research and commercial applications.
  • Supplies a standardized benchmarking protocol to compare generative modeling approaches on this corpus.
  • Includes a reference baseline using pixel-space flow matching for direct performance comparisons.
  • Provides a centrally hosted, deduplicated, and safety-filtered resource to reduce setup costs for large-scale experiments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Access to such a permissively licensed corpus could accelerate experimentation by removing data acquisition hurdles common in visual generation research.
  • The scale of 28 trillion pixels may support training regimes that reveal scaling behaviors not visible in smaller datasets.
  • Central hosting on a public platform could encourage community contributions of improved models or evaluations on the same data.

Load-bearing premise

Captions produced by the state-of-the-art vision-language model are accurate and detailed enough to support effective training of generative models.

What would settle it

Training multiple generative models on GPIC and measuring whether their output quality and diversity fall substantially below equivalent models trained on human-captioned datasets of similar scale.

Figures

Figures reproduced from arXiv: 2605.30341 by Jiajun Wu, Juan Carlos Niebles, Justin Johnson, Keshigeyan Chandrasegaran, Kyle Sargent, Li Fei-Fei, Michael Jang, Michael Poli, Suchir Agarwal.

Figure 1
Figure 1. Figure 1: Example image-caption pairs from GPIC. Additional samples are shown in Figure [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: GPIC dataset statistics. The figure shows GPIC’s image height and width distributions, license composition, caption statistics, release format, dataset splits, and benchmark scales. GPIC images have an average height of 479 pixels and an average width of 587 pixels. GPIC is centrally hosted on Hugging Face as 8,000 shards totaling 12.9TB and released under the MIT license. GPIC￾Lite (10M) and GPIC-Nano (1M… view at source ↗
Figure 3
Figure 3. Figure 3: Our dataset construction pipeline. We develop a four-stage pipeline to create GPIC. We source permissive images from Flickr and Wikimedia (Stage 1), filter low-quality and harmful images (Stage 2), deduplicate images using similarity scores derived from SSCD [23] copy detection features (Stage 3), and caption into one of tag, short, medium, or long (Stage 4). Qwen-3-VL-4B-Instruct [19] is used for filterin… view at source ↗
Figure 4
Figure 4. Figure 4: Example images that are filtered due to low resolution and poor visual quality. We apply a sequence of image-level filters to remove images un￾suitable for training or benchmarking. First, we remove images with extreme resolutions or aspect ratios. Together, these filters remove approximately 0.01% of the source pool. We also discard images whose longest side is smaller than 256 pixels. Next, we apply VLM-… view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative examples of similar image pairs across SSCD similarity ranges. Each group shows nearest-neighbor image pairs within the indicated SSCD similarity interval. At lower thresholds, similar pairs often contain visually related but distinct images, including changes in pose, viewpoint, or object identity. At higher thresholds, pairs increasingly correspond to near-duplicates, but visible differences … view at source ↗
Figure 6
Figure 6. Figure 6: Image collision models. SSCD￾based duplicate removals follow a power-law trend across subset sizes and similarity thresh￾olds. Extrapolating to the 110M-image source pool shows that θ = 0.95 is estimated to re￾move 9.62×106 images, leaving approximately 1.01 × 108 images. choice strongly affects how many images are removed. We therefore build predictive col￾lision models on smaller subsets before run￾ning … view at source ↗
Figure 7
Figure 7. Figure 7: Captioning model selection. We evaluate Qwen3-VL-Instruct models on the GPIC captioning microbenchmark across five caption-quality criteria and throughput. Throughput in images per second (1xH100) is shown in parentheses below each model. Qwen3-VL-4B-Instruct provides the best quality-throughput tradeoff: it matches or approaches the best quality scores across short, medium, and long captions while maintai… view at source ↗
Figure 8
Figure 8. Figure 8: GPIC shard statistics. We show the per-shard distribution of image counts and caption￾type percentages for GPIC-Full. GPIC-Full is shuffled into 8000 approximately balanced shards, each containing ≈ 12,500 images and preserving the target caption mixture of 1% tag, 45% short, 45% medium, and 9% long captions. Benchmark scales. We divide the GPIC train set into three nested tiers: GPIC-Nano with 1M images, … view at source ↗
Figure 9
Figure 9. Figure 9: Comparison of FID and FD-DINOv2 on ImageNet-1K. ImageNet-1K FID is saturated: several models achieve lower FID than the distance between 50K held-out real ImageNet-1K images and the ImageNet-1K training set. By contrast, FD-DINOv2 remains unsaturated: all evaluated models have higher FD-DINOv2 than the corresponding held-out real-image distance, including models trained with DINOv2 features. Dotted lines i… view at source ↗
Figure 10
Figure 10. Figure 10: Pretraining loss for the JiT￾T2I reference baseline [34] on GPIC￾Full. We show training loss vs. itera￾tions. The model is trained for one epoch on GPIC-Full (100M text-image pairs). Experiment Setup. We train JiT-T2I on GPIC-Full for one epoch at 256 × 256 resolution. The global batch size is 256. We use AdamW with learning rate 10−4 , betas 0.9 and 0.95, and no weight decay. We use a constant learning-r… view at source ↗
Figure 11
Figure 11. Figure 11: JiT-T2I samples after training on GPIC-Full for one epoch. We show generated images for prompts in the held-out Test-50K subset. Each group contains a real test image, the corresponding text prompt, and JiT-T2I generations sampled with classifier-free guidance scales CFG = 1.75, 4.00, and 6.25, respectively. The examples span diverse object-centric and scene￾level prompts, including animals, vehicles, nat… view at source ↗
read the original abstract

Studying scalable methods for visual generative modeling requires large, accessible, and stable datasets. We introduce GPIC, a Giant Permissive Image Corpus of approximately 28 trillion pixels. GPIC comprises diverse internet images captioned by a state-of-the-art vision-language model, including 100M training, 200K validation, and 1M test examples. Moreover, all GPIC images are permissively licensed for both research and commercial use. GPIC is safety-filtered, deduplicated, and centrally hosted on Hugging Face. We provide a benchmarking protocol for generative modeling on GPIC. Finally, we provide a reference baseline for pixel-space flow matching on GPIC. Our dataset, benchmark, and models are available at https://huggingface.co/datasets/stanford-vision-lab/gpic. Evaluation toolkit and code are available at https://gpic.stanford.edu

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces GPIC, a Giant Permissive Image Corpus comprising ~100M training images (plus 200K validation and 1M test examples) sourced from the internet and captioned by a state-of-the-art vision-language model, totaling approximately 28 trillion pixels. All images are asserted to be permissively licensed for research and commercial use; the corpus is safety-filtered, deduplicated, and centrally hosted on Hugging Face. A benchmarking protocol for generative modeling is provided along with a reference baseline using pixel-space flow matching. The dataset, benchmark, and models are released publicly.

Significance. If the licensing, filtering, and caption-quality claims are substantiated, GPIC would offer a valuable large-scale, openly accessible resource for scalable visual generative modeling research, addressing limitations of existing restricted datasets. The explicit public release of the full dataset, evaluation toolkit, and code on Hugging Face and the project site is a clear strength that supports reproducibility and community use.

major comments (2)
  1. [Abstract] Abstract: The central utility claim—that GPIC enables effective training of generative models—rests on the unverified assumption that captions from the state-of-the-art VLM are sufficiently accurate and detailed; no human evaluation, image-caption fidelity metrics (e.g., CLIPScore or human preference studies), or downstream ablation comparing VLM captions to human captions is supplied.
  2. [Abstract] Abstract and dataset description: No quantitative details or verification steps are given for licensing checks across the full 100M images, the effectiveness of the safety filter, or the deduplication procedure; these omissions are load-bearing for the permissiveness and safety assertions that distinguish GPIC.
minor comments (1)
  1. [Abstract] The abstract states 'approximately 28 trillion pixels' without providing a per-split breakdown or total pixel count verification method.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback. We address each major comment below and indicate planned revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central utility claim—that GPIC enables effective training of generative models—rests on the unverified assumption that captions from the state-of-the-art VLM are sufficiently accurate and detailed; no human evaluation, image-caption fidelity metrics (e.g., CLIPScore or human preference studies), or downstream ablation comparing VLM captions to human captions is supplied.

    Authors: We agree that the manuscript would be strengthened by additional evidence on caption quality. The current version relies on a state-of-the-art VLM without including CLIPScore, human studies, or ablations against human captions. In revision we will add CLIPScore computed on a representative subset of images, qualitative caption examples, and an explicit discussion of this limitation. A full-scale human preference study or exhaustive ablation is not feasible at this corpus size, but the provided baseline training results demonstrate practical utility of the captions as-is. revision: partial

  2. Referee: [Abstract] Abstract and dataset description: No quantitative details or verification steps are given for licensing checks across the full 100M images, the effectiveness of the safety filter, or the deduplication procedure; these omissions are load-bearing for the permissiveness and safety assertions that distinguish GPIC.

    Authors: We acknowledge that the abstract and high-level dataset description omit quantitative verification statistics. The full manuscript describes the overall pipeline, but we will expand the relevant section with concrete numbers: the fraction of images removed by the safety filter, the deduplication rate and method (e.g., perceptual hash or embedding similarity threshold), and the licensing verification approach (source-level permissive license filtering with automated metadata checks). These additions will directly support the permissiveness and safety claims. revision: yes

Circularity Check

0 steps flagged

No circularity: direct dataset release with no derivations or predictions

full rationale

The paper is a dataset introduction paper that releases GPIC (100M captioned images, splits, safety filtering, and a hosted baseline). It contains no equations, no claimed predictions, no fitted parameters, and no derivation chain that could reduce to self-referential inputs. The central claim is the existence and permissiveness of the corpus itself, which is externally verifiable by download and inspection rather than by any internal reduction. Self-citations, if present, are not load-bearing for any result. This is the standard non-circular outcome for a data release.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a dataset curation and release paper containing no mathematical derivations, fitted parameters, background axioms, or postulated entities.

pith-pipeline@v0.9.1-grok · 5705 in / 1118 out tokens · 45211 ms · 2026-06-29T07:34:19.372847+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

95 extracted references · 21 canonical work pages · 14 internal anchors

  1. [1]

    Wan: Open and Advanced Large-Scale Video Generative Models

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314, 2025. 1

  2. [2]

    High- resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High- resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022

  3. [3]

    Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022

    Chitwan Saharia, William Chan, Saurabh Saxena, Lala Li, Jay Whang, Emily L Denton, Kamyar Ghasemipour, Raphael Gontijo Lopes, Burcu Karagol Ayan, Tim Salimans, et al. Photorealistic text-to-image diffusion models with deep language understanding.Advances in neural information processing systems, 35:36479–36494, 2022. 1

  4. [4]

    Qwen-Image-VAE-2.0 Technical Report

    Zekai Zhang, Deqing Li, Kuan Cao, Yujia Wu, Chenfei Wu, Yu Wu, Liang Peng, Hao Meng, Jia- hao Li, Jie Zhang, et al. Qwen-image-vae-2.0 technical report.arXiv preprint arXiv:2605.13565,

  5. [5]

    Video generation models as world simulators

    Tim Brooks, Bill Peebles, Connor Holmes, Will DePue, Yufei Guo, Li Jing, David Schnurr, Joe Taylor, Troy Luhman, Eric Luhman, Clarence Ng, Ricky Wang, and Aditya Ramesh. Video generation models as world simulators. 2024. URL https://openai.com/research/ video-generation-models-as-world-simulators. 1

  6. [6]

    Nano banana 2: Google’s latest ai image generation model

    Google. Nano banana 2: Google’s latest ai image generation model. https://blog.google/ innovation-and-ai/technology/ai/nano-banana-2/ , February 2026. Accessed: 2026- 05-24. 1

  7. [7]

    Large Scale GAN Training for High Fidelity Natural Image Synthesis

    Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high fidelity natural image synthesis.CoRR, abs/1809.11096, 2018. URL http://arxiv.org/abs/1809. 11096. 1

  8. [8]

    Neural Discrete Representation Learning

    Aäron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning.CoRR, abs/1711.00937, 2017. URLhttp://arxiv.org/abs/1711.00937. 1

  9. [9]

    Taming transformers for high-resolution image synthesis, 2021

    Patrick Esser, Robin Rombach, and Björn Ommer. Taming transformers for high-resolution image synthesis, 2021. URLhttps://arxiv.org/abs/2012.09841. 1

  10. [10]

    Scalable Diffusion Models with Transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers, 2023. URL https://arxiv.org/abs/2212.09748. 1

  11. [11]

    Diffusion transformers with representation autoencoders, 2025

    Boyang Zheng, Nanye Ma, Shengbang Tong, and Saining Xie. Diffusion transformers with representation autoencoders, 2025. 1

  12. [12]

    Simpler diffusion (sid2): 1.5 fid on imagenet512 with pixel-space diffusion, 2025

    Emiel Hoogeboom, Thomas Mensink, Jonathan Heek, Kay Lamerigts, Ruiqi Gao, and Tim Salimans. Simpler diffusion (sid2): 1.5 fid on imagenet512 with pixel-space diffusion, 2025. URLhttps://arxiv.org/abs/2410.19324. 8

  13. [13]

    Reconstruction vs

    Jingfeng Yao, Bin Yang, and Xinggang Wang. Reconstruction vs. generation: Taming optimiza- tion dilemma in latent diffusion models, 2025. URLhttps://arxiv.org/abs/2501.01423

  14. [14]

    Visual Autoregressive Modeling: Scalable Image Generation via Next-Scale Prediction

    Keyu Tian, Yi Jiang, Zehuan Yuan, Bingyue Peng, and Liwei Wang. Visual autoregressive modeling: Scalable image generation via next-scale prediction, 2024. URL https://arxiv. org/abs/2404.02905. 1

  15. [15]

    Datacomp: In search of the next generation of multimodal datasets

    Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, Eyal Orgad, Rahim Entezari, Giannis Daras, Sarah Pratt, Vivek Ramanujan, Yonatan Bitton, Kalyani Marathe, Stephen Mussmann, Richard Vencu, Mehdi Cherti, Ranjay Krishna, Pang Wei Koh, Olga Saukh, Alexan...

  16. [16]

    Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li

    Bart Thomee, David A. Shamma, Gerald Friedland, Benjamin Elizalde, Karl Ni, Douglas Poland, Damian Borth, and Li-Jia Li. Yfcc100m: the new data in multimedia research.Communications of the ACM, 59(2):64–73, January 2016. ISSN 1557-7317. doi: 10.1145/2812802. URL http://dx.doi.org/10.1145/2812802. 3, 10

  17. [17]

    Laion-5b: an open large-scale dataset for training next generation image-text models

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, Patrick Schramowski, Srivatsa Kundurthy, Katherine Crowson, Ludwig Schmidt, Robert Kaczmarczyk, and Jenia Jitsev. Laion-5b: an open large-scale dataset for training next generation image-text models....

  18. [18]

    img2dataset: Easily turn large sets of image urls to an image dataset

    Romain Beaumont. img2dataset: Easily turn large sets of image urls to an image dataset. https://github.com/rom1504/img2dataset, 2021. 3

  19. [19]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  20. [20]

    Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models

    George Stein, Jesse Cresswell, Rasa Hosseinzadeh, Yi Sui, Brendan Ross, Valentin Villecroze, Zhaoyan Liu, Anthony L Caterini, Eric Taylor, and Gabriel Loaiza-Ganem. Exposing flaws of generative model evaluation metrics and their unfair treatment of diffusion models. InAdvances in Neural Information Processing Systems, volume 36, 2023. 3, 7, 16, 17

  21. [21]

    Imagenet: A large- scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. IEEE, 2009. 3

  22. [22]

    Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, Tom Duerig, and Vittorio Ferrari. The open images dataset v4: Unified image classification, object detection, and visual relationship detection at scale.International Journal of Computer Vision, 128...

  23. [23]

    A self- supervised descriptor for image copy detection

    Ed Pizzi, Sreya Dutta Roy, Sugosh Nagavara Ravindra, Priya Goyal, and Matthijs Douze. A self- supervised descriptor for image copy detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14532–14542, 2022. 4

  24. [24]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. InProceedings of the 29th symposium on operating systems principles, pages 611–626, 2023. 6

  25. [25]

    Sglang: Efficient execution of structured language model programs.Advances in neural information processing systems, 37: 62557–62583, 2024

    Lianmin Zheng, Liangsheng Yin, Zhiqiang Xie, Chuyue Sun, Jeff Huang, Cody H Yu, Shiyi Cao, Christos Kozyrakis, Ion Stoica, Joseph E Gonzalez, et al. Sglang: Efficient execution of structured language model programs.Advances in neural information processing systems, 37: 62557–62583, 2024. 6

  26. [26]

    GANs Trained by a Two Time-Scale Update Rule Converge to a Local Nash Equilibrium

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium, 2018. URL https://arxiv.org/abs/1706.08500. 7 12

  27. [27]

    Going Deeper with Convolutions

    Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 1–9, 2015. URLhttps://arxiv.org/abs/1409.4842. 7

  28. [28]

    Improved precision and recall metric for assessing generative models

    Tuomas Kynkäänniemi, Tero Karras, Samuli Laine, Jaakko Lehtinen, and Timo Aila. Improved precision and recall metric for assessing generative models. InNeural Information Processing Systems, 2019. URLhttps://api.semanticscholar.org/CorpusID:118648975. 7

  29. [29]

    Mehdi S. M. Sajjadi, Olivier Bachem, Mario Lu ˇci´c, Olivier Bousquet, and Sylvain Gelly. Assessing generative models via precision and recall.arXiv, abs/1806.00035, 2018. URL https://api.semanticscholar.org/CorpusID:44104089

  30. [30]

    Reli- able fidelity and diversity metrics for generative models

    Muhammad Ferjad Naeem, Seong Joon Oh, Youngjung Uh, Yunjey Choi, and Jaejun Yoo. Reli- able fidelity and diversity metrics for generative models. InInternational Conference on Machine Learning, 2020. URLhttps://api.semanticscholar.org/CorpusID:211259260. 7

  31. [31]

    Oriane Siméoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julie...

  32. [32]

    Sigmoid Loss for Language Image Pre-Training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training, 2023. URLhttps://arxiv.org/abs/2303.15343. 8

  33. [33]

    Back to Basics: Let Denoising Generative Models Denoise

    Tianhong Li and Kaiming He. Back to basics: Let denoising generative models denoise.arXiv preprint arXiv:2511.13720, 2025. 9

  34. [34]

    PixelGen: Improving Pixel Diffusion with Perceptual Supervision

    Zehong Ma, Ruihan Xu, and Shiliang Zhang. Pixelgen: Pixel diffusion beats latent diffusion with perceptual loss.arXiv preprint arXiv:2602.02493, 2026. 9

  35. [35]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report, 2025. URL https: //arxiv.org/abs/2505.09388. 9

  36. [36]

    Pillow (pil fork) documentation, 2015

    Alex Clark. Pillow (pil fork) documentation, 2015. URL https://buildmedia. readthedocs.org/media/pdf/pillow/latest/pillow.pdf. 16

  37. [37]

    KEEP BACK

    Gaurav Parmar, Richard Zhang, and Jun-Yan Zhu. On aliased resizing and surprising subtleties in GAN evaluation. InProceedings of the Conference on Computer Vision and Pattern Recognition (CVPR). IEEE, 2022. 16 13 Appendix Contents A Additional Image-Text examples from GPIC 14 B Evaluation 16 B.1 Construction of Imagenet-256 and GPIC-256 . . . . . . . . . ...

  38. [38]

    Center crop along the longer edge to form a square image

  39. [39]

    Olympic Games

    Bicubic downsampling to 256×256 from the Pillow library [36]. We note that popular Python image libraries use different bicubic interpolation kernels. Our choice of Pillow is consistent with prior work [20, 37]. B.2 Additional Oracle Reference Metrics GPIC Subset FD↓Precision↑Recall↑Density↑Coverage↑ Full 0.07 0.757 0.762 0.973 0.966 Lite 0.07 0.762 0.768...

  40. [40]

    MAIN SUBJECTS: - Include the 1–3 most important people, animals, objects, or landmarks

  41. [41]

    red car",

    ATTRIBUTES: - Include visible attributes (color, texture, condition) (example: "red car", "snowy road", "bare trees")

  42. [42]

    forest",

    SETTING: - Include 1–2 coarse environment tags (example: "forest", "street", "kitchen")

  43. [43]

    running",

    ACTION (RARE): - Only include if extremely obvious and short (e.g., "running", "sitting"). - Prefer nouns over verbs. COMPRESSION RULE: - Do NOT try to describe everything. - Include only key visual elements. - Missing details are acceptable. COUNTING: - Avoid exact numbers unless extremely obvious. - Prefer plural forms (e.g., "trees"). TEXT IN IMAGE: - ...

  44. [44]

    Output ONLY the tag list

  45. [45]

    No sentences, no explanations

  46. [46]

    No punctuation except commas

  47. [47]

    snowy road, forest, bare trees, winter, cloudy

    All lowercase. EXAMPLES (style reference only): - "snowy road, forest, bare trees, winter, cloudy" - "white cat, sunlight, cozy" - "city street, night, race car, neon lights" USER MESSAGE Write a keyword-style caption (tag-style) for the image shown. Figure F.1: Prompt used to generate tag-styled captions for GPIC images. 21 VLM Captioning: Short MAIN INS...

  48. [48]

    MAIN SUBJECTS: - Mention the 1–3 most important people/animals/objects/landmarks

  49. [49]

    red tie",

    SIMPLE DETAILS: - Add 1–3 simple visible details that help identify them (example: "red tie", "blue bottle", "white car")

  50. [50]

    street",

    SETTING: - Add ONE short setting word (example: "street", "park", "kitchen")

  51. [51]

    walking",

    MAIN ACTION (ONLY if clearly visible): - Use a simple verb (example: "walking", "sitting", "holding hands"). - If the action is not clearly visible, DO NOT include action(s). COUNTING (STRICT): - Use an exact number ONLY if it is very easy and unambiguous to count. - If an exact number is unclear, do NOT guess. - You may use "several" or "a group of" only...

  52. [52]

    Output 1–2 sentences (max 2)

  53. [53]

    Start immediately with the main subjects (no meta phrases)

  54. [55]

    photo",

    Do NOT mention the words "photo", "image" or "picture"

  55. [56]

    Two cyclists ride on a paved road

    Use neutral, literal language. LENGTH (STRICT): - Aim for ~12–25 words total. - Keep sentences short and easy to read. EXAMPLES (style reference only): - "Two cyclists ride on a paved road." - "A white cat lies on a bed near a window." - "A bowl of noodles sits on a table with chopsticks." USER MESSAGE Write a short caption (1–2 sentences) for the image s...

  56. [57]

    Two cyclists

    Start immediately with the main visible entity/entities (no meta phrases). Example starts: "Two cyclists ...", "Close-up of ...", "Passengers ...", "A street ..."

  57. [58]

    Include the following (when clearly visible): - main objects/entities (people/animals/vehicles/objects/structures) - key visible attributes (color/material/clothing/object type) - scene context (indoor/outdoor + setting such as street/room/park/store/stadium) - grounded spatial layout (foreground/background/left/right/next to/in front of)

  58. [59]

    several" or

    Count entities ONLY when clearly countable. If an exact number is unclear, do NOT guess. You may use "several" or "a group of" only when clearly correct

  59. [60]

    Otherwise describe a static configuration

    Describe actions/poses ONLY when directly supported visually. Otherwise describe a static configuration

  60. [61]

    NOT VISIBLE

    Do NOT write any text from the image. Exception: include visible text ONLY if it is large, clearly readable, and necessary to identify the main subject or scene. EDGE CASES: - If the image is blank OR the main content is not visible/understandable (for example, all black/white, too blurry, too dark, overexposed, or corrupted), output exactly: "NOT VISIBLE...

  61. [62]

    Output TWO sentences by default

  62. [63]

    Use THREE sentences ONLY when absolutely required to identify the scene clearly

  63. [64]

    Aim for ~25–60 words total

  64. [67]

    image",

    Do NOT mention the words "image", "photo", or "picture"

  65. [68]

    Use neutral, literal language

  66. [69]

    - "Close-up of a white cat lying on a bed near a window. Soft daylight falls across the blanket, and a curtain is visible along the edge of the frame

    Be informative but do NOT attempt exhaustive object listing. EXAMPLES (style reference only): - "Two cyclists ride on a paved road with dashed lane markings. An orange barrier lines the left side, with several people standing behind it on the sidewalk. Trees and buildings appear in the background." - "Close-up of a white cat lying on a bed near a window. ...

  67. [70]

    1.2) Cover important secondary elements, but do NOT attempt to list every small background object

    OBJECTS / ENTITIES (nodes) 1.1) Identify the main visible entities in the scene (people, animals, vehicles, objects, structures). 1.2) Cover important secondary elements, but do NOT attempt to list every small background object. 1.3) Prefer describing entities in a grounded order such as foreground → background when possible. 1.4) If multiple similar enti...

  68. [71]

    2.2) Do NOT guess precise brands, logos, or fine details unless the text/marking is clearly readable

    ATTRIBUTES (visible-only) 2.1) Describe visible attributes only when clearly observable: color, size, shape, material, texture, patterns. 2.2) Do NOT guess precise brands, logos, or fine details unless the text/marking is clearly readable

  69. [72]

    3.2) If a pose or action is not clearly verifiable, do NOT infer it

    POSE + ACTIONS (confidence-gated) 3.1) Describe poses (standing, sitting, leaning, arms extended, head direction) and actions (riding, walking, holding) ONLY when directly supported by clearly visible body position and/or physical contact with an object. 3.2) If a pose or action is not clearly verifiable, do NOT infer it. Instead describe what the body lo...

  70. [73]

    side-by-side

    RELATIONS / LAYOUT 4.1) Describe spatial layout using grounded relationships such as: left, right, top, bottom. 4.2) Do NOT overstate alignment or formation (e.g., do not say “side-by-side” unless clearly true). TEXT (OCR) REQUIREMENT:

  71. [74]

    If any text is visible anywhere (signs, labels, screens, posters, documents, packaging, subtitles, watermarks, UI elements, logos with words, etc.), you MUST try to transcribe it

  72. [75]

    Reproduce visible text exactly as written, preserving casing, punctuation, numbers, symbols, and spelling

  73. [76]

    Only transcribe text that is clearly legible

  74. [77]

    If text is present but not fully readable, do NOT guess; simply say that text is present

  75. [78]

    NOT VISIBLE

    When including OCR text, place it naturally into the caption (prefer Sentences 3–5), so the caption remains coherent and readable. EDGE CASES: - If the image is blank OR the main content is not visible/understandable (for example, all black/white, too blurry, too dark, overexposed, or corrupted), output exactly: "NOT VISIBLE." OUTPUT RULES (STRICT):

  76. [79]

    Produce exactly 5–7 sentences

  77. [80]

    Use this sentence structure (STRICT): - Sentences 1–3: main subjects + key attributes + main actions + core setting - Sentences 4–6: layout + secondary elements + background context (include OCR here when possible) - Sentence 7 (optional): extra fine details that help reconstruction

  78. [81]

    The sentences must be information-dense rather than brief

  79. [82]

    Do NOT use bullet points, lists, headings, or JSON

  80. [83]

    Do NOT include disclaimers or meta commentary

Showing first 80 references.