pith. machine review for the scientific record. sign in

arxiv: 2604.25370 · v2 · submitted 2026-04-28 · 💻 cs.CV · cs.AI

Recognition: unknown

GPT-Image-2 in the Wild: A Twitter Dataset of Self-Reported AI-Generated Images from the First Week of Deployment

Ethan Traister, Jenny Wu, Kewen Xie, Kidus Zewde, Simiao Ren, Tommy Duong, Xingyu Shen, Yuchen Zhou, Zikang Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:22 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords GPT-image-2AI-generated imagesTwitter datasetC2PA provenanceimage curation pipelinesocial media AI contentzero-shot classificationcontent credentials
0
0 comments X

The pith

A curated set of 10,217 GPT-image-2 pictures from Twitter shows that social media platforms erase cryptographic proof of AI origin.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper assembles and publicly releases the first large collection of GPT-image-2 images posted on Twitter in the days after the model's release. Authors applied text searches, automated checks for the platform's 'Made with AI' label, and name matching to turn 27,662 raw posts into 10,217 verified examples. They then describe the pictures through subject categories, readable text, detected faces, and topic clusters. A central finding is that every image lost its C2PA content credentials during upload, so cryptographic checks cannot confirm AI origin on social platforms.

Core claim

Through a pipeline of multilingual keyword filters, browser-driven badge verification, and model-name matching applied to posts from the first six days after release, the authors produced a dataset of 10,217 confirmed GPT-image-2 images together with four descriptive analyses and the observation that Twitter's CDN removes C2PA provenance data on every upload.

What carries the argument

The multi-stage curation pipeline that combines multilingual text heuristics, automated 'Made with AI' badge detection, and model-name variant matching to isolate genuine GPT-image-2 images.

If this is right

  • Public researchers now have a large, timestamped sample of real-world GPT-image-2 use for studying generation patterns.
  • C2PA-based verification cannot work for images that have passed through Twitter's upload process.
  • Dataset analyses show that most posted GPT-image-2 pictures contain readable text and many contain human faces.
  • Semantic clustering of the images yields 137 distinct visual topics that future studies can track over time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The stripping of provenance markers implies that platforms may need new, non-cryptographic signals if they want to label AI content at scale.
  • The high rate of text and faces in the images suggests the generator is frequently used for text-heavy or portrait-style outputs rather than abstract scenes.
  • Releasing the curation code alongside the images allows other groups to repeat the collection process for later model releases.

Load-bearing premise

The combination of text rules, badge checks, and name matching selects only real GPT-image-2 images and excludes almost all others.

What would settle it

A manual audit of several hundred randomly sampled images from the released dataset that finds a substantial fraction are not GPT-image-2 outputs, or the successful recovery of C2PA data from any image in the collection.

Figures

Figures reproduced from arXiv: 2604.25370 by Ethan Traister, Jenny Wu, Kewen Xie, Kidus Zewde, Simiao Ren, Tommy Duong, Xingyu Shen, Yuchen Zhou, Zikang Zhang.

Figure 1
Figure 1. Figure 1: A sample of 30 images from the GPT-Image-2 Twitter Dataset, illustrating the breadth of content: anime illustrations, photorealistic portraits, text-heavy infographics, fantasy scenes, product mockups, food, nature, and architecture. Abstract The release of GPT-image-2 by OpenAI marks a watershed moment in AI-generated imagery: the boundary between photographic reality and synthetic content has never been … view at source ↗
Figure 2
Figure 2. Figure 2: End-to-end collection and curation pipeline (left to right). Stage 1 removes non-photo media and failed downloads; Stage 2 applies multilingual text heuristics yielding three classes. Uncertain tweets are further checked via Playwright/Chromium browser automation for Twitter’s “Made with AI” badge, yielding 4,750 additional badge-confirmed images view at source ↗
Figure 3
Figure 3. Figure 3: Daily count of confirmed GPT-image-2 images in the collection window. Peak activity occurs on April 22 (Day 2), reflecting global uptake following the US-timezone release on April 21 view at source ↗
Figure 4
Figure 4. Figure 4: Subject matter distribution across 10,217 images classified via CLIP zero-shot. Text-graphic content dominates; fantasy/surreal and photorealistic portraits follow. 5. Visual Content Analysis To characterise the semantic and visual diversity of the dataset we apply four complementary analyses: (1) CLIP zero-shot subject-matter classification, (2) OCR-based text-legibility detection, (3) face detection with… view at source ↗
Figure 5
Figure 5. Figure 5: Representative examples from the eight subject-matter categories classified by CLIP zero-shot view at source ↗
Figure 6
Figure 6. Figure 6: Text presence across the 10,217 confirmed images. Over four-fifths (82.0%) contain machine-readable text, reflect￾ing GPT-image-2’s strong text-rendering capability. 5.2. Text Legibility Running EasyOCR (ZH/EN) on the full image set reveals that 82.0% of confirmed images contain legible text (8,383 of 10,217), with a median of 29 detected text regions per text-bearing image ( view at source ↗
Figure 7
Figure 7. Figure 7: Example images with high text density (≥20 OCR-detected regions), illustrating GPT-image-2’s multilingual text rendering across poster, infographic, and typographic compositions. 5.3. Face Detection and Demographics InsightFace (buffalo_l, ONNX/CUDA) detects at least one face in 59.2% of images (6,053 of 10,217), with 22,583 faces in total across those images ( view at source ↗
Figure 8
Figure 8. Figure 8: Face count distribution (left), gender split (centre), and estimated age histogram (right) across 10,217 confirmed images. 59.2% of images contain at least one detected face (22,583 total faces) view at source ↗
Figure 9
Figure 9. Figure 9: Example images with multiple detected faces, spanning photorealistic portraits, animated characters, and group compositions. 5.4. Semantic Clustering via CLIP We embed all 10,217 images with CLIP ViT-L/14 [18], reduce to 2-D via UMAP [19] (cosine metric, k = 30 neigh￾bours, min-dist 0.05), and cluster with HDBSCAN [20] (min cluster size 15). The procedure yields 137 distinct se￾mantic clusters ( view at source ↗
Figure 10
Figure 10. Figure 10: UMAP projection of CLIP ViT-L/14 embeddings coloured by HDBSCAN cluster assignment (left) and by tweet language (right). 137 clusters emerge from 10,217 images; 33.2% are classified as noise, reflecting genuine visual heterogeneity view at source ↗
Figure 11
Figure 11. Figure 11: Representative images from the four largest CLIP semantic clusters. Each cluster captures a visually coherent aesthetic family: anime group scenes (C7), typographic posters (C10), illustrated characters (C42), and photorealistic portraits (C37). 6. Conclusion We have presented the GPT-Image-2 Twitter Dataset: 10,217 confirmed GPT-image-2 images collected from Twitter/X within six days of the model’s publi… view at source ↗
Figure 12
Figure 12. Figure 12: Aspect ratio distribution (left) and native-resolution breakdown (right) for all 10,217 confirmed images. Portrait is the plurality at 53.5%. “Other” resolutions reflect Twitter CDN resampling on upload. 11 view at source ↗
read the original abstract

The release of GPT-image-2 by OpenAI marks a watershed moment in AI-generated imagery: the boundary between photographic reality and synthetic content has never been more difficult to discern. We introduce the GPT-Image-2 Twitter Dataset, the first published dataset of GPT-image-2 generated images, sourced from publicly available Twitter/X posts in the immediate aftermath of the model's April 21, 2026 release. Leveraging the Twitter API v2 and a multi-stage curation pipeline spanning multilingual text heuristics (English, Japanese, and Chinese), browser-automated Twitter "Made with AI" badge verification, and model name variant matching, we curate 10,217 confirmed GPT-image-2 images from 27,662 collected records over a six-day window. We characterize the dataset across four analyses: CLIP-based zero-shot subject taxonomy, OCR text legibility (82.0% of images contain detectable text), face detection (59.2% of images, 22,583 total faces), and semantic clustering (137 CLIP ViT-L/14 clusters). A key negative result is that C2PA content credentials are systematically stripped by Twitter's CDN on upload, rendering cryptographic provenance verification infeasible for social-media-sourced AI images. The dataset and all curation code are released publicly.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper presents the GPT-Image-2 Twitter Dataset, the first public collection of self-reported GPT-Image-2 generated images from Twitter/X. Using the Twitter API v2 and a multi-stage curation pipeline (multilingual text heuristics in English/Japanese/Chinese, automated 'Made with AI' badge verification, and model-name variant matching), the authors collect 27,662 records over six days post-release (April 21, 2026) and curate 10,217 confirmed GPT-Image-2 images. They characterize the dataset via CLIP zero-shot subject taxonomy, OCR (82.0% contain detectable text), face detection (59.2% of images, 22,583 faces), and semantic clustering (137 CLIP ViT-L/14 clusters). A key negative finding is that C2PA content credentials are systematically stripped by Twitter's CDN. The dataset and curation code are released publicly.

Significance. If the curation pipeline can be shown to have high precision, the dataset would be a valuable, timely resource for studying early public adoption, content characteristics, and social-media diffusion of a frontier text-to-image model. The released code and data enable reproducibility, and the C2PA stripping observation is independently verifiable and has clear implications for provenance research. The descriptive analyses (OCR rates, face counts, clustering) provide concrete starting points for downstream work on AI-image detection or taxonomy. However, the absence of any validation metrics for the central 'confirmed' count substantially reduces the dataset's immediate utility as a GPT-Image-2-specific benchmark.

major comments (1)
  1. [Methods / curation pipeline] Methods / curation pipeline description: The multi-stage pipeline (text heuristics + badge verification + model-name matching) is presented as producing 10,217 'confirmed' GPT-Image-2 images, yet no precision, recall, false-positive rate, manual audit of a random sample, inter-annotator agreement, or comparison against a ground-truth set is reported. This directly undermines the headline claim and the dataset's claimed specificity, as false positives from other generative models or mislabeled posts cannot be quantified.
minor comments (2)
  1. [Data collection] The six-day collection window and exact API query parameters should be stated with timestamps and rate-limit handling details for reproducibility.
  2. [Data collection] Clarify whether the 27,662 collected records include duplicates or retweets and how they were deduplicated before curation.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and detailed review. The primary concern about the lack of quantitative validation for the curation pipeline is a fair and substantive point that we address directly below. We have revised the manuscript to incorporate a manual audit and associated metrics, which we believe strengthens the paper while preserving its focus on the timely release of this dataset.

read point-by-point responses
  1. Referee: [Methods / curation pipeline] Methods / curation pipeline description: The multi-stage pipeline (text heuristics + badge verification + model-name matching) is presented as producing 10,217 'confirmed' GPT-Image-2 images, yet no precision, recall, false-positive rate, manual audit of a random sample, inter-annotator agreement, or comparison against a ground-truth set is reported. This directly undermines the headline claim and the dataset's claimed specificity, as false positives from other generative models or mislabeled posts cannot be quantified.

    Authors: We agree that the original manuscript did not report quantitative validation metrics for the multi-stage curation pipeline, which is a limitation. In the revised version, we have added a dedicated validation subsection describing a manual audit performed on a random sample of 500 images drawn from the final curated set. Two independent annotators reviewed each image and its associated post text for confirmation as GPT-Image-2 content, and the results—including observed precision and inter-annotator agreement—are now reported. We also explicitly discuss the inherent difficulty of estimating recall, as no exhaustive ground-truth set of all GPT-Image-2 posts on the platform exists. The pipeline's design, which layers independent signals (multilingual heuristics, automated badge verification, and model-name matching), is intended to prioritize specificity; the added audit provides empirical grounding for this claim and quantifies the risk of false positives from other models or mislabeled posts. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical data curation with standard descriptive analyses

full rationale

The paper reports collection of Twitter posts via API, application of a multi-stage heuristic pipeline (text matching, badge verification, name variants), and then applies off-the-shelf tools (CLIP zero-shot taxonomy, OCR, face detection, semantic clustering) to characterize the resulting set. No equations, fitted parameters, predictions, or derivations appear. The curation pipeline is presented as a methodological choice rather than a result derived from prior outputs or self-citations. Claims rest on the released dataset and code, which are externally verifiable. No load-bearing self-citation chains or self-definitional reductions exist.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The contribution is empirical data collection rather than theoretical modeling, so the ledger contains only domain assumptions about external services and no free parameters or invented entities.

axioms (2)
  • domain assumption Twitter API v2 returns accurate post metadata and media links for the queried period
    The collection step relies on this API without reported validation against ground truth.
  • ad hoc to paper The combination of text heuristics, badge detection, and model-name matching identifies GPT-image-2 images with high precision
    This is the core filter described in the abstract and is not independently verified in the provided text.

pith-pipeline@v0.9.0 · 5569 in / 1544 out tokens · 48301 ms · 2026-05-08T03:22:54.348250+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 6 canonical work pages · 1 internal anchor

  1. [1]

    DiffusionDB: A large- scale prompt gallery dataset for text-to-image gen- erative models,

    Z. J. Wang, E. Montoya, D. Munechika, H. Yang, B. Hoover, and D. H. Chau, “DiffusionDB: A large- scale prompt gallery dataset for text-to-image gen- erative models,” inProceedings of the 61st Annual Meeting of the Association for Computational Lin- guistics, pp. 893–911, 2023

  2. [2]

    GenImage: A million-scale benchmark for detecting AI-generated image,

    M. Zhu, H. Chen, Q. Yan, Z. Huang, W. Lin, Y. Gu, S. Zhao, W. Wang, M. Ye, H. Fan,et al., “GenImage: A million-scale benchmark for detecting AI-generated image,” inAdvances in Neural Information Processing Systems, vol. 36, 2023

  3. [3]

    LAION-5b: An open large-scale dataset for train- ing next generation image-text models,

    C. Schuhmann, R. Beaumont, R. Vencu, C. Gordon, R. Wightman, M. Cherti, T. Coombes, A. Katta, C. Mullis, M. Wortsman, P. Schramowski, S. Kun- durthy, K. Crowson, L. Schmidt, and J. Jitsev, “LAION-5b: An open large-scale dataset for train- ing next generation image-text models,”Advances in Neural Information Processing Systems, vol. 35, pp. 25278–25294, 2022

  4. [4]

    GPT4o-Receipt: A dataset and human study for AI-generated document forensics,

    Y. Zhang, S. Ren, A. Raj, E. Wei, D. Ng, A. Shen, J. Xu, Y. Zhang, and E. Marotta, “GPT4o-Receipt: A dataset and human study for AI-generated document forensics,”arXiv preprint, 2026

  5. [5]

    CNN-generated images are surprisingly easy to spot... for now,

    S.-Y. Wang, O. Wang, R. Zhang, A. Owens, and A. A. Efros, “CNN-generated images are surprisingly easy to spot... for now,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recog- nition, pp. 8695–8704, 2020

  6. [6]

    Are GAN generated images easy to detect? A critical analysis of the state-of-the-art,

    D. Gragnaniello, D. Cozzolino, F. Marra, G. Poggi, and L. Verdoliva, “Are GAN generated images easy to detect? A critical analysis of the state-of-the-art,” inIEEE International Conference on Multimedia and Expo, 2021

  7. [7]

    How well are open sourced AI-generated image detection models out-of- the-box: A comprehensive benchmark study,

    S. Ren, Y. Zhou, X. Shen, K. Zewde, T. Duong, G. Huang, E. Wei, and J. Xue, “How well are open sourced AI-generated image detection models out-of- the-box: A comprehensive benchmark study,”arXiv preprint arXiv:2602.07814, 2026

  8. [8]

    Ar- tificial fingerprinting for generative models: Rooting deepfake attribution in training data,

    N. Yu, V. Skripniuk, S. Abdelnabi, and M. Fritz, “Ar- tificial fingerprinting for generative models: Rooting deepfake attribution in training data,” inProceed- ings of the IEEE/CVF International Conference on Computer Vision, pp. 14448–14457, 2021

  9. [9]

    Can multi- modal (reasoning) LLMs work as deepfake detectors? arXiv preprint arXiv:2503.20084, 2025

    S. Ren, Y. Yao, K. Zewde, Z. Liang, N.-Y. Cheng, X. Zhan, Q. Liu, Y. Chen, and H. Xu, “Can multi- modal(reasoning)LLMsworkasdeepfakedetectors?,” arXiv preprint arXiv:2503.20084, 2025

  10. [10]

    C2PA technical specification, version 2.1,

    Coalition for Content Provenance and Authenticity, “C2PA technical specification, version 2.1,” tech. rep., C2PA, 2024

  11. [11]

    Do deepfake detectors work in reality?,

    S. Ren, D. Patil, K. Zewde, T. D. Ng, H. Xu, S. Jiang, R. Desai, N.-Y. Cheng, Y. Zhou, and R. Muthukr- ishnan, “Do deepfake detectors work in reality?,” in Proceedings of the 4th Workshop on Security Implica- tions of Deepfakes and Cheapfakes, 2025

  12. [12]

    Synthetic politics: Preva- lence, spreaders, and emotional reception of AI- generated political images on X,

    Y. Luo, F. Pierri, K. Sharma, J. Flamino, B. K. Szy- manski, and E. Ferrara, “Synthetic politics: Preva- lence, spreaders, and emotional reception of AI- generated political images on X,”arXiv preprint arXiv:2502.11248, 2025

  13. [13]

    Examining the prevalence and dynamics of AI-generated media in art subreddits,

    P.-Y. Sha, K.-C. Lee, and D. Murthy, “Examining the prevalence and dynamics of AI-generated media in art subreddits,”arXiv preprint arXiv:2410.07302, 2024

  14. [14]

    AMMeBa: A large-scale survey and dataset of media-based misinformation in-the-wild,

    C. Hortonet al., “AMMeBa: A large-scale survey and dataset of media-based misinformation in-the-wild,” inarXiv preprint arXiv:2405.11697, 2024

  15. [15]

    Twitter API v2 documentation: Recent search endpoint,

    X Corp., “Twitter API v2 documentation: Recent search endpoint,” 2024. Accessed April 2026

  16. [16]

    GPT-image-2: Our most capable image generation model,

    OpenAI, “GPT-image-2: Our most capable image generation model,” 2026. Accessed April 2026

  17. [17]

    Labels for AI-generated media on X,

    X Corp., “Labels for AI-generated media on X,” 2024. Accessed April 2026

  18. [18]

    Learning transferable visual models from natural language su- pervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever, “Learning transferable visual models from natural language su- pervision,” inProceedings of the 38th International Conference on Machine Learning, ICML, pp. 8748– 8763, 2021

  19. [19]

    UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction

    L. McInnes, J. Healy, and J. Melville, “UMAP: Uni- form manifold approximation and projection for di- mension reduction,”arXiv preprint arXiv:1802.03426, 2018

  20. [20]

    Density-based clustering based on hierarchical den- sity estimates,

    R. J. G. B. Campello, D. Moulavi, and J. Sander, “Density-based clustering based on hierarchical den- sity estimates,” inAdvances in Knowledge Discovery and Data Mining (PAKDD), pp. 160–172, 2013. 10 A. Additional Figures Figure 12:Aspect ratio distribution (left) and native-resolution breakdown (right) for all 10,217 confirmed images. Portrait is the plu...