pith. sign in

arxiv: 2606.00872 · v1 · pith:AS72SQJEnew · submitted 2026-05-30 · 💻 cs.CV

Images as Tables: In-Context Learning with TabPFN for Low-Data Detection of AI-Generated Images

Pith reviewed 2026-06-28 18:40 UTC · model grok-4.3

classification 💻 cs.CV
keywords AI-generated image detectionTabPFNin-context learningDINOv3low-data regimeimage forensicstabular modelstransfer detection
0
0 comments X

The pith

Encoding images as PCA-reduced DINOv3 rows lets TabPFN classify real versus AI-generated images more accurately than task-specific detectors when labels are scarce.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reframes AI-generated image detection as a low-data tabular classification task. Each image is passed through a frozen DINOv3 backbone, its CLS token is projected to a 500-dimensional row via PCA, and TabPFN performs real/fake prediction by in-context inference on a small labeled context set rather than by training a new classifier. On the GenImage benchmark this approach beats the recent LATTE detector by as much as 8.2 percent in low-data regimes and improves generalization when the test generator differs from those seen in training, while LATTE remains stronger once large pooled labeled sets from all generators become available. The result positions tabular foundation models as a lightweight adaptation mechanism that shifts detector updates from gradient fine-tuning to simple context-set replacement.

Core claim

DINOv3-PCA-TabPFN outperforms LATTE by up to 8.2 percent in the low-data regime and in cross-generator transfer settings because real/fake classification is performed by TabPFN's in-context tabular inference over fixed 500-dimensional visual feature rows instead of task-specific classifier training.

What carries the argument

The image-to-table conversion that reduces frozen DINOv3 CLS features to 500-dimensional PCA rows and feeds them to TabPFN for in-context real/fake prediction.

If this is right

  • Detector adaptation depends only on swapping the labeled context set rather than on gradient-based retraining.
  • Accuracy gains appear specifically when the number of labeled samples per generator is small.
  • Cross-generator generalization improves without any fine-tuning on the target generator's data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same frozen-backbone-plus-tabular-inference pipeline could be tested on other visual detection tasks that face rapid distribution shift.
  • Tabular foundation models may reduce the labeled-data requirement for many forensics problems that currently rely on repeated model retraining.

Load-bearing premise

The 500-dimensional PCA projection of DINOv3 features retains enough structure to let TabPFN separate real from generated images across different generators.

What would settle it

A new generator where DINOv3-PCA-TabPFN accuracy with a small context set falls below a retrained baseline given identical labels.

Figures

Figures reproduced from arXiv: 2606.00872 by Jan Philip Walter, Margret Keuper, Shashank Agnihotri.

Figure 1
Figure 1. Figure 1: In the pooled generator setting, LATTE reaches the high￾est accuracy (real/fake class detection) at the largest training size, but DINOv3-PCA-TabPFN is stronger at the smaller shared train￾ing sizes, which is the regime targeted by our in-context detector adaptation. fail after the generator architecture, training data, sampling procedure, or post-processing pipeline changes. This is the central difficulty… view at source ↗
Figure 2
Figure 2. Figure 2: Representation and classifier comparison in the Multi-Multi development setting. DINOv3 features with TabPFN give the strongest and most stable performance across accuracy, precision, recall, F1, and ROC-AUC, while DINOv2, frequency features, and the MLP baseline are weaker in the low-data regime. All compared encoders and settings are explained in Appendix Section C [PITH_FULL_IMAGE:figures/full_fig_p003… view at source ↗
Figure 3
Figure 3. Figure 3: Multi-Single accuracy by training size and test generator. Training uses all fake generators, while each panel evaluates one test generator, highlighting which generators remain difficult even after broad training coverage. 3. Method Given an image x, DINOv3 produces a frozen feature vector h(x) ∈ R 768. Images are loaded as RGB images, resized to a shorter edge of 256 pixels, center-cropped to 224 × 224. … view at source ↗
Figure 4
Figure 4. Figure 4: Pairwise generator transfer. Left and middle: TabPFN accuracy matrices at k = 25 and k = 625. Right: TabPFN minus LATTE at k = 625, where positive values indicate a TabPFN advantage. do not close the gap. A small MLP trained on the same DINOv3 features becomes competitive with more data, but TabPFN is stronger in the small-context regime. This is the key methodological result: the gain is not simply “use a… view at source ↗
Figure 5
Figure 5. Figure 5: Multi-Multi context-size scaling for TabPFN. Accuracy increases steadily with more labeled context examples, but the main paper focuses on the direct comparison to LATTE rather than this standalone scaling curve [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Additional Multi-Multi comparison across model families. DINOv3 with TabPFN is the most stable option across the tested sample sizes, while DINOv2, frequency features, and the MLP baseline are less reliable in the smallest regimes [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Frequency-domain feature comparison. Frequency features improve with training size, but remain weaker than DINOv3 features with TabPFN, supporting learned visual embeddings rather than hand-crafted frequency statistics alone. 11 [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Global PCA projection of DINOv3 CLS features colored by generator. The plot shows that the frozen representation contains generator-dependent structure before the TabPFN decision step [PITH_FULL_IMAGE:figures/full_fig_p012_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Generator-wise PCA facets for real and generated images. The facets make visible that some generators are more separated from real images than others, which explains part of the Multi-Single difficulty pattern. 12 [PITH_FULL_IMAGE:figures/full_fig_p012_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Global PCA projection of real versus generated images. The overlap indicates that a simple linear projection is not enough for perfect separation, but the visible structure supports lightweight downstream classification. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: PCA grid by generator. BigGAN and GLIDE show clearer real/fake separation, while ADM and Midjourney overlap more strongly with real images [PITH_FULL_IMAGE:figures/full_fig_p014_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: TabPFN Multi-Single accuracy by test generator. This compact view shows that generator difficulty varies substantially even when training uses all generators. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Multi-Single accuracy by training size and test generator. This plot complements the main-paper Multi-Single figure with the full per-test-generator training-size curves [PITH_FULL_IMAGE:figures/full_fig_p015_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Multi-Single comparison across TabPFN input encodings. The results show that generator difficulty persists across encodings, but DINOv3 features remain the most reliable low-data representation. 15 [PITH_FULL_IMAGE:figures/full_fig_p015_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Multi-Single comparison across model families. The plot compares the proposed image-to-table pipeline against alternative classifiers and shows that difficult generators remain challenging across model choices. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Multi-Single comparison to LATTE. The detector is trained with broad generator coverage and evaluated on one test generator at a time. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Single-Multi comparison to LATTE. The detector is trained on one fake generator and evaluated on the full generator set, making this a stricter cross-generator transfer setting. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: Single-Multi comparison across model families. This setting is stricter than Multi-Single because only one fake generator is observed during training and the detector is evaluated on the full generator set [PITH_FULL_IMAGE:figures/full_fig_p019_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Single-Multi baseline comparison. The plot provides the baseline trends for transfer from one observed fake generator to the full generator set. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Grouped TabPFN result summary across generator-aware settings. This figure provides a compact visual overview of the main TabPFN trends discussed in the paper and appendix. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_20.png] view at source ↗
read the original abstract

AI-generated image detection is a moving-target problem: detectors trained on one generator often fail when a new generator appears, and only a few labeled examples are available. We study a simple image-to-table formulation for this regime, where each image is encoded by a frozen DINOv3 backbone, its CLS feature is reduced to a 500-dimensional structured row with PCA, and TabPFN performs real/fake classification by in-context tabular inference rather than task-specific classifier training. This turns fake-image detection into low-data structured prediction over learned visual features, making detector adaptation depend on the labeled context set instead of gradient-based fine-tuning. On GenImage, LATTE, a recent state-of-the-art detector, remains stronger when many labeled samples from all generators are available, by 7.4% in the largest pooled setting, but DINOv3-PCA-TabPFN is stronger in the practically important low-data regime, outperforming LATTE by up to 8.2%, and in transfer settings where the detector must generalize from one generator to another. These results position tabular foundation models as a strong complementary adaptation mechanism for image forensics, shifting adaptation from detector retraining to lightweight in-context updates with a small labeled set of examples. Code URL: https://github.com/jpwalter30/Towards-Generalizable-Detection-of-AI-Generated-Images

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that a simple image-to-table pipeline—encoding images via a frozen DINOv3 backbone, reducing the CLS token to a 500-dimensional vector with PCA, and performing real/fake classification with TabPFN via in-context tabular inference—outperforms the LATTE baseline by up to 8.2% in low-data regimes and in cross-generator transfer settings on GenImage, while trailing LATTE by 7.4% in the largest pooled high-data setting. Adaptation is achieved by supplying a small labeled context set rather than gradient-based fine-tuning.

Significance. If the empirical gains hold under proper statistical controls, the work establishes tabular foundation models as a viable complementary mechanism for low-data, generalizable AI-image detection. It shifts the adaptation burden from model retraining to lightweight in-context updates and supplies reproducible code, which strengthens the contribution for the forensics community.

major comments (3)
  1. [Abstract / experimental results] Abstract and experimental results section: the reported deltas (8.2% low-data, 7.4% pooled) are presented without error bars, exact per-regime sample counts, number of random seeds, or cross-validation details, preventing verification of whether the gains are statistically reliable or sensitive to the particular low-data splits.
  2. [Methods (PCA step)] Methods (PCA reduction step): no cumulative explained variance ratio is reported for the 500-component PCA of DINOv3 CLS features, nor is there an ablation on component count. Because the central claim rests on the 500-dimensional projection retaining sufficient generator-discriminative structure for TabPFN in-context inference (especially in transfer), the absence of this diagnostic directly undermines confidence in the 8.2% and transfer gains.
  3. [Experiments / ablation studies] Experimental protocol: the manuscript supplies no analysis of which variance directions are discarded by the fixed 500-component cutoff and whether those directions contain subtle generator artifacts critical for cross-generator generalization, leaving the weakest assumption untested.
minor comments (2)
  1. [Abstract] The abstract states 'up to 8.2%' without clarifying whether this is the maximum across all low-data regimes or a specific setting; a table or figure reference would improve clarity.
  2. [Experiments] The code URL is provided, which is a strength, but the manuscript does not state the exact TabPFN context-set sizes used in each low-data experiment.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and commit to revisions that add the requested statistical and methodological details.

read point-by-point responses
  1. Referee: [Abstract / experimental results] Abstract and experimental results section: the reported deltas (8.2% low-data, 7.4% pooled) are presented without error bars, exact per-regime sample counts, number of random seeds, or cross-validation details, preventing verification of whether the gains are statistically reliable or sensitive to the particular low-data splits.

    Authors: We agree that the current manuscript lacks error bars, exact sample counts per regime, number of random seeds, and cross-validation details. In the revision we will report means and standard deviations over at least five random seeds, specify the precise number of labeled examples in each low-data regime, document the splitting procedure, and include these statistics in both the abstract and experimental results section. revision: yes

  2. Referee: [Methods (PCA step)] Methods (PCA reduction step): no cumulative explained variance ratio is reported for the 500-component PCA of DINOv3 CLS features, nor is there an ablation on component count. Because the central claim rests on the 500-dimensional projection retaining sufficient generator-discriminative structure for TabPFN in-context inference (especially in transfer), the absence of this diagnostic directly undermines confidence in the 8.2% and transfer gains.

    Authors: We acknowledge that the manuscript does not report the cumulative explained variance ratio for the 500-component PCA nor provide an ablation on the number of components. In the revised version we will add the cumulative explained variance for the chosen 500 components and include an ablation study varying the component count (e.g., 100, 300, 500, 1000) with corresponding performance metrics on the low-data and transfer settings. revision: yes

  3. Referee: [Experiments / ablation studies] Experimental protocol: the manuscript supplies no analysis of which variance directions are discarded by the fixed 500-component cutoff and whether those directions contain subtle generator artifacts critical for cross-generator generalization, leaving the weakest assumption untested.

    Authors: We agree that the manuscript does not analyze the variance directions discarded by the 500-component cutoff. In the revision we will add an analysis comparing the discriminative power (e.g., via feature importance or transfer performance) of the retained versus discarded components to assess whether critical generator artifacts are lost. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison with no derivations or self-referential reductions

full rationale

The manuscript contains no equations, derivations, fitted parameters presented as predictions, or self-citation chains that bear the central claim. The method (DINOv3 CLS features + 500-dim PCA + TabPFN in-context classification) is described procedurally and evaluated via direct empirical comparison to the external baseline LATTE on GenImage. All reported gains (e.g., +8.2% in low-data regime) are experimental outcomes, not reductions to the inputs by construction. The PCA step is a fixed preprocessing choice whose adequacy is an empirical question, not a definitional loop. This is the normal case of a self-contained empirical paper.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method depends on the pre-trained DINOv3 and TabPFN models (treated as fixed external artifacts) and on the choice of 500 PCA dimensions; no new entities are postulated.

free parameters (1)
  • PCA target dimension
    Set to 500 to produce tabular rows compatible with TabPFN; value chosen to balance information retention and model input size.
axioms (1)
  • domain assumption DINOv3 CLS features contain information sufficient to distinguish real from generated images after linear PCA reduction.
    Invoked by the decision to use the frozen backbone and PCA step without further fine-tuning.

pith-pipeline@v0.9.1-grok · 5790 in / 1336 out tokens · 20711 ms · 2026-06-28T18:40:30.831453+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

32 extracted references · 6 canonical work pages · 2 internal anchors

  1. [1]

    and Lim, Jongwoo and Lin, Ruei-Sung and Yang, Ming-Hsuan , title =

    Ross, David A. and Lim, Jongwoo and Lin, Ruei-Sung and Yang, Ming-Hsuan , title =. International Journal of Computer Vision , volume =. 2008 , month =. doi:10.1007/s11263-007-0075-7 , url =

  2. [2]

    and Pouget-Abadie, Jean and Mirza, Mehdi and Xu, Bing and Warde-Farley, David and Ozair, Sherjil and Courville, Aaron and Bengio, Yoshua , title =

    Goodfellow, Ian J. and Pouget-Abadie, Jean and Mirza, Mehdi and Xu, Bing and Warde-Farley, David and Ozair, Sherjil and Courville, Aaron and Bengio, Yoshua , title =. Advances in Neural Information Processing Systems , volume =. 2014 , url =

  3. [3]

    Brock, Andrew and Donahue, Jeff and Simonyan, Karen , booktitle=

  4. [4]

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages =

    He, Kaiming and Zhang, Xiangyu and Ren, Shaoqing and Sun, Jian , title =. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages =. 2016 , url =

  5. [5]

    Dosovitskiy, Alexey and Beyer, Lucas and Kolesnikov, Alexander and Weissenborn, Dirk and Zhai, Xiaohua and Unterthiner, Thomas and Dehghani, Mostafa and Minderer, Matthias and Heigold, Georg and Gelly, Sylvain and others , booktitle=

  6. [6]

    Advances in Neural Information Processing Systems , volume =

    Dhariwal, Prafulla and Nichol, Alexander , title =. Advances in Neural Information Processing Systems , volume =. 2021 , url =

  7. [7]

    Proceedings of the 39th International Conference on Machine Learning , pages =

    Nichol, Alexander Quinn and Dhariwal, Prafulla and Ramesh, Aditya and Shyam, Pranav and Mishkin, Pamela and McGrew, Bob and Sutskever, Ilya and Chen, Mark , title =. Proceedings of the 39th International Conference on Machine Learning , pages =. 2022 , editor =

  8. [8]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

    Rombach, Robin and Blattmann, Andreas and Lorenz, Dominik and Esser, Patrick and Ommer, Bj. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =. 2022 , url =

  9. [9]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

    Gu, Shuyang and Chen, Dong and Bao, Jianmin and Wen, Fang and Zhang, Bo and Chen, Dongdong and Yuan, Lu and Guo, Baining , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =. 2022 , url =

  10. [10]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =

    Liu, Zhuang and Mao, Hanzi and Wu, Chao-Yuan and Feichtenhofer, Christoph and Darrell, Trevor and Xie, Saining , title =. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages =. 2022 , url =

  11. [11]

    Transactions on Machine Learning Research , year=

    Oquab, Maxime and Darcet, Timoth. Transactions on Machine Learning Research , year=

  12. [12]

    International Conference on Learning Representations , year =

    Hollmann, Noah and M. International Conference on Learning Representations , year =

  13. [13]

    Advances in Neural Information Processing Systems , volume =

    Zhu, Mingjian and Chen, Hanting and Yan, Qiangyu and Huang, Xudong and Lin, Guanyu and Li, Wei and Tu, Zhijun and Hu, Hailin and Hu, Jie and Wang, Yunhe , title =. Advances in Neural Information Processing Systems , volume =. 2023 , note =

  14. [14]

    Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops , pages =

    Epstein, Dave and Jain, Ishan and Wang, Oliver and Zhang, Richard , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision Workshops , pages =. 2023 , url =

  15. [15]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops , pages =

    Cozzolino, Davide and Poggi, Giovanni and Corvi, Riccardo and Nie. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops , pages =. 2024 , url =

  16. [16]

    and Fritz, Mario , title =

    Yu, Ning and Davis, Larry S. and Fritz, Mario , title =. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages =. 2019 , publisher =. doi:10.1109/ICCV.2019.00765 , isbn =

  17. [17]

    DINOv3

    Sim. arXiv preprint arXiv:2508.10104 , year=

  18. [18]

    Vasilcoiu, Ana and Najdenkoska, Ivona and Geradts, Zeno and Worring, Marcel , journal=

  19. [19]

    ACM Computing Surveys , volume =

    Pei, Gan and Zhang, Jiangning and Hu, Menghan and Zhang, Zhenyu and Wang, Chengjie and Wu, Yunsheng and Zhai, Guangtao and Yang, Jian and Tao, Dacheng , title =. ACM Computing Surveys , volume =. 2026 , publisher =. doi:10.1145/3801962 , url =

  20. [20]

    Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages =

    Yermakov, Andrii and Cech, Jan and Matas, Jiri and Fritz, Mario , title =. Proceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision , pages =. 2026 , url =

  21. [21]

    Computer Science Review , volume =

    Mahara, Arpan and Rishe, Naphtali , title =. Computer Science Review , volume =. 2026 , doi =

  22. [22]

    URL https://www.nature.com/articles/ s41586-024-08328-6

    Hollmann, Noah and M. Nature , volume =. 2025 , month =. doi:10.1038/s41586-024-08328-6 , url =

  23. [23]

    TabPFN-2.5: Advancing the State of the Art in Tabular Foundation Models

    Grinsztajn, L. arXiv preprint arXiv:2511.08667 , year=

  24. [24]

    2026 , url =

    Oei, Victor and Schmalfuss, Jenny and Mehl, Lukas and Bartsch, Madlen and Agnihotri, Shashank and Keuper, Margret and Bulling, Andreas and Bruhn, Andres , booktitle =. 2026 , url =

  25. [25]

    2026 , url =

    Basu, Abhipsa and Singh, Mohana and Agnihotri, Shashank and Keuper, Margret and Radhakrishnan, Venkatesh Babu , booktitle =. 2026 , url =

  26. [26]

    Synthetic Data for Computer Vision Workshop @ CVPR 2025 , year =

    Agnihotri, Shashank and Schader, David and Sharei, Nico and Ka. Synthetic Data for Computer Vision Workshop @ CVPR 2025 , year =

  27. [27]

    2025 , url =

    Agnihotri, Shashank and Jakubassa, Jonas and Dey, Priyam and Goyal, Sachin and Schiele, Bernt and Radhakrishnan, Venkatesh Babu and Keuper, Margret , booktitle =. 2025 , url =

  28. [28]

    Jung, Steffen and Keuper, Margret , booktitle =

  29. [29]

    2024 , url =

    Gavrikov, Paul and Agnihotri, Shashank and Keuper, Margret and Keuper, Janis , booktitle =. 2024 , url =

  30. [30]

    Agnihotri, Shashank and Schader, David and Jakubassa, Jonas and Sharei, Nico and Kral, Simon and Weber, Ruben and Keuper, Margret and others , journal =

  31. [31]

    2025 , url =

    Agnihotri, Shashank and Caspary, Julian Yuya and Schwarz, Luca and Gao, Xinyan and Schmalfuss, Jenny and Bruhn, Andres and Keuper, Margret , journal =. 2025 , url =

  32. [32]

    Poggi, Nicolas and Agnihotri, Shashank and Keuper, Margret , journal =