pith. machine review for the scientific record. sign in

arxiv: 2605.13352 · v1 · submitted 2026-05-13 · 💻 cs.LG

Recognition: unknown

GeoFlowVLM: Geometry-Aware Joint Uncertainty for Frozen Vision-Language Embedding

Andreas Hellander, Ekta Vats, Li Ju, Mayank Nautiyal, Prashant Singh

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:40 UTC · model grok-4.3

classification 💻 cs.LG
keywords Riemannian flow matchingvision-language modelsuncertainty quantificationhyperspherical embeddingsaleatoric uncertaintyepistemic uncertaintypost-hoc adapterconditional entropy
0
0 comments X

The pith

A single masked velocity field on paired hyperspherical embeddings yields valid joint and conditional Riemannian flows for uncertainty in frozen vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes GeoFlowVLM as a post-hoc adapter that trains one masked velocity field via Riemannian flow matching on the product of two unit hyperspheres to capture the joint distribution of paired image and text embeddings from frozen dual-encoder models. This single model produces, in the population limit, valid velocity fields for both the joint distribution and the two cross-modal conditional distributions. From the model the authors derive a conditional entropy that quantifies aleatoric cross-modal ambiguity and a marginal-typicality score that isolates epistemic uncertainty via an exact chain-rule decomposition of the joint negative log-likelihood. Empirically the entropy tracks retrieval recall monotonically while the typicality score supports calibrated selective classification. A reader would care because standard l2-normalized embeddings supply no built-in measures of ambiguity or support, restricting reliable use in retrieval and zero-shot tasks.

Core claim

The central claim is that a single masked velocity field trained by Riemannian flow matching on the joint distribution of paired l2-normalized embeddings on S^{d-1} x S^{d-1} recovers, in the population limit, valid Riemannian flow-matching velocity fields on the joint domain and on each conditional domain; the resulting conditional entropy supplies a decision-theoretically interpretable aleatoric score via a Fano-type bound, while the marginal-typicality score follows from an exact decomposition that isolates the structurally discriminative cross-modal pointwise mutual information term.

What carries the argument

A single masked velocity field trained via Riemannian flow matching on the product hypersphere

If this is right

  • The conditional entropy provides a calibrated measure of aleatoric ambiguity that can be used to rank retrieval candidates by expected information gain.
  • The marginal-typicality score isolates epistemic uncertainty after subtracting the cross-modal pointwise mutual information term, enabling selective prediction that improves accuracy.
  • The cross-modal pointwise mutual information term alone is empirically uninformative for either uncertainty type and should be discarded in downstream decisions.
  • Both derived scores can be computed from one forward pass of the trained velocity field without retraining the original vision-language encoders.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same masked-field construction could be tested on other product manifolds arising from paired embeddings in different modalities.
  • If the consistency result extends to finite samples, the method might serve as a lightweight calibration layer for any frozen dual-encoder system without requiring architectural changes.
  • Selective accuracy gains observed on zero-shot tasks suggest the typicality score could be combined with existing outlier detectors to reduce false positives in deployed vision-language pipelines.

Load-bearing premise

That training a single masked velocity field on finite paired data will produce conditional and marginal distributions whose derived entropy and typicality scores remain calibrated on real retrieval and classification benchmarks.

What would settle it

Observation that the conditional retrieval entropy fails to track Recall@1 monotonically across the three reported retrieval benchmarks in either direction, or that marginal-typicality-based selective accuracy is no better than random selection on the four zero-shot classification benchmarks.

Figures

Figures reproduced from arXiv: 2605.13352 by Andreas Hellander, Ekta Vats, Li Ju, Mayank Nautiyal, Prashant Singh.

Figure 1
Figure 1. Figure 1: Aleatoric uncertainty calibration. R@1 across uncertainty bins ordered from low to high uncertainty for T2I (top) and I2T (bottom); curves averaged over 5 random-seed runs with shaded ±1 standard deviation. GeoFlowVLM shows a stronger and more consistent monotonic decrease than all baselines across both retrieval directions and all datasets [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Epistemic component analysis. Selective prediction accuracy vs. rejected fraction. Each curve uses one term from Eq. (5) as a standalone epistemic score. Results. Since no single component dominates uniformly across datasets ( [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Epistemic uncertainty: selective prediction. Accuracy vs. rejected fraction on four zero￾shot classification benchmarks. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: GeoFlowVLM architecture. Image and text CLIP embeddings zI and zT are projected into a shared H-dimensional latent space. Per-stream conditioning vectors cI , cT derived from flow times tI , tT and direction dir are shared across eight DualStreamBlocks (DSBs), each refining both streams via AdaLN-Zero, gated cross-modal interaction, and SwiGLU. Stream decoders produce the final velocity v ϕ t (zt). The joi… view at source ↗
Figure 5
Figure 5. Figure 5: Riemannian vs. Euclidean: aleatoric calibration. Recall@1 per entropy bin for T2I (top) and I2T (bottom) across MS-COCO, Flickr, and CUB datasets. GeoFlowVLM produces a consistent monotone decrease; the Euclidean baseline shows no calibration signal. Epistemic uncertainty [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Riemannian vs. Euclidean: epistemic uncertainty. Selective prediction accuracy across four zero-shot datasets. GeoFlowVLM consistently outperforms the Euclidean baseline. D.2 Effect of ODE Steps on Epistemic Uncertainty We study the sensitivity of the epistemic uncertainty signal to the number of probability flow ODE integration steps N used during NLL evaluation. We vary N ∈ {10, 20, 30, 40, 50} on a fixe… view at source ↗
Figure 7
Figure 7. Figure 7: Effect of ODE steps N on epistemic uncertainty. Selective prediction curves (mean ± std over 10 runs) across four zero-shot datasets. All values of N ∈ {10, 20, 30, 40, 50} produce nearly identical curves, confirming that N = 10 is sufficient [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Effect of Hutchinson probe count nprobe on epistemic uncertainty. Selective predic￾tion curves (mean ± std over 10 runs) at fixed T = 10. All probe counts produce nearly identical curves, confirming that nprobe = 1 is sufficient [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Guidance scale ablation. Spearman ρ between per-query posterior entropy and retrieval difficulty as a function of λ, for I2T (left) and T2I (right). AUSAC is stable across all probe counts: the variation across nprobe is within run-to-run variance due to Hutchinson randomness in every case, and no meaningful trend is visible in either the table or the selective prediction curves. This confirms that a singl… view at source ↗
Figure 10
Figure 10. Figure 10: Aleatoric uncertainty calibration under SigLIP backbone. R@1 across uncertainty bins ordered from low to high uncertainty for T2I (top) and I2T (bottom); curves show the mean over 5 independent runs. GeoFlowVLM shows a stronger and more consistent monotonic decrease than all baselines across both retrieval directions and all datasets. 0 30 60 90 Rejected Fraction (%) 76 77 78 79 80 81 Accuracy (%) ImageNe… view at source ↗
Figure 11
Figure 11. Figure 11: Epistemic uncertainty: Selective accuracy under SigLIP backbone. Epistemic Uncertainty (Q2) [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗
read the original abstract

Standard dual-encoder vision-language models that map images and text to deterministic points on a shared unit hypersphere through $\ell_2$ normalization typically expose neither \emph{aleatoric} uncertainty (cross-modal ambiguity) nor \emph{epistemic} uncertainty (lack of training-distribution support). Existing post-hoc methods either recover at most one of the two uncertainty components, or ignore the hyperspherical geometry of these models' embeddings. We propose \textbf{GeoFlowVLM} as a post-hoc adapter that learns the joint distribution of paired $\ell_2$-normalised dual-encoder VLM embeddings on the product hypersphere $\mathbb{S}^{d-1} \times \mathbb{S}^{d-1}$ via Riemannian flow matching with a single masked velocity field. A consistency result shows that, in the population limit, the trained network exposes the joint flow and both cross-modal conditional flows as valid Riemannian flow-matching velocity fields on their respective domains. We derive two quantities from this single model: a conditional retrieval entropy that quantifies aleatoric ambiguity with a decision-theoretic interpretation via a Fano-type bound, and a marginal-typicality epistemic score justified by an exact chain-rule decomposition of the joint NLL. This decomposition isolates a cross-modal pointwise-mutual-information term that is structurally discriminative rather than epistemic, and is empirically the only consistently uninformative standalone component. Empirically, the entropy tracks Recall@1 with near-ideal monotonic calibration across three retrieval benchmarks in both directions, and the marginal-typicality sum yields consistently calibrated selective accuracy across four zero-shot classification benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes GeoFlowVLM, a post-hoc adapter for frozen dual-encoder VLMs. It models the joint distribution of paired ℓ₂-normalized embeddings on the product hypersphere S^{d-1} × S^{d-1} via Riemannian flow matching using a single masked velocity field. A population-limit consistency result is asserted, under which the trained network yields valid RFM velocity fields for the joint distribution and both cross-modal conditionals. Two derived quantities are introduced: a conditional retrieval entropy (aleatoric uncertainty, with Fano-type decision-theoretic interpretation) and a marginal-typicality epistemic score obtained from an exact chain-rule decomposition of the joint negative log-likelihood. Experiments report that the entropy tracks Recall@1 monotonically across three retrieval benchmarks and that the typicality score produces calibrated selective accuracy on four zero-shot classification benchmarks.

Significance. If the consistency result holds, the work supplies a unified geometry-respecting mechanism for extracting both aleatoric and epistemic uncertainty from a single post-hoc model without retraining the base VLM. The chain-rule isolation of the structurally discriminative PMI term and the empirical calibration on standard retrieval and selective-prediction tasks are concrete strengths that could improve uncertainty-aware multimodal pipelines.

major comments (2)
  1. [Abstract] Abstract: the population-limit consistency result is stated without derivation, key algebraic steps, or a numbered theorem. It is therefore impossible to verify whether the masking operation allows the joint RFM objective to factor exactly into conditional RFM objectives on each hypersphere (including tangent-space projection) or whether extra cross-terms vanish only under unstated assumptions on the embedding measure.
  2. [Method] Method section (consistency claim): the assertion that a single masked velocity field simultaneously satisfies the joint and both conditional Riemannian flow-matching objectives rests on an unshown decomposition. Without explicit verification that conditioning on one modality recovers the exact conditional velocity field, the central theoretical justification for deriving both uncertainty scores from one network remains load-bearing and unverifiable from the provided text.
minor comments (2)
  1. [Notation] Notation: the hypersphere is denoted S^{d-1} in the abstract but occasionally appears with inconsistent dimension subscripts in the text; a single global definition would improve readability.
  2. [Experiments] Figures: the calibration plots for Recall@1 and selective accuracy would benefit from explicit confidence bands or bootstrap intervals to allow visual assessment of monotonicity strength.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments correctly identify that the population-limit consistency result is presented without sufficient detail for independent verification. We will revise the manuscript to include an explicit numbered theorem with the full algebraic derivation.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the population-limit consistency result is stated without derivation, key algebraic steps, or a numbered theorem. It is therefore impossible to verify whether the masking operation allows the joint RFM objective to factor exactly into conditional RFM objectives on each hypersphere (including tangent-space projection) or whether extra cross-terms vanish only under unstated assumptions on the embedding measure.

    Authors: We agree that the consistency result requires an explicit derivation. In the revised manuscript we will introduce a numbered theorem in the Method section that states the population-limit claim and provides the key algebraic steps. The proof will show that the single masked velocity field on the product hypersphere decomposes exactly into the joint RFM objective and the two conditional RFM objectives, with cross-terms vanishing because the masking isolates each modality and the product measure on S^{d-1} × S^{d-1} induces the required conditional independence in the population limit. Tangent-space projections will be handled explicitly via the standard Riemannian exponential and logarithmic maps on the hypersphere. revision: yes

  2. Referee: [Method] Method section (consistency claim): the assertion that a single masked velocity field simultaneously satisfies the joint and both conditional Riemannian flow-matching objectives rests on an unshown decomposition. Without explicit verification that conditioning on one modality recovers the exact conditional velocity field, the central theoretical justification for deriving both uncertainty scores from one network remains load-bearing and unverifiable from the provided text.

    Authors: We acknowledge that the current text asserts the decomposition without displaying the steps. The revision will contain a self-contained proof that, under the population limit, conditioning the masked velocity field on one modality recovers the exact conditional velocity field on the corresponding hypersphere. The argument relies only on the definition of Riemannian flow matching, the geometry of the product manifold, and the fact that the mask zeros the velocity components belonging to the unobserved modality; no additional assumptions on the embedding measure are required beyond those already stated for standard RFM on hyperspheres. revision: yes

Circularity Check

0 steps flagged

Derivation chain is self-contained with no circular reductions

full rationale

The paper asserts a consistency result in the population limit for the single masked velocity field yielding valid joint and conditional Riemannian flow-matching velocities, but this is presented as a derived mathematical property rather than a tautological redefinition or fitted input renamed as prediction. The marginal-typicality epistemic score follows from an exact chain-rule decomposition of the joint negative log-likelihood, which is a standard information-theoretic identity isolating the PMI term; this does not reduce the score to the model fit by construction. The conditional retrieval entropy is justified via an external Fano-type bound. No self-citations are load-bearing for the central claims, no ansatz is smuggled, and no uniqueness theorem is imported from prior author work. The derivation remains independent of the fitted values themselves.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The approach rests on standard assumptions about normalized embeddings and the applicability of flow matching to Riemannian manifolds; the learned velocity field parameters are the primary fitted component.

free parameters (1)
  • masked velocity field parameters
    Learned during adapter training to approximate the joint distribution on the product hypersphere.
axioms (2)
  • domain assumption Dual-encoder VLM embeddings are l2-normalized and therefore lie on the unit hypersphere
    Standard property of models such as CLIP that the method takes as given.
  • standard math Riemannian flow matching defines valid velocity fields on the hypersphere and its product manifold
    Invoked to justify training and the consistency result.

pith-pipeline@v0.9.0 · 5608 in / 1477 out tokens · 56458 ms · 2026-05-14T20:40:39.196016+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

44 extracted references · 4 canonical work pages · 2 internal anchors

  1. [1]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021

  2. [2]

    Sigmoid loss for lan- guage image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for lan- guage image pre-training. In Proceedings of the IEEE/CVF international conference on com- puter vision, pages 11975–11986, 2023

  3. [3]

    Probabilistic embeddings for cross-modal retrieval

    Sanghyuk Chun, Seong Joon Oh, Rafael Sampaio De Rezende, Y annis Kalantidis, and Diane Larlus. Probabilistic embeddings for cross-modal retrieval. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 8415–8424, 2021

  4. [4]

    Prob- vlm: Probabilistic adapter for frozen vison-language models

    Uddeshya Upadhyay, Shyamgopal Karthik, Massimiliano Mancini, and Zeynep Akata. Prob- vlm: Probabilistic adapter for frozen vison-language models. In Proceedings of the IEEE/CVF international conference on computer vision , pages 1899–1910, 2023

  5. [5]

    Probabilistic embeddings for frozen vision-language models: uncertainty quantification with gaussian process latent vari- able models

    Aishwarya V enkataramanan, Paul Bodesheim, and Joachim Denzler. Probabilistic embeddings for frozen vision-language models: uncertainty quantification with gaussian process latent vari- able models. In Proceedings of the F orty-First Conference on Uncertainty in Artificial Intelli- gence, pages 4309–4328, 2025

  6. [6]

    Post-hoc probabilistic vision-language models

    Anton Baumann, Rui Li, Marcus Klasson, Santeri Mentu, Shyamgopal Karthik, Zeynep Akata, Arno Solin, and Martin Trapp. Post-hoc probabilistic vision-language models. arXiv preprint arXiv:2412.06014, 2024

  7. [7]

    Exploiting the asymmetric uncertainty structure of pre-trained vlms on the unit hypersphere

    Li Ju, Max Andersson, Stina Fredriksson, Edward Glöckner, Andreas Hellander, Ekta V ats, and Prashant Singh. Exploiting the asymmetric uncertainty structure of pre-trained vlms on the unit hypersphere. In 39th Conference on Neural Information Processing Systems (NeurIPS 2025), 2025

  8. [8]

    Dropout as a bayesian approximation: Representing model uncertainty in deep learning

    Y arin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1050–1059, 2016

  9. [9]

    Epistemic un- certainty quantification for pre-trained vlms via riemannian flow matching

    Li Ju, Mayank Nautiyal, Andreas Hellander, Ekta V ats, and Prashant Singh. Epistemic un- certainty quantification for pre-trained vlms via riemannian flow matching. In International Conference on Machine Learning, 2026. URL https://arxiv.org/abs/2601.21662

  10. [10]

    Improved probabilistic image-text representations

    Sanghyuk Chun. Improved probabilistic image-text representations. In The Twelfth Interna- tional Conference on Learning Representations , 2024

  11. [11]

    Probabilistic language-image pre-training

    Sanghyuk Chun, Wonjae Kim, Song Park, and Sangdoo Y un. Probabilistic language-image pre-training. In The Thirteenth International Conference on Learning Representations , 2025

  12. [12]

    Alex Kendall and Y arin Gal. What uncertainties do we need in bayesian deep learning for computer vision? In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, pages 5580–5590, 2017

  13. [13]

    Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods

    Eyke Hüllermeier and Willem Waegeman. Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods. Machine learning, 110(3):457–506, 2021

  14. [14]

    Riemannian continuous normalizing flows

    Emile Mathieu and Maximilian Nickel. Riemannian continuous normalizing flows. Advances in neural information processing systems , 33:2503–2515, 2020

  15. [15]

    Riemannian score-based generative modelling

    V alentin De Bortoli, Emile Mathieu, Michael Hutchinson, James Thornton, Y ee Whye Teh, and Arnaud Doucet. Riemannian score-based generative modelling. Advances in neural infor- mation processing systems, 35:2406–2422, 2022

  16. [16]

    Y aron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations (ICLR), 2023. 10

  17. [17]

    Flow matching on general geometries

    Ricky TQ Chen and Y aron Lipman. Flow matching on general geometries. In The Twelfth International Conference on Learning Representations , 2024

  18. [18]

    Improving and generalizing flow-based gener- ative models with minibatch optimal transport

    Alexander Tong, Kilian Fatras, Nikolay Malkin, Guillaume Huguet, Y anlei Zhang, Jarrid Rector-Brooks, Guy Wolf, and Y oshua Bengio. Improving and generalizing flow-based gener- ative models with minibatch optimal transport. Transactions on Machine Learning Research , pages 1–34, 2024

  19. [19]

    Geometry-aware image flow matching

    Junho Lee, Kwanseok Kim, and Joonseok Lee. Geometry-aware image flow matching. In International Conference on Machine Learning , Proceedings of Machine Learning Research, 2026

  20. [20]

    The double-ellipsoid geometry of CLIP

    Meir Y ossef Levi and Guy Gilboa. The double-ellipsoid geometry of CLIP. In F orty-second International Conference on Machine Learning , pages 33999–34019, 2025

  21. [21]

    Whitened CLIP as a likelihood surrogate of im- ages and captions

    Roy Betser, Meir Y ossef Levi, and Guy Gilboa. Whitened CLIP as a likelihood surrogate of im- ages and captions. In Proceedings of the 42nd International Conference on Machine Learning , volume 267 of Proceedings of Machine Learning Research, pages 4069–4095. PMLR, 2025

  22. [22]

    On the importance of embedding norms in self-supervised learning

    Andrew Draganov, Sharvaree V adgama, Sebastian Damrich, Jan Niklas Böhm, Lucas Maes, Dmitry Kobak, and Erik Bekkers. On the importance of embedding norms in self-supervised learning. In Proceedings of the 42nd International Conference on Machine Learning , volume 267 of Proceedings of Machine Learning Research. PMLR, 2025

  23. [23]

    Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning

    Victor Weixin Liang, Y uhui Zhang, Y ongchan Kwon, Serena Y eung, and James Y Zou. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. Advances in Neural Information Processing Systems , 35:17612–17625, 2022

  24. [24]

    Understanding contrastive representation learning through alignment and uniformity on the hypersphere

    Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In Proceedings of the 37th International Con- ference on Machine Learning, ICML. PMLR, 2020

  25. [25]

    EVA-CLIP: Improved Training Techniques for CLIP at Scale

    Quan Sun, Y uxin Fang, Ledell Wu, Xinlong Wang, and Y ue Cao. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023

  26. [26]

    Sinkhorn distances: Lightspeed computation of optimal transport

    Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. Advances in neural information processing systems, 26, 2013

  27. [27]

    Elements of information theory (wiley series in telecom- munications and signal processing)

    Thomas M Cover and Joy A Thomas. Elements of information theory (wiley series in telecom- munications and signal processing) . Wiley-interscience, 2006

  28. [28]

    Reproducible scaling laws for contrastive language-image learning

    Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition , pages 2818–2829, 2023

  29. [29]

    Scaling up visual and vision-language representation learning with noisy text supervision

    Chao Jia, Yinfei Y ang, Y e Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Y un- Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning , pages 4904–4916. PMLR, 2021

  30. [30]

    Imagenet: A large- scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

  31. [31]

    Food-101–mining discriminative components with random forests

    Lukas Bossard, Matthieu Guillaumin, and Luc V an Gool. Food-101–mining discriminative components with random forests. In European conference on computer vision, pages 446–461. Springer, 2014

  32. [32]

    Learning multiple layers of features from tiny images

    A Krizhevsky. Learning multiple layers of features from tiny images. Master’s thesis, Univer- sity of Toronto, 2009. 11

  33. [33]

    Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models

    Andrei Barbu, David Mayo, Julian Alverio, William Luo, Christopher Wang, Dan Gutfreund, Josh Tenenbaum, and Boris Katz. Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. Advances in neural information processing systems , 32, 2019

  34. [34]

    Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning

    Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 2556–2565, 2018

  35. [35]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Pi- otr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision , pages 740–755. Springer, 2014

  36. [36]

    Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models

    Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641–2649, 2015

  37. [37]

    The caltech- ucsd birds-200-2011 dataset

    Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech- ucsd birds-200-2011 dataset. In California Institute of Technology , 2011. Technical Report CNS-TR-2011-001

  38. [38]

    Learning deep representations of fine-grained visual descriptions

    Scott Reed, Zeynep Akata, Bernt Schiele, and Honglak Lee. Learning deep representations of fine-grained visual descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016

  39. [39]

    Random features for large-scale kernel machines

    Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. Advances in neural information processing systems , 20, 2007

  40. [40]

    Scalable diffusion models with transformers

    William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF international conference on computer vision , pages 4195–4205, 2023

  41. [41]

    Going deeper with image transformers

    Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Hervé Jégou. Going deeper with image transformers. In Proceedings of the IEEE/CVF international confer- ence on computer vision , pages 32–42, 2021

  42. [42]

    GLU Variants Improve Transformer

    Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202 , 2020. URL https://arxiv.org/abs/2002.05202

  43. [43]

    Classifier-free diffusion guidance

    Jonathan Ho and Tim Salimans. Classifier-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications , 2021. A Supplementary Material A.1 Datasets All models are trained on Conceptual Captions 1M (CC1M) [ 34], a web-crawled image-caption dataset providing one caption per image. Embeddings for all datasets are pre...

  44. [44]

    Training uses AdamW with a cosine learning-rate schedule and linear warmup

    We use a single hyperparameter configuration for all downstream evaluations and do not perform dataset-specific tuning. Training uses AdamW with a cosine learning-rate schedule and linear warmup. To stabilize joint density learning, we use a three-stage curriculum over the mask distribution. In the first stage, the model is trained only on conditional flow ta...