arxiv: 2605.13352 · v1 · submitted 2026-05-13 · 💻 cs.LG

Recognition: unknown

GeoFlowVLM: Geometry-Aware Joint Uncertainty for Frozen Vision-Language Embedding

Andreas Hellander, Ekta Vats, Li Ju, Mayank Nautiyal, Prashant Singh

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:40 UTC · model grok-4.3

classification 💻 cs.LG

keywords Riemannian flow matchingvision-language modelsuncertainty quantificationhyperspherical embeddingsaleatoric uncertaintyepistemic uncertaintypost-hoc adapterconditional entropy

0 comments

The pith

A single masked velocity field on paired hyperspherical embeddings yields valid joint and conditional Riemannian flows for uncertainty in frozen vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes GeoFlowVLM as a post-hoc adapter that trains one masked velocity field via Riemannian flow matching on the product of two unit hyperspheres to capture the joint distribution of paired image and text embeddings from frozen dual-encoder models. This single model produces, in the population limit, valid velocity fields for both the joint distribution and the two cross-modal conditional distributions. From the model the authors derive a conditional entropy that quantifies aleatoric cross-modal ambiguity and a marginal-typicality score that isolates epistemic uncertainty via an exact chain-rule decomposition of the joint negative log-likelihood. Empirically the entropy tracks retrieval recall monotonically while the typicality score supports calibrated selective classification. A reader would care because standard l2-normalized embeddings supply no built-in measures of ambiguity or support, restricting reliable use in retrieval and zero-shot tasks.

Core claim

The central claim is that a single masked velocity field trained by Riemannian flow matching on the joint distribution of paired l2-normalized embeddings on S^{d-1} x S^{d-1} recovers, in the population limit, valid Riemannian flow-matching velocity fields on the joint domain and on each conditional domain; the resulting conditional entropy supplies a decision-theoretically interpretable aleatoric score via a Fano-type bound, while the marginal-typicality score follows from an exact decomposition that isolates the structurally discriminative cross-modal pointwise mutual information term.

What carries the argument

A single masked velocity field trained via Riemannian flow matching on the product hypersphere

If this is right

The conditional entropy provides a calibrated measure of aleatoric ambiguity that can be used to rank retrieval candidates by expected information gain.
The marginal-typicality score isolates epistemic uncertainty after subtracting the cross-modal pointwise mutual information term, enabling selective prediction that improves accuracy.
The cross-modal pointwise mutual information term alone is empirically uninformative for either uncertainty type and should be discarded in downstream decisions.
Both derived scores can be computed from one forward pass of the trained velocity field without retraining the original vision-language encoders.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same masked-field construction could be tested on other product manifolds arising from paired embeddings in different modalities.
If the consistency result extends to finite samples, the method might serve as a lightweight calibration layer for any frozen dual-encoder system without requiring architectural changes.
Selective accuracy gains observed on zero-shot tasks suggest the typicality score could be combined with existing outlier detectors to reduce false positives in deployed vision-language pipelines.

Load-bearing premise

That training a single masked velocity field on finite paired data will produce conditional and marginal distributions whose derived entropy and typicality scores remain calibrated on real retrieval and classification benchmarks.

What would settle it

Observation that the conditional retrieval entropy fails to track Recall@1 monotonically across the three reported retrieval benchmarks in either direction, or that marginal-typicality-based selective accuracy is no better than random selection on the four zero-shot classification benchmarks.

Figures

Figures reproduced from arXiv: 2605.13352 by Andreas Hellander, Ekta Vats, Li Ju, Mayank Nautiyal, Prashant Singh.

**Figure 1.** Figure 1: Aleatoric uncertainty calibration. R@1 across uncertainty bins ordered from low to high uncertainty for T2I (top) and I2T (bottom); curves averaged over 5 random-seed runs with shaded ±1 standard deviation. GeoFlowVLM shows a stronger and more consistent monotonic decrease than all baselines across both retrieval directions and all datasets [PITH_FULL_IMAGE:figures/full_fig_p007_1.png] view at source ↗

**Figure 2.** Figure 2: Epistemic component analysis. Selective prediction accuracy vs. rejected fraction. Each curve uses one term from Eq. (5) as a standalone epistemic score. Results. Since no single component dominates uniformly across datasets ( [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗

**Figure 3.** Figure 3: Epistemic uncertainty: selective prediction. Accuracy vs. rejected fraction on four zeroshot classification benchmarks. 8 [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: GeoFlowVLM architecture. Image and text CLIP embeddings zI and zT are projected into a shared H-dimensional latent space. Per-stream conditioning vectors cI , cT derived from flow times tI , tT and direction dir are shared across eight DualStreamBlocks (DSBs), each refining both streams via AdaLN-Zero, gated cross-modal interaction, and SwiGLU. Stream decoders produce the final velocity v ϕ t (zt). The joi… view at source ↗

**Figure 5.** Figure 5: Riemannian vs. Euclidean: aleatoric calibration. Recall@1 per entropy bin for T2I (top) and I2T (bottom) across MS-COCO, Flickr, and CUB datasets. GeoFlowVLM produces a consistent monotone decrease; the Euclidean baseline shows no calibration signal. Epistemic uncertainty [PITH_FULL_IMAGE:figures/full_fig_p022_5.png] view at source ↗

**Figure 6.** Figure 6: Riemannian vs. Euclidean: epistemic uncertainty. Selective prediction accuracy across four zero-shot datasets. GeoFlowVLM consistently outperforms the Euclidean baseline. D.2 Effect of ODE Steps on Epistemic Uncertainty We study the sensitivity of the epistemic uncertainty signal to the number of probability flow ODE integration steps N used during NLL evaluation. We vary N ∈ {10, 20, 30, 40, 50} on a fixe… view at source ↗

**Figure 7.** Figure 7: Effect of ODE steps N on epistemic uncertainty. Selective prediction curves (mean ± std over 10 runs) across four zero-shot datasets. All values of N ∈ {10, 20, 30, 40, 50} produce nearly identical curves, confirming that N = 10 is sufficient [PITH_FULL_IMAGE:figures/full_fig_p023_7.png] view at source ↗

**Figure 8.** Figure 8: Effect of Hutchinson probe count nprobe on epistemic uncertainty. Selective prediction curves (mean ± std over 10 runs) at fixed T = 10. All probe counts produce nearly identical curves, confirming that nprobe = 1 is sufficient [PITH_FULL_IMAGE:figures/full_fig_p023_8.png] view at source ↗

**Figure 9.** Figure 9: Guidance scale ablation. Spearman ρ between per-query posterior entropy and retrieval difficulty as a function of λ, for I2T (left) and T2I (right). AUSAC is stable across all probe counts: the variation across nprobe is within run-to-run variance due to Hutchinson randomness in every case, and no meaningful trend is visible in either the table or the selective prediction curves. This confirms that a singl… view at source ↗

**Figure 10.** Figure 10: Aleatoric uncertainty calibration under SigLIP backbone. R@1 across uncertainty bins ordered from low to high uncertainty for T2I (top) and I2T (bottom); curves show the mean over 5 independent runs. GeoFlowVLM shows a stronger and more consistent monotonic decrease than all baselines across both retrieval directions and all datasets. 0 30 60 90 Rejected Fraction (%) 76 77 78 79 80 81 Accuracy (%) ImageNe… view at source ↗

**Figure 11.** Figure 11: Epistemic uncertainty: Selective accuracy under SigLIP backbone. Epistemic Uncertainty (Q2) [PITH_FULL_IMAGE:figures/full_fig_p025_11.png] view at source ↗

read the original abstract

Standard dual-encoder vision-language models that map images and text to deterministic points on a shared unit hypersphere through $\ell_2$ normalization typically expose neither \emph{aleatoric} uncertainty (cross-modal ambiguity) nor \emph{epistemic} uncertainty (lack of training-distribution support). Existing post-hoc methods either recover at most one of the two uncertainty components, or ignore the hyperspherical geometry of these models' embeddings. We propose \textbf{GeoFlowVLM} as a post-hoc adapter that learns the joint distribution of paired $\ell_2$-normalised dual-encoder VLM embeddings on the product hypersphere $\mathbb{S}^{d-1} \times \mathbb{S}^{d-1}$ via Riemannian flow matching with a single masked velocity field. A consistency result shows that, in the population limit, the trained network exposes the joint flow and both cross-modal conditional flows as valid Riemannian flow-matching velocity fields on their respective domains. We derive two quantities from this single model: a conditional retrieval entropy that quantifies aleatoric ambiguity with a decision-theoretic interpretation via a Fano-type bound, and a marginal-typicality epistemic score justified by an exact chain-rule decomposition of the joint NLL. This decomposition isolates a cross-modal pointwise-mutual-information term that is structurally discriminative rather than epistemic, and is empirically the only consistently uninformative standalone component. Empirically, the entropy tracks Recall@1 with near-ideal monotonic calibration across three retrieval benchmarks in both directions, and the marginal-typicality sum yields consistently calibrated selective accuracy across four zero-shot classification benchmarks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GeoFlowVLM fits one masked Riemannian flow on the product hypersphere to pull both aleatoric entropy and epistemic typicality from frozen VLM embeddings, but the population consistency claim for the conditional flows rests on algebra that is not shown.

read the letter

The paper's main move is to treat paired image-text embeddings as points on S^{d-1} x S^{d-1} and train a single masked velocity field with Riemannian flow matching. From the resulting joint model they extract a conditional entropy that tracks aleatoric cross-modal ambiguity and a marginal-typicality score for epistemic uncertainty, using an exact chain-rule split of the negative log-likelihood. The empirical side shows the entropy correlating with Recall@1 on retrieval sets and the typicality score producing calibrated selective accuracy on zero-shot classification tasks across several benchmarks. That combination is new relative to prior post-hoc VLM uncertainty work, which usually handles at most one uncertainty type and often drops the hyperspherical geometry. The experiments are straightforward and the calibration plots look clean, so the practical side holds up on the reported numbers. The soft spot is the central consistency result. The abstract states that in the population limit the masked field simultaneously gives valid joint and conditional Riemannian flow-matching velocities, yet the text supplies no derivation steps, no check on whether the mask commutes with the tangent-space projection, and no error analysis. Without those steps the claim cannot be verified from what is written, which leaves the theoretical foundation thinner than the empirical claims. The citations to flow-matching and VLM uncertainty literature are appropriate and do not appear padded. This is aimed at people who already work on post-hoc uncertainty for multimodal embeddings or who care about geometry-preserving generative models on manifolds. A reader who wants a single adapter that respects the unit-hypersphere constraint and supplies both uncertainty flavors could get concrete value from the method if the derivations are filled in. I would send it to peer review. The core construction is distinct enough and the benchmarks are relevant, so referees can pressure-test the missing algebra and the calibration claims directly.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes GeoFlowVLM, a post-hoc adapter for frozen dual-encoder VLMs. It models the joint distribution of paired ℓ₂-normalized embeddings on the product hypersphere S^{d-1} × S^{d-1} via Riemannian flow matching using a single masked velocity field. A population-limit consistency result is asserted, under which the trained network yields valid RFM velocity fields for the joint distribution and both cross-modal conditionals. Two derived quantities are introduced: a conditional retrieval entropy (aleatoric uncertainty, with Fano-type decision-theoretic interpretation) and a marginal-typicality epistemic score obtained from an exact chain-rule decomposition of the joint negative log-likelihood. Experiments report that the entropy tracks Recall@1 monotonically across three retrieval benchmarks and that the typicality score produces calibrated selective accuracy on four zero-shot classification benchmarks.

Significance. If the consistency result holds, the work supplies a unified geometry-respecting mechanism for extracting both aleatoric and epistemic uncertainty from a single post-hoc model without retraining the base VLM. The chain-rule isolation of the structurally discriminative PMI term and the empirical calibration on standard retrieval and selective-prediction tasks are concrete strengths that could improve uncertainty-aware multimodal pipelines.

major comments (2)

[Abstract] Abstract: the population-limit consistency result is stated without derivation, key algebraic steps, or a numbered theorem. It is therefore impossible to verify whether the masking operation allows the joint RFM objective to factor exactly into conditional RFM objectives on each hypersphere (including tangent-space projection) or whether extra cross-terms vanish only under unstated assumptions on the embedding measure.
[Method] Method section (consistency claim): the assertion that a single masked velocity field simultaneously satisfies the joint and both conditional Riemannian flow-matching objectives rests on an unshown decomposition. Without explicit verification that conditioning on one modality recovers the exact conditional velocity field, the central theoretical justification for deriving both uncertainty scores from one network remains load-bearing and unverifiable from the provided text.

minor comments (2)

[Notation] Notation: the hypersphere is denoted S^{d-1} in the abstract but occasionally appears with inconsistent dimension subscripts in the text; a single global definition would improve readability.
[Experiments] Figures: the calibration plots for Recall@1 and selective accuracy would benefit from explicit confidence bands or bootstrap intervals to allow visual assessment of monotonicity strength.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comments correctly identify that the population-limit consistency result is presented without sufficient detail for independent verification. We will revise the manuscript to include an explicit numbered theorem with the full algebraic derivation.

read point-by-point responses

Referee: [Abstract] Abstract: the population-limit consistency result is stated without derivation, key algebraic steps, or a numbered theorem. It is therefore impossible to verify whether the masking operation allows the joint RFM objective to factor exactly into conditional RFM objectives on each hypersphere (including tangent-space projection) or whether extra cross-terms vanish only under unstated assumptions on the embedding measure.

Authors: We agree that the consistency result requires an explicit derivation. In the revised manuscript we will introduce a numbered theorem in the Method section that states the population-limit claim and provides the key algebraic steps. The proof will show that the single masked velocity field on the product hypersphere decomposes exactly into the joint RFM objective and the two conditional RFM objectives, with cross-terms vanishing because the masking isolates each modality and the product measure on S^{d-1} × S^{d-1} induces the required conditional independence in the population limit. Tangent-space projections will be handled explicitly via the standard Riemannian exponential and logarithmic maps on the hypersphere. revision: yes
Referee: [Method] Method section (consistency claim): the assertion that a single masked velocity field simultaneously satisfies the joint and both conditional Riemannian flow-matching objectives rests on an unshown decomposition. Without explicit verification that conditioning on one modality recovers the exact conditional velocity field, the central theoretical justification for deriving both uncertainty scores from one network remains load-bearing and unverifiable from the provided text.

Authors: We acknowledge that the current text asserts the decomposition without displaying the steps. The revision will contain a self-contained proof that, under the population limit, conditioning the masked velocity field on one modality recovers the exact conditional velocity field on the corresponding hypersphere. The argument relies only on the definition of Riemannian flow matching, the geometry of the product manifold, and the fact that the mask zeros the velocity components belonging to the unobserved modality; no additional assumptions on the embedding measure are required beyond those already stated for standard RFM on hyperspheres. revision: yes

Circularity Check

0 steps flagged

Derivation chain is self-contained with no circular reductions

full rationale

The paper asserts a consistency result in the population limit for the single masked velocity field yielding valid joint and conditional Riemannian flow-matching velocities, but this is presented as a derived mathematical property rather than a tautological redefinition or fitted input renamed as prediction. The marginal-typicality epistemic score follows from an exact chain-rule decomposition of the joint negative log-likelihood, which is a standard information-theoretic identity isolating the PMI term; this does not reduce the score to the model fit by construction. The conditional retrieval entropy is justified via an external Fano-type bound. No self-citations are load-bearing for the central claims, no ansatz is smuggled, and no uniqueness theorem is imported from prior author work. The derivation remains independent of the fitted values themselves.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The approach rests on standard assumptions about normalized embeddings and the applicability of flow matching to Riemannian manifolds; the learned velocity field parameters are the primary fitted component.

free parameters (1)

masked velocity field parameters
Learned during adapter training to approximate the joint distribution on the product hypersphere.

axioms (2)

domain assumption Dual-encoder VLM embeddings are l2-normalized and therefore lie on the unit hypersphere
Standard property of models such as CLIP that the method takes as given.
standard math Riemannian flow matching defines valid velocity fields on the hypersphere and its product manifold
Invoked to justify training and the consistency result.

pith-pipeline@v0.9.0 · 5608 in / 1477 out tokens · 56458 ms · 2026-05-14T20:40:39.196016+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

44 extracted references · 4 canonical work pages · 2 internal anchors

[1]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pages 8748–8763. PMLR, 2021

2021
[2]

Sigmoid loss for lan- guage image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for lan- guage image pre-training. In Proceedings of the IEEE/CVF international conference on com- puter vision, pages 11975–11986, 2023

2023
[3]

Probabilistic embeddings for cross-modal retrieval

Sanghyuk Chun, Seong Joon Oh, Rafael Sampaio De Rezende, Y annis Kalantidis, and Diane Larlus. Probabilistic embeddings for cross-modal retrieval. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages 8415–8424, 2021

2021
[4]

Prob- vlm: Probabilistic adapter for frozen vison-language models

Uddeshya Upadhyay, Shyamgopal Karthik, Massimiliano Mancini, and Zeynep Akata. Prob- vlm: Probabilistic adapter for frozen vison-language models. In Proceedings of the IEEE/CVF international conference on computer vision , pages 1899–1910, 2023

1910
[5]

Probabilistic embeddings for frozen vision-language models: uncertainty quantiﬁcation with gaussian process latent vari- able models

Aishwarya V enkataramanan, Paul Bodesheim, and Joachim Denzler. Probabilistic embeddings for frozen vision-language models: uncertainty quantiﬁcation with gaussian process latent vari- able models. In Proceedings of the F orty-First Conference on Uncertainty in Artiﬁcial Intelli- gence, pages 4309–4328, 2025

2025
[6]

Post-hoc probabilistic vision-language models

Anton Baumann, Rui Li, Marcus Klasson, Santeri Mentu, Shyamgopal Karthik, Zeynep Akata, Arno Solin, and Martin Trapp. Post-hoc probabilistic vision-language models. arXiv preprint arXiv:2412.06014, 2024

work page arXiv 2024
[7]

Exploiting the asymmetric uncertainty structure of pre-trained vlms on the unit hypersphere

Li Ju, Max Andersson, Stina Fredriksson, Edward Glöckner, Andreas Hellander, Ekta V ats, and Prashant Singh. Exploiting the asymmetric uncertainty structure of pre-trained vlms on the unit hypersphere. In 39th Conference on Neural Information Processing Systems (NeurIPS 2025), 2025

2025
[8]

Dropout as a bayesian approximation: Representing model uncertainty in deep learning

Y arin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Proceedings of the 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1050–1059, 2016

2016
[9]

Epistemic un- certainty quantiﬁcation for pre-trained vlms via riemannian ﬂow matching

Li Ju, Mayank Nautiyal, Andreas Hellander, Ekta V ats, and Prashant Singh. Epistemic un- certainty quantiﬁcation for pre-trained vlms via riemannian ﬂow matching. In International Conference on Machine Learning, 2026. URL https://arxiv.org/abs/2601.21662

work page arXiv 2026
[10]

Improved probabilistic image-text representations

Sanghyuk Chun. Improved probabilistic image-text representations. In The Twelfth Interna- tional Conference on Learning Representations , 2024

2024
[11]

Probabilistic language-image pre-training

Sanghyuk Chun, Wonjae Kim, Song Park, and Sangdoo Y un. Probabilistic language-image pre-training. In The Thirteenth International Conference on Learning Representations , 2025

2025
[12]

Alex Kendall and Y arin Gal. What uncertainties do we need in bayesian deep learning for computer vision? In Proceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, pages 5580–5590, 2017

2017
[13]

Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods

Eyke Hüllermeier and Willem Waegeman. Aleatoric and epistemic uncertainty in machine learning: An introduction to concepts and methods. Machine learning, 110(3):457–506, 2021

2021
[14]

Riemannian continuous normalizing ﬂows

Emile Mathieu and Maximilian Nickel. Riemannian continuous normalizing ﬂows. Advances in neural information processing systems , 33:2503–2515, 2020

2020
[15]

Riemannian score-based generative modelling

V alentin De Bortoli, Emile Mathieu, Michael Hutchinson, James Thornton, Y ee Whye Teh, and Arnaud Doucet. Riemannian score-based generative modelling. Advances in neural infor- mation processing systems, 35:2406–2422, 2022

2022
[16]

Y aron Lipman, Ricky T. Q. Chen, Heli Ben-Hamu, Maximilian Nickel, and Matthew Le. Flow matching for generative modeling. In The Eleventh International Conference on Learning Representations (ICLR), 2023. 10

2023
[17]

Flow matching on general geometries

Ricky TQ Chen and Y aron Lipman. Flow matching on general geometries. In The Twelfth International Conference on Learning Representations , 2024

2024
[18]

Improving and generalizing ﬂow-based gener- ative models with minibatch optimal transport

Alexander Tong, Kilian Fatras, Nikolay Malkin, Guillaume Huguet, Y anlei Zhang, Jarrid Rector-Brooks, Guy Wolf, and Y oshua Bengio. Improving and generalizing ﬂow-based gener- ative models with minibatch optimal transport. Transactions on Machine Learning Research , pages 1–34, 2024

2024
[19]

Geometry-aware image ﬂow matching

Junho Lee, Kwanseok Kim, and Joonseok Lee. Geometry-aware image ﬂow matching. In International Conference on Machine Learning , Proceedings of Machine Learning Research, 2026

2026
[20]

The double-ellipsoid geometry of CLIP

Meir Y ossef Levi and Guy Gilboa. The double-ellipsoid geometry of CLIP. In F orty-second International Conference on Machine Learning , pages 33999–34019, 2025

2025
[21]

Whitened CLIP as a likelihood surrogate of im- ages and captions

Roy Betser, Meir Y ossef Levi, and Guy Gilboa. Whitened CLIP as a likelihood surrogate of im- ages and captions. In Proceedings of the 42nd International Conference on Machine Learning , volume 267 of Proceedings of Machine Learning Research, pages 4069–4095. PMLR, 2025

2025
[22]

On the importance of embedding norms in self-supervised learning

Andrew Draganov, Sharvaree V adgama, Sebastian Damrich, Jan Niklas Böhm, Lucas Maes, Dmitry Kobak, and Erik Bekkers. On the importance of embedding norms in self-supervised learning. In Proceedings of the 42nd International Conference on Machine Learning , volume 267 of Proceedings of Machine Learning Research. PMLR, 2025

2025
[23]

Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning

Victor Weixin Liang, Y uhui Zhang, Y ongchan Kwon, Serena Y eung, and James Y Zou. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. Advances in Neural Information Processing Systems , 35:17612–17625, 2022

2022
[24]

Understanding contrastive representation learning through alignment and uniformity on the hypersphere

Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In Proceedings of the 37th International Con- ference on Machine Learning, ICML. PMLR, 2020

2020
[25]

EVA-CLIP: Improved Training Techniques for CLIP at Scale

Quan Sun, Y uxin Fang, Ledell Wu, Xinlong Wang, and Y ue Cao. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[26]

Sinkhorn distances: Lightspeed computation of optimal transport

Marco Cuturi. Sinkhorn distances: Lightspeed computation of optimal transport. Advances in neural information processing systems, 26, 2013

2013
[27]

Elements of information theory (wiley series in telecom- munications and signal processing)

Thomas M Cover and Joy A Thomas. Elements of information theory (wiley series in telecom- munications and signal processing) . Wiley-interscience, 2006

2006
[28]

Reproducible scaling laws for contrastive language-image learning

Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuhmann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scaling laws for contrastive language-image learning. In Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition , pages 2818–2829, 2023

2023
[29]

Scaling up visual and vision-language representation learning with noisy text supervision

Chao Jia, Yinfei Y ang, Y e Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Y un- Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. In International conference on machine learning , pages 4904–4916. PMLR, 2021

2021
[30]

Imagenet: A large- scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

2009
[31]

Food-101–mining discriminative components with random forests

Lukas Bossard, Matthieu Guillaumin, and Luc V an Gool. Food-101–mining discriminative components with random forests. In European conference on computer vision, pages 446–461. Springer, 2014

2014
[32]

Learning multiple layers of features from tiny images

A Krizhevsky. Learning multiple layers of features from tiny images. Master’s thesis, Univer- sity of Toronto, 2009. 11

2009
[33]

Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models

Andrei Barbu, David Mayo, Julian Alverio, William Luo, Christopher Wang, Dan Gutfreund, Josh Tenenbaum, and Boris Katz. Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models. Advances in neural information processing systems , 32, 2019

2019
[34]

Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 2556–2565, 2018

2018
[35]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Pi- otr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision , pages 740–755. Springer, 2014

2014
[36]

Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models

Bryan A Plummer, Liwei Wang, Chris M Cervantes, Juan C Caicedo, Julia Hockenmaier, and Svetlana Lazebnik. Flickr30k entities: Collecting region-to-phrase correspondences for richer image-to-sentence models. In Proceedings of the IEEE international conference on computer vision, pages 2641–2649, 2015

2015
[37]

The caltech- ucsd birds-200-2011 dataset

Catherine Wah, Steve Branson, Peter Welinder, Pietro Perona, and Serge Belongie. The caltech- ucsd birds-200-2011 dataset. In California Institute of Technology , 2011. Technical Report CNS-TR-2011-001

2011
[38]

Learning deep representations of ﬁne-grained visual descriptions

Scott Reed, Zeynep Akata, Bernt Schiele, and Honglak Lee. Learning deep representations of ﬁne-grained visual descriptions. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2016

2016
[39]

Random features for large-scale kernel machines

Ali Rahimi and Benjamin Recht. Random features for large-scale kernel machines. Advances in neural information processing systems , 20, 2007

2007
[40]

Scalable diffusion models with transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers. In Proceedings of the IEEE/CVF international conference on computer vision , pages 4195–4205, 2023

2023
[41]

Going deeper with image transformers

Hugo Touvron, Matthieu Cord, Alexandre Sablayrolles, Gabriel Synnaeve, and Hervé Jégou. Going deeper with image transformers. In Proceedings of the IEEE/CVF international confer- ence on computer vision , pages 32–42, 2021

2021
[42]

GLU Variants Improve Transformer

Noam Shazeer. Glu variants improve transformer. arXiv preprint arXiv:2002.05202 , 2020. URL https://arxiv.org/abs/2002.05202

work page internal anchor Pith review Pith/arXiv arXiv 2002
[43]

Classiﬁer-free diffusion guidance

Jonathan Ho and Tim Salimans. Classiﬁer-free diffusion guidance. In NeurIPS 2021 Workshop on Deep Generative Models and Downstream Applications , 2021. A Supplementary Material A.1 Datasets All models are trained on Conceptual Captions 1M (CC1M) [ 34], a web-crawled image-caption dataset providing one caption per image. Embeddings for all datasets are pre...

2021
[44]

Training uses AdamW with a cosine learning-rate schedule and linear warmup

We use a single hyperparameter conﬁguration for all downstream evaluations and do not perform dataset-speciﬁc tuning. Training uses AdamW with a cosine learning-rate schedule and linear warmup. To stabilize joint density learning, we use a three-stage curriculum over the mask distribution. In the ﬁrst stage, the model is trained only on conditional ﬂow ta...

2048