pith. machine review for the scientific record. sign in

arxiv: 2605.12872 · v1 · submitted 2026-05-13 · 💻 cs.LG

Recognition: unknown

SMA: Submodular Modality Aligner For Data Efficient Multimodal Learning

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:28 UTC · model grok-4.3

classification 💻 cs.LG
keywords multimodal learningsubmodular optimizationdata efficiencymodality alignmentzero-shot classificationmutual informationCLIP benchmarklow-data learning
0
0 comments X

The pith

SMA aligns images and text by optimizing submodular mutual information over sets of descriptions rather than individual pairs, enabling strong zero-shot performance with only tens of thousands of samples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard multimodal models maximize correlation between single image-text pairs, which overlooks geometric structure across modalities and demands enormous paired datasets. This paper reframes alignment as a combinatorial set problem and introduces the Submodular Modality Aligner (SMA) that applies submodular mutual information to multiple augmentations and descriptions of the same entity. The objective simultaneously increases inter-modality information and reduces cross-modal divergence, allowing the model to extract far more signal from limited data. On 14 zero-shot classification and retrieval tasks from the CLIP benchmark, SMA delivers consistent gains in the low-data regime. The result is multimodal generalization using orders of magnitude fewer samples than conventional pairwise approaches.

Core claim

By treating multiple augmentations and descriptions of an entity as sets and optimizing a submodular mutual information objective, SMA jointly maximizes cross-modal mutual information while reducing modality gap, enabling data-efficient multimodal learning that achieves strong generalization on zero-shot tasks with only tens of thousands of paired samples.

What carries the argument

Submodular Modality Aligner (SMA) using Submodular Mutual Information (SMI) on sets of cross-modal descriptions to capture richer structure beyond pairwise correlations.

If this is right

  • SMA achieves strong multimodal generalization using only tens of thousands of samples on CLIP benchmark tasks.
  • Consistent performance gains appear across 14 zero-shot classification and retrieval tasks in low-data regimes.
  • The approach makes multimodal foundation models practical in settings where aligned data is scarce or expensive.
  • Set-based combinatorial objectives extract more information from each sample than instance-level pairwise learning.
  • The method reduces reliance on massive paired datasets for modality alignment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same set-based SMI objective could be applied to other scarce-data multimodal problems such as video-text or audio-text alignment.
  • SMA's reduced data requirement may lower the cost of adapting foundation models to new domains or languages.
  • Combining SMA with parameter-efficient fine-tuning techniques could further shrink the data needed for competitive performance.
  • The combinatorial view suggests rethinking other contrastive objectives in vision-language models as set functions.

Load-bearing premise

The set-based submodular mutual information formulation captures richer cross-modal geometric structure without introducing new biases or needing heavy post-hoc tuning.

What would settle it

Training SMA and a standard pairwise baseline on the same 50,000-sample multimodal subset and finding no statistically significant gain for SMA on downstream zero-shot tasks would refute the claimed advantage.

Figures

Figures reproduced from arXiv: 2605.12872 by Anay Majee, Rishabh Iyer, Truong Pham.

Figure 1
Figure 1. Figure 1: Architcture of SMA. Comparison between CLIP, SigLIP, SAIL and our alginment structure. SAIL has frozen pretrained encoders and only trains on a small projection layer, in contrast to the end-to-end training by CLIP and SigLIP. However, all 3 methods are instance based alignment training and can only extract information from singleton positive pairs. Our SMA losses are trained on top of frozen encoders and … view at source ↗
Figure 2
Figure 2. Figure 2: Illustration of the Loss formulation in Submodular Modality Aligner (SMA) - Given a image and text pair we first (a) augment them separately alleviating the need for large datasets. Then we train only the alignment layers using the combinatorial SMA loss formulation LSMA which jointly models (b) cross-modal alignments (correlations between image and text sets) and (c) minimizes divergence across modalities… view at source ↗
read the original abstract

Despite the recent success of Multimodal Foundation Models (FMs), their reliance on massive paired datasets limits their applicability in low-data and rare-scenario settings where aligned data is scarce and expensive. A key bottleneck is the adoption of an instance-level formulation, which learns alignment by maximizing correlation between individual image-text pairs while neglecting the underlying geometric structure across modalities resulting in a modality gap across input modalities. In this paper, we propose a combinatorial paradigm for multimodal alignment that moves beyond pairwise learning and introduce the \emph{Submodular Modality Aligner (SMA)}, which treats multiple augmentations and descriptions of an entity as a set, leveraging multiple descriptions of the data to capture richer cross-modal structure. We instantiate SMA using a principled objective based on Submodular Mutual Information (SMI), which jointly maximizes inter-modality mutual information while reducing cross-modal divergence. This formulation enables the model to effectively utilize multiple positive associations and extract significantly more information from limited data. We evaluate SMA on 14 zero-shot classification and retrieval tasks from the CLIP benchmark and demonstrate consistent gains in the low-data regime. Notably, SMA achieves strong multimodal generalization using only tens of thousands of samples. This is orders of magnitude fewer than standard approaches. Our results highlight the importance of set-based formulations and submodular objectives for data-efficient multimodal learning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces the Submodular Modality Aligner (SMA), a combinatorial approach to multimodal alignment that replaces instance-level pairwise learning with a set-based formulation. Multiple augmentations and descriptions of each entity are treated as a set, and alignment is performed via a Submodular Mutual Information (SMI) objective that jointly maximizes inter-modality mutual information while reducing cross-modal divergence. The authors evaluate SMA on 14 zero-shot classification and retrieval tasks from the CLIP benchmark and claim consistent gains, including strong generalization using only tens of thousands of samples—orders of magnitude fewer than standard approaches.

Significance. If the empirical results hold after detailed verification, the work would be significant for data-efficient multimodal learning. Grounding the objective in established submodular theory rather than ad-hoc fitting is a strength, and demonstrating that set-based SMI can extract substantially more signal from limited paired data could reduce reliance on massive datasets in low-resource settings.

major comments (3)
  1. Experiments section: The manuscript claims consistent gains across 14 tasks and orders-of-magnitude data reduction, yet the abstract (and by extension the results) supplies no quantitative numbers, error bars, baseline comparisons, or ablation studies isolating the SMI objective from multi-positive contrastive effects. Without these, the central claim that the combinatorial paradigm drives the reported gains cannot be verified.
  2. Method section (§3): The set construction process for augmentations and descriptions is not specified (how multiple descriptions are sampled or chosen, whether sets are fixed across runs). This leaves open the possibility that performance gains arise from the multi-positive formulation itself rather than the submodular objective, undermining the claim that SMI captures richer cross-modal geometry.
  3. Method section (§3, SMI formulation): The objective is described as principled, yet the presence of SMI balancing parameters (listed as free parameters) is not addressed; it is unclear whether these are fixed, cross-validated, or tuned per task, which directly affects the data-efficiency and reproducibility claims.
minor comments (2)
  1. Abstract: The phrase 'orders of magnitude fewer' should be accompanied by explicit sample counts for both SMA and the standard approaches it is compared against.
  2. Notation: Ensure SMI and related submodular terms are defined at first use and that any equations for the objective are numbered and cross-referenced in the text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and will incorporate revisions to strengthen the manuscript's clarity, reproducibility, and empirical support.

read point-by-point responses
  1. Referee: Experiments section: The manuscript claims consistent gains across 14 tasks and orders-of-magnitude data reduction, yet the abstract (and by extension the results) supplies no quantitative numbers, error bars, baseline comparisons, or ablation studies isolating the SMI objective from multi-positive contrastive effects. Without these, the central claim that the combinatorial paradigm drives the reported gains cannot be verified.

    Authors: We agree that the abstract lacks specific quantitative results and that the results section would benefit from more explicit isolation of the SMI contribution. The full manuscript contains tables reporting performance on all 14 tasks with comparisons to CLIP baselines, but we acknowledge the absence of error bars and dedicated ablations. In revision, we will (1) update the abstract with key quantitative metrics (e.g., average zero-shot accuracy gains and data reduction factors), (2) add standard error bars to all tables and figures, (3) include explicit baseline comparisons, and (4) add an ablation subsection comparing SMA to a multi-positive contrastive loss without the submodular term. These changes will directly address verifiability of the combinatorial contribution. revision: yes

  2. Referee: Method section (§3): The set construction process for augmentations and descriptions is not specified (how multiple descriptions are sampled or chosen, whether sets are fixed across runs). This leaves open the possibility that performance gains arise from the multi-positive formulation itself rather than the submodular objective, undermining the claim that SMI captures richer cross-modal geometry.

    Authors: We accept that the set construction details were insufficiently specified. In the revised §3 we will explicitly describe the procedure: for each entity we sample a fixed number of augmentations (k=4) per image using standard CLIP augmentations and select up to m=3 descriptions from the available captions, with the resulting sets held fixed across all training runs and random seeds. We will also add an ablation that compares performance with fixed sets versus randomly re-sampled sets at each epoch, thereby isolating the benefit of the SMI objective from generic multi-positive effects. revision: yes

  3. Referee: Method section (§3, SMI formulation): The objective is described as principled, yet the presence of SMI balancing parameters (listed as free parameters) is not addressed; it is unclear whether these are fixed, cross-validated, or tuned per task, which directly affects the data-efficiency and reproducibility claims.

    Authors: We thank the referee for highlighting this omission. The balancing parameters in the SMI objective are fixed at λ=1.0 and μ=0.5 for all experiments; these values were selected once via a small held-out validation split from the training data and never tuned per downstream task. In the revision we will state these exact values in §3, describe the one-time validation procedure, and add a sensitivity plot in the appendix showing that performance remains stable for modest perturbations around these defaults. This clarification preserves the data-efficiency claim while improving reproducibility. revision: yes

Circularity Check

0 steps flagged

No circularity detected; SMI-based objective is independently grounded and empirically validated

full rationale

The paper proposes SMA as a new set-based combinatorial paradigm instantiated via a Submodular Mutual Information (SMI) objective drawn from established submodular optimization literature. The central derivation moves from instance-level pairwise alignment to set-level mutual information maximization without any quoted step that reduces the claimed gains to a fitted parameter, self-definition, or self-citation chain by construction. Evaluation on 14 zero-shot tasks reports empirical improvements in the low-data regime; these results are presented as outcomes of the method rather than tautological restatements of inputs. No load-bearing equation or uniqueness claim collapses to prior author work in a manner that would force the result. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on submodular optimization theory as background and the assumption that SMI jointly maximizes inter-modality information while reducing divergence; no new entities are postulated.

free parameters (1)
  • SMI balancing parameters
    Parameters controlling the trade-off between mutual information and divergence reduction in the submodular objective are likely chosen or tuned.
axioms (1)
  • domain assumption Submodular mutual information can jointly maximize inter-modality mutual information while reducing cross-modal divergence when applied to sets of multimodal descriptions.
    Directly invoked as the principled objective for SMA in the abstract.

pith-pipeline@v0.9.0 · 5540 in / 1175 out tokens · 35149 ms · 2026-05-14T20:28:21.138946+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

54 extracted references · 13 canonical work pages · 5 internal anchors

  1. [1]

    Liteembed: Adapting clip to rare classes, 2026

    Aishwarya Agarwal, Srikrishna Karanam, and Vineet Gandhi. Liteembed: Adapting clip to rare classes, 2026

  2. [2]

    Theoretical analysis of submodular information measures for targeted data subset selection.ArXiv, abs/2402.13454, 2024

    Nathan Beck, Truong Pham, and Rishabh Iyer. Theoretical analysis of submodular information measures for targeted data subset selection.ArXiv, abs/2402.13454, 2024

  3. [3]

    Nathan Beck, Durga Sivasubramanian, Apurva Dani, Ganesh Ramakrishnan, and Rishabh K. Iyer. Effective evaluation of deep active learning on image classification tasks.ArXiv, abs/2106.15324, 2021

  4. [4]

    Submodularity in machine learning and artificial intelligence.ArXiv, abs/2202.00132, 2022

    Jeff Bilmes. Submodularity in machine learning and artificial intelligence.ArXiv, abs/2202.00132, 2022

  5. [5]

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey E. Hinton. A simple framework for contrastive learning of visual representations.ArXiv, abs/2002.05709, 2020

  6. [6]

    Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Yao Liu, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, and Quoc V . Le. Symbolic discovery of optimization algorithms, 2023

  7. [7]

    Adam Coates, Honglak Lee, and Andrew Y . Ng. Stanford stl-10 image dataset

  8. [8]

    Training data subset selection for regression with controlled generalization error

    Sivasubhramanian Durga, Rishabh Iyer, Ganesh Ramakrishnan, and Abir De. Training data subset selection for regression with controlled generalization error. InInternational Conference on Machine Learning, pages 9202–9212. PMLR, 2021

  9. [9]

    Elsevier, 2005

    Satoru Fujishige.Submodular Functions and Optimization, volume 58. Elsevier, 2005

  10. [10]

    With limited data for multimodal alignment, let the STRUCTURE guide you

    Fabian Gröger, Shuo Wen, Huyen Le, and Maria Brbic. With limited data for multimodal alignment, let the STRUCTURE guide you. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

  11. [11]

    Flower categorization using deep convolutional neural networks, 2017

    Ayesha Gurnani, Viraj Mavani, Vandit Gajjar, and Yash Khandhediya. Flower categorization using deep convolutional neural networks, 2017

  12. [12]

    Chinchali, and ufuk topcu

    Po han Li, Sandeep P. Chinchali, and ufuk topcu. CSA: Data-efficient mapping of unimodal features to multimodal features. InThe Thirteenth International Conference on Learning Representations, 2025

  13. [13]

    Explicit entropic constructions for coverage, facility location, and graph cuts, 2026

    Rishabh Iyer. Explicit entropic constructions for coverage, facility location, and graph cuts, 2026

  14. [14]

    Polyhedral aspects of Submodularity, Convexity and Concavity

    Rishabh Iyer and Jeff Bilmes. Polyhedral aspects of submodularity, convexity and concavity. ArXiv, abs/1506.07329

  15. [15]

    Submodular combina- torial information measures with applications in machine learning

    Rishabh Iyer, Ninad Khargoankar, Jeff Bilmes, and Himanshu Asanani. Submodular combina- torial information measures with applications in machine learning. InAlgorithmic Learning Theory, pages 722–754. PMLR, 2021

  16. [16]

    Rishabh Iyer, Ninad Khargonkar, Jeff Bilmes, and Himanshu Asnani. Generalized submod- ular information measures: Theoretical properties, examples, optimization algorithms, and applications.IEEE Transactions on Information Theory, 68(2):752–781, 2021

  17. [17]

    Tendulkar, Rishabh K Iyer, and Abir De

    Eeshaan Jain, Tushar Nandy, Gaurav Aggarwal, Ashish V . Tendulkar, Rishabh K Iyer, and Abir De. Efficient data subset selection to generalize training across models: Transductive and inductive networks. InThirty-seventh Conference on Neural Information Processing Systems (NeurIPS), 2023

  18. [18]

    Jegelka and J

    S. Jegelka and J. Bilmes. Submodularity beyond submodular energies: coupling edges in graph cuts. pages 1897–1904, Piscataway, NJ, USA, June 2011. IEEE

  19. [19]

    Le, Yun- Hsuan Sung, Zhen Li, and Tom Duerig

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc V . Le, Yun- Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representation learning with noisy text supervision. InInternational Conference on Machine Learning, 2021. 10

  20. [20]

    Kaushal, R

    V . Kaushal, R. Iyer, K. Doctor, A. Sahoo, P. Dubal, S. Kothawade, R. Mahadev, K. Dargan, and G. Ramakrishnan. Demystifying multi-faceted video summarization: Tradeoff between diversity, representation, coverage and importance. In2019 IEEE Winter Conference on Applications of Computer Vision (WACV), pages 452–461, 2019

  21. [21]

    Learning from less data: A unified data subset selection and active learning framework for computer vision

    Vishal Kaushal, Rishabh Iyer, Suraj Kothawade, Rohan Mahadev, Khoshrav Doctor, and Ganesh Ramakrishnan. Learning from less data: A unified data subset selection and active learning framework for computer vision. In2019 IEEE Winter Conference on Applications of Computer Vision (WACV), 2019

  22. [22]

    How good is a video summary? a new benchmarking dataset and evaluation framework towards realistic video summarization.ArXiv, abs/2101.10514, 2021

    Vishal Kaushal, Suraj Kothawade, Anshul Tomar, Rishabh Iyer, and Ganesh Ramakrishnan. How good is a video summary? a new benchmarking dataset and evaluation framework towards realistic video summarization.ArXiv, abs/2101.10514, 2021

  23. [23]

    A framework towards domain specific video summarization

    Vishal Kaushal, Sandeep Subramanian, Suraj Kothawade, Rishabh Iyer, and Ganesh Ramakr- ishnan. A framework towards domain specific video summarization. In2019 IEEE winter conference on applications of computer vision (WACV), pages 666–675. IEEE, 2019

  24. [24]

    Evfimievski, Lucian Popa, and Rishabh Iyer

    Krishnateja Killamsetty, Guttu Sai Abhishek, Aakriti, Ganesh Ramakrishnan, Alexandre V . Evfimievski, Lucian Popa, and Rishabh Iyer. AUTOMATA: gradient based data subset selec- tion for compute-efficient hyper-parameter tuning. InProceedings of the 36th International Conference on Neural Information Processing Systems, 2024

  25. [25]

    A nested bi-level optimization framework for robust few shot learning

    Krishnateja Killamsetty, Changbin Li, Chen Zhao, Feng Chen, and Rishabh Iyer. A nested bi-level optimization framework for robust few shot learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 36, pages 7176–7184, 2022

  26. [26]

    SIMILAR: Sub- modular information measures based active learning in realistic scenarios.Advances in Neural Information Processing Systems, 34, 2021

    Suraj Kothawade, Nathan Beck, Krishnateja Killamsetty, and Rishabh Iyer. SIMILAR: Sub- modular information measures based active learning in realistic scenarios.Advances in Neural Information Processing Systems, 34, 2021

  27. [27]

    Suraj Kothawade, Saikat Ghosh, Sumit Shekhar, Yu Xiang, and Rishabh K. Iyer. Talisman: Targeted active learning for object detection with rare classes and slices using submodular mutual information. InComputer Vision - ECCV 2022 - 17th European Conference, 2022

  28. [28]

    Bilmes, and Rishabh K

    Suraj Kothawade, Vishal Kaushal, Ganesh Ramakrishnan, Jeff A. Bilmes, and Rishabh K. Iyer. PRISM: A rich class of parameterized submodular information measures for guided data subset selection. InThirty-Sixth AAAI Conference on Artificial Intelligence, AAAI, pages 10238–10246, 2022

  29. [29]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky. Learning multiple layers of features from tiny images. 2009

  30. [30]

    An end-to-end submodular framework for data-efficient in-context learning

    Lilly Kumari, Shengjie Wang, Arnav Das, Tianyi Zhou, and Jeff Bilmes. An end-to-end submodular framework for data-efficient in-context learning. InFindings of the Association for Computational Linguistics: NAACL 2024, pages 3293–3308, 2024

  31. [31]

    NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

    Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catanzaro, and Wei Ping. Nv-embed: Improved techniques for training llms as generalist embedding models.ArXiv, abs/2405.17428, 2024

  32. [32]

    Caltech 101, Apr 2022

    Fei-Fei Li, Marco Andreeto, Marc’Aurelio Ranzato, and Pietro Perona. Caltech 101, Apr 2022

  33. [33]

    Weixin Liang, Yuhui Zhang, Yongchan Kwon, Serena Yeung, and James Y . Zou. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning.ArXiv, abs/2203.02053, 2022

  34. [34]

    Lawrence Zitnick

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. Microsoft COCO: Common objects in context. In Computer Vision – ECCV 2014, Cham, 2014. Springer International Publishing

  35. [35]

    Looking beyond the known: Towards a data discovery guided open-world object detection, 2025

    Anay Majee, Amitesh Gangrade, and Rishabh Iyer. Looking beyond the known: Towards a data discovery guided open-world object detection, 2025. 11

  36. [36]

    SCoRe: Submodular combinatorial representation learning

    Anay Majee, Suraj Nandkishor Kothawade, Krishnateja Killamsetty, and Rishabh K Iyer. SCoRe: Submodular combinatorial representation learning. InProceedings of the 41st International Conference on Machine Learning, volume 235, pages 34327–34349, 2024

  37. [37]

    SMILe: Leveraging submodular mutual information for robust few-shot object detection

    Anay Majee, Ryan Sharp, and Rishabh Iyer. SMILe: Leveraging submodular mutual information for robust few-shot object detection. InEuropean Conference on Computer Vision (ECCV), 2024

  38. [38]

    S. Maji, J. Kannala, E. Rahtu, M. Blaschko, and A. Vedaldi. Fine-grained visual classification of aircraft. Technical report, 2013

  39. [39]

    O’Connor

    Mayug Maniparambil, Raiymbek Akshulakov, Yasser Abdelaziz Dahou Djilali, Sanath Narayan, Mohamed El Amine Seddik, Karttikeya Mangalam, and Noel E. O’Connor. Do vision and language encoders represent the world similarly?, 2024

  40. [40]

    Lazier than lazy greedy.Proceedings of the AAAI Conference on Artificial Intelligence, 29(1), 2015

    Baharan Mirzasoleiman, Ashwinkumar Badanidiyuru, Amin Karbasi, Jan V ondrak, and Andreas Krause. Lazier than lazy greedy.Proceedings of the AAAI Conference on Artificial Intelligence, 29(1), 2015

  41. [41]

    G. L. Nemhauser, L. A. Wolsey, and M. L. Fisher. An analysis of approximations for maximizing submodular set functions—i.Mathematical Programming, 14(1):265–294, 1978

  42. [42]

    Asif: Coupled data turns unimodal models to multimodal without training, 2023

    Antonio Norelli, Marco Fumero, Valentino Maiorca, Luca Moschella, Emanuele Rodolà, and Francesco Locatello. Asif: Coupled data turns unimodal models to multimodal without training, 2023

  43. [43]

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy V . V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russ Howes, Po-Yao (Bernie) Huang, Shang-Wen Li, Ishan Misra, Michael G. Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jégou, Julien Maira...

  44. [44]

    Learning Transferable Visual Models From Natural Language Supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision.CoRR, abs/2103.00020, 2021

  45. [45]

    Welle, Mårten Björkman, and Danica Kragic

    Peiyang Shi, Michael C. Welle, Mårten Björkman, and Danica Kragic. Towards understanding the modality gap in CLIP. InICLR 2023 Workshop on Multimodal Representation Learning: Perks and Pitfalls, 2023

  46. [46]

    Deep metric learning via facility location, 2017

    Hyun Oh Song, Stefanie Jegelka, Vivek Rathod, and Kevin Murphy. Deep metric learning via facility location, 2017

  47. [47]

    Cresswell, Guangwei Yu, Gabriel Loaiza-Ganem, and Maksims V olkovs

    Noël V ouitsis, Zhaoyan Liu, Satya Krishna Gorti, Valentin Villecroze, Jesse C. Cresswell, Guangwei Yu, Gabriel Loaiza-Ganem, and Maksims V olkovs. Data-efficient multimodal fusion on a single gpu, 2024

  48. [48]

    Understanding contrastive representation learning through alignment and uniformity on the hypersphere.ArXiv, abs/2005.10242, 2020

    Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere.ArXiv, abs/2005.10242, 2020

  49. [49]

    Submodularity in data subset selection and active learning

    Kai Wei, Rishabh Iyer, and Jeff Bilmes. Submodularity in data subset selection and active learning. InICML, 2015

  50. [50]

    Distributional vision-language alignment by cauchy-schwarz divergence.ArXiv, abs/2502.17028, 2025

    Wenzhe Yin, Zehao Xiao, Pan Zhou, Shujian Yu, Jiayi Shen, Jan Jakob Sonke, and Efstratios Gavves. Distributional vision-language alignment by cauchy-schwarz divergence.ArXiv, abs/2502.17028, 2025

  51. [51]

    Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Y . Zou. When and why vision-language models behave like bags-of-words, and what to do about it?ArXiv, abs/2210.01936, 2022. 12

  52. [52]

    Sigmoid loss for language image pre-training.2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 11941–11952, 2023

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training.2023 IEEE/CVF International Conference on Computer Vision (ICCV), pages 11941–11952, 2023

  53. [53]

    Assessing and learning alignment of unimodal vision and language models

    Le Zhang, Qian Yang, and Aishwarya Agrawal. Assessing and learning alignment of unimodal vision and language models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 14604–14614, June 2025

  54. [54]

    Clip-pae: Projection-augmentation embedding to extract relevant features for a disentangled, interpretable, and controllable text- guided face manipulation, 2025

    Chenliang Zhou, Fangcheng Zhong, and Cengiz Oztireli. Clip-pae: Projection-augmentation embedding to extract relevant features for a disentangled, interpretable, and controllable text- guided face manipulation, 2025. A Appendix A.1 Modularity Gap connection to Submodularity Consider the Submodular functionf(X) =−( P x∈X x)2, we have a version of SMI funct...