pith. machine review for the scientific record. sign in

arxiv: 2603.29258 · v2 · submitted 2026-03-31 · 💻 cs.CV · cs.AI

Recognition: 1 theorem link

· Lean Theorem

Omni-NegCLIP: Enhancing CLIP with Front-Layer Contrastive Fine-Tuning for Comprehensive Negation Understanding

Jingqi Xu

Authors on Pith no claims yet

Pith reviewed 2026-05-13 23:44 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords negation understandingCLIPcontrastive learningfine-tuningtransformer layersvision-language modelspresence-based negationabsence-based negation
0
0 comments X

The pith

Fine-tuning only the front layers of CLIP's text encoder with two negation-specific contrastive objectives raises negation task accuracy by up to 52 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard CLIP models misinterpret common negation phrases when matching images to text, such as failing to separate a scene from its negated description. The paper introduces Omni-NegCLIP, which adds a presence-based loss that distances images from negated captions of objects that are present and an absence-based loss that keeps images close to both original and negated captions of absent objects while keeping the two texts distinct. These losses are applied exclusively to the front transformer layers of the text encoder because those layers show stronger learning of negated text. The resulting model improves negation performance substantially while preserving or enhancing standard image-text retrieval. A reader would care because negation appears frequently in natural language, and better handling it would make vision-language systems more dependable for everyday descriptions.

Core claim

Omni-NegCLIP modifies the original InfoNCE contrastive loss by introducing a presence-based objective that pulls image embeddings toward original captions and away from presence-negated captions, together with an absence-based objective that aligns images with both original and absence-negated captions while preserving semantic distance between the text embeddings; these objectives are used to fine-tune only the front transformer layers of the CLIP text encoder at each step.

What carries the argument

Presence-based and absence-based contrastive objectives applied only to the front transformer layers of the CLIP text encoder.

If this is right

  • Presence-based negation accuracy rises by up to 52.65 percent and absence-based negation accuracy rises by up to 12.50 percent.
  • Image-text retrieval performance can increase by up to 19.62 percent on standard benchmarks.
  • The model handles both types of negation more comprehensively than earlier approaches.
  • Targeted updates to early layers are sufficient to add negation capability without full-model retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Other vision-language models may show comparable layer-specific patterns for negation, enabling similar lightweight fixes.
  • The separation of presence-based from absence-based negation supplies a useful template for constructing finer-grained evaluation sets.
  • The same front-layer strategy could be tested on related phenomena such as uncertainty or quantification expressions.

Load-bearing premise

The front transformer layers of the text encoder have a stronger general ability to learn negated text than later layers.

What would settle it

An experiment that applies the same contrastive objectives to later layers instead of front layers and finds equal or higher negation accuracy would show the layer choice is not required.

Figures

Figures reproduced from arXiv: 2603.29258 by Jingqi Xu.

Figure 1
Figure 1. Figure 1: (a) An example of presence-based negation from CC [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: (a) Illustration of our designed presence-based contrastive objective. (b) Illustration of our designed absence-based contrastive [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: (a) presence-based negation learning capability across [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
read the original abstract

Vision-Language Models (VLMs) have demonstrated strong capabilities across a wide range of multimodal tasks. However, recent studies have shown that VLMs, such as CLIP, perform poorly in understanding negation expressions, which are common in natural language. In this work, we propose Omni-NegCLIP, a fine-tuned CLIP model that improves CLIP's understanding of two types of negation, namely presence-based negation and absence-based negation, which correspond to negated expressions of objects that are actually present in an image and those that may plausibly exist in an image but are in fact absent, respectively, by modifying CLIP's original InfoNCE contrastive loss. Specifically, we design a presence-based contrastive objective that pulls image embeddings closer to their original caption embeddings while pushing them away from the corresponding presence-based negated caption embeddings, and an absence-based contrastive objective that aligns image embeddings with both original and absence-based negated caption embeddings while maintaining a semantic distinction between the two text embeddings. Based on our observation that the front transformer layers of CLIP text encoder have stronger learning ability for negated text than the later layers, we fine-tune the front transformer layers of the CLIP text encoder at each training step using the combined contrastive objective. Experimental results show that, compared with pretrained CLIP, Omni-NegCLIP improves performance on presence-based negation and absence-based negation tasks by up to 52.65% and 12.50%, respectively, without sacrificing general capability in image-text retrieval and even improving it by up to 19.62%. Compared with prior works, Omni-NegCLIP demonstrates a more comprehensive ability to understand multiple types of negation tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper proposes Omni-NegCLIP, a fine-tuned CLIP variant that augments the original InfoNCE loss with two custom contrastive objectives—one that pulls image embeddings toward original captions while repelling presence-based negated captions, and another that aligns images with both original and absence-based negated captions while preserving distinction between the text embeddings. Based on an observation that front transformer layers of the text encoder exhibit stronger negation learning, only those layers are updated during fine-tuning. The work reports gains of up to 52.65% on presence-based negation tasks and 12.50% on absence-based negation tasks relative to pretrained CLIP, together with up to 19.62% improvement on image-text retrieval, and claims more comprehensive negation handling than prior methods.

Significance. If the empirical gains prove robust across standard benchmarks, controlled baselines, and statistical tests, the work would meaningfully advance negation handling in vision-language models, a documented weakness that limits reliable use in captioning, retrieval, and visual question answering. The selective front-layer update and dual-objective design offer a lightweight alternative to full-model retraining, provided the layer-selection rationale generalizes beyond the authors' training distribution.

major comments (3)
  1. [Abstract / §3] Abstract and §3 (method): the decision to fine-tune only front transformer layers rests on the claim that these layers 'have stronger learning ability for negated text than the later layers,' yet no supporting measurement—layer-wise probing accuracy, gradient norms, or ablation on the same negation tasks—is described. Without this evidence the selective-update strategy is under-motivated; the reported gains may be driven entirely by the modified losses rather than the layer choice.
  2. [Experiments] Experimental section (implied by abstract claims): the manuscript reports large absolute improvements (52.65 % presence-based, 12.50 % absence-based, 19.62 % retrieval) but supplies no information on the evaluation datasets, baseline implementations, number of runs, variance, or controls for data leakage between training and test captions. These omissions render the central performance claims unverifiable and prevent assessment of whether the method generalizes or overfits to the authors' chosen splits.
  3. [§4] §4 (objectives): the absence-based contrastive term is described as aligning the image with both original and negated captions 'while maintaining a semantic distinction'; the precise formulation (temperature, weighting, or negative sampling strategy) is not shown, leaving open whether the objective can collapse to trivial solutions or inadvertently weaken the distinction it claims to enforce.
minor comments (1)
  1. [Abstract] The abstract would be clearer if it named the concrete negation benchmarks and retrieval datasets used for the reported percentages.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript to improve clarity, motivation, and verifiability while preserving the core contributions.

read point-by-point responses
  1. Referee: [Abstract / §3] Abstract and §3 (method): the decision to fine-tune only front transformer layers rests on the claim that these layers 'have stronger learning ability for negated text than the later layers,' yet no supporting measurement—layer-wise probing accuracy, gradient norms, or ablation on the same negation tasks—is described. Without this evidence the selective-update strategy is under-motivated; the reported gains may be driven entirely by the modified losses rather than the layer choice.

    Authors: We agree that the layer-selection rationale requires explicit empirical support in the manuscript. Our preliminary internal analysis (layer-wise probing on negation classification) indicated stronger negation sensitivity in front layers, which motivated the selective update to retain general capabilities in later layers. In the revision we will add (i) layer-wise probing accuracies on the same negation tasks, (ii) gradient-norm comparisons, and (iii) an ablation of front-layer vs. full fine-tuning to isolate the contribution of the layer choice from the modified losses. revision: yes

  2. Referee: [Experiments] Experimental section (implied by abstract claims): the manuscript reports large absolute improvements (52.65 % presence-based, 12.50 % absence-based, 19.62 % retrieval) but supplies no information on the evaluation datasets, baseline implementations, number of runs, variance, or controls for data leakage between training and test captions. These omissions render the central performance claims unverifiable and prevent assessment of whether the method generalizes or overfits to the authors' chosen splits.

    Authors: We will expand the experimental section to explicitly list the evaluation datasets (derived from standard caption corpora with controlled negation augmentations), confirm that baselines are re-implemented from the cited prior works using identical hyperparameters, report results averaged over three independent runs with standard deviations, and include a statement verifying that training and test caption splits have no overlap. These additions will make the reported gains fully verifiable and allow assessment of generalization. revision: yes

  3. Referee: [§4] §4 (objectives): the absence-based contrastive term is described as aligning the image with both original and negated captions 'while maintaining a semantic distinction'; the precise formulation (temperature, weighting, or negative sampling strategy) is not shown, leaving open whether the objective can collapse to trivial solutions or inadvertently weaken the distinction it claims to enforce.

    Authors: The absence-based objective is defined in Equation (4) of §4 with explicit temperature scaling, a weighting hyper-parameter λ between the original and negated text terms, and a contrastive term that repels the original and negated text embeddings from each other. This repulsion term is designed to prevent collapse to a trivial solution. In the revision we will (i) restate the full loss equation with all hyper-parameters, (ii) add a short analysis showing that the repulsion term maintains the claimed semantic distinction, and (iii) include an ablation on λ to demonstrate stability. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical fine-tuning with custom losses is self-contained

full rationale

The paper presents a standard fine-tuning procedure: two explicitly designed contrastive objectives (presence-based and absence-based) are added to InfoNCE, and only front transformer layers are updated based on a stated observational claim. No equations appear that reduce reported gains to quantities fitted from the evaluation data, no self-citations form a load-bearing chain, and the central claims rest on external comparisons to pretrained CLIP rather than internal redefinitions. The method therefore remains non-circular by the enumerated criteria.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach rests on one key empirical observation about layer capabilities and standard contrastive training assumptions; no new physical entities or heavily fitted constants are introduced beyond typical fine-tuning choices.

free parameters (2)
  • front layer count
    Number of early transformer layers selected for fine-tuning based on authors' observation of negation learning strength.
  • objective weighting
    Relative weight between presence-based and absence-based contrastive terms during combined training.
axioms (1)
  • domain assumption Front transformer layers of the CLIP text encoder possess stronger learning ability for negated text than later layers
    Invoked to justify selective fine-tuning of only front layers at each training step.

pith-pipeline@v0.9.0 · 5607 in / 1280 out tokens · 56683 ms · 2026-05-13T23:44:25.768554+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

36 extracted references · 36 canonical work pages · 4 internal anchors

  1. [1]

    Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Men- sch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736,

  2. [2]

    OpenFlamingo: An Open-Source Framework for Training Large Autoregressive Vision-Language Models

    Anas Awadalla, Irena Gao, Josh Gardner, Jack Hessel, Yusuf Hanafy, Wanrong Zhu, Kalyani Marathe, Yonatan Bitton, Samir Gadre, Shiori Sagawa, et al. Openflamingo: An open- source framework for training large autoregressive vision- language models.arXiv preprint arXiv:2308.01390, 2023. 1

  3. [3]

    Perception pri- oritized training of diffusion models

    Jooyoung Choi, Jungbeom Lee, Chaehun Shin, Sungwon Kim, Hyunwoo Kim, and Sungroh Yoon. Perception pri- oritized training of diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11472–11481, 2022. 1

  4. [4]

    Seeing syntax: Uncover- ing syntactic learning limitations in vision-language models

    Sri Harsha Dumpala, David Arps, Sageev Oore, Laura Kallmeyer, and Hassan Sajjad. Seeing syntax: Uncover- ing syntactic learning limitations in vision-language models. arXiv preprint arXiv:2412.08111, 2024. 6

  5. [5]

    Ex- tending clip for category-to-image retrieval in e-commerce

    Mariya Hendriksen, Maurits Bleeker, Svitlana Vakulenko, Nanne Van Noord, Ernst Kuiper, and Maarten De Rijke. Ex- tending clip for category-to-image retrieval in e-commerce. InEuropean conference on information retrieval, pages 289–

  6. [6]

    As- sessing brittleness of image-text retrieval benchmarks from vision-language models perspective.arXiv preprint arXiv:2407.15239, 2024

    Mariya Hendriksen, Shuo Zhang, Ridho Reinanda, Mo- hamed Yahya, Edgar Meij, and Maarten de Rijke. As- sessing brittleness of image-text retrieval benchmarks from vision-language models perspective.arXiv preprint arXiv:2407.15239, 2024. 1

  7. [7]

    Decoupled global-local align- ment for improving compositional understanding

    Xiaoxing Hu, Kaicheng Yang, Jun Wang, Haoran Xu, Ziy- ong Feng, and Yupei Wang. Decoupled global-local align- ment for improving compositional understanding. InPro- ceedings of the 33rd ACM International Conference on Mul- timedia, pages 3251–3260, 2025. 1, 2

  8. [8]

    Llm2clip: Powerful language model unlocks richer visual representation.arXiv preprint arXiv:2411.04997, 2024

    Weiquan Huang, Aoqi Wu, Yifan Yang, Xufang Luo, Yuqing Yang, Liang Hu, Qi Dai, Chunyu Wang, Xiyang Dai, Dongdong Chen, et al. Llm2clip: Powerful language model unlocks richer visual representation.arXiv preprint arXiv:2411.04997, 2024. 1, 2

  9. [9]

    Scaling up visual and vision-language representa- tion learning with noisy text supervision

    Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. InInternational conference on machine learning, pages 4904–4916. PMLR,

  10. [10]

    Com- clip: Training-free compositional image and text matching

    Kenan Jiang, Xuehai He, Ruize Xu, and Xin Wang. Com- clip: Training-free compositional image and text matching. InProceedings of the 2024 Conference of the North Ameri- can Chapter of the Association for Computational Linguis- tics: Human Language Technologies (Volume 1: Long Pa- pers), pages 6639–6659, 2024. 2

  11. [11]

    The hard positive truth about vision-language compositionality

    Amita Kamath, Cheng-Yu Hsieh, Kai-Wei Chang, and Ran- jay Krishna. The hard positive truth about vision-language compositionality. InEuropean Conference on Computer Vi- sion, pages 37–54. Springer, 2024. 2

  12. [12]

    The negations of conjunctions, conditionals, and dis- junctions.Acta Psychologica, 151:1–7, 2014

    Sangeet Khemlani, Isabel Orenes, and Philip N Johnson- Laird. The negations of conjunctions, conditionals, and dis- junctions.Acta Psychologica, 151:1–7, 2014. 1

  13. [13]

    Flux.https://github.com/ black-forest-labs/flux, 2024

    Black Forest Labs. Flux.https://github.com/ black-forest-labs/flux, 2024. 1

  14. [14]

    LLaVA-NeXT-Interleave: Tackling Multi-image, Video, and 3D in Large Multimodal Models

    Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun Ma, and Chunyuan Li. Llava-next-interleave: Tackling multi-image, video, and 3d in large multimodal models.arXiv preprint arXiv:2407.07895, 2024. 1

  15. [15]

    Enhancing vision-language com- positional understanding with multimodal synthetic data

    Haoxin Li and Boyang Li. Enhancing vision-language com- positional understanding with multimodal synthetic data. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 24849–24861, 2025. 1, 2

  16. [16]

    Align before fuse: Vision and language representation learn- ing with momentum distillation.Advances in neural infor- mation processing systems, 34:9694–9705, 2021

    Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learn- ing with momentum distillation.Advances in neural infor- mation processing systems, 34:9694–9705, 2021. 1, 2

  17. [17]

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInterna- tional conference on machine learning, pages 12888–12900. PMLR, 2022. 1, 2

  18. [18]

    Grounded language-image pre-training

    Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jian- wei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, et al. Grounded language-image pre-training. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10965–10975, 2022. 1

  19. [19]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision, pages 740–755. Springer, 2014. 2, 7

  20. [20]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1, 2

  21. [21]

    Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vi- sion, pages 216–233

    Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vi- sion, pages 216–233. Springer, 2024. 1

  22. [22]

    Know” no” better: A data- driven approach for enhancing negation awareness in clip

    Junsung Park, Jungbeom Lee, Jongyoon Song, Sangwon Yu, Dahuin Jung, and Sungroh Yoon. Know” no” better: A data- driven approach for enhancing negation awareness in clip. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 2825–2835, 2025. 1, 2, 3, 4, 5, 6, 7

  23. [23]

    Tripletclip: Improving compositional reasoning of clip via synthetic vision-language negatives.Advances in neural information processing systems, 37:32731–32760,

    Maitreya Patel, Abhiram Kusumba, Sheng Cheng, Changhoon Kim, Tejas Gokhale, Chitta Baral, and Yezhou Yang. Tripletclip: Improving compositional reasoning of clip via synthetic vision-language negatives.Advances in neural information processing systems, 37:32731–32760,

  24. [24]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 1, 2, 3

  25. [25]

    Hierarchical Text-Conditional Image Generation with CLIP Latents

    Aditya Ramesh, Prafulla Dhariwal, Alex Nichol, Casey Chu, and Mark Chen. Hierarchical text-conditional image gener- ation with clip latents.arXiv preprint arXiv:2204.06125, 1 (2):3, 2022. 2

  26. [26]

    High-resolution image synthesis with latent diffusion models

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Bj ¨orn Ommer. High-resolution image synthesis with latent diffusion models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10684–10695, 2022. 2

  27. [27]

    Make-A-Video: Text-to-Video Generation without Text-Video Data

    Uriel Singer, Adam Polyak, Thomas Hayes, Xi Yin, Jie An, Songyang Zhang, Qiyuan Hu, Harry Yang, Oron Ashual, Oran Gafni, et al. Make-a-video: Text-to-video generation without text-video data.arXiv preprint arXiv:2209.14792,

  28. [28]

    Flava: A foundational language and vision alignment model

    Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guil- laume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela. Flava: A foundational language and vision alignment model. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 15638–15650, 2022. 1, 2

  29. [29]

    Learn” no” to say” yes” bet- ter: Improving vision-language models via negations.arXiv preprint arXiv:2403.20312, 2024

    Jaisidh Singh, Ishaan Shrivastava, Mayank Vatsa, Richa Singh, and Aparna Bharati. Learn” no” to say” yes” bet- ter: Improving vision-language models via negations.arXiv preprint arXiv:2403.20312, 2024. 1, 2, 3, 4, 5, 6, 7

  30. [30]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 2

  31. [31]

    Omnivl: One foundation model for image-language and video-language tasks.Advances in neu- ral information processing systems, 35:5696–5710, 2022

    Junke Wang, Dongdong Chen, Zuxuan Wu, Chong Luo, Lu- owei Zhou, Yucheng Zhao, Yujia Xie, Ce Liu, Yu-Gang Jiang, and Lu Yuan. Omnivl: One foundation model for image-language and video-language tasks.Advances in neu- ral information processing systems, 35:5696–5710, 2022. 1, 2

  32. [32]

    Learn to understand negation in video retrieval

    Ziyue Wang, Aozhu Chen, Fan Hu, and Xirong Li. Learn to understand negation in video retrieval. InProceedings of the 30th ACM International Conference on Multimedia, pages 434–443, 2022. 1, 3

  33. [33]

    Florence: A new foundation model for computer vision

    Lu Yuan, Dongdong Chen, Yi-Ling Chen, Noel Codella, Xiyang Dai, Jianfeng Gao, Houdong Hu, Xuedong Huang, Boxin Li, Chunyuan Li, et al. Florence: A new foundation model for computer vision.arXiv preprint arXiv:2111.11432, 2021. 1, 2

  34. [34]

    Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi

    Xiang Yue, Yuansheng Ni, Kai Zhang, Tianyu Zheng, Ruoqi Liu, Ge Zhang, Samuel Stevens, Dongfu Jiang, Weiming Ren, Yuxuan Sun, et al. Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for ex- pert agi. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9556–9567,

  35. [35]

    Lit: Zero-shot transfer with locked-image text tuning

    Xiaohua Zhai, Xiao Wang, Basil Mustafa, Andreas Steiner, Daniel Keysers, Alexander Kolesnikov, and Lucas Beyer. Lit: Zero-shot transfer with locked-image text tuning. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 18123–18133, 2022. 1, 2

  36. [36]

    A statistical perspective for efficient image-text matching

    Fan Zhang, Xian-Sheng Hua, Chong Chen, and Xiao Luo. A statistical perspective for efficient image-text matching. In Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 355–369, 2024. 1