pith. sign in

arxiv: 2603.25722 · v2 · pith:2O3HS3KSnew · submitted 2026-03-26 · 💻 cs.CV · cs.LG

No Hard Negatives Required: Concept Centric Learning Leads to Compositionality without Degrading Zero-shot Capabilities of Contrastive Models

Pith reviewed 2026-05-21 10:20 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords contrastive learningvision-language modelscompositionalityconcept-centric learningattention poolingzero-shot learningcaption splitting
0
0 comments X

The pith

Splitting captions into short concept-centric parts and adding attention pooling lets contrastive vision-language models learn compositionality while keeping zero-shot capabilities intact.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies two main reasons why contrastive vision-language models fail at compositionality: long training captions do not demand learning how concepts bind together, and the global pooling at the end of encoders discards the necessary binding information. To fix this, the authors split captions into shorter segments focused on individual concepts using standard natural language processing tools and align these with the corresponding image regions. They also replace global pooling with a parameter-free cross-modal attention mechanism that produces concept-specific visual embeddings. With these changes and some auxiliary contrastive losses, the models reach state-of-the-art results on compositionality benchmarks and do not lose performance on zero-shot tasks or retrieval, all without raising inference costs. A sympathetic reader would care because it offers a simple, practical way to improve a key weakness in popular models without the usual trade-offs or need for specialized hard negative examples.

Core claim

The paper shows that concept centric learning, obtained by breaking long captions into short concept-focused parts and using cross-modal attention pooling to create matching visual embeddings, enables contrastive vision-language models to develop compositional representations. This method avoids the need for hard negative samples and delivers superior results on compositionality tasks while preserving or enhancing zero-shot classification and retrieval performance without any added inference overhead.

What carries the argument

Concept-centric caption splitting combined with parameter-free cross-modal attention-pooling to generate concept-specific embeddings from the image encoder.

If this is right

  • Reaches state-of-the-art performance on standard compositionality benchmarks
  • Maintains or improves zero-shot and retrieval capabilities
  • Requires no increase in inference cost
  • Does not rely on hard negative samples that often harm basic model abilities

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This method provides an alternative to hard-negative based approaches for improving compositionality.
  • The parameter-free nature of the attention pooling suggests it can be added to existing models with minimal changes.
  • If these root causes prove primary, the same splitting and pooling steps could address binding failures in other multimodal contrastive setups.

Load-bearing premise

That the two identified root causes—long captions not requiring compositional representations and global pooling losing binding information—are the dominant factors limiting compositionality in these models.

What would settle it

Training models with and without the caption splitting and attention pooling steps, then checking whether compositionality benchmark scores rise substantially only when both changes are present while zero-shot and retrieval metrics stay stable or improve.

Figures

Figures reproduced from arXiv: 2603.25722 by Brais Martinez, David T. Hoffmann, Hai X. Pham, Ricardo Guerrero.

Figure 1
Figure 1. Figure 1: Method overview. (a) SigLIP uses a learnable query token in combination with an attention layer to pool the visual tokens into a single token. Aligning only global representations hampers the learning of a compositional representation. (b) Similar to SigLIP, our method aligns the global representations v and t. To simplify learning of a compositional representation, our method extends SigLIP by first, pool… view at source ↗
Figure 2
Figure 2. Figure 2: Change in attention when using C2LIP compared to SigLIP. We visualize the difference in attention between C2LIP and SigLIP to the visual tokens given a caption and an image. Higher attention for C2LIP is shown as green. Lower attention for C 2LIP indicated with violet. White means no change. (a) Black re￾gions that are not a sweater get reduced attention, the sweater gets more or unchanged attention. (b) T… view at source ↗
read the original abstract

Contrastive vision-language (V&L) models remain a popular choice for various applications. However, several limitations have emerged, most notably the limited ability of V&L models to learn compositional representations. Prior methods often addressed this limitation by generating custom training data to obtain hard negative samples. Hard negatives have been shown to improve performance on compositionality tasks, but are often specific to a single benchmark, do not generalize, and can cause substantial degradation of basic V&L capabilities such as zero-shot or retrieval performance, rendering them impractical. In this work we follow a different approach. We identify two root causes that limit compositionality performance of V&Ls: 1) Long training captions do not require a compositional representation; and 2) The final global pooling in the text and image encoders lead to a complete loss of the necessary information to learn binding in the first place. As a remedy, we propose two simple solutions: 1) We obtain short concept centric caption parts using standard NLP software and align those with the image; and 2) We introduce a parameter-free cross-modal attention-pooling to obtain concept centric visual embeddings from the image encoder. With these two changes and simple auxiliary contrastive losses, we obtain SOTA performance on standard compositionality benchmarks, while maintaining or improving strong zero-shot and retrieval capabilities. This is achieved without increasing inference cost. We release the code for this work at https://github.com/saic-fi/concept_centric_clip.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript diagnoses limited compositionality in contrastive vision-language models as arising from two root causes: long training captions that do not necessitate compositional representations, and global pooling in text and image encoders that eliminates binding information. It proposes remedying these via (1) splitting captions into short concept-centric parts using off-the-shelf NLP tools and aligning them to images, and (2) a parameter-free cross-modal attention pooling to produce concept-centric visual embeddings. Combined with auxiliary contrastive losses, these changes are claimed to deliver SOTA results on standard compositionality benchmarks while preserving or improving zero-shot classification and retrieval performance, all without raising inference cost. Code is released.

Significance. If the empirical claims are substantiated, the work would be significant because it offers a lightweight, generalizable alternative to hard-negative mining that avoids benchmark-specificity and capability degradation. The parameter-free pooling and lack of inference overhead make the method practical for existing CLIP-style pipelines, and the public code release aids reproducibility and follow-up work.

major comments (2)
  1. Abstract: The central claim that caption splitting and attention pooling directly remedy the identified root causes and yield SOTA compositionality without zero-shot degradation rests on the assumption that these are the primary failure modes. No diagnostic evidence (e.g., probing for attribute-object binding before/after global pooling or controlled tests isolating parser errors) is described to rule out alternative drivers such as loss formulation or data statistics; if auxiliary losses alone suffice, the proposed remedies are not load-bearing for the result.
  2. Abstract and §3 (Method): The assertion that long captions 'do not require a compositional representation' and that global pooling causes 'complete loss of the necessary information' is presented without quantitative support or ablation showing that splitting and attention pooling are necessary rather than incidental to the reported gains. This leaves open the possibility that gains fail to generalize beyond the chosen benchmarks if other untested factors dominate.
minor comments (1)
  1. The description of the cross-modal attention pooling would benefit from an explicit equation or pseudocode to clarify how concept-centric visual embeddings are formed from the image encoder outputs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We address the major comments point by point below, providing clarifications based on the empirical results and committing to revisions that add supporting analysis where the current version is limited.

read point-by-point responses
  1. Referee: Abstract: The central claim that caption splitting and attention pooling directly remedy the identified root causes and yield SOTA compositionality without zero-shot degradation rests on the assumption that these are the primary failure modes. No diagnostic evidence (e.g., probing for attribute-object binding before/after global pooling or controlled tests isolating parser errors) is described to rule out alternative drivers such as loss formulation or data statistics; if auxiliary losses alone suffice, the proposed remedies are not load-bearing for the result.

    Authors: We acknowledge that the manuscript does not present explicit diagnostic probes, such as attribute-object binding analysis before versus after global pooling or controlled tests isolating parser errors from other factors. Our current evidence consists of end-to-end performance gains on compositionality benchmarks together with ablations showing that the full combination outperforms baselines. To directly address the concern that auxiliary losses alone might suffice, we will add a dedicated ablation study in the revised version that trains models using only the auxiliary contrastive losses without caption splitting or cross-modal attention pooling. We will also include a brief discussion of alternative drivers such as loss formulation and data statistics, referencing related work on these topics. revision: yes

  2. Referee: Abstract and §3 (Method): The assertion that long captions 'do not require a compositional representation' and that global pooling causes 'complete loss of the necessary information' is presented without quantitative support or ablation showing that splitting and attention pooling are necessary rather than incidental to the reported gains. This leaves open the possibility that gains fail to generalize beyond the chosen benchmarks if other untested factors dominate.

    Authors: We agree that the manuscript would benefit from more direct quantitative support for the stated root causes. While Section 4 reports ablations that isolate the contribution of splitting and attention pooling to the final performance numbers, we will expand §3 and the experiments section with additional controlled comparisons: (1) training with long captions versus split concept-centric captions while keeping all other components fixed, and (2) replacing global pooling with the proposed cross-modal attention pooling in isolation. These results will be presented to demonstrate necessity rather than incidental benefit. On generalization, we already evaluate on multiple standard compositionality benchmarks; we will add a short limitations paragraph noting that further tests on additional datasets would be valuable for future work. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical modifications rest on standard tools and benchmarks

full rationale

The paper presents an empirical approach that splits captions via off-the-shelf NLP parsers and adds a parameter-free cross-modal attention pool plus auxiliary contrastive losses. These changes are motivated by analysis of training captions and global pooling but do not reduce any claimed result to a fitted parameter renamed as prediction, a self-citation chain, or a self-definitional loop. Performance is reported via direct comparison on compositionality and zero-shot benchmarks rather than by construction from the inputs. No equations or uniqueness theorems are invoked that collapse back to the authors' prior outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The method rests on standard contrastive learning assumptions and the reliability of off-the-shelf NLP for concept extraction; no new free parameters, fitted constants, or postulated entities are introduced. The attention pooling is explicitly parameter-free.

axioms (2)
  • domain assumption Standard contrastive learning framework for vision-language models remains valid when captions are split into short concept-centric parts.
    The auxiliary losses and alignment strategy presuppose the base contrastive setup continues to function under the proposed modifications.
  • domain assumption Off-the-shelf NLP software reliably extracts short concept-centric caption segments that correspond to visual elements.
    The first proposed solution depends on this extraction step producing useful training signals.

pith-pipeline@v0.9.0 · 5812 in / 1401 out tokens · 72235 ms · 2026-05-21T10:20:27.829171+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 1 internal anchor

  1. [1]

    Object-centric binding in contrastive language-image pretraining.arXiv preprint arXiv:2502.14113, 2025

    Rim Assouel, Pietro Astolfi, Florian Bordes, Michal Drozdzal, and Adriana Romero-Soriano. Object-centric binding in contrastive language-image pretraining.arXiv preprint arXiv:2502.14113, 2025. 1, 3

  2. [2]

    Revis- iting multimodal representation in contrastive learning: from patch and token embeddings to finite discrete tokens

    Yuxiao Chen, Jianbo Yuan, Yu Tian, Shijie Geng, Xinyu Li, Ding Zhou, Dimitris N Metaxas, and Hongxia Yang. Revis- iting multimodal representation in contrastive learning: from patch and token embeddings to finite discrete tokens. In IEEE Conference on Computer Vision and Pattern Recog- nition, 2023. 5, 12

  3. [3]

    CLIP benchmark,

    Mehdi Cherti and Romain Beaumont. CLIP benchmark,

  4. [4]

    Teaching structured Vision&Language concepts to vi- sion&language models

    Sivan Doveh, Assaf Arbelle, Sivan Harary, Rameswar Panda, Roei Herzig, Eli Schwartz, Donghyun Kim, Raja Giryes, Rogerio Feris, Shimon Ullman, and Leonid Karlin- sky. Teaching structured Vision&Language concepts to vi- sion&language models. InIEEE Conference on Computer Vision and Pattern Recognition, 2022. 1, 3, 5, 12

  5. [5]

    Dense and aligned captions (DAC) promote compositional reasoning in VL models

    Sivan Doveh, Assaf Arbelle, Sivan Harary, Roei Herzig, Donghyun Kim, Paola Cascante-bonilla, Amit Alfassy, Rameswar Panda, Raja Giryes, Rogerio Feris, Shimon Ull- man, and Leonid Karlinsky. Dense and aligned captions (DAC) promote compositional reasoning in VL models. In Neural Information Processing Systems, 2023. 1, 3, 5, 12

  6. [6]

    SUG- ARCREPE++ dataset: vision-language model sensitivity to semantic and lexical alterations

    Sri Harsha Dumpala, Aman Jaiswal, Chandramouli Sastry, Evangelos Milios, Sageev Oore, and Hassan Sajjad. SUG- ARCREPE++ dataset: vision-language model sensitivity to semantic and lexical alterations. InNeural Information Pro- cessing Systems - Datasets and Benchmarks Track, 2024. 1, 2, 12

  7. [7]

    Im- ageinwords: Unlocking hyper-detailed image descriptions,

    Roopal Garg, Andrea Burns, Burcu Karagol Ayan, Yonatan Bitton, Ceslee Montgomery, Yasumasa Onoe, Andrew Bun- ner, Ranjay Krishna, Jason Baldridge, and Radu Soricut. Im- ageinwords: Unlocking hyper-detailed image descriptions,

  8. [8]

    Common data properties limit object-attribute binding in clip

    B Gurung, David T Hoffmann, and Thomas Brox. Common data properties limit object-attribute binding in clip. InGer- man Conference on Pattern Recognition (GCPR), 2025. 1, 3

  9. [9]

    spaCy 2: Natural lan- guage understanding with Bloom embeddings, convolutional neural networks and incremental parsing

    Matthew Honnibal and Ines Montani. spaCy 2: Natural lan- guage understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear, 2017. 3, 4

  10. [10]

    SugarCrepe: Fixing hackable benchmarks for vision-language compositionality

    Cheng-Yu Hsieh, Jieyu Zhang, Zixian Ma, Aniruddha Kem- bhavi, and Ranjay Krishna. SugarCrepe: Fixing hackable benchmarks for vision-language compositionality. InNeural Information Processing Systems - Datasets and Benchmarks Track, 2023. 1, 2, 3

  11. [11]

    Open- clip, 2021

    Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Han- naneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Open- clip, 2021. If you use this software, please cite it as below. 5

  12. [12]

    FineCLIP: Self-distilled region-based CLIP for better fine-grained un- derstanding

    Dong Jing, Xiaolong He, Yutian Luo, Nanyi Fei, Guoxing Yang, Wei Wei, Huiwen Zhao, and Zhiwu Lu. FineCLIP: Self-distilled region-based CLIP for better fine-grained un- derstanding. InNeural Information Processing Systems,

  13. [13]

    An im- age is worth 16x16 words: Transformers for image recogni- tion at scale

    Alexander Kolesnikov, Alexey Dosovitskiy, Dirk Weis- senborn, Georg Heigold, Jakob Uszkoreit, Lucas Beyer, Matthias Minderer, Mostafa Dehghani, Neil Houlsby, Syl- vain Gelly, Thomas Unterthiner, and Xiaohua Zhai. An im- age is worth 16x16 words: Transformers for image recogni- tion at scale. InICLR, 2021. 3

  14. [14]

    Modeling caption diversity in contrastive vision- language pretraining

    Samuel Lavoie, Polina Kirichenko, Mark Ibrahim, Mido As- sran, Andrew Gordon Wilson, Aaron Courville, and Nico- las Ballas. Modeling caption diversity in contrastive vision- language pretraining. InInternational Conference on Ma- chine Learning, 2024. 5, 12

  15. [15]

    Does clip bind concepts? probing compositionality in large image models

    Martha Lewis, Nihal V Nayak, Peilin Yu, Qinan Yu, Jack Merullo, Stephen H Bach, and Ellie Pavlick. Does clip bind concepts? probing compositionality in large image models. arXiv preprint arXiv:2212.10537, 2022. 1, 3

  16. [16]

    Enhancing vision-language com- positional understanding with multimodal synthetic data

    Haoxin Li and Boyang Li. Enhancing vision-language com- positional understanding with multimodal synthetic data. In IEEE Conference on Computer Vision and Pattern Recogni- tion, 2025. 3

  17. [17]

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInterna- tional Conference on Machine Learning, 2022. 5, 7

  18. [18]

    An inverse scal- ing law for CLIP training

    Xianhang Li, Zeyu Wang, and Cihang Xie. An inverse scal- ing law for CLIP training. InNeural Information Processing Systems, 2023. 5

  19. [19]

    Evaluating text-to-visual generation with image-to-text gen- eration

    Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. Evaluating text-to-visual generation with image-to-text gen- eration. InEuropean Conference on Computer Vision, 2024. 8

  20. [20]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InIEEE Conference on Computer Vision and Pattern Recognition,

  21. [21]

    CLIPS: An enhanced CLIP framework for learning with synthetic captions.arXiv [cs.CV], 2024

    Yanqing Liu, Xianhang Li, Zeyu Wang, Bingchen Zhao, and Cihang Xie. CLIPS: An enhanced CLIP framework for learning with synthetic captions.arXiv [cs.CV], 2024. 3, 5

  22. [22]

    DOCCI: Descriptions of Connected and Con- trasting Images

    Yasumasa Onoe, Sunayana Rane, Zachary Berger, Yonatan Bitton, Jaemin Cho, Roopal Garg, Alexander Ku, Zarana Parekh, Jordi Pont-Tuset, Garrett Tanzer, Su Wang, and Ja- son Baldridge. DOCCI: Descriptions of Connected and Con- trasting Images. InEuropean Conference on Computer Vi- sion, 2024. 5

  23. [23]

    TripletCLIP: Improving compositional reasoning of CLIP via synthetic vision-language negatives

    Maitreya Patel, Abhiram Kusumba, Sheng Cheng, Changhoon Kim, Tejas Gokhale, Chitta Baral, and Yezhou Yang. TripletCLIP: Improving compositional reasoning of CLIP via synthetic vision-language negatives. InNeural Information Processing Systems, 2024. 1, 3, 5, 12

  24. [24]

    Ad- vancing compositional awareness in CLIP with efficient fine- tuning

    Amit Peleg, Naman Deep Singh, and Matthias Hein. Ad- vancing compositional awareness in CLIP with efficient fine- tuning. InNeural Information Processing Systems, 2025. 3, 5, 12

  25. [25]

    Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever

    Alec Radford, Jong Wook Kim, Chris Hallacy, A. Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from nat- ural language supervision. InInternational Conference on Machine Learning, 2021. 1, 3

  26. [26]

    Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning

    Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning. InPro- ceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018. 4

  27. [27]

    FLA V A: A foundational language and vision alignment model

    Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guil- laume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela. FLA V A: A foundational language and vision alignment model. InIEEE Conference on Computer Vision and Pattern Recognition, 2022. 5, 7

  28. [28]

    no” to say “yes

    Jaisidh Singh, Ishaan Shrivastava, Mayank Vatsa, Richa Singh, and Aparna Bharati. Learn “no” to say “yes” better: Improving vision-language models via negations. InWinter Conference on Applications of Computer Vision, 2025. 1, 3, 5, 12

  29. [29]

    When are lemons purple? the concept associa- tion bias of vision-language models

    Yingtian Tang, Yutaro Yamada, Yoyo Zhang, and Ilker Yildirim. When are lemons purple? the concept associa- tion bias of vision-language models. InEmpirical Methods in Natural Language Processing, 2023. 1, 3

  30. [30]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier H ´enaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. Siglip 2: Multilingual vision- language encoders with improved semantic understand- ing, localization, and dense featu...

  31. [31]

    FLAIR: Vlm with fine- grained language-informed image representations

    Rui Xiao, Sanghwan Kim, Mariana-Iuliana Georgescu, Zeynep Akata, and Stephan Alaniz. FLAIR: Vlm with fine- grained language-informed image representations. InIEEE Conference on Computer Vision and Pattern Recognition,

  32. [32]

    FG- CLIP: Fine-grained visual and textual alignment

    Chunyu Xie, Bin Wang, Fanjing Kong, Jincheng Li, Dawei Liang, Gengshen Zhang, Dawei Leng, and Yuhui Yin. FG- CLIP: Fine-grained visual and textual alignment. InInterna- tional Conference on Machine Learning, 2025. 5, 12

  33. [33]

    When and why vision- language models behave like bags-of-words, and what to do about it? InInternational Conference on Learning Repre- sentations, 2023

    Mert Y ¨uksekg¨on¨ul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. When and why vision- language models behave like bags-of-words, and what to do about it? InInternational Conference on Learning Repre- sentations, 2023. 1, 3, 5, 12

  34. [34]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InIEEE International Conference on Computer Vision,

  35. [35]

    Con- trasting intra-modal and ranking cross-modal hard negatives to enhance visio-linguistic compositional understanding

    Le Zhang, Rabiul Awal, and Aishwarya Agrawal. Con- trasting intra-modal and ranking cross-modal hard negatives to enhance visio-linguistic compositional understanding. In IEEE Conference on Computer Vision and Pattern Recogni- tion, 2024. 1, 3, 5, 12

  36. [36]

    Iterated learning improves compositional- ity in large vision-language models

    Chenhao Zheng, Jieyu Zhang, Aniruddha Kembhavi, and Ranjay Krishna. Iterated learning improves compositional- ity in large vision-language models. InIEEE Conference on Computer Vision and Pattern Recognition, 2024. 5, 12

  37. [37]

    Replace Object

    Kecheng Zheng, Yifei Zhang, Wei Wu, Fan Lu, Shuailei Ma, Xin Jin, Wei Chen, and Yujun Shen. Dreamlip: Language- image pre-training with long captions. InEuropean Confer- ence on Computer Vision, 2024. 3, 4, 5, 12 A. Additional experimental results We provide additional results of our method, C 2LIP, and the baseline contrastive models employing the same V...