No Hard Negatives Required: Concept Centric Learning Leads to Compositionality without Degrading Zero-shot Capabilities of Contrastive Models

Brais Martinez; David T. Hoffmann; Hai X. Pham; Ricardo Guerrero

arxiv: 2603.25722 · v2 · pith:2O3HS3KSnew · submitted 2026-03-26 · 💻 cs.CV · cs.LG

No Hard Negatives Required: Concept Centric Learning Leads to Compositionality without Degrading Zero-shot Capabilities of Contrastive Models

Hai X. Pham , David T. Hoffmann , Ricardo Guerrero , Brais Martinez This is my paper

Pith reviewed 2026-05-21 10:20 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords contrastive learningvision-language modelscompositionalityconcept-centric learningattention poolingzero-shot learningcaption splitting

0 comments

The pith

Splitting captions into short concept-centric parts and adding attention pooling lets contrastive vision-language models learn compositionality while keeping zero-shot capabilities intact.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies two main reasons why contrastive vision-language models fail at compositionality: long training captions do not demand learning how concepts bind together, and the global pooling at the end of encoders discards the necessary binding information. To fix this, the authors split captions into shorter segments focused on individual concepts using standard natural language processing tools and align these with the corresponding image regions. They also replace global pooling with a parameter-free cross-modal attention mechanism that produces concept-specific visual embeddings. With these changes and some auxiliary contrastive losses, the models reach state-of-the-art results on compositionality benchmarks and do not lose performance on zero-shot tasks or retrieval, all without raising inference costs. A sympathetic reader would care because it offers a simple, practical way to improve a key weakness in popular models without the usual trade-offs or need for specialized hard negative examples.

Core claim

The paper shows that concept centric learning, obtained by breaking long captions into short concept-focused parts and using cross-modal attention pooling to create matching visual embeddings, enables contrastive vision-language models to develop compositional representations. This method avoids the need for hard negative samples and delivers superior results on compositionality tasks while preserving or enhancing zero-shot classification and retrieval performance without any added inference overhead.

What carries the argument

Concept-centric caption splitting combined with parameter-free cross-modal attention-pooling to generate concept-specific embeddings from the image encoder.

If this is right

Reaches state-of-the-art performance on standard compositionality benchmarks
Maintains or improves zero-shot and retrieval capabilities
Requires no increase in inference cost
Does not rely on hard negative samples that often harm basic model abilities

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This method provides an alternative to hard-negative based approaches for improving compositionality.
The parameter-free nature of the attention pooling suggests it can be added to existing models with minimal changes.
If these root causes prove primary, the same splitting and pooling steps could address binding failures in other multimodal contrastive setups.

Load-bearing premise

That the two identified root causes—long captions not requiring compositional representations and global pooling losing binding information—are the dominant factors limiting compositionality in these models.

What would settle it

Training models with and without the caption splitting and attention pooling steps, then checking whether compositionality benchmark scores rise substantially only when both changes are present while zero-shot and retrieval metrics stay stable or improve.

Figures

Figures reproduced from arXiv: 2603.25722 by Brais Martinez, David T. Hoffmann, Hai X. Pham, Ricardo Guerrero.

**Figure 1.** Figure 1: Method overview. (a) SigLIP uses a learnable query token in combination with an attention layer to pool the visual tokens into a single token. Aligning only global representations hampers the learning of a compositional representation. (b) Similar to SigLIP, our method aligns the global representations v and t. To simplify learning of a compositional representation, our method extends SigLIP by first, pool… view at source ↗

**Figure 2.** Figure 2: Change in attention when using C2LIP compared to SigLIP. We visualize the difference in attention between C2LIP and SigLIP to the visual tokens given a caption and an image. Higher attention for C2LIP is shown as green. Lower attention for C 2LIP indicated with violet. White means no change. (a) Black regions that are not a sweater get reduced attention, the sweater gets more or unchanged attention. (b) T… view at source ↗

read the original abstract

Contrastive vision-language (V&L) models remain a popular choice for various applications. However, several limitations have emerged, most notably the limited ability of V&L models to learn compositional representations. Prior methods often addressed this limitation by generating custom training data to obtain hard negative samples. Hard negatives have been shown to improve performance on compositionality tasks, but are often specific to a single benchmark, do not generalize, and can cause substantial degradation of basic V&L capabilities such as zero-shot or retrieval performance, rendering them impractical. In this work we follow a different approach. We identify two root causes that limit compositionality performance of V&Ls: 1) Long training captions do not require a compositional representation; and 2) The final global pooling in the text and image encoders lead to a complete loss of the necessary information to learn binding in the first place. As a remedy, we propose two simple solutions: 1) We obtain short concept centric caption parts using standard NLP software and align those with the image; and 2) We introduce a parameter-free cross-modal attention-pooling to obtain concept centric visual embeddings from the image encoder. With these two changes and simple auxiliary contrastive losses, we obtain SOTA performance on standard compositionality benchmarks, while maintaining or improving strong zero-shot and retrieval capabilities. This is achieved without increasing inference cost. We release the code for this work at https://github.com/saic-fi/concept_centric_clip.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Splitting captions into short concepts and adding parameter-free cross-modal attention pooling looks like a workable fix for compositionality in contrastive V&L models without the usual zero-shot hit.

read the letter

The core idea here is straightforward: long training captions let models get by without learning proper bindings, and global pooling wipes out the spatial or relational details needed for that. By breaking captions into short concept-centric pieces with standard NLP tools and replacing global pooling with a simple cross-modal attention step, the authors claim they get SOTA on compositionality benchmarks while holding or improving zero-shot and retrieval numbers, all at no extra inference cost. They also add some auxiliary contrastive losses and release the code, which is useful for replication checks.

Referee Report

2 major / 1 minor

Summary. The manuscript diagnoses limited compositionality in contrastive vision-language models as arising from two root causes: long training captions that do not necessitate compositional representations, and global pooling in text and image encoders that eliminates binding information. It proposes remedying these via (1) splitting captions into short concept-centric parts using off-the-shelf NLP tools and aligning them to images, and (2) a parameter-free cross-modal attention pooling to produce concept-centric visual embeddings. Combined with auxiliary contrastive losses, these changes are claimed to deliver SOTA results on standard compositionality benchmarks while preserving or improving zero-shot classification and retrieval performance, all without raising inference cost. Code is released.

Significance. If the empirical claims are substantiated, the work would be significant because it offers a lightweight, generalizable alternative to hard-negative mining that avoids benchmark-specificity and capability degradation. The parameter-free pooling and lack of inference overhead make the method practical for existing CLIP-style pipelines, and the public code release aids reproducibility and follow-up work.

major comments (2)

Abstract: The central claim that caption splitting and attention pooling directly remedy the identified root causes and yield SOTA compositionality without zero-shot degradation rests on the assumption that these are the primary failure modes. No diagnostic evidence (e.g., probing for attribute-object binding before/after global pooling or controlled tests isolating parser errors) is described to rule out alternative drivers such as loss formulation or data statistics; if auxiliary losses alone suffice, the proposed remedies are not load-bearing for the result.
Abstract and §3 (Method): The assertion that long captions 'do not require a compositional representation' and that global pooling causes 'complete loss of the necessary information' is presented without quantitative support or ablation showing that splitting and attention pooling are necessary rather than incidental to the reported gains. This leaves open the possibility that gains fail to generalize beyond the chosen benchmarks if other untested factors dominate.

minor comments (1)

The description of the cross-modal attention pooling would benefit from an explicit equation or pseudocode to clarify how concept-centric visual embeddings are formed from the image encoder outputs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We address the major comments point by point below, providing clarifications based on the empirical results and committing to revisions that add supporting analysis where the current version is limited.

read point-by-point responses

Referee: Abstract: The central claim that caption splitting and attention pooling directly remedy the identified root causes and yield SOTA compositionality without zero-shot degradation rests on the assumption that these are the primary failure modes. No diagnostic evidence (e.g., probing for attribute-object binding before/after global pooling or controlled tests isolating parser errors) is described to rule out alternative drivers such as loss formulation or data statistics; if auxiliary losses alone suffice, the proposed remedies are not load-bearing for the result.

Authors: We acknowledge that the manuscript does not present explicit diagnostic probes, such as attribute-object binding analysis before versus after global pooling or controlled tests isolating parser errors from other factors. Our current evidence consists of end-to-end performance gains on compositionality benchmarks together with ablations showing that the full combination outperforms baselines. To directly address the concern that auxiliary losses alone might suffice, we will add a dedicated ablation study in the revised version that trains models using only the auxiliary contrastive losses without caption splitting or cross-modal attention pooling. We will also include a brief discussion of alternative drivers such as loss formulation and data statistics, referencing related work on these topics. revision: yes
Referee: Abstract and §3 (Method): The assertion that long captions 'do not require a compositional representation' and that global pooling causes 'complete loss of the necessary information' is presented without quantitative support or ablation showing that splitting and attention pooling are necessary rather than incidental to the reported gains. This leaves open the possibility that gains fail to generalize beyond the chosen benchmarks if other untested factors dominate.

Authors: We agree that the manuscript would benefit from more direct quantitative support for the stated root causes. While Section 4 reports ablations that isolate the contribution of splitting and attention pooling to the final performance numbers, we will expand §3 and the experiments section with additional controlled comparisons: (1) training with long captions versus split concept-centric captions while keeping all other components fixed, and (2) replacing global pooling with the proposed cross-modal attention pooling in isolation. These results will be presented to demonstrate necessity rather than incidental benefit. On generalization, we already evaluate on multiple standard compositionality benchmarks; we will add a short limitations paragraph noting that further tests on additional datasets would be valuable for future work. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical modifications rest on standard tools and benchmarks

full rationale

The paper presents an empirical approach that splits captions via off-the-shelf NLP parsers and adds a parameter-free cross-modal attention pool plus auxiliary contrastive losses. These changes are motivated by analysis of training captions and global pooling but do not reduce any claimed result to a fitted parameter renamed as prediction, a self-citation chain, or a self-definitional loop. Performance is reported via direct comparison on compositionality and zero-shot benchmarks rather than by construction from the inputs. No equations or uniqueness theorems are invoked that collapse back to the authors' prior outputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The method rests on standard contrastive learning assumptions and the reliability of off-the-shelf NLP for concept extraction; no new free parameters, fitted constants, or postulated entities are introduced. The attention pooling is explicitly parameter-free.

axioms (2)

domain assumption Standard contrastive learning framework for vision-language models remains valid when captions are split into short concept-centric parts.
The auxiliary losses and alignment strategy presuppose the base contrastive setup continues to function under the proposed modifications.
domain assumption Off-the-shelf NLP software reliably extracts short concept-centric caption segments that correspond to visual elements.
The first proposed solution depends on this extraction step producing useful training signals.

pith-pipeline@v0.9.0 · 5812 in / 1401 out tokens · 72235 ms · 2026-05-21T10:20:27.829171+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We obtain short concept centric caption parts using standard NLP software and align those with the image; and 2) We introduce a parameter-free cross-modal attention-pooling to obtain concept centric visual embeddings from the image encoder.
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The final global pooling in the text and image encoders lead to a complete loss of the necessary information to learn binding in the first place.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

37 extracted references · 37 canonical work pages · 1 internal anchor

[1]

Object-centric binding in contrastive language-image pretraining.arXiv preprint arXiv:2502.14113, 2025

Rim Assouel, Pietro Astolfi, Florian Bordes, Michal Drozdzal, and Adriana Romero-Soriano. Object-centric binding in contrastive language-image pretraining.arXiv preprint arXiv:2502.14113, 2025. 1, 3

work page arXiv 2025
[2]

Revis- iting multimodal representation in contrastive learning: from patch and token embeddings to finite discrete tokens

Yuxiao Chen, Jianbo Yuan, Yu Tian, Shijie Geng, Xinyu Li, Ding Zhou, Dimitris N Metaxas, and Hongxia Yang. Revis- iting multimodal representation in contrastive learning: from patch and token embeddings to finite discrete tokens. In IEEE Conference on Computer Vision and Pattern Recog- nition, 2023. 5, 12

work page 2023
[3]

CLIP benchmark,

Mehdi Cherti and Romain Beaumont. CLIP benchmark,

work page
[4]

Teaching structured Vision&Language concepts to vi- sion&language models

Sivan Doveh, Assaf Arbelle, Sivan Harary, Rameswar Panda, Roei Herzig, Eli Schwartz, Donghyun Kim, Raja Giryes, Rogerio Feris, Shimon Ullman, and Leonid Karlin- sky. Teaching structured Vision&Language concepts to vi- sion&language models. InIEEE Conference on Computer Vision and Pattern Recognition, 2022. 1, 3, 5, 12

work page 2022
[5]

Dense and aligned captions (DAC) promote compositional reasoning in VL models

Sivan Doveh, Assaf Arbelle, Sivan Harary, Roei Herzig, Donghyun Kim, Paola Cascante-bonilla, Amit Alfassy, Rameswar Panda, Raja Giryes, Rogerio Feris, Shimon Ull- man, and Leonid Karlinsky. Dense and aligned captions (DAC) promote compositional reasoning in VL models. In Neural Information Processing Systems, 2023. 1, 3, 5, 12

work page 2023
[6]

SUG- ARCREPE++ dataset: vision-language model sensitivity to semantic and lexical alterations

Sri Harsha Dumpala, Aman Jaiswal, Chandramouli Sastry, Evangelos Milios, Sageev Oore, and Hassan Sajjad. SUG- ARCREPE++ dataset: vision-language model sensitivity to semantic and lexical alterations. InNeural Information Pro- cessing Systems - Datasets and Benchmarks Track, 2024. 1, 2, 12

work page 2024
[7]

Im- ageinwords: Unlocking hyper-detailed image descriptions,

Roopal Garg, Andrea Burns, Burcu Karagol Ayan, Yonatan Bitton, Ceslee Montgomery, Yasumasa Onoe, Andrew Bun- ner, Ranjay Krishna, Jason Baldridge, and Radu Soricut. Im- ageinwords: Unlocking hyper-detailed image descriptions,

work page
[8]

Common data properties limit object-attribute binding in clip

B Gurung, David T Hoffmann, and Thomas Brox. Common data properties limit object-attribute binding in clip. InGer- man Conference on Pattern Recognition (GCPR), 2025. 1, 3

work page 2025
[9]

spaCy 2: Natural lan- guage understanding with Bloom embeddings, convolutional neural networks and incremental parsing

Matthew Honnibal and Ines Montani. spaCy 2: Natural lan- guage understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear, 2017. 3, 4

work page 2017
[10]

SugarCrepe: Fixing hackable benchmarks for vision-language compositionality

Cheng-Yu Hsieh, Jieyu Zhang, Zixian Ma, Aniruddha Kem- bhavi, and Ranjay Krishna. SugarCrepe: Fixing hackable benchmarks for vision-language compositionality. InNeural Information Processing Systems - Datasets and Benchmarks Track, 2023. 1, 2, 3

work page 2023
[11]

Open- clip, 2021

Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Han- naneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Open- clip, 2021. If you use this software, please cite it as below. 5

work page 2021
[12]

FineCLIP: Self-distilled region-based CLIP for better fine-grained un- derstanding

Dong Jing, Xiaolong He, Yutian Luo, Nanyi Fei, Guoxing Yang, Wei Wei, Huiwen Zhao, and Zhiwu Lu. FineCLIP: Self-distilled region-based CLIP for better fine-grained un- derstanding. InNeural Information Processing Systems,

work page
[13]

An im- age is worth 16x16 words: Transformers for image recogni- tion at scale

Alexander Kolesnikov, Alexey Dosovitskiy, Dirk Weis- senborn, Georg Heigold, Jakob Uszkoreit, Lucas Beyer, Matthias Minderer, Mostafa Dehghani, Neil Houlsby, Syl- vain Gelly, Thomas Unterthiner, and Xiaohua Zhai. An im- age is worth 16x16 words: Transformers for image recogni- tion at scale. InICLR, 2021. 3

work page 2021
[14]

Modeling caption diversity in contrastive vision- language pretraining

Samuel Lavoie, Polina Kirichenko, Mark Ibrahim, Mido As- sran, Andrew Gordon Wilson, Aaron Courville, and Nico- las Ballas. Modeling caption diversity in contrastive vision- language pretraining. InInternational Conference on Ma- chine Learning, 2024. 5, 12

work page 2024
[15]

Does clip bind concepts? probing compositionality in large image models

Martha Lewis, Nihal V Nayak, Peilin Yu, Qinan Yu, Jack Merullo, Stephen H Bach, and Ellie Pavlick. Does clip bind concepts? probing compositionality in large image models. arXiv preprint arXiv:2212.10537, 2022. 1, 3

work page arXiv 2022
[16]

Enhancing vision-language com- positional understanding with multimodal synthetic data

Haoxin Li and Boyang Li. Enhancing vision-language com- positional understanding with multimodal synthetic data. In IEEE Conference on Computer Vision and Pattern Recogni- tion, 2025. 3

work page 2025
[17]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInterna- tional Conference on Machine Learning, 2022. 5, 7

work page 2022
[18]

An inverse scal- ing law for CLIP training

Xianhang Li, Zeyu Wang, and Cihang Xie. An inverse scal- ing law for CLIP training. InNeural Information Processing Systems, 2023. 5

work page 2023
[19]

Evaluating text-to-visual generation with image-to-text gen- eration

Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. Evaluating text-to-visual generation with image-to-text gen- eration. InEuropean Conference on Computer Vision, 2024. 8

work page 2024
[20]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InIEEE Conference on Computer Vision and Pattern Recognition,

work page
[21]

CLIPS: An enhanced CLIP framework for learning with synthetic captions.arXiv [cs.CV], 2024

Yanqing Liu, Xianhang Li, Zeyu Wang, Bingchen Zhao, and Cihang Xie. CLIPS: An enhanced CLIP framework for learning with synthetic captions.arXiv [cs.CV], 2024. 3, 5

work page 2024
[22]

DOCCI: Descriptions of Connected and Con- trasting Images

Yasumasa Onoe, Sunayana Rane, Zachary Berger, Yonatan Bitton, Jaemin Cho, Roopal Garg, Alexander Ku, Zarana Parekh, Jordi Pont-Tuset, Garrett Tanzer, Su Wang, and Ja- son Baldridge. DOCCI: Descriptions of Connected and Con- trasting Images. InEuropean Conference on Computer Vi- sion, 2024. 5

work page 2024
[23]

TripletCLIP: Improving compositional reasoning of CLIP via synthetic vision-language negatives

Maitreya Patel, Abhiram Kusumba, Sheng Cheng, Changhoon Kim, Tejas Gokhale, Chitta Baral, and Yezhou Yang. TripletCLIP: Improving compositional reasoning of CLIP via synthetic vision-language negatives. InNeural Information Processing Systems, 2024. 1, 3, 5, 12

work page 2024
[24]

Ad- vancing compositional awareness in CLIP with efficient fine- tuning

Amit Peleg, Naman Deep Singh, and Matthias Hein. Ad- vancing compositional awareness in CLIP with efficient fine- tuning. InNeural Information Processing Systems, 2025. 3, 5, 12

work page 2025
[25]

Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever

Alec Radford, Jong Wook Kim, Chris Hallacy, A. Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from nat- ural language supervision. InInternational Conference on Machine Learning, 2021. 1, 3

work page 2021
[26]

Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning. InPro- ceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018. 4

work page 2018
[27]

FLA V A: A foundational language and vision alignment model

Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guil- laume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela. FLA V A: A foundational language and vision alignment model. InIEEE Conference on Computer Vision and Pattern Recognition, 2022. 5, 7

work page 2022
[28]

no” to say “yes

Jaisidh Singh, Ishaan Shrivastava, Mayank Vatsa, Richa Singh, and Aparna Bharati. Learn “no” to say “yes” better: Improving vision-language models via negations. InWinter Conference on Applications of Computer Vision, 2025. 1, 3, 5, 12

work page 2025
[29]

When are lemons purple? the concept associa- tion bias of vision-language models

Yingtian Tang, Yutaro Yamada, Yoyo Zhang, and Ilker Yildirim. When are lemons purple? the concept associa- tion bias of vision-language models. InEmpirical Methods in Natural Language Processing, 2023. 1, 3

work page 2023
[30]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier H ´enaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. Siglip 2: Multilingual vision- language encoders with improved semantic understand- ing, localization, and dense featu...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[31]

FLAIR: Vlm with fine- grained language-informed image representations

Rui Xiao, Sanghwan Kim, Mariana-Iuliana Georgescu, Zeynep Akata, and Stephan Alaniz. FLAIR: Vlm with fine- grained language-informed image representations. InIEEE Conference on Computer Vision and Pattern Recognition,

work page
[32]

FG- CLIP: Fine-grained visual and textual alignment

Chunyu Xie, Bin Wang, Fanjing Kong, Jincheng Li, Dawei Liang, Gengshen Zhang, Dawei Leng, and Yuhui Yin. FG- CLIP: Fine-grained visual and textual alignment. InInterna- tional Conference on Machine Learning, 2025. 5, 12

work page 2025
[33]

When and why vision- language models behave like bags-of-words, and what to do about it? InInternational Conference on Learning Repre- sentations, 2023

Mert Y ¨uksekg¨on¨ul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. When and why vision- language models behave like bags-of-words, and what to do about it? InInternational Conference on Learning Repre- sentations, 2023. 1, 3, 5, 12

work page 2023
[34]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InIEEE International Conference on Computer Vision,

work page
[35]

Con- trasting intra-modal and ranking cross-modal hard negatives to enhance visio-linguistic compositional understanding

Le Zhang, Rabiul Awal, and Aishwarya Agrawal. Con- trasting intra-modal and ranking cross-modal hard negatives to enhance visio-linguistic compositional understanding. In IEEE Conference on Computer Vision and Pattern Recogni- tion, 2024. 1, 3, 5, 12

work page 2024
[36]

Iterated learning improves compositional- ity in large vision-language models

Chenhao Zheng, Jieyu Zhang, Aniruddha Kembhavi, and Ranjay Krishna. Iterated learning improves compositional- ity in large vision-language models. InIEEE Conference on Computer Vision and Pattern Recognition, 2024. 5, 12

work page 2024
[37]

Replace Object

Kecheng Zheng, Yifei Zhang, Wei Wu, Fan Lu, Shuailei Ma, Xin Jin, Wei Chen, and Yujun Shen. Dreamlip: Language- image pre-training with long captions. InEuropean Confer- ence on Computer Vision, 2024. 3, 4, 5, 12 A. Additional experimental results We provide additional results of our method, C 2LIP, and the baseline contrastive models employing the same V...

work page 2024

[1] [1]

Object-centric binding in contrastive language-image pretraining.arXiv preprint arXiv:2502.14113, 2025

Rim Assouel, Pietro Astolfi, Florian Bordes, Michal Drozdzal, and Adriana Romero-Soriano. Object-centric binding in contrastive language-image pretraining.arXiv preprint arXiv:2502.14113, 2025. 1, 3

work page arXiv 2025

[2] [2]

Revis- iting multimodal representation in contrastive learning: from patch and token embeddings to finite discrete tokens

Yuxiao Chen, Jianbo Yuan, Yu Tian, Shijie Geng, Xinyu Li, Ding Zhou, Dimitris N Metaxas, and Hongxia Yang. Revis- iting multimodal representation in contrastive learning: from patch and token embeddings to finite discrete tokens. In IEEE Conference on Computer Vision and Pattern Recog- nition, 2023. 5, 12

work page 2023

[3] [3]

CLIP benchmark,

Mehdi Cherti and Romain Beaumont. CLIP benchmark,

work page

[4] [4]

Teaching structured Vision&Language concepts to vi- sion&language models

Sivan Doveh, Assaf Arbelle, Sivan Harary, Rameswar Panda, Roei Herzig, Eli Schwartz, Donghyun Kim, Raja Giryes, Rogerio Feris, Shimon Ullman, and Leonid Karlin- sky. Teaching structured Vision&Language concepts to vi- sion&language models. InIEEE Conference on Computer Vision and Pattern Recognition, 2022. 1, 3, 5, 12

work page 2022

[5] [5]

Dense and aligned captions (DAC) promote compositional reasoning in VL models

Sivan Doveh, Assaf Arbelle, Sivan Harary, Roei Herzig, Donghyun Kim, Paola Cascante-bonilla, Amit Alfassy, Rameswar Panda, Raja Giryes, Rogerio Feris, Shimon Ull- man, and Leonid Karlinsky. Dense and aligned captions (DAC) promote compositional reasoning in VL models. In Neural Information Processing Systems, 2023. 1, 3, 5, 12

work page 2023

[6] [6]

SUG- ARCREPE++ dataset: vision-language model sensitivity to semantic and lexical alterations

Sri Harsha Dumpala, Aman Jaiswal, Chandramouli Sastry, Evangelos Milios, Sageev Oore, and Hassan Sajjad. SUG- ARCREPE++ dataset: vision-language model sensitivity to semantic and lexical alterations. InNeural Information Pro- cessing Systems - Datasets and Benchmarks Track, 2024. 1, 2, 12

work page 2024

[7] [7]

Im- ageinwords: Unlocking hyper-detailed image descriptions,

Roopal Garg, Andrea Burns, Burcu Karagol Ayan, Yonatan Bitton, Ceslee Montgomery, Yasumasa Onoe, Andrew Bun- ner, Ranjay Krishna, Jason Baldridge, and Radu Soricut. Im- ageinwords: Unlocking hyper-detailed image descriptions,

work page

[8] [8]

Common data properties limit object-attribute binding in clip

B Gurung, David T Hoffmann, and Thomas Brox. Common data properties limit object-attribute binding in clip. InGer- man Conference on Pattern Recognition (GCPR), 2025. 1, 3

work page 2025

[9] [9]

spaCy 2: Natural lan- guage understanding with Bloom embeddings, convolutional neural networks and incremental parsing

Matthew Honnibal and Ines Montani. spaCy 2: Natural lan- guage understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear, 2017. 3, 4

work page 2017

[10] [10]

SugarCrepe: Fixing hackable benchmarks for vision-language compositionality

Cheng-Yu Hsieh, Jieyu Zhang, Zixian Ma, Aniruddha Kem- bhavi, and Ranjay Krishna. SugarCrepe: Fixing hackable benchmarks for vision-language compositionality. InNeural Information Processing Systems - Datasets and Benchmarks Track, 2023. 1, 2, 3

work page 2023

[11] [11]

Open- clip, 2021

Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Han- naneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Open- clip, 2021. If you use this software, please cite it as below. 5

work page 2021

[12] [12]

FineCLIP: Self-distilled region-based CLIP for better fine-grained un- derstanding

Dong Jing, Xiaolong He, Yutian Luo, Nanyi Fei, Guoxing Yang, Wei Wei, Huiwen Zhao, and Zhiwu Lu. FineCLIP: Self-distilled region-based CLIP for better fine-grained un- derstanding. InNeural Information Processing Systems,

work page

[13] [13]

An im- age is worth 16x16 words: Transformers for image recogni- tion at scale

Alexander Kolesnikov, Alexey Dosovitskiy, Dirk Weis- senborn, Georg Heigold, Jakob Uszkoreit, Lucas Beyer, Matthias Minderer, Mostafa Dehghani, Neil Houlsby, Syl- vain Gelly, Thomas Unterthiner, and Xiaohua Zhai. An im- age is worth 16x16 words: Transformers for image recogni- tion at scale. InICLR, 2021. 3

work page 2021

[14] [14]

Modeling caption diversity in contrastive vision- language pretraining

Samuel Lavoie, Polina Kirichenko, Mark Ibrahim, Mido As- sran, Andrew Gordon Wilson, Aaron Courville, and Nico- las Ballas. Modeling caption diversity in contrastive vision- language pretraining. InInternational Conference on Ma- chine Learning, 2024. 5, 12

work page 2024

[15] [15]

Does clip bind concepts? probing compositionality in large image models

Martha Lewis, Nihal V Nayak, Peilin Yu, Qinan Yu, Jack Merullo, Stephen H Bach, and Ellie Pavlick. Does clip bind concepts? probing compositionality in large image models. arXiv preprint arXiv:2212.10537, 2022. 1, 3

work page arXiv 2022

[16] [16]

Enhancing vision-language com- positional understanding with multimodal synthetic data

Haoxin Li and Boyang Li. Enhancing vision-language com- positional understanding with multimodal synthetic data. In IEEE Conference on Computer Vision and Pattern Recogni- tion, 2025. 3

work page 2025

[17] [17]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInterna- tional Conference on Machine Learning, 2022. 5, 7

work page 2022

[18] [18]

An inverse scal- ing law for CLIP training

Xianhang Li, Zeyu Wang, and Cihang Xie. An inverse scal- ing law for CLIP training. InNeural Information Processing Systems, 2023. 5

work page 2023

[19] [19]

Evaluating text-to-visual generation with image-to-text gen- eration

Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. Evaluating text-to-visual generation with image-to-text gen- eration. InEuropean Conference on Computer Vision, 2024. 8

work page 2024

[20] [20]

Improved baselines with visual instruction tuning

Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InIEEE Conference on Computer Vision and Pattern Recognition,

work page

[21] [21]

CLIPS: An enhanced CLIP framework for learning with synthetic captions.arXiv [cs.CV], 2024

Yanqing Liu, Xianhang Li, Zeyu Wang, Bingchen Zhao, and Cihang Xie. CLIPS: An enhanced CLIP framework for learning with synthetic captions.arXiv [cs.CV], 2024. 3, 5

work page 2024

[22] [22]

DOCCI: Descriptions of Connected and Con- trasting Images

Yasumasa Onoe, Sunayana Rane, Zachary Berger, Yonatan Bitton, Jaemin Cho, Roopal Garg, Alexander Ku, Zarana Parekh, Jordi Pont-Tuset, Garrett Tanzer, Su Wang, and Ja- son Baldridge. DOCCI: Descriptions of Connected and Con- trasting Images. InEuropean Conference on Computer Vi- sion, 2024. 5

work page 2024

[23] [23]

TripletCLIP: Improving compositional reasoning of CLIP via synthetic vision-language negatives

Maitreya Patel, Abhiram Kusumba, Sheng Cheng, Changhoon Kim, Tejas Gokhale, Chitta Baral, and Yezhou Yang. TripletCLIP: Improving compositional reasoning of CLIP via synthetic vision-language negatives. InNeural Information Processing Systems, 2024. 1, 3, 5, 12

work page 2024

[24] [24]

Ad- vancing compositional awareness in CLIP with efficient fine- tuning

Amit Peleg, Naman Deep Singh, and Matthias Hein. Ad- vancing compositional awareness in CLIP with efficient fine- tuning. InNeural Information Processing Systems, 2025. 3, 5, 12

work page 2025

[25] [25]

Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever

Alec Radford, Jong Wook Kim, Chris Hallacy, A. Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from nat- ural language supervision. InInternational Conference on Machine Learning, 2021. 1, 3

work page 2021

[26] [26]

Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning. InPro- ceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018. 4

work page 2018

[27] [27]

FLA V A: A foundational language and vision alignment model

Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guil- laume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela. FLA V A: A foundational language and vision alignment model. InIEEE Conference on Computer Vision and Pattern Recognition, 2022. 5, 7

work page 2022

[28] [28]

no” to say “yes

Jaisidh Singh, Ishaan Shrivastava, Mayank Vatsa, Richa Singh, and Aparna Bharati. Learn “no” to say “yes” better: Improving vision-language models via negations. InWinter Conference on Applications of Computer Vision, 2025. 1, 3, 5, 12

work page 2025

[29] [29]

When are lemons purple? the concept associa- tion bias of vision-language models

Yingtian Tang, Yutaro Yamada, Yoyo Zhang, and Ilker Yildirim. When are lemons purple? the concept associa- tion bias of vision-language models. InEmpirical Methods in Natural Language Processing, 2023. 1, 3

work page 2023

[30] [30]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier H ´enaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. Siglip 2: Multilingual vision- language encoders with improved semantic understand- ing, localization, and dense featu...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[31] [31]

FLAIR: Vlm with fine- grained language-informed image representations

Rui Xiao, Sanghwan Kim, Mariana-Iuliana Georgescu, Zeynep Akata, and Stephan Alaniz. FLAIR: Vlm with fine- grained language-informed image representations. InIEEE Conference on Computer Vision and Pattern Recognition,

work page

[32] [32]

FG- CLIP: Fine-grained visual and textual alignment

Chunyu Xie, Bin Wang, Fanjing Kong, Jincheng Li, Dawei Liang, Gengshen Zhang, Dawei Leng, and Yuhui Yin. FG- CLIP: Fine-grained visual and textual alignment. InInterna- tional Conference on Machine Learning, 2025. 5, 12

work page 2025

[33] [33]

When and why vision- language models behave like bags-of-words, and what to do about it? InInternational Conference on Learning Repre- sentations, 2023

Mert Y ¨uksekg¨on¨ul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. When and why vision- language models behave like bags-of-words, and what to do about it? InInternational Conference on Learning Repre- sentations, 2023. 1, 3, 5, 12

work page 2023

[34] [34]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InIEEE International Conference on Computer Vision,

work page

[35] [35]

Con- trasting intra-modal and ranking cross-modal hard negatives to enhance visio-linguistic compositional understanding

Le Zhang, Rabiul Awal, and Aishwarya Agrawal. Con- trasting intra-modal and ranking cross-modal hard negatives to enhance visio-linguistic compositional understanding. In IEEE Conference on Computer Vision and Pattern Recogni- tion, 2024. 1, 3, 5, 12

work page 2024

[36] [36]

Iterated learning improves compositional- ity in large vision-language models

Chenhao Zheng, Jieyu Zhang, Aniruddha Kembhavi, and Ranjay Krishna. Iterated learning improves compositional- ity in large vision-language models. InIEEE Conference on Computer Vision and Pattern Recognition, 2024. 5, 12

work page 2024

[37] [37]

Replace Object

Kecheng Zheng, Yifei Zhang, Wei Wu, Fan Lu, Shuailei Ma, Xin Jin, Wei Chen, and Yujun Shen. Dreamlip: Language- image pre-training with long captions. InEuropean Confer- ence on Computer Vision, 2024. 3, 4, 5, 12 A. Additional experimental results We provide additional results of our method, C 2LIP, and the baseline contrastive models employing the same V...

work page 2024