No Hard Negatives Required: Concept Centric Learning Leads to Compositionality without Degrading Zero-shot Capabilities of Contrastive Models
Pith reviewed 2026-05-21 10:20 UTC · model grok-4.3
The pith
Splitting captions into short concept-centric parts and adding attention pooling lets contrastive vision-language models learn compositionality while keeping zero-shot capabilities intact.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper shows that concept centric learning, obtained by breaking long captions into short concept-focused parts and using cross-modal attention pooling to create matching visual embeddings, enables contrastive vision-language models to develop compositional representations. This method avoids the need for hard negative samples and delivers superior results on compositionality tasks while preserving or enhancing zero-shot classification and retrieval performance without any added inference overhead.
What carries the argument
Concept-centric caption splitting combined with parameter-free cross-modal attention-pooling to generate concept-specific embeddings from the image encoder.
If this is right
- Reaches state-of-the-art performance on standard compositionality benchmarks
- Maintains or improves zero-shot and retrieval capabilities
- Requires no increase in inference cost
- Does not rely on hard negative samples that often harm basic model abilities
Where Pith is reading between the lines
- This method provides an alternative to hard-negative based approaches for improving compositionality.
- The parameter-free nature of the attention pooling suggests it can be added to existing models with minimal changes.
- If these root causes prove primary, the same splitting and pooling steps could address binding failures in other multimodal contrastive setups.
Load-bearing premise
That the two identified root causes—long captions not requiring compositional representations and global pooling losing binding information—are the dominant factors limiting compositionality in these models.
What would settle it
Training models with and without the caption splitting and attention pooling steps, then checking whether compositionality benchmark scores rise substantially only when both changes are present while zero-shot and retrieval metrics stay stable or improve.
Figures
read the original abstract
Contrastive vision-language (V&L) models remain a popular choice for various applications. However, several limitations have emerged, most notably the limited ability of V&L models to learn compositional representations. Prior methods often addressed this limitation by generating custom training data to obtain hard negative samples. Hard negatives have been shown to improve performance on compositionality tasks, but are often specific to a single benchmark, do not generalize, and can cause substantial degradation of basic V&L capabilities such as zero-shot or retrieval performance, rendering them impractical. In this work we follow a different approach. We identify two root causes that limit compositionality performance of V&Ls: 1) Long training captions do not require a compositional representation; and 2) The final global pooling in the text and image encoders lead to a complete loss of the necessary information to learn binding in the first place. As a remedy, we propose two simple solutions: 1) We obtain short concept centric caption parts using standard NLP software and align those with the image; and 2) We introduce a parameter-free cross-modal attention-pooling to obtain concept centric visual embeddings from the image encoder. With these two changes and simple auxiliary contrastive losses, we obtain SOTA performance on standard compositionality benchmarks, while maintaining or improving strong zero-shot and retrieval capabilities. This is achieved without increasing inference cost. We release the code for this work at https://github.com/saic-fi/concept_centric_clip.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript diagnoses limited compositionality in contrastive vision-language models as arising from two root causes: long training captions that do not necessitate compositional representations, and global pooling in text and image encoders that eliminates binding information. It proposes remedying these via (1) splitting captions into short concept-centric parts using off-the-shelf NLP tools and aligning them to images, and (2) a parameter-free cross-modal attention pooling to produce concept-centric visual embeddings. Combined with auxiliary contrastive losses, these changes are claimed to deliver SOTA results on standard compositionality benchmarks while preserving or improving zero-shot classification and retrieval performance, all without raising inference cost. Code is released.
Significance. If the empirical claims are substantiated, the work would be significant because it offers a lightweight, generalizable alternative to hard-negative mining that avoids benchmark-specificity and capability degradation. The parameter-free pooling and lack of inference overhead make the method practical for existing CLIP-style pipelines, and the public code release aids reproducibility and follow-up work.
major comments (2)
- Abstract: The central claim that caption splitting and attention pooling directly remedy the identified root causes and yield SOTA compositionality without zero-shot degradation rests on the assumption that these are the primary failure modes. No diagnostic evidence (e.g., probing for attribute-object binding before/after global pooling or controlled tests isolating parser errors) is described to rule out alternative drivers such as loss formulation or data statistics; if auxiliary losses alone suffice, the proposed remedies are not load-bearing for the result.
- Abstract and §3 (Method): The assertion that long captions 'do not require a compositional representation' and that global pooling causes 'complete loss of the necessary information' is presented without quantitative support or ablation showing that splitting and attention pooling are necessary rather than incidental to the reported gains. This leaves open the possibility that gains fail to generalize beyond the chosen benchmarks if other untested factors dominate.
minor comments (1)
- The description of the cross-modal attention pooling would benefit from an explicit equation or pseudocode to clarify how concept-centric visual embeddings are formed from the image encoder outputs.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback on our manuscript. We address the major comments point by point below, providing clarifications based on the empirical results and committing to revisions that add supporting analysis where the current version is limited.
read point-by-point responses
-
Referee: Abstract: The central claim that caption splitting and attention pooling directly remedy the identified root causes and yield SOTA compositionality without zero-shot degradation rests on the assumption that these are the primary failure modes. No diagnostic evidence (e.g., probing for attribute-object binding before/after global pooling or controlled tests isolating parser errors) is described to rule out alternative drivers such as loss formulation or data statistics; if auxiliary losses alone suffice, the proposed remedies are not load-bearing for the result.
Authors: We acknowledge that the manuscript does not present explicit diagnostic probes, such as attribute-object binding analysis before versus after global pooling or controlled tests isolating parser errors from other factors. Our current evidence consists of end-to-end performance gains on compositionality benchmarks together with ablations showing that the full combination outperforms baselines. To directly address the concern that auxiliary losses alone might suffice, we will add a dedicated ablation study in the revised version that trains models using only the auxiliary contrastive losses without caption splitting or cross-modal attention pooling. We will also include a brief discussion of alternative drivers such as loss formulation and data statistics, referencing related work on these topics. revision: yes
-
Referee: Abstract and §3 (Method): The assertion that long captions 'do not require a compositional representation' and that global pooling causes 'complete loss of the necessary information' is presented without quantitative support or ablation showing that splitting and attention pooling are necessary rather than incidental to the reported gains. This leaves open the possibility that gains fail to generalize beyond the chosen benchmarks if other untested factors dominate.
Authors: We agree that the manuscript would benefit from more direct quantitative support for the stated root causes. While Section 4 reports ablations that isolate the contribution of splitting and attention pooling to the final performance numbers, we will expand §3 and the experiments section with additional controlled comparisons: (1) training with long captions versus split concept-centric captions while keeping all other components fixed, and (2) replacing global pooling with the proposed cross-modal attention pooling in isolation. These results will be presented to demonstrate necessity rather than incidental benefit. On generalization, we already evaluate on multiple standard compositionality benchmarks; we will add a short limitations paragraph noting that further tests on additional datasets would be valuable for future work. revision: yes
Circularity Check
No circularity: empirical modifications rest on standard tools and benchmarks
full rationale
The paper presents an empirical approach that splits captions via off-the-shelf NLP parsers and adds a parameter-free cross-modal attention pool plus auxiliary contrastive losses. These changes are motivated by analysis of training captions and global pooling but do not reduce any claimed result to a fitted parameter renamed as prediction, a self-citation chain, or a self-definitional loop. Performance is reported via direct comparison on compositionality and zero-shot benchmarks rather than by construction from the inputs. No equations or uniqueness theorems are invoked that collapse back to the authors' prior outputs.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Standard contrastive learning framework for vision-language models remains valid when captions are split into short concept-centric parts.
- domain assumption Off-the-shelf NLP software reliably extracts short concept-centric caption segments that correspond to visual elements.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We obtain short concept centric caption parts using standard NLP software and align those with the image; and 2) We introduce a parameter-free cross-modal attention-pooling to obtain concept centric visual embeddings from the image encoder.
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
The final global pooling in the text and image encoders lead to a complete loss of the necessary information to learn binding in the first place.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Rim Assouel, Pietro Astolfi, Florian Bordes, Michal Drozdzal, and Adriana Romero-Soriano. Object-centric binding in contrastive language-image pretraining.arXiv preprint arXiv:2502.14113, 2025. 1, 3
-
[2]
Yuxiao Chen, Jianbo Yuan, Yu Tian, Shijie Geng, Xinyu Li, Ding Zhou, Dimitris N Metaxas, and Hongxia Yang. Revis- iting multimodal representation in contrastive learning: from patch and token embeddings to finite discrete tokens. In IEEE Conference on Computer Vision and Pattern Recog- nition, 2023. 5, 12
work page 2023
- [3]
-
[4]
Teaching structured Vision&Language concepts to vi- sion&language models
Sivan Doveh, Assaf Arbelle, Sivan Harary, Rameswar Panda, Roei Herzig, Eli Schwartz, Donghyun Kim, Raja Giryes, Rogerio Feris, Shimon Ullman, and Leonid Karlin- sky. Teaching structured Vision&Language concepts to vi- sion&language models. InIEEE Conference on Computer Vision and Pattern Recognition, 2022. 1, 3, 5, 12
work page 2022
-
[5]
Dense and aligned captions (DAC) promote compositional reasoning in VL models
Sivan Doveh, Assaf Arbelle, Sivan Harary, Roei Herzig, Donghyun Kim, Paola Cascante-bonilla, Amit Alfassy, Rameswar Panda, Raja Giryes, Rogerio Feris, Shimon Ull- man, and Leonid Karlinsky. Dense and aligned captions (DAC) promote compositional reasoning in VL models. In Neural Information Processing Systems, 2023. 1, 3, 5, 12
work page 2023
-
[6]
SUG- ARCREPE++ dataset: vision-language model sensitivity to semantic and lexical alterations
Sri Harsha Dumpala, Aman Jaiswal, Chandramouli Sastry, Evangelos Milios, Sageev Oore, and Hassan Sajjad. SUG- ARCREPE++ dataset: vision-language model sensitivity to semantic and lexical alterations. InNeural Information Pro- cessing Systems - Datasets and Benchmarks Track, 2024. 1, 2, 12
work page 2024
-
[7]
Im- ageinwords: Unlocking hyper-detailed image descriptions,
Roopal Garg, Andrea Burns, Burcu Karagol Ayan, Yonatan Bitton, Ceslee Montgomery, Yasumasa Onoe, Andrew Bun- ner, Ranjay Krishna, Jason Baldridge, and Radu Soricut. Im- ageinwords: Unlocking hyper-detailed image descriptions,
-
[8]
Common data properties limit object-attribute binding in clip
B Gurung, David T Hoffmann, and Thomas Brox. Common data properties limit object-attribute binding in clip. InGer- man Conference on Pattern Recognition (GCPR), 2025. 1, 3
work page 2025
-
[9]
Matthew Honnibal and Ines Montani. spaCy 2: Natural lan- guage understanding with Bloom embeddings, convolutional neural networks and incremental parsing. To appear, 2017. 3, 4
work page 2017
-
[10]
SugarCrepe: Fixing hackable benchmarks for vision-language compositionality
Cheng-Yu Hsieh, Jieyu Zhang, Zixian Ma, Aniruddha Kem- bhavi, and Ranjay Krishna. SugarCrepe: Fixing hackable benchmarks for vision-language compositionality. InNeural Information Processing Systems - Datasets and Benchmarks Track, 2023. 1, 2, 3
work page 2023
-
[11]
Gabriel Ilharco, Mitchell Wortsman, Ross Wightman, Cade Gordon, Nicholas Carlini, Rohan Taori, Achal Dave, Vaishaal Shankar, Hongseok Namkoong, John Miller, Han- naneh Hajishirzi, Ali Farhadi, and Ludwig Schmidt. Open- clip, 2021. If you use this software, please cite it as below. 5
work page 2021
-
[12]
FineCLIP: Self-distilled region-based CLIP for better fine-grained un- derstanding
Dong Jing, Xiaolong He, Yutian Luo, Nanyi Fei, Guoxing Yang, Wei Wei, Huiwen Zhao, and Zhiwu Lu. FineCLIP: Self-distilled region-based CLIP for better fine-grained un- derstanding. InNeural Information Processing Systems,
-
[13]
An im- age is worth 16x16 words: Transformers for image recogni- tion at scale
Alexander Kolesnikov, Alexey Dosovitskiy, Dirk Weis- senborn, Georg Heigold, Jakob Uszkoreit, Lucas Beyer, Matthias Minderer, Mostafa Dehghani, Neil Houlsby, Syl- vain Gelly, Thomas Unterthiner, and Xiaohua Zhai. An im- age is worth 16x16 words: Transformers for image recogni- tion at scale. InICLR, 2021. 3
work page 2021
-
[14]
Modeling caption diversity in contrastive vision- language pretraining
Samuel Lavoie, Polina Kirichenko, Mark Ibrahim, Mido As- sran, Andrew Gordon Wilson, Aaron Courville, and Nico- las Ballas. Modeling caption diversity in contrastive vision- language pretraining. InInternational Conference on Ma- chine Learning, 2024. 5, 12
work page 2024
-
[15]
Does clip bind concepts? probing compositionality in large image models
Martha Lewis, Nihal V Nayak, Peilin Yu, Qinan Yu, Jack Merullo, Stephen H Bach, and Ellie Pavlick. Does clip bind concepts? probing compositionality in large image models. arXiv preprint arXiv:2212.10537, 2022. 1, 3
-
[16]
Enhancing vision-language com- positional understanding with multimodal synthetic data
Haoxin Li and Boyang Li. Enhancing vision-language com- positional understanding with multimodal synthetic data. In IEEE Conference on Computer Vision and Pattern Recogni- tion, 2025. 3
work page 2025
-
[17]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInterna- tional Conference on Machine Learning, 2022. 5, 7
work page 2022
-
[18]
An inverse scal- ing law for CLIP training
Xianhang Li, Zeyu Wang, and Cihang Xie. An inverse scal- ing law for CLIP training. InNeural Information Processing Systems, 2023. 5
work page 2023
-
[19]
Evaluating text-to-visual generation with image-to-text gen- eration
Zhiqiu Lin, Deepak Pathak, Baiqi Li, Jiayao Li, Xide Xia, Graham Neubig, Pengchuan Zhang, and Deva Ramanan. Evaluating text-to-visual generation with image-to-text gen- eration. InEuropean Conference on Computer Vision, 2024. 8
work page 2024
-
[20]
Improved baselines with visual instruction tuning
Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. InIEEE Conference on Computer Vision and Pattern Recognition,
-
[21]
CLIPS: An enhanced CLIP framework for learning with synthetic captions.arXiv [cs.CV], 2024
Yanqing Liu, Xianhang Li, Zeyu Wang, Bingchen Zhao, and Cihang Xie. CLIPS: An enhanced CLIP framework for learning with synthetic captions.arXiv [cs.CV], 2024. 3, 5
work page 2024
-
[22]
DOCCI: Descriptions of Connected and Con- trasting Images
Yasumasa Onoe, Sunayana Rane, Zachary Berger, Yonatan Bitton, Jaemin Cho, Roopal Garg, Alexander Ku, Zarana Parekh, Jordi Pont-Tuset, Garrett Tanzer, Su Wang, and Ja- son Baldridge. DOCCI: Descriptions of Connected and Con- trasting Images. InEuropean Conference on Computer Vi- sion, 2024. 5
work page 2024
-
[23]
TripletCLIP: Improving compositional reasoning of CLIP via synthetic vision-language negatives
Maitreya Patel, Abhiram Kusumba, Sheng Cheng, Changhoon Kim, Tejas Gokhale, Chitta Baral, and Yezhou Yang. TripletCLIP: Improving compositional reasoning of CLIP via synthetic vision-language negatives. InNeural Information Processing Systems, 2024. 1, 3, 5, 12
work page 2024
-
[24]
Ad- vancing compositional awareness in CLIP with efficient fine- tuning
Amit Peleg, Naman Deep Singh, and Matthias Hein. Ad- vancing compositional awareness in CLIP with efficient fine- tuning. InNeural Information Processing Systems, 2025. 3, 5, 12
work page 2025
-
[25]
Alec Radford, Jong Wook Kim, Chris Hallacy, A. Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from nat- ural language supervision. InInternational Conference on Machine Learning, 2021. 1, 3
work page 2021
-
[26]
Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning
Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, im- age alt-text dataset for automatic image captioning. InPro- ceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565, 2018. 4
work page 2018
-
[27]
FLA V A: A foundational language and vision alignment model
Amanpreet Singh, Ronghang Hu, Vedanuj Goswami, Guil- laume Couairon, Wojciech Galuba, Marcus Rohrbach, and Douwe Kiela. FLA V A: A foundational language and vision alignment model. InIEEE Conference on Computer Vision and Pattern Recognition, 2022. 5, 7
work page 2022
-
[28]
Jaisidh Singh, Ishaan Shrivastava, Mayank Vatsa, Richa Singh, and Aparna Bharati. Learn “no” to say “yes” better: Improving vision-language models via negations. InWinter Conference on Applications of Computer Vision, 2025. 1, 3, 5, 12
work page 2025
-
[29]
When are lemons purple? the concept associa- tion bias of vision-language models
Yingtian Tang, Yutaro Yamada, Yoyo Zhang, and Ilker Yildirim. When are lemons purple? the concept associa- tion bias of vision-language models. InEmpirical Methods in Natural Language Processing, 2023. 1, 3
work page 2023
-
[30]
Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muham- mad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, Olivier H ´enaff, Jeremiah Harmsen, Andreas Steiner, and Xiaohua Zhai. Siglip 2: Multilingual vision- language encoders with improved semantic understand- ing, localization, and dense featu...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
FLAIR: Vlm with fine- grained language-informed image representations
Rui Xiao, Sanghwan Kim, Mariana-Iuliana Georgescu, Zeynep Akata, and Stephan Alaniz. FLAIR: Vlm with fine- grained language-informed image representations. InIEEE Conference on Computer Vision and Pattern Recognition,
-
[32]
FG- CLIP: Fine-grained visual and textual alignment
Chunyu Xie, Bin Wang, Fanjing Kong, Jincheng Li, Dawei Liang, Gengshen Zhang, Dawei Leng, and Yuhui Yin. FG- CLIP: Fine-grained visual and textual alignment. InInterna- tional Conference on Machine Learning, 2025. 5, 12
work page 2025
-
[33]
Mert Y ¨uksekg¨on¨ul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. When and why vision- language models behave like bags-of-words, and what to do about it? InInternational Conference on Learning Repre- sentations, 2023. 1, 3, 5, 12
work page 2023
-
[34]
Sigmoid loss for language image pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InIEEE International Conference on Computer Vision,
-
[35]
Le Zhang, Rabiul Awal, and Aishwarya Agrawal. Con- trasting intra-modal and ranking cross-modal hard negatives to enhance visio-linguistic compositional understanding. In IEEE Conference on Computer Vision and Pattern Recogni- tion, 2024. 1, 3, 5, 12
work page 2024
-
[36]
Iterated learning improves compositional- ity in large vision-language models
Chenhao Zheng, Jieyu Zhang, Aniruddha Kembhavi, and Ranjay Krishna. Iterated learning improves compositional- ity in large vision-language models. InIEEE Conference on Computer Vision and Pattern Recognition, 2024. 5, 12
work page 2024
-
[37]
Kecheng Zheng, Yifei Zhang, Wei Wu, Fan Lu, Shuailei Ma, Xin Jin, Wei Chen, and Yujun Shen. Dreamlip: Language- image pre-training with long captions. InEuropean Confer- ence on Computer Vision, 2024. 3, 4, 5, 12 A. Additional experimental results We provide additional results of our method, C 2LIP, and the baseline contrastive models employing the same V...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.