pith. machine review for the scientific record. sign in

arxiv: 2604.13313 · v1 · submitted 2026-04-14 · 💻 cs.LG

Recognition: unknown

Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compositional Understanding

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:21 UTC · model grok-4.3

classification 💻 cs.LG
keywords compositional understandingcontrastive pretrainingnegative mininglexical concretenessvision-language modelsInfoNCE losshard negativesgradient imbalance
0
0 comments X

The pith

Lexical concreteness guides the selection of negative samples that strengthen compositional learning in vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models struggle with word order and attribute binding because standard contrastive pretraining supplies too few examples that force the model to notice small semantic shifts. The authors identify lexical concreteness as the factor that makes a negative sample effective: changing a concrete word such as a specific object produces larger, more reliable visual and structural differences than changing an abstract word such as a quality or relation. They therefore build ConcretePlant to locate and alter these perceptually grounded terms in captions, creating harder negatives automatically. A second problem is that easy pairs dominate the InfoNCE gradient; the Cement loss corrects this by tying the margin penalty to a psycholinguistic concreteness score so that harder pairs receive appropriate emphasis. The resulting Slipform system records new best scores on compositional benchmarks, cross-modal retrieval, and linear probing tasks.

Core claim

Lexical concreteness determines negative-sample efficacy because modifying highly concrete terms produces more pronounced structural and visual discrepancies than modifying abstract terms, thereby supplying a stronger learning signal during contrastive pretraining. ConcretePlant isolates and manipulates these perceptually grounded concepts, while the margin-based Cement loss dynamically calibrates penalization by correlating psycholinguistic scores with sample difficulty and thereby removes the gradient imbalance that otherwise lets easy pairs overwhelm InfoNCE optimization. The integrated Slipform framework reaches state-of-the-art accuracy on compositional evaluation benchmarks, general跨模态

What carries the argument

Lexical concreteness as the determinant of negative-sample efficacy, realized through ConcretePlant for systematic isolation and manipulation of perceptually grounded concepts together with the Cement margin loss that balances gradients in InfoNCE.

If this is right

  • Compositional reasoning improves without requiring new generative architectures for negative creation.
  • Performance rises on word-order and attribute-binding tests as well as on general cross-modal retrieval.
  • Single-label and multi-label linear probing accuracy increases when the same pretraining objective is used.
  • The gradient imbalance in InfoNCE is mitigated once penalization is tied to an independent measure of sample hardness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same concreteness signal could be used to curate training curricula or to weight positive pairs in other self-supervised regimes.
  • Extending the method to text-only or audio-visual models would test whether perceptual grounding remains the dominant hardness cue outside image-text pairs.
  • Combining concreteness scores with additional linguistic features such as imageability or specificity might yield still stronger negatives.
  • Applying Cement-style margin calibration at fine-tuning time rather than only at pretraining could preserve the gains on downstream tasks.

Load-bearing premise

Modifying highly concrete terms generates more pronounced structural and visual discrepancies than modifying abstract terms, thereby providing a substantially stronger learning signal during contrastive pretraining.

What would settle it

A controlled ablation in which negatives formed by modifying abstract terms produce equal or higher compositional benchmark scores than negatives formed by modifying concrete terms.

read the original abstract

Vision-Language Models demonstrate remarkable capabilities but often struggle with compositional reasoning, exhibiting vulnerabilities regarding word order and attribute binding. This limitation arises from a scarcity of informative samples needed to differentiate subtle semantic variations during contrastive pretraining. Although hard negative mining offers a promising remedy, existing methods lack explicit mechanisms to dictate which linguistic elements undergo modification. Instead of engineering generative architectures, this study establishes lexical concreteness as a fundamental determinant of negative sample efficacy. Modifying highly concrete terms generates more pronounced structural and visual discrepancies, providing a substantially stronger learning signal. Leveraging this principle, ConcretePlant is proposed to systematically isolate and manipulate perceptually grounded concepts. Analyses of the InfoNCE further reveals a severe gradient imbalance, where easily distinguishable pairs disproportionately overwhelm the optimization process and restrict the bandwidth available for nuanced learning. To resolve this degradation, the Cement loss is formulated utilizing a margin-based approach. By correlating psycholinguistic scores with sample difficulty, this objective dynamically calibrates the penalization applied to individual training pairs. Comprehensive evaluations substantiate these theoretical claims. The integrated framework, designated as Slipform, achieves state-of-the-art accuracy across diverse compositional evaluation benchmarks, general cross-modal retrieval, single and multi label linear probing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

4 major / 2 minor

Summary. The paper proposes Slipform, an integrated framework for vision-language pretraining that combines ConcretePlant—a concreteness-guided negative mining strategy that preferentially modifies highly concrete lexical items to create harder negatives—with the Cement loss, a margin-based contrastive objective that uses psycholinguistic concreteness scores to dynamically calibrate per-pair penalties. The central claim is that this approach resolves gradient imbalance in standard InfoNCE, supplies stronger learning signals for compositional distinctions, and yields state-of-the-art results on compositional benchmarks, cross-modal retrieval, and linear probing tasks.

Significance. If the empirical claims hold after proper controls, the work supplies a lightweight, psycholinguistically motivated alternative to generative hard-negative methods for addressing compositionality failures in VLMs. The explicit linkage between lexical concreteness, visual discrepancy magnitude, and loss weighting is a potentially reusable design principle that could improve sample efficiency without architectural changes.

major comments (4)
  1. [§3.2] §3.2 (ConcretePlant description): the core premise that concreteness scores reliably predict larger structural/visual discrepancies is not isolated from confounds such as term frequency, part-of-speech, or object salience; no ablation compares concreteness-guided selection against frequency-matched or random noun replacement baselines, leaving open the possibility that simpler heuristics reproduce the reported gains.
  2. [§4.1] §4.1 and Table 2 (InfoNCE gradient analysis): the claimed severe gradient imbalance is asserted but the manuscript provides neither quantitative plots of per-pair gradient magnitudes nor a derivation showing how the imbalance restricts bandwidth for nuanced pairs; without these, the motivation for Cement loss remains qualitative.
  3. [§5] §5 (Cement loss formulation): the margin in Cement is set using concreteness scores, yet the paper does not demonstrate that this choice is superior to standard hard-negative margins or temperature scaling; an ablation replacing the concreteness-derived margin with a fixed or frequency-derived margin is missing and is load-bearing for the claim that psycholinguistic grounding is essential.
  4. [§6.3] §6.3 (compositional benchmark results): SOTA claims are presented without error bars, statistical significance tests against the strongest recent hard-negative baselines, or component-wise ablations (ConcretePlant alone vs. Cement alone vs. both), making it impossible to attribute gains specifically to the proposed mechanisms.
minor comments (2)
  1. [Eq. 7] Notation for the Cement loss (Eq. 7) uses an undefined symbol for the concreteness-derived margin; a clear definition or reference to the preceding section is needed.
  2. [Figure 3] Figure 3 (gradient imbalance visualization) is referenced but the caption does not specify the exact dataset slice or number of pairs used, reducing reproducibility.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, clarifying our contributions where possible and outlining planned revisions to address the concerns.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (ConcretePlant description): the core premise that concreteness scores reliably predict larger structural/visual discrepancies is not isolated from confounds such as term frequency, part-of-speech, or object salience; no ablation compares concreteness-guided selection against frequency-matched or random noun replacement baselines, leaving open the possibility that simpler heuristics reproduce the reported gains.

    Authors: We agree that explicitly isolating concreteness from confounds such as term frequency is important for the claim. Section 3.1 presents correlation analyses between concreteness and visual discrepancy, but these do not fully control for all listed factors via dedicated baselines. In the revised manuscript we will add an ablation comparing ConcretePlant to frequency-matched noun replacement and random baselines on the same data splits, allowing direct assessment of whether concreteness guidance yields gains beyond these simpler heuristics. revision: yes

  2. Referee: [§4.1] §4.1 and Table 2 (InfoNCE gradient analysis): the claimed severe gradient imbalance is asserted but the manuscript provides neither quantitative plots of per-pair gradient magnitudes nor a derivation showing how the imbalance restricts bandwidth for nuanced pairs; without these, the motivation for Cement loss remains qualitative.

    Authors: The manuscript contains a derivation of the per-pair gradient expression for InfoNCE in the supplementary material, and Section 4.1 illustrates the imbalance via loss-component breakdowns. However, we acknowledge the absence of explicit quantitative plots. We will add figures showing per-pair gradient magnitudes stratified by concreteness and difficulty in the revision, together with a concise explanation of how high-magnitude easy pairs reduce effective learning bandwidth for nuanced compositional distinctions. revision: yes

  3. Referee: [§5] §5 (Cement loss formulation): the margin in Cement is set using concreteness scores, yet the paper does not demonstrate that this choice is superior to standard hard-negative margins or temperature scaling; an ablation replacing the concreteness-derived margin with a fixed or frequency-derived margin is missing and is load-bearing for the claim that psycholinguistic grounding is essential.

    Authors: We will include the requested ablation in the revised version. Specifically, we will compare the concreteness-derived margin against (i) a fixed margin and (ii) a frequency-derived margin while keeping all other hyperparameters constant. The results will be reported alongside the main tables to quantify whether the psycholinguistic grounding contributes measurably beyond standard margin or temperature choices. revision: yes

  4. Referee: [§6.3] §6.3 (compositional benchmark results): SOTA claims are presented without error bars, statistical significance tests against the strongest recent hard-negative baselines, or component-wise ablations (ConcretePlant alone vs. Cement alone vs. both), making it impossible to attribute gains specifically to the proposed mechanisms.

    Authors: We agree that error bars, significance testing, and component ablations are required to support the attribution of gains. In the revision we will rerun the main experiments with multiple random seeds to report mean ± standard deviation, apply paired statistical tests against the strongest hard-negative baselines, and add component-wise ablations (ConcretePlant only, Cement only, and the full Slipform combination) on the compositional benchmarks. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on external psycholinguistic inputs and empirical validation rather than self-definition or fitted predictions

full rationale

The abstract presents ConcretePlant as a method that isolates perceptually grounded concepts based on the principle that concrete terms yield stronger discrepancies, and Cement loss as a margin-based objective that correlates external psycholinguistic scores with sample difficulty to calibrate penalties. No equations, derivations, or self-citations are shown that reduce any claimed result to its own fitted parameters or prior outputs by construction. The InfoNCE gradient analysis is diagnostic rather than definitional, and the SOTA claims are framed as outcomes of comprehensive evaluations. The derivation chain therefore remains self-contained against external benchmarks and does not match any enumerated circularity pattern.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review is based solely on the abstract; no explicit free parameters, axioms, or invented entities can be extracted or audited. The approach implicitly relies on external psycholinguistic concreteness scores and the standard InfoNCE formulation.

pith-pipeline@v0.9.0 · 5515 in / 1224 out tokens · 41869 ms · 2026-05-10T15:21:22.970018+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

73 extracted references · 23 canonical work pages · 7 internal anchors

  1. [1]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning (ICML), pages 8748–8763. PMLR, 2021

  2. [2]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 11975–11986, 2023

  3. [3]

    Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning

    Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. InAnnual Meeting of the Association for Computational Linguistics (ACL), pages 2556–2565, 2018

  4. [4]

    Wit: Wikipedia-basedimagetextdatasetformultimodalmultilingualmachinelearning

    Krishna Srinivasan, Karthik Raman, Jiecao Chen, Michael Bendersky, and Marc Najork. Wit: Wikipedia-basedimagetextdatasetformultimodalmultilingualmachinelearning. InProceedings 10 Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compositional Understanding of the 44th international ACM SIGIR conference on research and development...

  5. [5]

    Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in Neural Information Processing Systems (NeurIPS), 35:25278–25294, 2022

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in Neural Information Processing Systems (NeurIPS), 35:25278–25294, 2022

  6. [6]

    Datacomp: In search of the next generation of multimodal datasets.Advances in Neural Information Processing Systems (NeurIPS), 36:27092–27112, 2023

    Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Datacomp: In search of the next generation of multimodal datasets.Advances in Neural Information Processing Systems (NeurIPS), 36:27092–27112, 2023

  7. [7]

    Meta clip 2: A worldwide scaling recipe.arXiv preprint arXiv:2507.22062,

    Yung-Sung Chuang, Yang Li, Dong Wang, Ching-Feng Yeh, Kehan Lyu, Ramya Raghavendra, James Glass, Lifei Huang, Jason Weston, Luke Zettlemoyer, et al. Meta clip 2: A worldwide scaling recipe.arXiv preprint arXiv:2507.22062, 2025

  8. [8]

    Scaling pre-training to one hundred billion data for vision language models.arXiv preprint arXiv:2502.07617, 2025

    Xiao Wang, Ibrahim Alabdulmohsin, Daniel Salz, Zhe Li, Keran Rong, and Xiaohua Zhai. Scaling pre-training to one hundred billion data for vision language models.arXiv preprint arXiv:2502.07617, 2025

  9. [9]

    Vision-language models do not understand negation

    Kumail Alhamoud, Shaden Alshammari, Yonglong Tian, Guohao Li, Philip HS Torr, Yoon Kim, and Marzyeh Ghassemi. Vision-language models do not understand negation. InProceedings of theIEEE/CVFConferenceonComputerVisionandPatternRecognition(CVPR),pages29612–29622, 2025

  10. [10]

    Valse: Atask-independentbenchmarkforvisionandlanguagemodelscenteredonlinguistic phenomena

    Letitia Parcalabescu, Michele Cafagna, Lilitta Muradjan, Anette Frank, Iacer Calixto, and Albert Gatt. Valse: Atask-independentbenchmarkforvisionandlanguagemodelscenteredonlinguistic phenomena. InAnnual Meeting of the Association for Computational Linguistics (ACL), pages 8253–8280, 2022

  11. [11]

    Vl-checklist: Evaluating pre-trained vision-language models with objects, attributes and relations.arXiv preprint arXiv:2207.00221, 2022

    Tiancheng Zhao, Tianqi Zhang, Mingwei Zhu, Haozhan Shen, Kyusong Lee, Xiaopeng Lu, and Jianwei Yin. Vl-checklist: Evaluating pre-trained vision-language models with objects, attributes and relations.arXiv preprint arXiv:2207.00221, 2022

  12. [12]

    Clip the bias: How useful is balancing data in multimodal learning?arXiv preprint arXiv:2403.04547, 2024

    Ibrahim Alabdulmohsin, Xiao Wang, Andreas Steiner, Priya Goyal, Alexander D’Amour, and Xiaohua Zhai. Clip the bias: How useful is balancing data in multimodal learning?arXiv preprint arXiv:2403.04547, 2024

  13. [13]

    Winoground: Probing vision and language models for visio-linguistic compositionality

    Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. Winoground: Probing vision and language models for visio-linguistic compositionality. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5238–5248, 2022

  14. [14]

    Clip behaves like a bag-of-words model cross-modally but not uni-modally.arXiv preprint arXiv:2502.03566, 2025

    Darina Koishigarina, Arnas Uselis, and Seong Joon Oh. Clip behaves like a bag-of-words model cross-modally but not uni-modally.arXiv preprint arXiv:2502.03566, 2025

  15. [15]

    Contrastive learning with hard negative samples,

    Joshua Robinson, Ching-Yao Chuang, Suvrit Sra, and Stefanie Jegelka. Contrastive learning with hard negative samples.arXiv preprint arXiv:2010.04592, 2020

  16. [16]

    Contrasting intra-modal and ranking cross- modal hard negatives to enhance visio-linguistic compositional understanding

    Le Zhang, Rabiul Awal, and Aishwarya Agrawal. Contrasting intra-modal and ranking cross- modal hard negatives to enhance visio-linguistic compositional understanding. InProceedings of theIEEE/CVFConferenceonComputerVisionandPatternRecognition(CVPR),pages13774–13784, 2024. 11 Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compo...

  17. [17]

    spacy 2: Natural language understanding with bloom embeddings, convolu- tional neural networks and incremental parsing.(No Title), 2017

    Matthew Honnibal. spacy 2: Natural language understanding with bloom embeddings, convolu- tional neural networks and incremental parsing.(No Title), 2017

  18. [18]

    When and why vision-language models behave like bags-of-words, and what to do about it? arXiv preprint arXiv:2210.01936, 2022

    Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. When and why vision-language models behave like bags-of-words, and what to do about it?arXiv preprint arXiv:2210.01936, 2022

  19. [19]

    Countercurate: Enhancing physical and semantic visio-linguistic compositional reasoning via counterfactual examples

    Jianrui Zhang, Mu Cai, Tengyang Xie, and Yong Jae Lee. Countercurate: Enhancing physical and semantic visio-linguistic compositional reasoning via counterfactual examples. InFindings of the Association for Computational Linguistics (ACL), pages 15481–15495, 2024

  20. [20]

    Improving clip training with language rewrites.Advances in Neural Information Processing Systems (NeurIPS), 36: 35544–35575, 2023

    Lijie Fan, Dilip Krishnan, Phillip Isola, Dina Katabi, and Yonglong Tian. Improving clip training with language rewrites.Advances in Neural Information Processing Systems (NeurIPS), 36: 35544–35575, 2023

  21. [21]

    Tripletclip: Improving compositional reasoning of clip via syntheticvision-languagenegatives.AdvancesinNeuralInformationProcessingSystems(NeurIPS), 37:32731–32760, 2024

    Maitreya Patel, Naga Sai Abhiram Kusumba, Sheng Cheng, Changhoon Kim, Tejas Gokhale, Chitta Baral, and Yezhou Yang. Tripletclip: Improving compositional reasoning of clip via syntheticvision-languagenegatives.AdvancesinNeuralInformationProcessingSystems(NeurIPS), 37:32731–32760, 2024

  22. [22]

    Literal and metaphorical sense identification through concrete and abstract context

    Peter Turney, Yair Neuman, Dan Assaf, and Yohai Cohen. Literal and metaphorical sense identification through concrete and abstract context. InConference on Empirical Methods in Natural Language Processing (EMNLP), pages 680–690, 2011

  23. [23]

    Psychology Press, 2013

    Allan Paivio.Imagery and verbal processes. Psychology Press, 2013

  24. [24]

    Quantifying the visual concreteness of words and topics in multimodal datasets

    Jack Hessel, David Mimno, and Lillian Lee. Quantifying the visual concreteness of words and topics in multimodal datasets. InConference of the North American Chapter of the Association for Computational Linguistics (NAACL), pages 2194–2205, 2018

  25. [25]

    Concreteness ratings for 40 thousand generally known english word lemmas.Behavior research methods, 46(3):904–911, 2014

    Marc Brysbaert, Amy Beth Warriner, and Victor Kuperman. Concreteness ratings for 40 thousand generally known english word lemmas.Behavior research methods, 46(3):904–911, 2014

  26. [26]

    Dense and aligned captions (dac) promote compositional reasoning in vl models.Advances in Neural Information Processing Systems (NeurIPS), 36:76137–76150, 2023

    Sivan Doveh, Assaf Arbelle, Sivan Harary, Roei Herzig, Donghyun Kim, Paola Cascante-Bonilla, Amit Alfassy, Rameswar Panda, Raja Giryes, Rogerio Feris, et al. Dense and aligned captions (dac) promote compositional reasoning in vl models.Advances in Neural Information Processing Systems (NeurIPS), 36:76137–76150, 2023

  27. [27]

    Revisit large-scale image-caption data in pre-training multimodal foundation models.arXiv preprint arXiv:2410.02740, 2024

    Zhengfeng Lai, Vasileios Saveris, Chen Chen, Hong-You Chen, Haotian Zhang, Bowen Zhang, Juan Lao Tebar, Wenze Hu, Zhe Gan, Peter Grasch, et al. Revisit large-scale image-caption data in pre-training multimodal foundation models.arXiv preprint arXiv:2410.02740, 2024

  28. [28]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, ChengenHuang, ChenxuLv, etal. Qwen3technicalreport.arXivpreprintarXiv:2505.09388, 2025

  29. [29]

    Adversarial diffusion distillation

    Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. InEuropean Conference on Computer Vision, pages 87–103. Springer, 2024

  30. [30]

    Harnessing meta-learning for controllable full-frame video stabilization.arXiv preprint arXiv:2508.18859, 2025

    Muhammad Kashif Ali, Eun Woo Im, Dongjin Kim, Tae Hyun Kim, Vivek Gupta, Haonan Luo, and Tianrui Li. Harnessing meta-learning for controllable full-frame video stabilization.arXiv preprint arXiv:2508.18859, 2025

  31. [31]

    Self-augmented visual contrastive decoding.arXiv preprint arXiv:2510.13315, 2025

    Eun Woo Im, Muhammad Kashif Ali, and Vivek Gupta. Self-augmented visual contrastive decoding.arXiv preprint arXiv:2510.13315, 2025. 12 Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compositional Understanding

  32. [32]

    Representation Learning with Contrastive Predictive Coding

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

  33. [33]

    Robust locally weighted regression and smoothing scatterplots.Journal of the American statistical association, 74(368):829–836, 1979

    William S Cleveland. Robust locally weighted regression and smoothing scatterplots.Journal of the American statistical association, 74(368):829–836, 1979

  34. [34]

    Understanding contrastive representation learning through alignment and uniformity on the hypersphere

    Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. InInternational Conference on Machine Learning (ICML), pages 9929–9939. PMLR, 2020

  35. [35]

    arXiv preprint arXiv:2010.06682 , year=

    Tiffany Tianhui Cai, Jonathan Frankle, David J Schwab, and Ari S Morcos. Are all negatives created equal in contrastive instance discrimination?arXiv preprint arXiv:2010.06682, 2020

  36. [36]

    Sulla quantizzazione del gas perfetto monoatomico.Rendiconti Lincei, 145, 1926

    Enrico Fermi. Sulla quantizzazione del gas perfetto monoatomico.Rendiconti Lincei, 145, 1926

  37. [37]

    On the theory of quantum mechanics.Proceedings of the Royal Society of London

    Paul Adrien Maurice Dirac. On the theory of quantum mechanics.Proceedings of the Royal Society of London. Series A, Containing Papers of a Mathematical and Physical Character, 112 (762):661–677, 1926

  38. [38]

    The fermi–dirac distribution provides a calibrated probabilistic output for binary classifiers

    Sung-Cheol Kim, Adith S Arun, Mehmet Eren Ahsen, Robert Vogel, and Gustavo Stolovitzky. The fermi–dirac distribution provides a calibrated probabilistic output for binary classifiers. Proceedings of the National Academy of Sciences, 118(34):e2100761118, 2021

  39. [39]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InProceedings of the European Conference on Computer Vision (ECCV), pages 740–755. Springer, 2014

  40. [40]

    Deep visual-semantic alignments for generating image descrip- tions

    Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descrip- tions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3128–3137, 2015

  41. [41]

    Sugarcrepe: Fixinghackablebenchmarksforvision-languagecompositionality.AdvancesinNeuralInformation Processing Systems (NeurIPS), 36:31096–31116, 2023

    Cheng-Yu Hsieh, Jieyu Zhang, Zixian Ma, Aniruddha Kembhavi, and Ranjay Krishna. Sugarcrepe: Fixinghackablebenchmarksforvision-languagecompositionality.AdvancesinNeuralInformation Processing Systems (NeurIPS), 36:31096–31116, 2023

  42. [42]

    Sugarcrepe++ dataset: Vision-language model sensitivity to semantic and lexical alterations.Advances in Neural Information Processing Systems (NeurIPS), 37:17972–18018, 2024

    Sri H Dumpala, Aman Jaiswal, Chandramouli Sastry, Evangelos Milios, Sageev Oore, and Hassan Sajjad. Sugarcrepe++ dataset: Vision-language model sensitivity to semantic and lexical alterations.Advances in Neural Information Processing Systems (NeurIPS), 37:17972–18018, 2024

  43. [43]

    Imagenet large scale visual recognition challenge.International Journal of Computer Vision (IJCV), 115(3):211–252, 2015

    Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.International Journal of Computer Vision (IJCV), 115(3):211–252, 2015

  44. [44]

    Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions.Transactions of the association for computational linguistics, 2:67–78, 2014

  45. [45]

    A large-scale study of representation learning with the visual task adaptation benchmark

    Xiaohua Zhai, Joan Puigcerver, Alexander Kolesnikov, Pierre Ruyssen, Carlos Riquelme, Mario Lucic, Josip Djolonga, Andre Susano Pinto, Maxim Neumann, Alexey Dosovitskiy, et al. A large-scale study of representation learning with the visual task adaptation benchmark.arXiv preprint arXiv:1910.04867, 2019. 13 Concrete Jungle: Towards Concreteness Paved Contr...

  46. [46]

    Perception Encoder: The best visual embeddings are not at the output of the network

    Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, et al. Perception encoder: The best visual embeddings are not at the output of the network.arXiv preprint arXiv:2504.13181, 2025

  47. [47]

    Demystifying clip data.arXiv preprint arXiv:2309.16671, 2023

    Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Demystifying clip data.arXiv preprint arXiv:2309.16671, 2023

  48. [48]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

  49. [49]

    BERTScore: Evaluating Text Generation with BERT

    Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675, 2019

  50. [50]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692, 2019

  51. [51]

    Deep variational bayesian modeling of haze degradation process

    Eun Woo Im, Junsung Shin, Sungyong Baik, and Tae Hyun Kim. Deep variational bayesian modeling of haze degradation process. InProceedings of the 32nd ACM International Conference on Information and Knowledge Management, pages 895–904, 2023

  52. [52]

    Harnessing meta- learning for improving full-frame video stabilization

    Muhammad Kashif Ali, Eun Woo Im, Dongjin Kim, and Tae Hyun Kim. Harnessing meta- learning for improving full-frame video stabilization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12605–12614, 2024

  53. [53]

    Vidhalluc: Evaluating temporal hallucinations in multimodal large language models for video understanding

    Chaoyu Li, Eun Woo Im, and Pooyan Fazli. Vidhalluc: Evaluating temporal hallucinations in multimodal large language models for video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13723–13733, 2025

  54. [54]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alab- dulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

  55. [55]

    Decoupled contrastive learning

    Chun-Hsiao Yeh, Cheng-Yao Hong, Yen-Chi Hsu, Tyng-Luh Liu, Yubei Chen, and Yann LeCun. Decoupled contrastive learning. InProceedings of the European Conference on Computer Vision (ECCV), pages 668–684. Springer, 2022

  56. [56]

    Clipscore: A reference-free evaluation metric for image captioning

    Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. InConference on Empirical Methods in Natural Language Processing (EMNLP), pages 7514–7528, 2021

  57. [57]

    Differential-informed sample selection accelerates multimodal contrastive learning

    Zihua Zhao, Feng Hong, Mengxi Chen, Pengyi Chen, Benyuan Liu, Jiangchao Yao, Ya Zhang, and Yanfeng Wang. Differential-informed sample selection accelerates multimodal contrastive learning. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2930–2940, 2025

  58. [58]

    Veclip: Improving clip training via visual-enriched captions

    Zhengfeng Lai, Haotian Zhang, Bowen Zhang, Wentao Wu, Haoping Bai, Aleksei Timofeev, Xianzhi Du, Zhe Gan, Jiulong Shan, Chen-Nee Chuah, et al. Veclip: Improving clip training via visual-enriched captions. InProceedings of the European Conference on Computer Vision (ECCV), pages 111–127. Springer, 2024. 14 Concrete Jungle: Towards Concreteness Paved Contra...

  59. [59]

    Provablestochasticoptimizationforglobalcontrastivelearning: Smallbatchdoesnotharm performance

    Zhuoning Yuan, Yuexin Wu, Zi-Hao Qiu, Xianzhi Du, Lijun Zhang, Denny Zhou, and Tianbao Yang. Provablestochasticoptimizationforglobalcontrastivelearning: Smallbatchdoesnotharm performance. InInternational Conference on Machine Learning (ICML), pages 25760–25782. PMLR, 2022

  60. [60]

    Disco-clip: A distributed contrastive loss for memory efficient clip training

    Yihao Chen, Xianbiao Qi, Jianan Wang, and Lei Zhang. Disco-clip: A distributed contrastive loss for memory efficient clip training. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22648–22657, 2023

  61. [61]

    Know" no" better: A data-driven approach for enhancing negation awareness in clip

    Junsung Park, Jungbeom Lee, Jongyoon Song, Sangwon Yu, Dahuin Jung, and Sungroh Yoon. Know" no" better: A data-driven approach for enhancing negation awareness in clip. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2825–2835, 2025

  62. [62]

    Teaching clip to count to ten

    Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, and Tali Dekel. Teaching clip to count to ten. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3170–3180, 2023

  63. [63]

    Nemo: Can multimodal llms identify attribute-modified objects?arXiv preprint arXiv:2411.17794, 2024

    Jiaxuan Li, Junwen Mo, MinhDuc Vo, Akihiro Sugimoto, and Hideki Nakayama. Nemo: Can multimodal llms identify attribute-modified objects?arXiv preprint arXiv:2411.17794, 2024

  64. [64]

    What’s “up” with vision-language models? investigating their struggle with spatial reasoning

    Amita Kamath, Jack Hessel, and Kai-Wei Chang. What’s “up” with vision-language models? investigating their struggle with spatial reasoning. InConference on Empirical Methods in Natural Language Processing (EMNLP), pages 9161–9175, 2023

  65. [65]

    Spatial457: A diagnostic benchmark for 6d spatial reasoning of large mutimodal models

    Xingrui Wang, Wufei Ma, Tiezheng Zhang, Celso M de Melo, Jieneng Chen, and Alan Yuille. Spatial457: A diagnostic benchmark for 6d spatial reasoning of large mutimodal models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 24669–24679, 2025

  66. [66]

    Zixian Ma, Jerry Hong, Mustafa Omer Gul, Mona Gandhi, Irena Gao, and Ranjay Krishna. Crepe: Can vision-language foundation models reason compositionally? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10910–10921, 2023

  67. [67]

    Vismin: Visual minimal-change understanding.Advances in Neural Information Processing Systems (NeurIPS), 37:107795– 107829, 2024

    Rabiul Awal, Saba Ahmadi, Le Zhang, and Aishwarya Agrawal. Vismin: Visual minimal-change understanding.Advances in Neural Information Processing Systems (NeurIPS), 37:107795– 107829, 2024

  68. [68]

    Explain before you answer: A survey on compositional visual reasoning.arXiv preprint arXiv:2508.17298, 2025

    Fucai Ke, Joy Hsu, Zhixi Cai, Zixian Ma, Xin Zheng, Xindi Wu, Sukai Huang, Weiqing Wang, Pari Delir Haghighi, Gholamreza Haffari, et al. Explain before you answer: A survey on compositional visual reasoning.arXiv preprint arXiv:2508.17298, 2025

  69. [69]

    Chimera: Compositional image generation using part-based concepting, 2025.https://arxiv.org/abs/2510.18083

    Shivam Singh, Yiming Chen, Agneet Chatterjee, Amit Raj, James Hays, Yezhou Yang, and Chitta Baral. Chimera: Compositional image generation using part-based concepting.arXiv preprint arXiv:2510.18083, 2025

  70. [70]

    Mass: Overcoming language bias in image-text matching

    Jiwan Chung, Seungwon Lim, Sangkyu Lee, and Youngjae Yu. Mass: Overcoming language bias in image-text matching. InAssociation for the Advancement of Artificial Intelligence (AAAI), volume 39, pages 2591–2599, 2025

  71. [71]

    Coarse-to-fine contrastive learning in image-text-graph space for improved vision- language compositionality

    Harman Singh, Pengchuan Zhang, Qifan Wang, Mengjiao Wang, Wenhan Xiong, Jingfei Du, and Yu Chen. Coarse-to-fine contrastive learning in image-text-graph space for improved vision- language compositionality. InConference on Empirical Methods in Natural Language Processing (EMNLP), pages 869–893, 2023. 15 Concrete Jungle: Towards Concreteness Paved Contrast...

  72. [72]

    Structure-clip: Towards scene graph knowledge to enhance multi-modal structured representations

    Yufeng Huang, Jiji Tang, Zhuo Chen, Rongsheng Zhang, Xinfeng Zhang, Weijie Chen, Zeng Zhao, Zhou Zhao, Tangjie Lv, Zhipeng Hu, et al. Structure-clip: Towards scene graph knowledge to enhance multi-modal structured representations. InAssociation for the Advancement of Artificial Intelligence (AAAI), volume 38, pages 2417–2425, 2024

  73. [73]

    A hat wearing a man

    Amita Kamath, Cheng-Yu Hsieh, Kai-Wei Chang, and Ranjay Krishna. The hard positive truth about vision-language compositionality. InProceedings of the European Conference on Computer Vision (ECCV), pages 37–54. Springer, 2024. 16 Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compositional Understanding Concrete Jungle: Towards...