arxiv: 2604.13313 · v1 · submitted 2026-04-14 · 💻 cs.LG

Recognition: unknown

Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compositional Understanding

Eun Woo Im , Dhruv Madhwal , Vivek Gupta

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:21 UTC · model grok-4.3

classification 💻 cs.LG

keywords compositional understandingcontrastive pretrainingnegative mininglexical concretenessvision-language modelsInfoNCE losshard negativesgradient imbalance

0 comments

The pith

Lexical concreteness guides the selection of negative samples that strengthen compositional learning in vision-language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models struggle with word order and attribute binding because standard contrastive pretraining supplies too few examples that force the model to notice small semantic shifts. The authors identify lexical concreteness as the factor that makes a negative sample effective: changing a concrete word such as a specific object produces larger, more reliable visual and structural differences than changing an abstract word such as a quality or relation. They therefore build ConcretePlant to locate and alter these perceptually grounded terms in captions, creating harder negatives automatically. A second problem is that easy pairs dominate the InfoNCE gradient; the Cement loss corrects this by tying the margin penalty to a psycholinguistic concreteness score so that harder pairs receive appropriate emphasis. The resulting Slipform system records new best scores on compositional benchmarks, cross-modal retrieval, and linear probing tasks.

Core claim

Lexical concreteness determines negative-sample efficacy because modifying highly concrete terms produces more pronounced structural and visual discrepancies than modifying abstract terms, thereby supplying a stronger learning signal during contrastive pretraining. ConcretePlant isolates and manipulates these perceptually grounded concepts, while the margin-based Cement loss dynamically calibrates penalization by correlating psycholinguistic scores with sample difficulty and thereby removes the gradient imbalance that otherwise lets easy pairs overwhelm InfoNCE optimization. The integrated Slipform framework reaches state-of-the-art accuracy on compositional evaluation benchmarks, general跨模态

What carries the argument

Lexical concreteness as the determinant of negative-sample efficacy, realized through ConcretePlant for systematic isolation and manipulation of perceptually grounded concepts together with the Cement margin loss that balances gradients in InfoNCE.

If this is right

Compositional reasoning improves without requiring new generative architectures for negative creation.
Performance rises on word-order and attribute-binding tests as well as on general cross-modal retrieval.
Single-label and multi-label linear probing accuracy increases when the same pretraining objective is used.
The gradient imbalance in InfoNCE is mitigated once penalization is tied to an independent measure of sample hardness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same concreteness signal could be used to curate training curricula or to weight positive pairs in other self-supervised regimes.
Extending the method to text-only or audio-visual models would test whether perceptual grounding remains the dominant hardness cue outside image-text pairs.
Combining concreteness scores with additional linguistic features such as imageability or specificity might yield still stronger negatives.
Applying Cement-style margin calibration at fine-tuning time rather than only at pretraining could preserve the gains on downstream tasks.

Load-bearing premise

Modifying highly concrete terms generates more pronounced structural and visual discrepancies than modifying abstract terms, thereby providing a substantially stronger learning signal during contrastive pretraining.

What would settle it

A controlled ablation in which negatives formed by modifying abstract terms produce equal or higher compositional benchmark scores than negatives formed by modifying concrete terms.

read the original abstract

Vision-Language Models demonstrate remarkable capabilities but often struggle with compositional reasoning, exhibiting vulnerabilities regarding word order and attribute binding. This limitation arises from a scarcity of informative samples needed to differentiate subtle semantic variations during contrastive pretraining. Although hard negative mining offers a promising remedy, existing methods lack explicit mechanisms to dictate which linguistic elements undergo modification. Instead of engineering generative architectures, this study establishes lexical concreteness as a fundamental determinant of negative sample efficacy. Modifying highly concrete terms generates more pronounced structural and visual discrepancies, providing a substantially stronger learning signal. Leveraging this principle, ConcretePlant is proposed to systematically isolate and manipulate perceptually grounded concepts. Analyses of the InfoNCE further reveals a severe gradient imbalance, where easily distinguishable pairs disproportionately overwhelm the optimization process and restrict the bandwidth available for nuanced learning. To resolve this degradation, the Cement loss is formulated utilizing a margin-based approach. By correlating psycholinguistic scores with sample difficulty, this objective dynamically calibrates the penalization applied to individual training pairs. Comprehensive evaluations substantiate these theoretical claims. The integrated framework, designated as Slipform, achieves state-of-the-art accuracy across diverse compositional evaluation benchmarks, general cross-modal retrieval, single and multi label linear probing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper ties negative mining to lexical concreteness scores and adds a margin-based Cement loss to fix gradient imbalance in InfoNCE, but does not isolate concreteness from simpler cues like noun frequency or visual salience.

read the letter

The main takeaway is that they treat concreteness as a guide for which words to alter when creating hard negatives, then use those scores to set per-pair margins in a new loss. This is meant to give vision-language models a stronger signal on composition during contrastive training. They also flag that standard InfoNCE lets easy pairs swamp the gradients and propose Cement to rebalance by linking difficulty to psycholinguistic ratings. ConcretePlant is the component that picks and modifies the concrete terms in the text side of pairs. That framing is the clearest new angle here. It moves beyond generic hard-negative mining by grounding the choice in an external linguistic property rather than just embedding similarity. The gradient-imbalance observation is also straightforward and worth making explicit. The central assumption still needs more work. The claim that modifying concrete terms creates reliably larger visual and structural gaps than other choices is plausible on its face, but the paper does not appear to test it against obvious confounds such as term frequency, part-of-speech, or simple object-noun selection. If concreteness is mainly a proxy for picking visually distinct nouns, then the psycholinguistic justification is not carrying the load and simpler heuristics could match the gains. The abstract promises SOTA numbers on compositional benchmarks and retrieval tasks, yet the provided details give no quantitative results, ablation tables, or error bars, so it is hard to judge how large or robust the improvement actually is. This paper is for people already working on contrastive objectives and compositional failures in multimodal models. A reader who wants practical levers for negative sampling would get value from the idea even if the experiments need tightening. It deserves a serious referee because the hypothesis is specific, the target problem is real, and the claims are falsifiable. I would send it out with a request for controls that separate concreteness from frequency and salience, plus full experimental reporting.

Referee Report

4 major / 2 minor

Summary. The paper proposes Slipform, an integrated framework for vision-language pretraining that combines ConcretePlant—a concreteness-guided negative mining strategy that preferentially modifies highly concrete lexical items to create harder negatives—with the Cement loss, a margin-based contrastive objective that uses psycholinguistic concreteness scores to dynamically calibrate per-pair penalties. The central claim is that this approach resolves gradient imbalance in standard InfoNCE, supplies stronger learning signals for compositional distinctions, and yields state-of-the-art results on compositional benchmarks, cross-modal retrieval, and linear probing tasks.

Significance. If the empirical claims hold after proper controls, the work supplies a lightweight, psycholinguistically motivated alternative to generative hard-negative methods for addressing compositionality failures in VLMs. The explicit linkage between lexical concreteness, visual discrepancy magnitude, and loss weighting is a potentially reusable design principle that could improve sample efficiency without architectural changes.

major comments (4)

[§3.2] §3.2 (ConcretePlant description): the core premise that concreteness scores reliably predict larger structural/visual discrepancies is not isolated from confounds such as term frequency, part-of-speech, or object salience; no ablation compares concreteness-guided selection against frequency-matched or random noun replacement baselines, leaving open the possibility that simpler heuristics reproduce the reported gains.
[§4.1] §4.1 and Table 2 (InfoNCE gradient analysis): the claimed severe gradient imbalance is asserted but the manuscript provides neither quantitative plots of per-pair gradient magnitudes nor a derivation showing how the imbalance restricts bandwidth for nuanced pairs; without these, the motivation for Cement loss remains qualitative.
[§5] §5 (Cement loss formulation): the margin in Cement is set using concreteness scores, yet the paper does not demonstrate that this choice is superior to standard hard-negative margins or temperature scaling; an ablation replacing the concreteness-derived margin with a fixed or frequency-derived margin is missing and is load-bearing for the claim that psycholinguistic grounding is essential.
[§6.3] §6.3 (compositional benchmark results): SOTA claims are presented without error bars, statistical significance tests against the strongest recent hard-negative baselines, or component-wise ablations (ConcretePlant alone vs. Cement alone vs. both), making it impossible to attribute gains specifically to the proposed mechanisms.

minor comments (2)

[Eq. 7] Notation for the Cement loss (Eq. 7) uses an undefined symbol for the concreteness-derived margin; a clear definition or reference to the preceding section is needed.
[Figure 3] Figure 3 (gradient imbalance visualization) is referenced but the caption does not specify the exact dataset slice or number of pairs used, reducing reproducibility.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below, clarifying our contributions where possible and outlining planned revisions to address the concerns.

read point-by-point responses

Referee: [§3.2] §3.2 (ConcretePlant description): the core premise that concreteness scores reliably predict larger structural/visual discrepancies is not isolated from confounds such as term frequency, part-of-speech, or object salience; no ablation compares concreteness-guided selection against frequency-matched or random noun replacement baselines, leaving open the possibility that simpler heuristics reproduce the reported gains.

Authors: We agree that explicitly isolating concreteness from confounds such as term frequency is important for the claim. Section 3.1 presents correlation analyses between concreteness and visual discrepancy, but these do not fully control for all listed factors via dedicated baselines. In the revised manuscript we will add an ablation comparing ConcretePlant to frequency-matched noun replacement and random baselines on the same data splits, allowing direct assessment of whether concreteness guidance yields gains beyond these simpler heuristics. revision: yes
Referee: [§4.1] §4.1 and Table 2 (InfoNCE gradient analysis): the claimed severe gradient imbalance is asserted but the manuscript provides neither quantitative plots of per-pair gradient magnitudes nor a derivation showing how the imbalance restricts bandwidth for nuanced pairs; without these, the motivation for Cement loss remains qualitative.

Authors: The manuscript contains a derivation of the per-pair gradient expression for InfoNCE in the supplementary material, and Section 4.1 illustrates the imbalance via loss-component breakdowns. However, we acknowledge the absence of explicit quantitative plots. We will add figures showing per-pair gradient magnitudes stratified by concreteness and difficulty in the revision, together with a concise explanation of how high-magnitude easy pairs reduce effective learning bandwidth for nuanced compositional distinctions. revision: yes
Referee: [§5] §5 (Cement loss formulation): the margin in Cement is set using concreteness scores, yet the paper does not demonstrate that this choice is superior to standard hard-negative margins or temperature scaling; an ablation replacing the concreteness-derived margin with a fixed or frequency-derived margin is missing and is load-bearing for the claim that psycholinguistic grounding is essential.

Authors: We will include the requested ablation in the revised version. Specifically, we will compare the concreteness-derived margin against (i) a fixed margin and (ii) a frequency-derived margin while keeping all other hyperparameters constant. The results will be reported alongside the main tables to quantify whether the psycholinguistic grounding contributes measurably beyond standard margin or temperature choices. revision: yes
Referee: [§6.3] §6.3 (compositional benchmark results): SOTA claims are presented without error bars, statistical significance tests against the strongest recent hard-negative baselines, or component-wise ablations (ConcretePlant alone vs. Cement alone vs. both), making it impossible to attribute gains specifically to the proposed mechanisms.

Authors: We agree that error bars, significance testing, and component ablations are required to support the attribution of gains. In the revision we will rerun the main experiments with multiple random seeds to report mean ± standard deviation, apply paired statistical tests against the strongest hard-negative baselines, and add component-wise ablations (ConcretePlant only, Cement only, and the full Slipform combination) on the compositional benchmarks. revision: yes

Circularity Check

0 steps flagged

No circularity: claims rest on external psycholinguistic inputs and empirical validation rather than self-definition or fitted predictions

full rationale

The abstract presents ConcretePlant as a method that isolates perceptually grounded concepts based on the principle that concrete terms yield stronger discrepancies, and Cement loss as a margin-based objective that correlates external psycholinguistic scores with sample difficulty to calibrate penalties. No equations, derivations, or self-citations are shown that reduce any claimed result to its own fitted parameters or prior outputs by construction. The InfoNCE gradient analysis is diagnostic rather than definitional, and the SOTA claims are framed as outcomes of comprehensive evaluations. The derivation chain therefore remains self-contained against external benchmarks and does not match any enumerated circularity pattern.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review is based solely on the abstract; no explicit free parameters, axioms, or invented entities can be extracted or audited. The approach implicitly relies on external psycholinguistic concreteness scores and the standard InfoNCE formulation.

pith-pipeline@v0.9.0 · 5515 in / 1224 out tokens · 41869 ms · 2026-05-10T15:21:22.970018+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

73 extracted references · 23 canonical work pages · 7 internal anchors

[1]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational Conference on Machine Learning (ICML), pages 8748–8763. PMLR, 2021

2021
[2]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 11975–11986, 2023

2023
[3]

Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning

Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. InAnnual Meeting of the Association for Computational Linguistics (ACL), pages 2556–2565, 2018

2018
[4]

Wit: Wikipedia-basedimagetextdatasetformultimodalmultilingualmachinelearning

Krishna Srinivasan, Karthik Raman, Jiecao Chen, Michael Bendersky, and Marc Najork. Wit: Wikipedia-basedimagetextdatasetformultimodalmultilingualmachinelearning. InProceedings 10 Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compositional Understanding of the 44th international ACM SIGIR conference on research and development...

2021
[5]

Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in Neural Information Processing Systems (NeurIPS), 35:25278–25294, 2022

Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion-5b: An open large-scale dataset for training next generation image-text models.Advances in Neural Information Processing Systems (NeurIPS), 35:25278–25294, 2022

2022
[6]

Datacomp: In search of the next generation of multimodal datasets.Advances in Neural Information Processing Systems (NeurIPS), 36:27092–27112, 2023

Samir Yitzhak Gadre, Gabriel Ilharco, Alex Fang, Jonathan Hayase, Georgios Smyrnis, Thao Nguyen, Ryan Marten, Mitchell Wortsman, Dhruba Ghosh, Jieyu Zhang, et al. Datacomp: In search of the next generation of multimodal datasets.Advances in Neural Information Processing Systems (NeurIPS), 36:27092–27112, 2023

2023
[7]

Meta clip 2: A worldwide scaling recipe.arXiv preprint arXiv:2507.22062,

Yung-Sung Chuang, Yang Li, Dong Wang, Ching-Feng Yeh, Kehan Lyu, Ramya Raghavendra, James Glass, Lifei Huang, Jason Weston, Luke Zettlemoyer, et al. Meta clip 2: A worldwide scaling recipe.arXiv preprint arXiv:2507.22062, 2025

work page arXiv 2025
[8]

Scaling pre-training to one hundred billion data for vision language models.arXiv preprint arXiv:2502.07617, 2025

Xiao Wang, Ibrahim Alabdulmohsin, Daniel Salz, Zhe Li, Keran Rong, and Xiaohua Zhai. Scaling pre-training to one hundred billion data for vision language models.arXiv preprint arXiv:2502.07617, 2025

work page arXiv 2025
[9]

Vision-language models do not understand negation

Kumail Alhamoud, Shaden Alshammari, Yonglong Tian, Guohao Li, Philip HS Torr, Yoon Kim, and Marzyeh Ghassemi. Vision-language models do not understand negation. InProceedings of theIEEE/CVFConferenceonComputerVisionandPatternRecognition(CVPR),pages29612–29622, 2025

2025
[10]

Valse: Atask-independentbenchmarkforvisionandlanguagemodelscenteredonlinguistic phenomena

Letitia Parcalabescu, Michele Cafagna, Lilitta Muradjan, Anette Frank, Iacer Calixto, and Albert Gatt. Valse: Atask-independentbenchmarkforvisionandlanguagemodelscenteredonlinguistic phenomena. InAnnual Meeting of the Association for Computational Linguistics (ACL), pages 8253–8280, 2022

2022
[11]

Vl-checklist: Evaluating pre-trained vision-language models with objects, attributes and relations.arXiv preprint arXiv:2207.00221, 2022

Tiancheng Zhao, Tianqi Zhang, Mingwei Zhu, Haozhan Shen, Kyusong Lee, Xiaopeng Lu, and Jianwei Yin. Vl-checklist: Evaluating pre-trained vision-language models with objects, attributes and relations.arXiv preprint arXiv:2207.00221, 2022

work page arXiv 2022
[12]

Clip the bias: How useful is balancing data in multimodal learning?arXiv preprint arXiv:2403.04547, 2024

Ibrahim Alabdulmohsin, Xiao Wang, Andreas Steiner, Priya Goyal, Alexander D’Amour, and Xiaohua Zhai. Clip the bias: How useful is balancing data in multimodal learning?arXiv preprint arXiv:2403.04547, 2024

work page arXiv 2024
[13]

Winoground: Probing vision and language models for visio-linguistic compositionality

Tristan Thrush, Ryan Jiang, Max Bartolo, Amanpreet Singh, Adina Williams, Douwe Kiela, and Candace Ross. Winoground: Probing vision and language models for visio-linguistic compositionality. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5238–5248, 2022

2022
[14]

Clip behaves like a bag-of-words model cross-modally but not uni-modally.arXiv preprint arXiv:2502.03566, 2025

Darina Koishigarina, Arnas Uselis, and Seong Joon Oh. Clip behaves like a bag-of-words model cross-modally but not uni-modally.arXiv preprint arXiv:2502.03566, 2025

work page arXiv 2025
[15]

Contrastive learning with hard negative samples,

Joshua Robinson, Ching-Yao Chuang, Suvrit Sra, and Stefanie Jegelka. Contrastive learning with hard negative samples.arXiv preprint arXiv:2010.04592, 2020

work page arXiv 2010
[16]

Contrasting intra-modal and ranking cross- modal hard negatives to enhance visio-linguistic compositional understanding

Le Zhang, Rabiul Awal, and Aishwarya Agrawal. Contrasting intra-modal and ranking cross- modal hard negatives to enhance visio-linguistic compositional understanding. InProceedings of theIEEE/CVFConferenceonComputerVisionandPatternRecognition(CVPR),pages13774–13784, 2024. 11 Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compo...

2024
[17]

spacy 2: Natural language understanding with bloom embeddings, convolu- tional neural networks and incremental parsing.(No Title), 2017

Matthew Honnibal. spacy 2: Natural language understanding with bloom embeddings, convolu- tional neural networks and incremental parsing.(No Title), 2017

2017
[18]

When and why vision-language models behave like bags-of-words, and what to do about it? arXiv preprint arXiv:2210.01936, 2022

Mert Yuksekgonul, Federico Bianchi, Pratyusha Kalluri, Dan Jurafsky, and James Zou. When and why vision-language models behave like bags-of-words, and what to do about it?arXiv preprint arXiv:2210.01936, 2022

work page arXiv 2022
[19]

Countercurate: Enhancing physical and semantic visio-linguistic compositional reasoning via counterfactual examples

Jianrui Zhang, Mu Cai, Tengyang Xie, and Yong Jae Lee. Countercurate: Enhancing physical and semantic visio-linguistic compositional reasoning via counterfactual examples. InFindings of the Association for Computational Linguistics (ACL), pages 15481–15495, 2024

2024
[20]

Improving clip training with language rewrites.Advances in Neural Information Processing Systems (NeurIPS), 36: 35544–35575, 2023

Lijie Fan, Dilip Krishnan, Phillip Isola, Dina Katabi, and Yonglong Tian. Improving clip training with language rewrites.Advances in Neural Information Processing Systems (NeurIPS), 36: 35544–35575, 2023

2023
[21]

Tripletclip: Improving compositional reasoning of clip via syntheticvision-languagenegatives.AdvancesinNeuralInformationProcessingSystems(NeurIPS), 37:32731–32760, 2024

Maitreya Patel, Naga Sai Abhiram Kusumba, Sheng Cheng, Changhoon Kim, Tejas Gokhale, Chitta Baral, and Yezhou Yang. Tripletclip: Improving compositional reasoning of clip via syntheticvision-languagenegatives.AdvancesinNeuralInformationProcessingSystems(NeurIPS), 37:32731–32760, 2024

2024
[22]

Literal and metaphorical sense identification through concrete and abstract context

Peter Turney, Yair Neuman, Dan Assaf, and Yohai Cohen. Literal and metaphorical sense identification through concrete and abstract context. InConference on Empirical Methods in Natural Language Processing (EMNLP), pages 680–690, 2011

2011
[23]

Psychology Press, 2013

Allan Paivio.Imagery and verbal processes. Psychology Press, 2013

2013
[24]

Quantifying the visual concreteness of words and topics in multimodal datasets

Jack Hessel, David Mimno, and Lillian Lee. Quantifying the visual concreteness of words and topics in multimodal datasets. InConference of the North American Chapter of the Association for Computational Linguistics (NAACL), pages 2194–2205, 2018

2018
[25]

Concreteness ratings for 40 thousand generally known english word lemmas.Behavior research methods, 46(3):904–911, 2014

Marc Brysbaert, Amy Beth Warriner, and Victor Kuperman. Concreteness ratings for 40 thousand generally known english word lemmas.Behavior research methods, 46(3):904–911, 2014

2014
[26]

Dense and aligned captions (dac) promote compositional reasoning in vl models.Advances in Neural Information Processing Systems (NeurIPS), 36:76137–76150, 2023

Sivan Doveh, Assaf Arbelle, Sivan Harary, Roei Herzig, Donghyun Kim, Paola Cascante-Bonilla, Amit Alfassy, Rameswar Panda, Raja Giryes, Rogerio Feris, et al. Dense and aligned captions (dac) promote compositional reasoning in vl models.Advances in Neural Information Processing Systems (NeurIPS), 36:76137–76150, 2023

2023
[27]

Revisit large-scale image-caption data in pre-training multimodal foundation models.arXiv preprint arXiv:2410.02740, 2024

Zhengfeng Lai, Vasileios Saveris, Chen Chen, Hong-You Chen, Haotian Zhang, Bowen Zhang, Juan Lao Tebar, Wenze Hu, Zhe Gan, Peter Grasch, et al. Revisit large-scale image-caption data in pre-training multimodal foundation models.arXiv preprint arXiv:2410.02740, 2024

work page arXiv 2024
[28]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, ChengenHuang, ChenxuLv, etal. Qwen3technicalreport.arXivpreprintarXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[29]

Adversarial diffusion distillation

Axel Sauer, Dominik Lorenz, Andreas Blattmann, and Robin Rombach. Adversarial diffusion distillation. InEuropean Conference on Computer Vision, pages 87–103. Springer, 2024

2024
[30]

Harnessing meta-learning for controllable full-frame video stabilization.arXiv preprint arXiv:2508.18859, 2025

Muhammad Kashif Ali, Eun Woo Im, Dongjin Kim, Tae Hyun Kim, Vivek Gupta, Haonan Luo, and Tianrui Li. Harnessing meta-learning for controllable full-frame video stabilization.arXiv preprint arXiv:2508.18859, 2025

work page arXiv 2025
[31]

Self-augmented visual contrastive decoding.arXiv preprint arXiv:2510.13315, 2025

Eun Woo Im, Muhammad Kashif Ali, and Vivek Gupta. Self-augmented visual contrastive decoding.arXiv preprint arXiv:2510.13315, 2025. 12 Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compositional Understanding

work page arXiv 2025
[32]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[33]

Robust locally weighted regression and smoothing scatterplots.Journal of the American statistical association, 74(368):829–836, 1979

William S Cleveland. Robust locally weighted regression and smoothing scatterplots.Journal of the American statistical association, 74(368):829–836, 1979

1979
[34]

Understanding contrastive representation learning through alignment and uniformity on the hypersphere

Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. InInternational Conference on Machine Learning (ICML), pages 9929–9939. PMLR, 2020

2020
[35]

arXiv preprint arXiv:2010.06682 , year=

Tiffany Tianhui Cai, Jonathan Frankle, David J Schwab, and Ari S Morcos. Are all negatives created equal in contrastive instance discrimination?arXiv preprint arXiv:2010.06682, 2020

work page arXiv 2010
[36]

Sulla quantizzazione del gas perfetto monoatomico.Rendiconti Lincei, 145, 1926

Enrico Fermi. Sulla quantizzazione del gas perfetto monoatomico.Rendiconti Lincei, 145, 1926

1926
[37]

On the theory of quantum mechanics.Proceedings of the Royal Society of London

Paul Adrien Maurice Dirac. On the theory of quantum mechanics.Proceedings of the Royal Society of London. Series A, Containing Papers of a Mathematical and Physical Character, 112 (762):661–677, 1926

1926
[38]

The fermi–dirac distribution provides a calibrated probabilistic output for binary classifiers

Sung-Cheol Kim, Adith S Arun, Mehmet Eren Ahsen, Robert Vogel, and Gustavo Stolovitzky. The fermi–dirac distribution provides a calibrated probabilistic output for binary classifiers. Proceedings of the National Academy of Sciences, 118(34):e2100761118, 2021

2021
[39]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InProceedings of the European Conference on Computer Vision (ECCV), pages 740–755. Springer, 2014

2014
[40]

Deep visual-semantic alignments for generating image descrip- tions

Andrej Karpathy and Li Fei-Fei. Deep visual-semantic alignments for generating image descrip- tions. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3128–3137, 2015

2015
[41]

Sugarcrepe: Fixinghackablebenchmarksforvision-languagecompositionality.AdvancesinNeuralInformation Processing Systems (NeurIPS), 36:31096–31116, 2023

Cheng-Yu Hsieh, Jieyu Zhang, Zixian Ma, Aniruddha Kembhavi, and Ranjay Krishna. Sugarcrepe: Fixinghackablebenchmarksforvision-languagecompositionality.AdvancesinNeuralInformation Processing Systems (NeurIPS), 36:31096–31116, 2023

2023
[42]

Sugarcrepe++ dataset: Vision-language model sensitivity to semantic and lexical alterations.Advances in Neural Information Processing Systems (NeurIPS), 37:17972–18018, 2024

Sri H Dumpala, Aman Jaiswal, Chandramouli Sastry, Evangelos Milios, Sageev Oore, and Hassan Sajjad. Sugarcrepe++ dataset: Vision-language model sensitivity to semantic and lexical alterations.Advances in Neural Information Processing Systems (NeurIPS), 37:17972–18018, 2024

2024
[43]

Imagenet large scale visual recognition challenge.International Journal of Computer Vision (IJCV), 115(3):211–252, 2015

Olga Russakovsky, Jia Deng, Hao Su, Jonathan Krause, Sanjeev Satheesh, Sean Ma, Zhiheng Huang, Andrej Karpathy, Aditya Khosla, Michael Bernstein, et al. Imagenet large scale visual recognition challenge.International Journal of Computer Vision (IJCV), 115(3):211–252, 2015

2015
[44]

Peter Young, Alice Lai, Micah Hodosh, and Julia Hockenmaier. From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions.Transactions of the association for computational linguistics, 2:67–78, 2014

2014
[45]

A large-scale study of representation learning with the visual task adaptation benchmark

Xiaohua Zhai, Joan Puigcerver, Alexander Kolesnikov, Pierre Ruyssen, Carlos Riquelme, Mario Lucic, Josip Djolonga, Andre Susano Pinto, Maxim Neumann, Alexey Dosovitskiy, et al. A large-scale study of representation learning with the visual task adaptation benchmark.arXiv preprint arXiv:1910.04867, 2019. 13 Concrete Jungle: Towards Concreteness Paved Contr...

work page arXiv 1910
[46]

Perception Encoder: The best visual embeddings are not at the output of the network

Daniel Bolya, Po-Yao Huang, Peize Sun, Jang Hyun Cho, Andrea Madotto, Chen Wei, Tengyu Ma, Jiale Zhi, Jathushan Rajasegaran, Hanoona Rasheed, et al. Perception encoder: The best visual embeddings are not at the output of the network.arXiv preprint arXiv:2504.13181, 2025

work page internal anchor Pith review arXiv 2025
[47]

Demystifying clip data.arXiv preprint arXiv:2309.16671, 2023

Hu Xu, Saining Xie, Xiaoqing Ellen Tan, Po-Yao Huang, Russell Howes, Vasu Sharma, Shang-Wen Li, Gargi Ghosh, Luke Zettlemoyer, and Christoph Feichtenhofer. Demystifying clip data.arXiv preprint arXiv:2309.16671, 2023

work page arXiv 2023
[48]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[49]

BERTScore: Evaluating Text Generation with BERT

Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q Weinberger, and Yoav Artzi. Bertscore: Evaluating text generation with bert.arXiv preprint arXiv:1904.09675, 2019

work page internal anchor Pith review arXiv 1904
[50]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Yinhan Liu, Myle Ott, Naman Goyal, Jingfei Du, Mandar Joshi, Danqi Chen, Omer Levy, Mike Lewis, Luke Zettlemoyer, and Veselin Stoyanov. Roberta: A robustly optimized bert pretraining approach.arXiv preprint arXiv:1907.11692, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907
[51]

Deep variational bayesian modeling of haze degradation process

Eun Woo Im, Junsung Shin, Sungyong Baik, and Tae Hyun Kim. Deep variational bayesian modeling of haze degradation process. InProceedings of the 32nd ACM International Conference on Information and Knowledge Management, pages 895–904, 2023

2023
[52]

Harnessing meta- learning for improving full-frame video stabilization

Muhammad Kashif Ali, Eun Woo Im, Dongjin Kim, and Tae Hyun Kim. Harnessing meta- learning for improving full-frame video stabilization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 12605–12614, 2024

2024
[53]

Vidhalluc: Evaluating temporal hallucinations in multimodal large language models for video understanding

Chaoyu Li, Eun Woo Im, and Pooyan Fazli. Vidhalluc: Evaluating temporal hallucinations in multimodal large language models for video understanding. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 13723–13733, 2025

2025
[54]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alab- dulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features.arXiv preprint arXiv:2502.14786, 2025

work page internal anchor Pith review arXiv 2025
[55]

Decoupled contrastive learning

Chun-Hsiao Yeh, Cheng-Yao Hong, Yen-Chi Hsu, Tyng-Luh Liu, Yubei Chen, and Yann LeCun. Decoupled contrastive learning. InProceedings of the European Conference on Computer Vision (ECCV), pages 668–684. Springer, 2022

2022
[56]

Clipscore: A reference-free evaluation metric for image captioning

Jack Hessel, Ari Holtzman, Maxwell Forbes, Ronan Le Bras, and Yejin Choi. Clipscore: A reference-free evaluation metric for image captioning. InConference on Empirical Methods in Natural Language Processing (EMNLP), pages 7514–7528, 2021

2021
[57]

Differential-informed sample selection accelerates multimodal contrastive learning

Zihua Zhao, Feng Hong, Mengxi Chen, Pengyi Chen, Benyuan Liu, Jiangchao Yao, Ya Zhang, and Yanfeng Wang. Differential-informed sample selection accelerates multimodal contrastive learning. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2930–2940, 2025

2025
[58]

Veclip: Improving clip training via visual-enriched captions

Zhengfeng Lai, Haotian Zhang, Bowen Zhang, Wentao Wu, Haoping Bai, Aleksei Timofeev, Xianzhi Du, Zhe Gan, Jiulong Shan, Chen-Nee Chuah, et al. Veclip: Improving clip training via visual-enriched captions. InProceedings of the European Conference on Computer Vision (ECCV), pages 111–127. Springer, 2024. 14 Concrete Jungle: Towards Concreteness Paved Contra...

2024
[59]

Provablestochasticoptimizationforglobalcontrastivelearning: Smallbatchdoesnotharm performance

Zhuoning Yuan, Yuexin Wu, Zi-Hao Qiu, Xianzhi Du, Lijun Zhang, Denny Zhou, and Tianbao Yang. Provablestochasticoptimizationforglobalcontrastivelearning: Smallbatchdoesnotharm performance. InInternational Conference on Machine Learning (ICML), pages 25760–25782. PMLR, 2022

2022
[60]

Disco-clip: A distributed contrastive loss for memory efficient clip training

Yihao Chen, Xianbiao Qi, Jianan Wang, and Lei Zhang. Disco-clip: A distributed contrastive loss for memory efficient clip training. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 22648–22657, 2023

2023
[61]

Know" no" better: A data-driven approach for enhancing negation awareness in clip

Junsung Park, Jungbeom Lee, Jongyoon Song, Sangwon Yu, Dahuin Jung, and Sungroh Yoon. Know" no" better: A data-driven approach for enhancing negation awareness in clip. InPro- ceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2825–2835, 2025

2025
[62]

Teaching clip to count to ten

Roni Paiss, Ariel Ephrat, Omer Tov, Shiran Zada, Inbar Mosseri, Michal Irani, and Tali Dekel. Teaching clip to count to ten. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 3170–3180, 2023

2023
[63]

Nemo: Can multimodal llms identify attribute-modified objects?arXiv preprint arXiv:2411.17794, 2024

Jiaxuan Li, Junwen Mo, MinhDuc Vo, Akihiro Sugimoto, and Hideki Nakayama. Nemo: Can multimodal llms identify attribute-modified objects?arXiv preprint arXiv:2411.17794, 2024

work page arXiv 2024
[64]

What’s “up” with vision-language models? investigating their struggle with spatial reasoning

Amita Kamath, Jack Hessel, and Kai-Wei Chang. What’s “up” with vision-language models? investigating their struggle with spatial reasoning. InConference on Empirical Methods in Natural Language Processing (EMNLP), pages 9161–9175, 2023

2023
[65]

Spatial457: A diagnostic benchmark for 6d spatial reasoning of large mutimodal models

Xingrui Wang, Wufei Ma, Tiezheng Zhang, Celso M de Melo, Jieneng Chen, and Alan Yuille. Spatial457: A diagnostic benchmark for 6d spatial reasoning of large mutimodal models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 24669–24679, 2025

2025
[66]

Zixian Ma, Jerry Hong, Mustafa Omer Gul, Mona Gandhi, Irena Gao, and Ranjay Krishna. Crepe: Can vision-language foundation models reason compositionally? InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 10910–10921, 2023

2023
[67]

Vismin: Visual minimal-change understanding.Advances in Neural Information Processing Systems (NeurIPS), 37:107795– 107829, 2024

Rabiul Awal, Saba Ahmadi, Le Zhang, and Aishwarya Agrawal. Vismin: Visual minimal-change understanding.Advances in Neural Information Processing Systems (NeurIPS), 37:107795– 107829, 2024

2024
[68]

Explain before you answer: A survey on compositional visual reasoning.arXiv preprint arXiv:2508.17298, 2025

Fucai Ke, Joy Hsu, Zhixi Cai, Zixian Ma, Xin Zheng, Xindi Wu, Sukai Huang, Weiqing Wang, Pari Delir Haghighi, Gholamreza Haffari, et al. Explain before you answer: A survey on compositional visual reasoning.arXiv preprint arXiv:2508.17298, 2025

work page arXiv 2025
[69]

Chimera: Compositional image generation using part-based concepting, 2025.https://arxiv.org/abs/2510.18083

Shivam Singh, Yiming Chen, Agneet Chatterjee, Amit Raj, James Hays, Yezhou Yang, and Chitta Baral. Chimera: Compositional image generation using part-based concepting.arXiv preprint arXiv:2510.18083, 2025

work page arXiv 2025
[70]

Mass: Overcoming language bias in image-text matching

Jiwan Chung, Seungwon Lim, Sangkyu Lee, and Youngjae Yu. Mass: Overcoming language bias in image-text matching. InAssociation for the Advancement of Artificial Intelligence (AAAI), volume 39, pages 2591–2599, 2025

2025
[71]

Coarse-to-fine contrastive learning in image-text-graph space for improved vision- language compositionality

Harman Singh, Pengchuan Zhang, Qifan Wang, Mengjiao Wang, Wenhan Xiong, Jingfei Du, and Yu Chen. Coarse-to-fine contrastive learning in image-text-graph space for improved vision- language compositionality. InConference on Empirical Methods in Natural Language Processing (EMNLP), pages 869–893, 2023. 15 Concrete Jungle: Towards Concreteness Paved Contrast...

2023
[72]

Structure-clip: Towards scene graph knowledge to enhance multi-modal structured representations

Yufeng Huang, Jiji Tang, Zhuo Chen, Rongsheng Zhang, Xinfeng Zhang, Weijie Chen, Zeng Zhao, Zhou Zhao, Tangjie Lv, Zhipeng Hu, et al. Structure-clip: Towards scene graph knowledge to enhance multi-modal structured representations. InAssociation for the Advancement of Artificial Intelligence (AAAI), volume 38, pages 2417–2425, 2024

2024
[73]

A hat wearing a man

Amita Kamath, Cheng-Yu Hsieh, Kai-Wei Chang, and Ranjay Krishna. The hard positive truth about vision-language compositionality. InProceedings of the European Conference on Computer Vision (ECCV), pages 37–54. Springer, 2024. 16 Concrete Jungle: Towards Concreteness Paved Contrastive Negative Mining for Compositional Understanding Concrete Jungle: Towards...

2024