pith. sign in

arxiv: 2606.01710 · v1 · pith:MCLKZ7GCnew · submitted 2026-06-01 · 💻 cs.CV · cs.LG

Density-Aware Translation of Spurious Correlations in Zero-Shot VLMs

Pith reviewed 2026-06-28 14:58 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords zero-shot VLMsspurious correlationsdensity-aware translationCLIPmodality gapanisotropic embeddingscalibrationzero-shot classification
0
0 comments X

The pith

Density-aware translation mitigates spurious correlations in zero-shot VLMs by rescaling similarities using embedding density.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces Density-Aware Translation to address how zero-shot VLMs like CLIP remain sensitive to spurious correlations from contextual cues. The method refines image-text similarity scores with a local geometric density term from group reference sets. It is based on the observation that CLIP embeddings show a modality gap and anisotropic shell, where common patterns cluster near the mean and rare ones are pushed out, unevenly amplifying spurious signals. DAT rescales similarities to suppress overconfident scores in diffuse areas while keeping dense semantic matches. Experiments show gains in worst-group and average accuracy, providing a calibration approach that avoids fine-tuning.

Core claim

Density-Aware Translation refines image-text similarity scores using a local geometric density term derived from group reference sets. This corrects the uneven alignment in CLIP embeddings, where the modality gap and anisotropic shell amplify spurious correlations while marginalizing rare but meaningful cues. The approach suppresses overconfident scores in low-density regions, leading to more reliable zero-shot classification.

What carries the argument

Density-Aware Translation (DAT), which applies a relative density-based rescaling to image-text similarities to counteract the effects of anisotropic embedding distribution.

If this is right

  • Improves worst-group accuracy on benchmark datasets
  • Also improves average accuracy
  • Serves as a calibration mechanism without requiring fine-tuning or prompt engineering
  • Preserves advantages of pre-trained multimodal models

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method could apply to other embedding-based models exhibiting similar geometric biases
  • Density measures might help identify which patterns are over-represented in training data
  • Combining DAT with existing debiasing techniques could yield additive benefits

Load-bearing premise

CLIP embeddings exhibit a modality gap and lie on an anisotropic shell, creating uneven alignment that amplifies spurious correlations.

What would settle it

Measuring the distribution of CLIP embeddings on a new dataset and finding it does not form an anisotropic shell with clustered common patterns, or observing that DAT does not improve accuracy on such data.

Figures

Figures reproduced from arXiv: 2606.01710 by Afsaneh Hasanebrahimi, Christopher Leckie, Hanxun Huang, Sarah Erfani.

Figure 1
Figure 1. Figure 1: Motivation for DAT. (a) CLIP embeddings exhibit anisotropic ellipsoidal structure, where frequent, spuriously correlated samples cluster near the mean, while rare but semantically meaningful samples lie in sparser regions. (b) Tangent-space Mahalanobis Distance for group embeddings shows that spuriously aligned LW has a lower distance, on average, to the waterbird prompt than the true WL group. Abbreviatio… view at source ↗
Figure 2
Figure 2. Figure 2: Pipeline of DAT/DAT*: DAT constructs group reference sets, encodes test images and prompts, estimates local density, and rescales similarities. DAT* follows the same pipeline but infers group attributes automatically, without explicit annotations. sphere to the tangent space at µˆy,a. Let z (h) y,a ∈ S d−1 be the h-th image embedding in group (y, a), and wy,a ∈ S d−1 be the normalised text embedding. Then … view at source ↗
Figure 3
Figure 3. Figure 3: Comparison of WG accuracy under varying (a) scaling factor, (b) number of samples per group, and (c) number of neighbours using Waterbirds and CLIP(ViT-L/14) [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Aggregation ablation on Waterbirds. The full DAT aggregation outperforms using either corrected score alone. Waterbirds, as generated in Liang et al. (2022)) and averag￾ing their embeddings. Spurious attribute variants and corre￾sponding group prompts are provided in Tables 11 and 12 in Appendix D.1. This robustified representation is intended to better capture attribute variability and has been shown to i… view at source ↗
Figure 5
Figure 5. Figure 5: Density comparisons on Waterbirds with CLIP ViT-L/14. (a) For WL images (waterbird on land), raw cosine similarity to the WL vs LW prompts shows heavy overlap, and cosine alone cannot reliably separate the true group from its spuriously aligned counterpart. (b) SLOF values for the same WL images, measured against WL vs LW references; WL assigns a higher SLOF (sparser) than LW, revealing density asymmetry. … view at source ↗
Figure 6
Figure 6. Figure 6: Illustration of score distributions for the worst group in Waterbirds (Landbird in Water, LW) and its spuriously aligned counterpart (Waterbird in Water, WW). (a) Zero-shot cosine scores for both groups largely overlap, making discrimination difficult. (b) SLOF values show clearer separation, with LW having lower SLOF under its own group references. (c) DAT scores incorporate this density difference and sh… view at source ↗
Figure 7
Figure 7. Figure 7: Illustration of score distributions for Dark-Hair Male (DH-M) in CelebA, and its spuriously aligned counterpart (Blond-Hair Male, BH-M). (a) Zero-shot cosine scores for both groups largely overlap, making discrimination difficult. (b) SLOF values show clearer separation, with DH-M having lower SLOF under its own group references. (c) DAT scores incorporate this density difference and sharpen the separation… view at source ↗
read the original abstract

Vision-Language models (VLMs), such as CLIP, achieve powerful zero-shot classification. However, their predictions remain sensitive to spurious correlations, where contextual cues dominate over semantic content. Earlier solutions typically rely on fine-tuning or prompt engineering, which either undermine the advantages of pre-trained models or are prone to hallucination. In this work, we propose Density-Aware Translation (DAT) that refines image-text similarity scores using a local geometric density term derived from group reference sets. Our approach is motivated by the phenomenon that CLIP embeddings exhibit a modality gap and lie on an anisotropic shell in the feature space: common patterns cluster near the mean, while rare patterns are pushed outward. This geometry creates uneven alignment, where spurious correlations are amplified while semantically meaningful but rare cues are marginalised. To address this, we employ a relative measure to rescale similarities based on embedding density, suppressing overconfident scores in diffuse regions while preserving dense, semantically consistent matches. Experimental results on benchmark datasets demonstrate consistent improvements in worst-group and average accuracy, highlighting density-aware translation as a simple and effective calibration mechanism for reliable zero-shot classification using multimodal models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes Density-Aware Translation (DAT), a post-hoc calibration for zero-shot VLMs such as CLIP. DAT rescales image-text similarity scores via a local geometric density term computed from group reference sets, motivated by the claim that CLIP embeddings occupy an anisotropic shell with a modality gap that amplifies spurious correlations. The central claim is that this yields consistent gains in both worst-group and average accuracy on benchmark datasets while remaining a simple, training-free procedure.

Significance. If DAT can be shown to operate without group-labeled data, the result would supply a lightweight, geometry-driven calibration that improves robustness of pre-trained multimodal models without fine-tuning or prompt engineering.

major comments (1)
  1. [Abstract] Abstract: The density term is stated to be 'derived from group reference sets,' yet the paper presents DAT as a zero-shot method. In the spurious-correlation setting, group reference sets are standardly formed from joint (class, spurious-attribute) labels. The manuscript provides no account of how these sets can be assembled from purely unlabeled or class-only data while still estimating the required local densities; this data requirement directly undercuts the headline zero-shot claim and is therefore load-bearing.
minor comments (1)
  1. [Abstract] Abstract: The claim of 'consistent improvements' is made without any numerical values, dataset names, or baseline comparisons; the full manuscript must supply these quantitative results for the experimental claim to be evaluable.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We appreciate the referee's careful reading and the opportunity to clarify the data requirements of our proposed method. We respond to the major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The density term is stated to be 'derived from group reference sets,' yet the paper presents DAT as a zero-shot method. In the spurious-correlation setting, group reference sets are standardly formed from joint (class, spurious-attribute) labels. The manuscript provides no account of how these sets can be assembled from purely unlabeled or class-only data while still estimating the required local densities; this data requirement directly undercuts the headline zero-shot claim and is therefore load-bearing.

    Authors: We agree that the abstract's phrasing could be misinterpreted as implying a fully unsupervised or class-label-only setting. To address this, we will revise the abstract to explicitly state that DAT is a training-free post-hoc calibration technique that utilizes group reference sets constructed from joint (class, spurious attribute) labels. This is consistent with standard evaluation protocols in the spurious correlation literature, where such labels are available for defining groups but no model training occurs. The 'zero-shot' designation in the title and abstract emphasizes the lack of fine-tuning on the target task, distinguishing our approach from methods that require gradient updates or prompt optimization. We will also expand the method section to include a discussion on the minimal data needed for reference sets and explore whether class-only labels suffice for density estimation in some cases, though empirical results suggest group labels improve performance. This clarification will strengthen rather than undercut the contribution by making the assumptions transparent. revision: yes

Circularity Check

0 steps flagged

No circularity; derivation introduces independent density rescaling

full rationale

The abstract describes DAT as refining similarity scores via a local geometric density term computed from group reference sets, motivated by the observed anisotropic shell geometry of CLIP embeddings. No equations, self-citations, or steps are exhibited that define the output similarity or accuracy gain as equivalent to the input density by construction, nor is any fitted parameter relabeled as a prediction. The method adds an explicit rescaling step whose grounding is the external geometric observation rather than tautological re-use of the target result. Concerns about group labels versus zero-shot status are assumption or data-requirement issues, not reductions of the claimed derivation to its own inputs.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Abstract-only review; full paper may reveal more parameters or assumptions in the density calculation.

free parameters (1)
  • density scaling factor
    The local geometric density term likely involves at least one tunable parameter for rescaling, though not detailed in abstract.
axioms (1)
  • domain assumption CLIP embeddings lie on an anisotropic shell with common patterns clustering near the mean
    Invoked as the geometric motivation for the approach in the abstract.

pith-pipeline@v0.9.1-grok · 5743 in / 1017 out tokens · 28041 ms · 2026-06-28T14:58:51.157157+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 4 canonical work pages · 2 internal anchors

  1. [1]

    Proceedings of the International Conference on Machine Learning (ICML) , year=

    Koh, Pang Wei and Sagawa, Shiori and Marklund, Henrik and Xie, Sang Michael and Zhang, Marvin and Balsubramani, Akshay and Hu, Weihua and Yasunaga, Michihiro and Phillips, Richard and Gao, Irena and others , pages=. Proceedings of the International Conference on Machine Learning (ICML) , year=

  2. [2]

    Proceedings of the International Conference on Learning Representations (ICLR) , year=

    Distributionally robust neural networks for group shifts: On the importance of regularization for worst-case generalization , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=

  3. [3]

    Proceedings of the IEEE International Conference on Computer Vision (ICCV) , pages=

    Deep learning face attributes in the wild , author=. Proceedings of the IEEE International Conference on Computer Vision (ICCV) , pages=

  4. [4]

    Skin lesion analysis toward Melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (

    Codella, Noel C F and Rotemberg, Veronica and Tschandl, Philipp and Celebi, M Emre and Dusza, Stephen and Gutman, David and Helba, Brian and Kalloo, Aadi and Liopyris, Konstantinos and Marchetti, Michael and others , journal=. Skin lesion analysis toward Melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (

  5. [5]

    Cohen, Joseph Paul and Morrison, Paul and Dao, Lan and Roth, Karsten and Duong, Tim and Ghassemi, Marzyeh and others , journal=

  6. [6]

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

    Functional map of the world , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

  7. [7]

    Proceedings of the International Conference on Machine Learning (ICML) , pages=

    What are Bayesian neural network posteriors really like? , author=. Proceedings of the International Conference on Machine Learning (ICML) , pages=

  8. [8]

    Proceedings of the International Conference on Machine Learning (ICML) , pages =

    Environment inference for invariant learning , author=. Proceedings of the International Conference on Machine Learning (ICML) , pages =

  9. [9]

    Proceedings of the IEEE International Conference on Computer Vision (ICCV) , pages=

    Linear spaces of meanings: compositional structures in vision-language models , author=. Proceedings of the IEEE International Conference on Computer Vision (ICCV) , pages=

  10. [10]

    Perception

    Bang An and Sicheng Zhu and Michael-Andrei Panaitescu-Liess and Chaithanya Kumar Mummadi and Furong Huang , booktitle=. Perception

  11. [11]

    Debiasing vision- language models via biased prompts.arXiv preprint arXiv:2302.00070, 2023

    Debiasing vision-language models via biased prompts , author=. arXiv preprint arXiv:2302.00070 , year=

  12. [12]

    Proceedings of the International Conference on Learning Representations (ICLR) , year=

    Zero-shot robustification of zero-shot models , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=

  13. [13]

    Proceedings of the International Conference on Machine Learning (ICML) , pages=

    Learning transferable visual models from natural language supervision , author=. Proceedings of the International Conference on Machine Learning (ICML) , pages=

  14. [14]

    Understanding transferable representation learning and zero-shot transfer in

    Chen, Zixiang and Deng, Yihe and Li, Yuanzhi and Gu, Quanquan , booktitle=. Understanding transferable representation learning and zero-shot transfer in

  15. [15]

    Subramanian, Sanjay and Merrill, William and Darrell, Trevor and Gardner, Matt and Singh, Sameer and Rohrbach, Anna , booktitle=

  16. [16]

    Sain, Aneeshan and Bhunia, Ayan Kumar and Chowdhury, Pinaki Nath and Koley, Subhadeep and Xiang, Tao and Song, Yi-Zhe , booktitle=

  17. [17]

    Online zero-shot classification with

    Qian, Qi and Hu, Juhua , booktitle=. Online zero-shot classification with

  18. [18]

    PLoS Medicine , volume=

    Variable generalization performance of a deep learning model to detect pneumonia in chest radiographs: A cross-sectional study , author=. PLoS Medicine , volume=. 2018 , publisher=

  19. [19]

    and Janizek, Joseph D

    DeGrave, Alex J. and Janizek, Joseph D. and Lee, Su-In , journal=. 2021 , publisher=

  20. [20]

    On the Opportunities and Risks of Foundation Models

    On the opportunities and risks of foundation models , author=. arXiv preprint arXiv:2108.07258 , year=

  21. [21]

    The double-ellipsoid geometry of

    Levi, Meir Yossef and Gilboa, Guy , booktitle=. The double-ellipsoid geometry of

  22. [22]

    Findings of the Association for Computational Linguistics: EMNLP 2025

    Evaluating fairness in large vision-language models across diverse demographic attributes and prompts , author=. Findings of the Association for Computational Linguistics: EMNLP 2025

  23. [23]

    Abbasi, Reza and Nazari, Ali and Sefid, Aminreza and Banayeeanzade, Mohammadali and Rohban, Mohammad Hossein and Baghshah, Mahdieh Soleymani , booktitle=

  24. [24]

    A survey on hallucination in Large Language Models: Principles, taxonomy, challenges, and open questions , volume=

    Huang, Lei and Yu, Weijiang and Ma, Weitao and Zhong, Weihong and Feng, Zhangyin and Wang, Haotian and Chen, Qianglong and Peng, Weihua and Feng, Xiaocheng and Qin, Bing and Liu, Ting , year=. A survey on hallucination in Large Language Models: Principles, taxonomy, challenges, and open questions , volume=. ACM Transactions on Information Systems , publisher=

  25. [25]

    Molahasani, Mahdiyar and Motamedi, Azadeh and Greenspan, Michael and Kim, Il-Min and Etemad, Ali , booktitle=

  26. [26]

    Proceedings of the International Conference on Machine Learning (ICML) , pages=

    Mitigating spurious correlations in multi-modal models during fine-tuning , author=. Proceedings of the International Conference on Machine Learning (ICML) , pages=

  27. [27]

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

    Finetune like you pretrain: Improved finetuning of zero-shot vision models , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

  28. [28]

    Varma, Maya and Delbrouck, Jean-Benoit and Chen, Zhihong and Chaudhari, Akshay and Langlotz, Curtis , booktitle=

  29. [29]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Contrastive adapters for foundation model group robustness , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  30. [30]

    Proceedings of the International Conference on Learning Representations (ICLR) , year=

    Mitigating spurious correlations in zero-shot multimodal models , author=. Proceedings of the International Conference on Learning Representations (ICLR) , year=

  31. [31]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  32. [32]

    Clustering on the unit hypersphere using von

    Banerjee, Arindam and Dhillon, Inderjit S and Ghosh, Joydeep and Sra, Suvrit and Ridgeway, Greg , journal=. Clustering on the unit hypersphere using von

  33. [33]

    Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

    Improving zero-shot generalization and robustness of multi-modal models , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

  34. [34]

    Invariant Risk Minimization

    Invariant risk minimization , author=. arXiv preprint arXiv:1907.02893 , year=

  35. [35]

    Proceedings of the International Conference on Machine Learning (ICML) , pages=

    Amend to alignment: Decoupled prompt tuning for mitigating spurious correlation in vision-language models , author=. Proceedings of the International Conference on Machine Learning (ICML) , pages=

  36. [36]

    Proceedings of the AAAI Conference on Artificial Intelligence , pages=

    Spurious feature eraser: Stabilizing test-time adaptation for vision-language foundation model , author=. Proceedings of the AAAI Conference on Artificial Intelligence , pages=

  37. [37]

    Sepehr Dehdashtian and Lan Wang and Vishnu Boddeti , booktitle=. Fairer

  38. [38]

    arXiv preprint arXiv:2403.07888 , year=

    Cross-modality debiasing: using language to mitigate sub-population shifts in imaging , author=. arXiv preprint arXiv:2403.07888 , year=

  39. [39]

    Nashed and Sabyasachi Sahoo and Liam Paull , year=

    Mahtab Sandhu and Yann Batiste Pequignot and Samer B. Nashed and Sabyasachi Sahoo and Liam Paull , year=

  40. [40]

    Proceedings of the ACM SIGMOD International Conference on Management of Data , pages=

    Breunig, Markus M and Kriegel, Hans-Peter and Ng, Raymond T and Sander, J. Proceedings of the ACM SIGMOD International Conference on Management of Data , pages=

  41. [41]

    Pacific-Asia Conference on Knowledge Discovery and Data Mining , pages=

    Ranking outliers using symmetric neighborhood relationship , author=. Pacific-Asia Conference on Knowledge Discovery and Data Mining , pages=. 2006 , organization=

  42. [42]

    Data Mining and Knowledge Discovery , volume=

    On the evaluation of unsupervised outlier detection: measures, datasets, and an empirical study , author=. Data Mining and Knowledge Discovery , volume=. 2016 , publisher=

  43. [43]

    Statistical Analysis and Data Mining: The ASA Data Science Journal , volume=

    A survey on unsupervised outlier detection in high-dimensional numerical data , author=. Statistical Analysis and Data Mining: The ASA Data Science Journal , volume=. 2012 , publisher=

  44. [44]

    The Annals of Mathematical Statistics , volume=

    A nonparametric estimate of a multivariate density function , author=. The Annals of Mathematical Statistics , volume=. 1965 , publisher=

  45. [45]

    2015 , publisher=

    Lectures on the nearest neighbor method , author=. 2015 , publisher=

  46. [46]

    Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI) , pages =

    Super-samples from kernel herding , author=. Proceedings of the Conference on Uncertainty in Artificial Intelligence (UAI) , pages =

  47. [47]

    Data Mining and Knowledge Discovery , volume=

    Local outlier detection reconsidered: a generalized view on locality with applications to spatial, video, and network outlier detection , author=. Data Mining and Knowledge Discovery , volume=. 2014 , publisher=

  48. [48]

    Kent, John T , journal=. The. 1982 , publisher=

  49. [49]

    2009 , publisher=

    Directional statistics , author=. 2009 , publisher=

  50. [50]

    A short note on parameter approximation for von

    Sra, Suvrit , journal=. A short note on parameter approximation for von. 2012 , publisher=

  51. [51]

    Saddlepoint approximations for the Bingham and

    Kume, Alfred and Wood, Andrew TA , journal=. Saddlepoint approximations for the Bingham and. 2005 , publisher=

  52. [52]

    Zhang, Sheng and Xu, Yanbo and Usuyama, Naoto and Xu, Hanwen and Bagga, Jaspreet and Tinn, Robert and Preston, Sam and Rao, Rajesh and Wei, Mu and Valluri, Naveen and others , journal=

  53. [53]

    2023 , howpublished =

    OpenAI , title =. 2023 , howpublished =

  54. [54]

    Rebuffi, Sylvestre-Alvise and Kolesnikov, Alexander and Sperl, Georg and Lampert, Christoph H , booktitle=

  55. [55]

    Intrinsic statistics on

    Pennec, Xavier , journal=. Intrinsic statistics on. 2006 , publisher=

  56. [56]

    Proceedings of the International Conference on Machine Learning (ICML) , pages =

    Discover and cure: Concept-aware mitigation of spurious correlation , author=. Proceedings of the International Conference on Machine Learning (ICML) , pages =

  57. [57]

    Fellbaum, Christiane , year=

  58. [58]

    Advances in Neural Information Processing Systems (NeurIPS) , volume=

    Maximum likelihood estimation of intrinsic dimension , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

  59. [59]

    Data Mining and Knowledge Discovery , volume=

    Extreme-value-theoretic estimation of local intrinsic dimensionality , author=. Data Mining and Knowledge Discovery , volume=. 2018 , publisher=

  60. [60]

    Proceedings of the SIAM International Conference on Data Mining (SDM) , pages=

    Dimensionality-aware outlier detection , author=. Proceedings of the SIAM International Conference on Data Mining (SDM) , pages=. 2024 , organization=

  61. [61]

    ACM Transactions on Intelligent Systems and Technology (TIST) , volume=

    A survey of zero-shot learning: Settings, methods, and applications , author=. ACM Transactions on Intelligent Systems and Technology (TIST) , volume=. 2019 , publisher=

  62. [62]

    Advances in Neural Information Processing Systems (NeurIPS) , volume=

    Zero-shot learning with semantic output codes , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

  63. [63]

    IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=

    Attribute-based classification for zero-shot visual object categorization , author=. IEEE Transactions on Pattern Analysis and Machine Intelligence , volume=. 2013 , publisher=