pith. machine review for the scientific record. sign in

arxiv: 2604.17126 · v1 · submitted 2026-04-18 · 💻 cs.CV

Recognition: unknown

Prompt Sensitivity in Vision-Language Grounding: How Small Changes in Wording Affect Object Detection

Amit Sethi, Dawar Jyoti Deka, Syed Mohammad Ali

Authors on Pith no claims yet

Pith reviewed 2026-05-10 06:29 UTC · model grok-4.3

classification 💻 cs.CV
keywords prompt sensitivityvision-language groundingobject detectionCLIPDETRinstabilityopen-vocabularysemantic equivalence
0
0 comments X

The pith

Vision-language grounding selects different objects for semantically similar prompts because the argmax step overrides text embedding proximity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests the assumption that natural language queries describing the same object will produce consistent detections in vision-language models. Experiments combine DETR object proposals with CLIP scoring on 263 COCO images and show that prompts such as 'a person,' 'a human,' and 'a pedestrian' frequently pick different instances, averaging 2.11 distinct selections. Principal component analysis indicates the shifts follow structured, directional patterns rather than random noise. Text embedding distance correlates only modestly with these disagreements (r = -0.58), accounting for 34 percent of the variance and pointing to the final selection rule as the dominant source. This finding matters because open-vocabulary detection systems are expected to behave reliably under minor rephrasings of user queries.

Core claim

In a controlled DETR-plus-CLIP pipeline on COCO val2017 images, overlapping prompts select different object instances with a mean instability of 2.11 distinct selections across six variants. PCA reveals that the variability is structured and directional. Prompt ensembling does not reduce inconsistency and often shifts selections toward generic regions. Text embedding proximity explains only 34 percent of grounding disagreement (r = -0.58), confirming that instability originates primarily from the argmax selection mechanism rather than from differences at the text embedding level.

What carries the argument

The argmax operation that picks the DETR proposal with the highest CLIP similarity score for a given text embedding.

If this is right

  • Semantically overlapping prompts produce inconsistent instance selections in open-vocabulary detection.
  • Averaging or ensembling multiple prompts tends to increase rather than decrease selection instability.
  • Most of the observed disagreement cannot be reduced by making text embeddings more similar.
  • The pattern of changes across prompts is systematic rather than random noise.
  • Current evaluation protocols that ignore prompt variation may overestimate model reliability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same selection sensitivity may appear in other two-stage grounding architectures that separate proposal generation from language scoring.
  • Robustness to prompt variation could become a standard evaluation axis alongside accuracy for vision-language systems.
  • Alternative selection strategies that incorporate uncertainty or ensemble at the proposal level might mitigate the effect.
  • Developers may need prompt normalization layers or post-selection consistency checks when deploying these models.

Load-bearing premise

The tested prompts count as semantically equivalent and the DETR-CLIP pipeline on COCO images captures general behavior of vision-language grounding.

What would settle it

Replacing the argmax selection with a softer rule such as softmax-weighted averaging over the same proposals and observing that prompt-induced disagreement drops sharply while text embeddings remain unchanged would falsify the claim that the selection mechanism is the main cause.

Figures

Figures reproduced from arXiv: 2604.17126 by Amit Sethi, Dawar Jyoti Deka, Syed Mohammad Ali.

Figure 1
Figure 1. Figure 1: Prompt-dependent grounding (example 1). Same [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Prompt-dependent grounding (example 2). Semanti [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Instability distribution over 263 images. [PITH_FULL_IMAGE:figures/full_fig_p002_3.png] view at source ↗
Figure 6
Figure 6. Figure 6: Mean vs. variance of CLIP similarity scores. High [PITH_FULL_IMAGE:figures/full_fig_p003_6.png] view at source ↗
Figure 5
Figure 5. Figure 5: PCA projection of CLIP similarity score vectors with [PITH_FULL_IMAGE:figures/full_fig_p003_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Ensembling failure (example 1). Left: single prompt [PITH_FULL_IMAGE:figures/full_fig_p004_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Text embedding cosine similarity vs. grounding [PITH_FULL_IMAGE:figures/full_fig_p004_9.png] view at source ↗
Figure 8
Figure 8. Figure 8: Ensembling failure (example 2). Ensemble averag [PITH_FULL_IMAGE:figures/full_fig_p004_8.png] view at source ↗
read the original abstract

Vision-language models enable open-vocabulary object grounding through natural language queries, under the implicit assumption that semantically equivalent descriptions yield consistent outputs. We examine this assumption using a controlled pipeline combining DETR for object proposals with CLIP for language-conditioned selection on 263 COCO val2017 images. We find that overlapping prompts such as "a person," "a human," and "a pedestrian" frequently select different instances, with mean instability of 2.11 distinct selections across six prompts. PCA analysis shows this variability is structured and directional, not random. Prompt ensembling does not improve quality and often shifts selections toward generic regions. We further show that text embedding proximity explains only 34% of grounding disagreement (r = -0.58), confirming that instability arises from the argmax selection mechanism rather than text-level distances alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper examines prompt sensitivity in vision-language grounding by combining DETR object proposals with CLIP-based selection on 263 COCO val2017 images. It reports that semantically overlapping prompts (e.g., 'a person', 'a human', 'a pedestrian') frequently yield different instance selections, with mean instability of 2.11 distinct outputs across six prompts. PCA reveals structured rather than random variability; prompt ensembling does not improve grounding quality; and text embedding proximity explains only 34% of disagreement (r = -0.58), leading to the conclusion that instability originates in the argmax selection mechanism rather than text-level distances.

Significance. If the central empirical findings hold after addressing the equivalence assumption, the work is significant for highlighting a concrete fragility in open-vocabulary detection pipelines. The controlled setup with explicit metrics (instability count, correlation, PCA structure) provides a reproducible starting point for studying selection robustness, which has direct implications for reliability in applications such as robotics and image retrieval. The partial decoupling of text distance from output variance is a useful observation that could motivate new selection or ensembling strategies.

major comments (2)
  1. [Abstract] Abstract: The attribution of residual disagreement (66%) to argmax fragility requires the six prompts to be treated as semantically equivalent, yet no justification or ablation is provided. 'A pedestrian' introduces locomotion and path constraints absent from 'a person' or 'a human'; if CLIP embeddings encode these distinctions, the observed selection shifts and PCA directions can be explained by legitimate semantic differences without invoking mechanism instability. An ablation separating equivalent versus non-equivalent prompt sets is needed to support the causal claim.
  2. [Abstract] Abstract and results: The reported correlation r = -0.58 and 34% explained variance lack error bars, confidence intervals, or p-values, and the PCA structure is presented without details on component count, total variance captured, or robustness checks (e.g., permutation tests). These omissions weaken the quantitative support for the claim that text proximity is insufficient to explain the instability.
minor comments (2)
  1. [Abstract] The abstract states 'overlapping prompts such as...' but does not enumerate the full set of six prompts or their exact wording, making it difficult to assess semantic overlap independently.
  2. [Methods] The dataset construction (263 images from COCO val2017) omits explicit exclusion criteria or image selection protocol, which should be stated for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which have helped us identify areas for improvement in our manuscript. We address each major comment below and outline the revisions we plan to make.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The attribution of residual disagreement (66%) to argmax fragility requires the six prompts to be treated as semantically equivalent, yet no justification or ablation is provided. 'A pedestrian' introduces locomotion and path constraints absent from 'a person' or 'a human'; if CLIP embeddings encode these distinctions, the observed selection shifts and PCA directions can be explained by legitimate semantic differences without invoking mechanism instability. An ablation separating equivalent versus non-equivalent prompt sets is needed to support the causal claim.

    Authors: We recognize the validity of this concern. While our prompts were selected as common synonyms for the same object category in detection tasks, subtle semantic nuances may exist. To strengthen our claim, we will add an ablation in the revised manuscript that groups prompts into highly equivalent sets (e.g., 'person' and 'human') and those with potential distinctions (including 'pedestrian'), reporting instability and PCA results separately for each group. This will help isolate whether the argmax mechanism contributes to instability even for near-equivalent prompts. We will also clarify in the text that our conclusion is based on the observed low correlation with embedding distances, suggesting additional factors in the selection process. revision: yes

  2. Referee: [Abstract] Abstract and results: The reported correlation r = -0.58 and 34% explained variance lack error bars, confidence intervals, or p-values, and the PCA structure is presented without details on component count, total variance captured, or robustness checks (e.g., permutation tests). These omissions weaken the quantitative support for the claim that text proximity is insufficient to explain the instability.

    Authors: We agree that additional statistical details are necessary to support the quantitative claims. In the revised version, we will augment the results section with error bars and confidence intervals for the correlation value, along with the corresponding p-value. For the PCA analysis, we will report the number of principal components considered, the cumulative variance explained, and include a permutation test or similar robustness check to demonstrate that the observed structure is statistically significant and not attributable to random variation. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical measurements with no derivations or self-referential reductions

full rationale

The paper conducts a controlled empirical study on 263 COCO images using a fixed DETR+CLIP pipeline. It directly measures selection instability across six prompts (mean 2.11 distinct selections), applies PCA to the observed variability, and computes a Pearson correlation (r = -0.58) between text embedding distances and grounding disagreement, reporting that this explains 34% of variance. None of these steps involve equations, parameter fitting, predictions derived from fitted inputs, self-citations of uniqueness theorems, or ansatzes smuggled via prior work. All reported quantities are direct experimental outputs independent of the target claims, so the derivation chain contains no reductions by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Empirical study using standard models and dataset with no fitted parameters or new entities; rests on the domain assumption of semantic equivalence.

axioms (1)
  • domain assumption Semantically equivalent natural language descriptions should yield consistent object grounding outputs
    This is the core implicit assumption the paper tests and finds violated.

pith-pipeline@v0.9.0 · 5444 in / 1125 out tokens · 40614 ms · 2026-05-10T06:29:01.435454+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

25 extracted references · 3 canonical work pages · 1 internal anchor

  1. [1]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning transferable visual models from natural language supervision.arXiv:2103.00020(2021)

  2. [2]

    Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexan- der Kirillov, and Sergey Zagoruyko. 2020. End-to-end object detection with transformers. InECCV

  3. [3]

    Gomez, Lukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. InNeurIPS

  4. [4]

    Lawrence Zitnick

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C. Lawrence Zitnick. 2014. Microsoft COCO: Common objects in context. InECCV

  5. [5]

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xi- aohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. 2021. An image is worth 16x16 words: Transformers for image recognition at scale. InICLR

  6. [6]

    Kaiming He, Georgia Gkioxari, Piotr Dollár, and Ross Girshick. 2017. Mask R-CNN. InICCV

  7. [7]

    Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. 2015. Faster R-CNN: Towards real-time object detection with region proposal networks. InNeurIPS

  8. [8]

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. 2009. Ima- geNet: A large-scale hierarchical image database. InCVPR

  9. [9]

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. InICML

  10. [10]

    Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. 2020. Mo- mentum contrast for unsupervised visual representation learning. InCVPR

  11. [11]

    Aishwarya Kamath, Mannat Singh, Yann LeCun, Gabriel Synnaeve, Ishan Misra, and Nicolas Carion. 2021. MDETR: Modulated detection for end-to-end multi- modal understanding. InICCV

  12. [12]

    Liunian Harold Li, Pengchuan Zhang, Haotian Zhang, Jianwei Yang, Chunyuan Li, Yiwu Zhong, Lijuan Wang, Lu Yuan, Lei Zhang, Jenq-Neng Hwang, Kai-Wei Chang, and Jianfeng Gao. 2022. Grounded language-image pre-training. InCVPR

  13. [13]

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Chun- yuan Li, Jianwei Yang, Hang Su, Jun Zhu, and Lei Zhang. 2023. Grounding DINO: Marrying DINO with grounded pre-training for open-set object detection. arXiv:2303.05499(2023)

  14. [14]

    Matthias Minderer, Alexey Gritsenko, Austin Stone, Maxim Neumann, Dirk Weissenborn, Alexey Dosovitskiy, Aravindh Mahendran, Anurag Arnab, Mostafa Dehghani, Zhuoran Shen, Xiao Wang, Xiaohua Zhai, Thomas Kipf, and Neil Houlsby. 2022. Simple open-vocabulary object detection with vision transformers. InECCV

  15. [15]

    Alina Kuznetsova, Hassan Rom, Neil Alldrin, Jasper Uijlings, Ivan Krasin, Jordi Pont-Tuset, Shahab Kamali, Stefan Popov, Matteo Malloci, Alexander Kolesnikov, Tom Duerig, and Vittorio Ferrari. 2020. The Open Images dataset V4.IJCV128 (2020), 1956–1981

  16. [16]

    Jiasen Lu, Dhruv Batra, Devi Parikh, and Stefan Lee. 2019. ViLBERT: Pretraining task-agnostic visiolinguistic representations for vision-and-language tasks. In NeurIPS

  17. [17]

    Hao Tan and Mohit Bansal. 2019. LXMERT: Learning cross-modality encoder representations from transformers. InEMNLP

  18. [18]

    Liunian Harold Li, Mark Yatskar, Da Yin, Cho-Jui Hsieh, and Kai-Wei Chang

  19. [19]

    arXiv preprint arXiv:1908.03557 , year=

    VisualBERT: A simple and performant baseline for vision and language. arXiv:1908.03557(2019)

  20. [20]

    Dan Hendrycks and Thomas Dietterich. 2019. Benchmarking neural network robustness to common corruptions and perturbations. InICLR

  21. [21]

    Karen Simonyan and Andrew Zisserman. 2015. Very deep convolutional networks for large-scale image recognition. InICLR

  22. [22]

    Alireza Zareian, Kevin Dela Rosa, Derek Hao Hu, and Shih-Fu Chang. 2021. Open-vocabulary object detection using captions. InCVPR

  23. [23]

    Timo Schick and Hinrich Schütze. 2021. Exploiting cloze-questions for few-shot text classification and natural language inference. InEACL

  24. [24]

    Why should I trust you?

    Marco Tulio Ribeiro, Sameer Singh, and Carlos Guestrin. 2016. “Why should I trust you?”: Explaining the predictions of any classifier. InKDD

  25. [25]

    Julius Adebayo, Justin Gilmer, Michael Muelly, Ian Goodfellow, Moritz Hardt, and Been Kim. 2018. Sanity checks for saliency maps. InNeurIPS