Semantic Robustness Certification for Vision-Language Models

Amardeep Kaur; Andrew C. Cullen; Christopher Leckie; Feng Liu; Paul Montague; Peiyu Yang; Sarah M. Erfani

arxiv: 2606.18839 · v1 · pith:JL3BO5LLnew · submitted 2026-06-17 · 💻 cs.LG · cs.CV

Semantic Robustness Certification for Vision-Language Models

Peiyu Yang , Paul Montague , Feng Liu , Andrew C. Cullen , Amardeep Kaur , Christopher Leckie , Sarah M. Erfani This is my paper

Pith reviewed 2026-06-26 21:30 UTC · model grok-4.3

classification 💻 cs.LG cs.CV

keywords vision-language modelsrobustness certificationsemantic transformationsdecision boundarytext promptsopen-vocabulary modelsextent parameterization

0 comments

The pith

A framework certifies vision-language model robustness under semantic transformations by using text prompts as proxies and deriving closed-form decision boundaries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a method to certify that a vision-language model's predicted class stays the same when its input undergoes semantic changes such as altered shape, size, or style. Text prompts serve as proxies to represent these changes, with a single extent parameter controlling how far the change goes. The method finds an exact mathematical expression for the model's decision boundary and uses it to compute the full range of extent values that leave the prediction unchanged. This avoids the need to gather new training examples for every possible variation. A reader would care because most existing robustness checks address only pixel or geometric shifts, while real applications encounter exactly these semantic drifts.

Core claim

Leveraging the open-vocabulary capability of VLMs, we use text prompts as semantic proxies to construct transformations parameterized by an extent that controls the degree of semantic variation. By characterizing the VLM decision boundary in closed form, our framework quantitatively certifies extent intervals for which the predicted class remains unchanged under the semantic transformation.

What carries the argument

Closed-form characterization of the VLM decision boundary under extent-parameterized semantic transformations proxied by text prompts.

If this is right

The framework certifies robustness to semantic variations without collecting additional data for each variation type.
Quantitative extent intervals are produced that guarantee the prediction remains stable.
The approach works across both synthetic and real-world datasets for diverse semantic changes.
Certification becomes practical for downstream tasks that encounter natural distribution shifts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The certified intervals could be used to rank different prompts or models by how wide a semantic range they tolerate.
Similar closed-form boundary techniques might apply to other multimodal models if their outputs can be expressed in comparable algebraic form.
Deployed systems could track the extent parameter of incoming inputs and flag cases that fall outside certified intervals.

Load-bearing premise

Text prompts serve as faithful semantic proxies for visual transformations and the VLM decision boundary admits a closed-form characterization allowing direct interval computation.

What would settle it

An experiment in which the model's actual class prediction changes inside a certified extent interval, or in which the computed closed-form boundary disagrees with the model's observed outputs on transformed inputs.

Figures

Figures reproduced from arXiv: 2606.18839 by Amardeep Kaur, Andrew C. Cullen, Christopher Leckie, Feng Liu, Paul Montague, Peiyu Yang, Sarah M. Erfani.

**Figure 1.** Figure 1: Illustration of our semantic robustness certificates for VLMs. Each column specifies a target semantic with a text proxy. The certified prediction-invariant intervals are visualized over a normalized semantic extent φ ∈ [0, 1]. Nearest images are retrieved from dataset via similarity to the transformed embedding (φ = 1) as visual references, with labels and similarities shown. construction of semantic tran… view at source ↗

**Figure 2.** Figure 2: Illustration of the semantic transformation in a threedimensional visualization of the VLM embedding space. With the basis (e1, e2), embeddings in Pa,a′ can be parameterized by an extent φ that controls the relative strengths with respect to (ua, ua′ ). The source semantic extent φa ∈ (−π, π] in Pa,a′ is defined from the orthogonal projection z∥ as φa := atan2(⟨z∥, e2⟩,⟨z∥, e1⟩). (8) We assume a target … view at source ↗

**Figure 3.** Figure 3: Illustration of our VLM robustness certificates. Prediction-invariant intervals are completely certified over a normalized semantic extent φ ∈ [0, 1] for diverse semantic variations across domains. Text prompts serve as proxies for specifying different target semantics [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Illustration of descriptors grouped by attribute type. Using a fixed prompt template (e.g., “a photo of a {attribute} class”), we vary only the descriptor to model semantic variations. provides complete certification without additional inputs or supervision (Mirman et al., 2021; Yuan et al., 2023), and therefore serves as our primary baseline. Visual Reference Transformation and Metric. Groundtruth semant… view at source ↗

**Figure 6.** Figure 6: Images of synthetic and real-world semantic variations. gate over the extent range [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗

**Figure 7.** Figure 7: Cross-modal semantic consistency on an ImageNet subset. For each image, we form a prompt family by fixing a template and varying only the attribute word, e.g., “a photo of a [attribute] [class]”. (a) We select the most similar prompt to the image as the anchor t ∗ , rank the remaining prompts by their cosine similarity to t ∗ in the prompt embedding space, and plot image-to-prompt cosine similarity over th… view at source ↗

**Figure 8.** Figure 8: Alignment to semantic variation in the input space. For each dataset, we compare our constructed semantic transformation with a visual reference transformation constructed from the annotated image sequence with gradual semantic variations. For each semantic instance, we uniformly sample extents φ ∈ [0, 1], compute cosine similarity between the transformed embedding and the visual reference embedding at eac… view at source ↗

**Figure 9.** Figure 9: Examples of prompt variants on ImageNet. We consider category name variants from ImageNet annotations, template variants for VLM prompting, and attribute synonym sets for semantics such as color, background, size, and texture. Within each type, we form a prompt family by substituting only the corresponding word or phrase while keeping the remaining prompt structure fixed, e.g., “a photo of a Tench” → “a ph… view at source ↗

**Figure 10.** Figure 10: Prompt cosine similarity under prompt variations on an ImageNet subset. For each variation type, we compute cosine similarity between a reference prompt (shown in bold) and its variants within the same prompt family, and report the distribution across prompt families. uncommon aliases. For example, a class may be annotated with both a common name and a scientific name (e.g., goldfish vs. Carassius cuvieri… view at source ↗

**Figure 11.** Figure 11: Semantic strength via similarity. For each dataset, we sample 20 semantic pairs (a, a′ ) and randomly split images from the two classes into two disjoint subsets. We compute class mean visual embeddings z¯a, z¯a′ on images from one subset and form the semantic direction by va,a′ = ¯za′ − z¯a. We then score images in the other subset by t(xi) = ⟨zi, va,a′ ⟩ with zi = fimg(xi), sort by t(xi), and partition … view at source ↗

read the original abstract

Vision-language models (VLMs) are now widely used in downstream tasks. However, real-world applications often expose VLMs to distribution shifts induced by semantic variation (e.g., shape, size, and style). Robustness certification determines if a model's prediction changes when transformations are applied to its input. While most certification frameworks study geometric or pixel-level transformations over inputs, this work proposes a novel framework that enables certifying VLM robustness under semantic-level transformations. Leveraging the open-vocabulary capability of VLMs, we use text prompts as semantic proxies to construct transformations parameterized by an extent that controls the degree of semantic variation. By characterizing the VLM decision boundary in closed form, our framework quantitatively certifies extent intervals for which the predicted class remains unchanged under the semantic transformation. Our framework is the first to certify VLM robustness under semantic-level variations without requiring additional data for each variation, making it practical to apply. Experiments on both synthetic and real-world data show that our framework enables certifying robustness under diverse semantic variations across scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper offers a practical angle on semantic robustness certification for VLMs by using text prompts as proxies, but the closed-form boundary claim is the part that needs the most scrutiny.

read the letter

The core idea is to certify how far you can push semantic changes (shape, size, style) on a VLM input before the predicted class flips, by treating text prompts as stand-ins for those changes and parameterizing them with an extent value. They claim this lets them compute exact intervals where the prediction stays stable by solving the decision boundary analytically.

What is new is the focus on semantic-level shifts for VLMs without collecting fresh data for every variation. Prior work has mostly handled geometric or pixel perturbations, so routing through the text side to proxy visual semantics is a reasonable move that fits how these models actually work.

The practical framing is also useful: it aims at real deployment scenarios where semantic drift is common. If the method delivers on the intervals, it could be a step toward more usable guarantees.

The weakest link is the closed-form characterization itself. The abstract states they derive exact extent intervals from the boundary, yet it does not show the functional mapping from extent to the image or text embeddings, or to the cosine similarities that drive the argmax. Without an explicit form (linear, parametric family, or otherwise), it is not obvious that the crossing point can be solved analytically rather than numerically for a general VLM. That step is load-bearing; if it does not hold, the quantitative certification reduces to something less precise.

Experiments are mentioned on synthetic and real data, but the abstract gives no numbers, baselines, or tightness results, so it is hard to judge whether the certificates are meaningful or loose.

This is for people working on robustness for multimodal models who want ideas beyond pixel-level methods. A reader already familiar with certification frameworks could extract the proxy idea and see whether it extends to their setting.

I would send it to review so the authors can supply the missing derivation and show the experimental support. The direction is worth checking even if the current write-up leaves the central math underspecified.

Referee Report

1 major / 1 minor

Summary. The paper proposes a framework for certifying robustness of vision-language models under semantic transformations (shape, size, style) by treating text prompts as parameterized semantic proxies controlled by an 'extent' variable. It claims to derive a closed-form characterization of the VLM decision boundary, enabling quantitative certification of intervals of the extent parameter over which the predicted class is invariant, without needing extra data per variation. Experiments on synthetic and real data are reported to support applicability across scenarios.

Significance. If the closed-form boundary characterization is valid, the work would be a notable advance: the first certification method targeting semantic-level shifts in VLMs that remains practical (no per-variation retraining or data). It directly addresses a gap left by geometric/pixel-level certification frameworks and could improve reliability of open-vocabulary VLMs in deployment.

major comments (1)

[Abstract] Abstract (central claim): the assertion that the VLM decision boundary 'admits a closed-form characterization' permitting direct interval computation is load-bearing for the entire quantitative certification result. The abstract provides no functional form relating the extent parameter to image embeddings f(I(extent)) or effective text embeddings; without an explicit analytic mapping (e.g., linear interpolation or known parametric family) whose cosine-similarity roots can be solved in closed form, the argmax decision boundary cannot be inverted analytically for general VLMs, rendering the certification claim unsupported.

minor comments (1)

The abstract would be clearer if it briefly indicated the assumed functional form of the extent-to-embedding map or gave the explicit boundary equation whose roots are claimed to be closed-form.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful review and for identifying the need to substantiate the closed-form characterization more explicitly. We address the concern point by point below.

read point-by-point responses

Referee: [Abstract] Abstract (central claim): the assertion that the VLM decision boundary 'admits a closed-form characterization' permitting direct interval computation is load-bearing for the entire quantitative certification result. The abstract provides no functional form relating the extent parameter to image embeddings f(I(extent)) or effective text embeddings; without an explicit analytic mapping (e.g., linear interpolation or known parametric family) whose cosine-similarity roots can be solved in closed form, the argmax decision boundary cannot be inverted analytically for general VLMs, rendering the certification claim unsupported.

Authors: We agree the abstract is concise and omits the explicit mapping. Section 3 of the manuscript defines the extent-parameterized semantic proxy by linearly interpolating between text embeddings of base prompts that represent the semantic variation (e.g., low-to-high extent of shape or style). The resulting text embedding is therefore affine in the extent variable. The VLM decision is the argmax over cosine similarities between the (fixed) image embedding and these parameterized text embeddings. Substituting the affine form yields cosine-similarity scores that are quadratic in the extent; the decision-boundary crossings are therefore the real roots of quadratic equations, which are obtained in closed form via the quadratic formula. This supplies the quantitative interval certification without per-variation data or numerical search. We will revise the abstract to include a one-sentence statement of the affine embedding assumption and the resulting quadratic closed-form boundary. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on model similarity scores without self-referential reduction

full rationale

The paper's central claim is a closed-form characterization of the VLM decision boundary under text-prompt-parameterized semantic transformations, enabling direct interval certification for unchanged predictions. No equations or steps in the abstract or description reduce the certification output to a fitted parameter or self-citation by construction. The framework treats the VLM's open-vocabulary similarity scores as given inputs and derives extent intervals from them analytically, without evidence of the result being equivalent to its inputs via self-definition, renaming, or load-bearing self-citation. This is a standard non-circular outcome for a certification method that assumes an analytic boundary form.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no concrete free parameters, axioms, or invented entities; insufficient information to populate the ledger.

pith-pipeline@v0.9.1-grok · 5721 in / 1003 out tokens · 19237 ms · 2026-06-26T21:30:57.513493+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 10 canonical work pages · 4 internal anchors

[1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Food-101– mining discriminative components with random forests

Bossard, L., Guillaumin, M., and Van Gool, L. Food-101– mining discriminative components with random forests. InComputer Vision–ECCV 2014: 13th European Con- ference, Zurich, Switzerland, September 6-12, 2014, Pro- ceedings, Part VI 13, pp. 446–461. Springer,

2014
[3]

Interpreting clip: Insights on the robust- ness to imagenet distribution shifts.arXiv preprint arXiv:2310.13040,

Crabb´e, J., Rodr ´ıguez, P., Shankar, V ., Zappella, L., and Blaas, A. Interpreting clip: Insights on the robust- ness to imagenet distribution shifts.arXiv preprint arXiv:2310.13040,

work page arXiv
[4]

N., Jovanovic, N., and Vechev, M

Ferrari, C., Muller, M. N., Jovanovic, N., and Vechev, M. Complete verification via multi-neuron relaxation guided branch-and-bound.arXiv preprint arXiv:2205.00263,

work page arXiv
[5]

Guo, D., Wu, F., Zhu, F., Leng, F., Shi, G., Chen, H., Fan, H., Wang, J., Jiang, J., Wang, J., et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Mmt-ard: Multimodal multi- teacher adversarial distillation for robust vision-language models.arXiv preprint arXiv:2511.17448, 2025a

Li, Y ., Dong, J., Yang, C., Wen, S., Koniusz, P., Huang, T., Tian, Y ., and Ong, Y .-S. Mmt-ard: Multimodal multi- teacher adversarial distillation for robust vision-language models.arXiv preprint arXiv:2511.17448, 2025a. Li, Y ., Yang, C., Dong, J., Yao, Z., Xu, H., Dong, Z., Zeng, H., An, Z., and Tian, Y . Ammkd: Adaptive multimodal multi-teacher disti...

work page arXiv
[7]

Fine-Grained Visual Classification of Aircraft

Maji, S., Rahtu, E., Kannala, J., Blaschko, M., and Vedaldi, A. Fine-grained visual classification of aircraft.arXiv preprint arXiv:1306.5151,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

D., Croce, F., and Hein, M

Schlarmann, C., Singh, N. D., Croce, F., and Hein, M. Robust clip: Unsupervised adversarial fine-tuning of vi- sion embeddings for robust large vision-language models. arXiv preprint arXiv:2402.12336,

work page arXiv
[9]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

11 Semantic Robustness Certification for Vision-Language Models Soomro, K., Zamir, A. R., and Shah, M. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Local path inte- gration for attribution

Yang, P., Akhtar, N., Wen, Z., and Mian, A. Local path inte- gration for attribution. InProceedings of the AAAI Confer- ence on Artificial Intelligence, volume 37, pp. 3173–3180, 2023a. Yang, P., Akhtar, N., Wen, Z., Shah, M., and Mian, A. S. Re- calibrating feature attributions for model interpretation. InInternational Conference on Learning Representati...

work page arXiv
[11]

Ants: Adaptive negative textual space shaping for ood detection via test-time mllm understanding and reasoning.arXiv preprint arXiv:2509.03951,

Zhu, W., Zhang, Y ., Jin, X., Zeng, W., and Zhang, L. Ants: Adaptive negative textual space shaping for ood detection via test-time mllm understanding and reasoning.arXiv preprint arXiv:2509.03951,

work page arXiv
[12]

Therefore,φ∈ U c,c′(δ). B. Experimental Setup In this work, we use the publicly available pretrained CLIP ViT-B/16 model released by OpenAI (Radford et al., 2021). All experiments are conducted using an NVIDIA 3090Ti GPU (24GB), a 16-core 3.9GHz Intel Core i9-12900K CPU, and 128GB RAM. To evaluate semantic robustness under controllable semantic extents, w...

2021
[13]

We therefore use multimodal large language models (MLLM) (e.g., GPT models (Achiam et al.,

and CycleGAN (Zhu et al., 2017)) often produced unrealistic outputs when asked to enforce semantic shifts on out-of-domain objects, and diffusion-based generators (e.g., InstructPix2Pix (Brooks et al., 2023)) frequently introduced visible artifacts or drifted from the input identity, which injects unintended semantic factors. We therefore use multimodal l...

2017
[14]

a photo of a [attribute] [class]

and Seedream (Guo et al., 2025)) to construct synthetic image sequences. Concretely, for each dataset we choose three representative classes and sample seed images per class. For each seed image, we generate at least one pair of ID and OOD semantic shifts for each semantic. Each shift is instantiated as an ordered image sequence with an explicit semantic ...

2025

[1] [1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Food-101– mining discriminative components with random forests

Bossard, L., Guillaumin, M., and Van Gool, L. Food-101– mining discriminative components with random forests. InComputer Vision–ECCV 2014: 13th European Con- ference, Zurich, Switzerland, September 6-12, 2014, Pro- ceedings, Part VI 13, pp. 446–461. Springer,

2014

[3] [3]

Interpreting clip: Insights on the robust- ness to imagenet distribution shifts.arXiv preprint arXiv:2310.13040,

Crabb´e, J., Rodr ´ıguez, P., Shankar, V ., Zappella, L., and Blaas, A. Interpreting clip: Insights on the robust- ness to imagenet distribution shifts.arXiv preprint arXiv:2310.13040,

work page arXiv

[4] [4]

N., Jovanovic, N., and Vechev, M

Ferrari, C., Muller, M. N., Jovanovic, N., and Vechev, M. Complete verification via multi-neuron relaxation guided branch-and-bound.arXiv preprint arXiv:2205.00263,

work page arXiv

[5] [5]

Guo, D., Wu, F., Zhu, F., Leng, F., Shi, G., Chen, H., Fan, H., Wang, J., Jiang, J., Wang, J., et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Mmt-ard: Multimodal multi- teacher adversarial distillation for robust vision-language models.arXiv preprint arXiv:2511.17448, 2025a

Li, Y ., Dong, J., Yang, C., Wen, S., Koniusz, P., Huang, T., Tian, Y ., and Ong, Y .-S. Mmt-ard: Multimodal multi- teacher adversarial distillation for robust vision-language models.arXiv preprint arXiv:2511.17448, 2025a. Li, Y ., Yang, C., Dong, J., Yao, Z., Xu, H., Dong, Z., Zeng, H., An, Z., and Tian, Y . Ammkd: Adaptive multimodal multi-teacher disti...

work page arXiv

[7] [7]

Fine-Grained Visual Classification of Aircraft

Maji, S., Rahtu, E., Kannala, J., Blaschko, M., and Vedaldi, A. Fine-grained visual classification of aircraft.arXiv preprint arXiv:1306.5151,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

D., Croce, F., and Hein, M

Schlarmann, C., Singh, N. D., Croce, F., and Hein, M. Robust clip: Unsupervised adversarial fine-tuning of vi- sion embeddings for robust large vision-language models. arXiv preprint arXiv:2402.12336,

work page arXiv

[9] [9]

UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

11 Semantic Robustness Certification for Vision-Language Models Soomro, K., Zamir, A. R., and Shah, M. Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Local path inte- gration for attribution

Yang, P., Akhtar, N., Wen, Z., and Mian, A. Local path inte- gration for attribution. InProceedings of the AAAI Confer- ence on Artificial Intelligence, volume 37, pp. 3173–3180, 2023a. Yang, P., Akhtar, N., Wen, Z., Shah, M., and Mian, A. S. Re- calibrating feature attributions for model interpretation. InInternational Conference on Learning Representati...

work page arXiv

[11] [11]

Ants: Adaptive negative textual space shaping for ood detection via test-time mllm understanding and reasoning.arXiv preprint arXiv:2509.03951,

Zhu, W., Zhang, Y ., Jin, X., Zeng, W., and Zhang, L. Ants: Adaptive negative textual space shaping for ood detection via test-time mllm understanding and reasoning.arXiv preprint arXiv:2509.03951,

work page arXiv

[12] [12]

Therefore,φ∈ U c,c′(δ). B. Experimental Setup In this work, we use the publicly available pretrained CLIP ViT-B/16 model released by OpenAI (Radford et al., 2021). All experiments are conducted using an NVIDIA 3090Ti GPU (24GB), a 16-core 3.9GHz Intel Core i9-12900K CPU, and 128GB RAM. To evaluate semantic robustness under controllable semantic extents, w...

2021

[13] [13]

We therefore use multimodal large language models (MLLM) (e.g., GPT models (Achiam et al.,

and CycleGAN (Zhu et al., 2017)) often produced unrealistic outputs when asked to enforce semantic shifts on out-of-domain objects, and diffusion-based generators (e.g., InstructPix2Pix (Brooks et al., 2023)) frequently introduced visible artifacts or drifted from the input identity, which injects unintended semantic factors. We therefore use multimodal l...

2017

[14] [14]

a photo of a [attribute] [class]

and Seedream (Guo et al., 2025)) to construct synthetic image sequences. Concretely, for each dataset we choose three representative classes and sample seed images per class. For each seed image, we generate at least one pair of ID and OOD semantic shifts for each semantic. Each shift is instantiated as an ordered image sequence with an explicit semantic ...

2025