arxiv: 2605.12876 · v1 · submitted 2026-05-13 · 💻 cs.LG

Recognition: 2 theorem links

· Lean Theorem

Certified Robustness under Heterogeneous Perturbations via Hybrid Randomized Smoothing

Blaise Delattre , Hengyu Wu , Paul Caillon , Wei Yang Bryan Lim , Yang Cao

Authors on Pith no claims yet

Pith reviewed 2026-05-14 20:19 UTC · model grok-4.3

classification 💻 cs.LG

keywords certified robustnessrandomized smoothingmultimodal modelshybrid perturbationsNeyman-Pearsondiscrete-continuous inputsadversarial robustness

0 comments

The pith

Hybrid randomized smoothing yields a closed-form one-dimensional certificate that generalizes both Gaussian and discrete smoothing for joint discrete-continuous inputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Randomized smoothing certifies robustness but has been limited to single input types such as images or text. This work unifies the two by treating the adversary's joint perturbation of mixed discrete and continuous inputs as a single worst-case problem. The authors recast that problem as an analytically solvable Neyman-Pearson test that exploits the likelihood ordering created by independent discrete and continuous noise distributions. The resulting certificate is a simple one-dimensional function that recovers the classical Gaussian certificate when the discrete part vanishes and the classical discrete certificate when the continuous part vanishes. The framework is demonstrated on a multimodal safety-filtering task that requires simultaneous protection of text tokens and image pixels.

Core claim

By analyzing the joint likelihood ordering induced by factorized discrete and continuous noise, the approach yields a closed-form, one-dimensional certificate that strictly generalizes both Gaussian (image-only) and discrete (text-only) randomized smoothing and supplies the first model-agnostic Neyman-Pearson certificate for joint discrete-token and continuous-image perturbations.

What carries the argument

Analytically tractable Neyman-Pearson formulation of the joint worst-case problem under factorized discrete and continuous noise.

If this is right

Supplies the first model-agnostic certificate that simultaneously protects against discrete-token and continuous-image perturbations.
Recovers existing Gaussian and discrete smoothing bounds as special cases of a single formula.
Enables certified safety filtering for text-image models without requiring modality-specific retraining.
Reduces the certification computation to a one-dimensional search over the joint likelihood ratio.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same factorized-noise construction could be applied to other heterogeneous pairs such as audio waveforms with discrete metadata.
If the factorized assumption is relaxed, the certificate may become tighter but would likely lose its closed-form character.
The one-dimensional reduction suggests that similar likelihood-ordering arguments could simplify certification for additional multimodal architectures.

Load-bearing premise

The joint likelihood ordering induced by factorized discrete and continuous noise permits an analytically tractable Neyman-Pearson formulation of the worst-case problem.

What would settle it

An explicit joint perturbation attack on a multimodal model for which the smoothed classifier's accuracy falls below the radius predicted by the one-dimensional certificate.

Figures

Figures reproduced from arXiv: 2605.12876 by Blaise Delattre, Hengyu Wu, Paul Caillon, Wei Yang Bryan Lim, Yang Cao.

**Figure 1.** Figure 1: Conceptual interaction-only safety behavior. Each modality is classified as Safe in isolation, while the joint text–image input is classified as Unsafe. This illustrates why the certified object must be the joint multimodal decision rather than two unimodal decisions. Multimodal safety filtering in large foundation models provides a canonical and high-stakes instance of this problem. Large multimodal mo… view at source ↗

**Figure 2.** Figure 2: shows a monotone trade-off: certified accuracy degrades as d increases, with no certification beyond ϵ = 0.75 for d = 2, while for d = 0 certification holds up to ϵ = 1.9 at over 30% CA, confirming that the hybrid certificate captures a joint robustness trade-off. 0.00 0.25 0.50 0.75 1.00 1.25 1.50 1.75 2.00 0 10 20 30 40 50 60 70 80 Certified accuracy (%) d=0 d=1 d=2 [PITH_FULL_IMAGE:figures/full_fig_p00… view at source ↗

**Figure 3.** Figure 3: Multimodal threat model for safety filtering. The image alone and the text alone are classified as ”Safe”, while their combination is ”Unsafe”. An adversary applies an ℓ2-bounded perturbation to the image and an append-only suffix attack to the text, with the goal of inducing a ”Safe” joint decision. Threat Model. We consider a joint multimodal adversary that perturbs both the text and image inputs. Text:… view at source ↗

**Figure 4.** Figure 4: Certified accuracy as a function of the image perturbation radius ϵ (ℓ2 threat on images), combined with adversarial text suffix attacks. Curves correspond to different discrete text budgets d. Gaussian image smoothing uses σ = 0.5 (left) and σ = 1.0 (right). creases monotonically as either the image radius ϵ or the text budget d increases, as predicted by the hybrid Neyman– Pearson analysis. Larger text b… view at source ↗

**Figure 5.** Figure 5: reports the certified accuracy of the hybrid randomized smoothing certificate as a function of the image perturbation radius ϵ, for a fixed discrete text budget d = 1. We evaluate this setting on a subset of 100 interaction-only examples drawn from the dataset constructed in Section 5.2. This subset consists of examples for which at least one certification method (text-only or hybrid) yields a non-zero cer… view at source ↗

**Figure 6.** Figure 6: “Look how many people love you.” Interaction-only example from the Hateful Memes dataset (Kiela et al., 2020). The image and the text are each individually classified as safe, while their joint interpretation is classified as unsafe, illustrating that harmful content can arise purely from multimodal interaction. This behavior highlights the extreme sensitivity of certain interaction-only examples to minima… view at source ↗

read the original abstract

Randomized smoothing provides strong, model-agnostic robustness certificates, but existing guarantees are limited to single modalities, treating continuous and discrete inputs in isolation. This limitation becomes critical in multimodal models, where decisions depend on cross-modal semantics and adversaries can jointly perturb heterogeneous inputs, rendering unimodal certificates insufficient. We introduce a unified randomized smoothing framework for mixed discrete--continuous inputs based on an analytically tractable Neyman--Pearson formulation of the joint worst-case problem. By analyzing the joint likelihood ordering induced by factorized discrete and continuous noise, our approach yields a closed-form, one-dimensional certificate that strictly generalizes both Gaussian (image-only) and discrete (text-only) randomized smoothing. We validate the framework on multimodal safety filtering, providing, to our knowledge, the first model-agnostic Neyman--Pearson certificate for joint discrete-token and continuous-image perturbations in interaction-dependent text--image safety filtering.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper unifies Gaussian and discrete smoothing into one closed-form certificate via joint Neyman-Pearson, but the one-dimensional collapse depends on a likelihood ordering whose details need explicit verification.

read the letter

The core contribution is a single randomized smoothing certificate that covers joint perturbations on mixed discrete-continuous inputs, such as token flips plus Gaussian noise on image-text models. It reduces to the standard Gaussian bound when the discrete part is fixed and to the multinomial bound when the continuous part is absent, which is a clean unification rather than a loose combination of separate certificates. The derivation uses the Neyman-Pearson lemma on the factorized joint distribution to find the worst-case pair of perturbations, and the abstract claims this yields a tractable one-dimensional expression. That step is the actual novelty if the math holds. The validation on multimodal safety filtering is a reasonable first test bed, and the model-agnostic framing is useful for practitioners who cannot retrain or inspect internals. The main soft spot is exactly the one flagged in the stress-test note: the claim that the joint likelihood ratio admits a total ordering that lets the worst-case problem collapse to a single scalar. If the acceptance region boundary still depends separately on the discrete flip probability and the continuous radius, the optimization stays two-dimensional and the closed form may require additional approximations or case splits that are not visible in the abstract. Without the explicit construction of the ordering and the resulting threshold function, it is difficult to judge whether the reduction is rigorous or only holds under extra assumptions on the noise scales. The paper is aimed at the certified robustness community working on multimodal systems. It is worth sending to peer review because the unification addresses a real gap and the central claim is falsifiable once the derivation is written out; a referee can check the likelihood ordering step directly and ask for the full proof of the one-dimensional certificate.

Referee Report

1 major / 1 minor

Summary. The paper introduces a unified randomized smoothing framework for mixed discrete-continuous inputs in multimodal models. It formulates the joint worst-case robustness problem via the Neyman-Pearson lemma applied to factorized discrete and continuous noise, deriving a closed-form one-dimensional certificate that generalizes both standard Gaussian smoothing for images and discrete multinomial bounds for text. The approach is validated on multimodal safety filtering tasks involving joint token and image perturbations.

Significance. If the central derivation holds, the result is significant because it supplies the first model-agnostic Neyman-Pearson certificate that simultaneously handles heterogeneous perturbations whose effects interact through cross-modal semantics, extending randomized smoothing beyond unimodal settings where separate certificates are provably insufficient.

major comments (1)

[Main theorem / Neyman-Pearson derivation] The core claim that the joint likelihood ratio under independent discrete and continuous noise admits an analytically tractable one-dimensional worst-case formulation (abstract and the main theorem section) is load-bearing yet insufficiently constructed: the argument must explicitly show why the Neyman-Pearson threshold depends on a single scalar rather than separate parameters for the discrete support size and continuous density scale, otherwise the optimization remains multi-dimensional and the closed-form reduction does not follow.

minor comments (1)

[Abstract] The abstract asserts generalization to both Gaussian and discrete cases but does not display the explicit functional form of the resulting certificate (e.g., the expression involving the inverse CDF or multinomial tail that reduces to each special case); including this in the main text would aid immediate verification.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment of the significance of our unified Neyman-Pearson framework for heterogeneous perturbations and for the constructive feedback on the main derivation. We address the single major comment below and will revise the manuscript accordingly to strengthen the exposition.

read point-by-point responses

Referee: [Main theorem / Neyman-Pearson derivation] The core claim that the joint likelihood ratio under independent discrete and continuous noise admits an analytically tractable one-dimensional worst-case formulation (abstract and the main theorem section) is load-bearing yet insufficiently constructed: the argument must explicitly show why the Neyman-Pearson threshold depends on a single scalar rather than separate parameters for the discrete support size and continuous density scale, otherwise the optimization remains multi-dimensional and the closed-form reduction does not follow.

Authors: We agree that the reduction to a one-dimensional certificate requires a clearer step-by-step argument. Because the joint noise distribution factorizes as p(x,y) = p_D(x) p_C(y), the joint likelihood ratio under any pair of inputs is exactly the product Λ(x,y) = Λ_D(x) · Λ_C(y). The Neyman-Pearson test for the worst-case adversary therefore seeks the infimum over all (x,y) of the measure of the set where Λ(x,y) exceeds a threshold t. Due to independence, the level sets of log Λ = log Λ_D + log Λ_C can be ordered monotonically by a single scalar t: for any fixed t the worst-case probability is obtained by independently choosing the discrete and continuous perturbations that maximize the tail probabilities subject to the same additive threshold split (t = t_D + t_C). This collapses the joint optimization to a univariate search over t (or equivalently over the combined p-value), exactly as in the pure-Gaussian and pure-discrete cases. The manuscript states this factorization and the resulting univariate form in the main theorem, but we acknowledge the exposition is terse. In the revision we will insert an explicit lemma showing the reduction from the two-dimensional (support-size, scale) parameter space to the single scalar t, together with the corresponding one-dimensional integral expression for the certificate. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation applies Neyman-Pearson to factorized joint likelihood

full rationale

The central result is obtained by analyzing the joint likelihood ordering under factorized discrete-continuous noise and applying the Neyman-Pearson lemma to obtain a one-dimensional certificate. This is a direct statistical derivation from the problem setup rather than a reduction to a fitted quantity, self-defined parameter, or self-citation chain. The generalization to Gaussian and discrete cases emerges from the factorization assumption without requiring the target certificate as an input. No load-bearing steps match the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on the applicability of the Neyman-Pearson lemma to the joint worst-case problem under factorized noise; no free parameters or invented entities are explicitly introduced in the abstract.

free parameters (1)

discrete and continuous noise scales
Smoothing noise levels for each modality are chosen parameters that define the certificate radius.

axioms (1)

domain assumption Neyman-Pearson lemma applies directly to the joint likelihood ratio test for heterogeneous perturbations
Invoked to obtain the closed-form one-dimensional certificate.

pith-pipeline@v0.9.0 · 5458 in / 1127 out tokens · 27865 ms · 2026-05-14T20:19:40.652530+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 4.2 ... F(t;r) = ∑ p1(z1|x1) Φ[(r²/2 + σ²(log t − log γ1(z1)))/(σ r)] ... unique t⋆(r) ... V(x1,adv;r) ... padv(d,ϵ) = inf V
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

hybrid Neyman–Pearson ... one-dimensional, continuous, and invertible likelihood-ratio CDF that strictly generalizes both discrete knapsack-based and Gaussian randomized smoothing

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

22 extracted references · 14 canonical work pages · 4 internal anchors

[1]

A., Jagielski, M., Gao, I., Awadalla, A., Koh, P

Carlini, N., Nasr, M., Choquette-Choo, C. A., Jagielski, M., Gao, I., Awadalla, A., Koh, P. W., Ippolito, D., Lee, K., Tramer, F., and Schmidt, L. Are aligned neural net- works adversarially aligned? InProceedings of the 37th Conference on Neural Information Processing Systems (NeurIPS 2023), 2023a. Carlini, N., Tram`er, F., Dvijotham, K. D., Rice, L., Su...

2023
[3]

A Survey on Multimodal Large Language Models for Autonomous Driving

Cui, C., Ma, Y ., Cao, X., Ye, W., Zhou, Y ., Liang, K., Chen, J., Lu, J., Yang, Z., Liao, K.-D., Gao, T., Li, E., Tang, K., Cao, Z., Zhou, T., Liu, A., Yan, X., Mei, S., Cao, J., Wang, Z., and Zheng, C. A Survey on Multimodal Large Language Models for Autonomous Driving . In 2024 IEEE/CVF Winter Conference on Applications of Computer Vision Workshops (WA...

2024
[4]

Defeating Prompt Injections by Design

Debenedetti, E., Shumailov, I., Fan, T., Hayes, J., Carlini, N., Fabian, D., Kern, C., Shi, C., Terzis, A., and Tram`er, F. Defeating prompt injections by design.arXiv preprint arXiv:2503.18813,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Ad- versarial attacks to multi-modal models.arXiv preprint arXiv:2409.06793,

Dou, Z., Hu, X., Yang, H., Liu, Z., and Fang, M. Ad- versarial attacks to multi-modal models.arXiv preprint arXiv:2409.06793,

work page arXiv
[6]

Llavaguard: An open vlm-based frame- work for safeguarding vision datasets and models.arXiv preprint arXiv:2406.05113,

Helff, L., Friedrich, F., Brack, M., Kersting, K., and Schramowski, P. Llavaguard: An open vlm-based frame- work for safeguarding vision datasets and models.arXiv preprint arXiv:2406.05113,

work page arXiv
[7]

Smith, Edoardo M

doi: 10.48550/arXiv. 2406.05113. 10 Certified Robustness under Heterogeneous Perturbations Huang, Z., Chu, W., Li, L., Xu, C., and Li, B. Commit: Cer- tifying robustness of multi-sensor fusion systems against semantic attacks. InProceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence (AAAI-25),

work page internal anchor Pith review doi:10.48550/arxiv
[8]

Khomsky, D., Maloyan, N., and Nutfullin, B

doi: 10.1109/ACCESS.2025.3609980. Khomsky, D., Maloyan, N., and Nutfullin, B. Prompt injection attacks in defended systems.arXiv preprint arXiv:2406.14048,

work page doi:10.1109/access.2025.3609980 2025
[9]

Certified robustness to adversarial examples with differential privacy

Lecuyer, M., Atlidakis, V ., Geambasu, R., Hsu, D., and Jana, S. Certified robustness to adversarial examples with differential privacy. InProceedings of the 2019 IEEE Symposium on Security and Privacy, pp. 656–672. IEEE,

2019
[10]

Second-order ad- versarial attack and certifiable robustness.arXiv preprint arXiv:1809.03113,

Li, B., Chen, C., Wang, W., and Carin, L. Second-order ad- versarial attack and certifiable robustness.arXiv preprint arXiv:1809.03113,

work page arXiv
[11]

MM-SafetyBench: A benchmark for safety evaluation of multimodal large language models

Liu, X., Zhu, Y ., Gu, J., Lan, Y ., Yang, C., and Qiao, Y . MM-SafetyBench: A benchmark for safety evaluation of multimodal large language models. In Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., and Varol, G. (eds.),Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LV...

2024
[12]

doi: 10.1007/978-3-031-72992-8\

work page doi:10.1007/978-3-031-72992-8
[13]

Discrete Diffusion Modeling by Estimating the Ratios of the Data Distribution

Lou, A., Meng, C., and Ermon, S. Discrete diffusion lan- guage modeling by estimating the ratios of the data distri- bution.arXiv preprint arXiv:2310.16834,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Randomized smoothing meets vision-language models

Seferis, E., Wu, C., Kollias, S., Bensalem, S., and Cheng, C.- H. Randomized smoothing meets vision-language models. In Christodoulopoulos, C., Chakraborty, T., Rose, C., and Peng, V . (eds.),Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing,

2025
[15]

and Cao, Y

Wu, H. and Cao, Y . Membership inference attacks on large- scale models: A survey.arXiv preprint arXiv:2503.19338,

work page arXiv
[16]

doi: 10.18653/v1/2020.acl-main.317

Association for Computational Lin- guistics. doi: 10.18653/v1/2020.acl-main.317. Yin, Z., Ye, M., Zhang, T., Du, T., Zhu, J., Liu, H., Chen, J., Wang, T., and Ma, F. Vlattack: Multimodal adversarial attacks on vision-language tasks via pre-trained models. InAdvances in Neural Information Processing Systems,

work page doi:10.18653/v1/2020.acl-main.317 2020
[17]

Yong, Z.-X., Menghini, C., and Bach, S. H. Low- resource languages jailbreak gpt-4.arXiv preprint arXiv:2310.02446,

work page arXiv
[18]

Zhan, Q., Fang, R., Panchal, H

doi: 10.1162/coli a 00476. Zhan, Q., Fang, R., Panchal, H. S., and Kang, D. Adaptive attacks break defenses against indirect prompt injection attacks on llm agents. InFindings of the Association for Computational Linguistics: NAACL,

work page doi:10.1162/coli
[19]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Zou, A., Wang, Z., Carlini, N., Nasr, M., Kolter, J. Z., and Fredrikson, M. Universal and transferable adversar- ial attacks on aligned language models.arXiv preprint arXiv:2307.15043,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

doi: 10.48550/arXiv.2307. 15043. 12 Certified Robustness under Heterogeneous Perturbations A. Appendix A.1. Choice of Thresholdτ Randomized smoothing certifies robustness by bounding the smoothed classifier’s output under worst-case perturbations. For a binary classifier f:X → {0,1} and its smoothed version g(x) =E z∼p(·|x)[f(z)] , the certified radius at...

work page doi:10.48550/arxiv.2307
[21]

Finally, since γ1(Z1)≥0 is independent of Yr, the map y7→φ(ay) is convex for every fixed a≥0

Therefore, for all convexφ, E[φ(Yr1)]≤E[φ(Y r2)],E[Y r1] =E[Y r2]. Finally, since γ1(Z1)≥0 is independent of Yr, the map y7→φ(ay) is convex for every fixed a≥0 . Conditioning on a=γ 1(Z1)yields E[φ(γ1(Z1)Yr1)]≤E[φ(γ 1(Z1)Yr2)]. UsingE[Y r1] =E[Y r2] = 1, this implies Γr1 ≤cx Γr2 andE[Γ r1] =E[Γ r2] = 1.(6) 4.4 Convex Order Implies Monotonicity of Lower Pa...

2007
[22]

For the image channel, we perform ℓ2-constrained projected gradient descent steps under radius ϵ (Madry et al., 2018)

under a discrete budget d. For the image channel, we perform ℓ2-constrained projected gradient descent steps under radius ϵ (Madry et al., 2018). At each iteration, one modality is updated while the other is held fixed, yielding a block-coordinate ascent scheme on the estimated smoothed objective. Algorithm 1 summarizes the procedure. Relation to Prior Wo...

2018
[23]

Look how many people love you

The resulting curves show that even small increases in image perturbation strength can rapidly erode certified coverage once prompt-injection attacks are permitted. Compared to the image-only setting (d= 0 ), certified accuracy under joint certification degrades substantially faster, despite identical image smoothing parameters. This behavior reflects the...

2020