pith. machine review for the scientific record. sign in

arxiv: 2602.00616 · v3 · submitted 2026-01-31 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

SPOT: Selective Prompt Projection via Total Variation for Inference-Only Safe Text-to-Image Generation

Authors on Pith no claims yet

Pith reviewed 2026-05-16 09:05 UTC · model grok-4.3

classification 💻 cs.AI
keywords text-to-image generationsafe generationinference-time interventiontotal variation distanceprompt projectiondiffusion modelssafety alignmentunsupervised safety
0
0 comments X

The pith

SPOT projects unsafe prompts onto nearby tau-safe alternatives at inference time to bound risk changes via total variation distance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that safety interventions in frozen text-to-image diffusion models can be achieved by projecting prompts toward a tau-safe set, where tau caps the reference risk score, while using total variation distance to control how much the expected risk can shift from the model's original prompt-conditioned distribution. This matters because it directly tackles the Safety-Prompt Alignment Tradeoff, allowing reductions in unsafe outputs without retraining the generator or deviating far on benign prompts. SPOT approximates the projection by having an LLM rank candidate rewrites and a VLM accept only those whose generated images stay under the tau threshold. Results across four datasets and three backbones show relative inappropriate score drops of 14.2 to 44.4 percent compared to strong baselines, with benign behavior kept close to the fixed reference.

Core claim

By defining the tau-safe set as prompts whose reference risk is at most tau and casting intervention as projection toward nearby prompts in this set, SPOT uses total variation distance on the prompt-conditioned distribution to bound the change in expected risk score, then approximates this projection at inference time with an LLM ranking rewrites and a VLM verifying generated images under the same tau, all without retraining the generator.

What carries the argument

The tau-safe set of prompts with reference risk at most tau, with projection approximated via LLM-ranked candidate rewrites and VLM acceptance under total variation bounds on distributional deviation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The selective projection template could apply to safety interventions in other generative domains such as audio or video where similar distributional bounds on risk metrics exist.
  • Hybrid pipelines that apply SPOT after an initial fine-tuning stage might achieve compounded safety gains while still avoiding full retraining for every new model.
  • Edge-case testing near the tau boundary could quantify how sensitive the bound is to prompt perturbations that the LLM-VLM step might miss.

Load-bearing premise

Total variation distance on the prompt-conditioned distribution gives a tight enough bound on change in expected risk score, and the LLM-plus-VLM approximation preserves this bound in practice.

What would settle it

Measuring the actual change in expected inappropriate score after SPOT projection on a held-out prompt set and checking whether it exceeds the total variation bound by more than the observed approximation error would falsify the central guarantee.

Figures

Figures reproduced from arXiv: 2602.00616 by Hyekyung Yoon, Minhyuk Lee, Myungjoo Kang.

Figure 1
Figure 1. Figure 1: (a) τ = 0.1, (b) τ = 0.3, (c) τ = 0.5, (d) τ = 0.7, and (e) τ = 0.9. As τ increases, the levels of sexuality and violence in the generated images increase. Utility on benign prompts (FID/CLIP on COCO). On benign COCO captions, our method preserves image quality and text–image alignment close to the no-alignment ref￾erence across all backbones (e.g., SD1.5: 32.46/33.36 vs. 32.34/33.42 in FID/CLIP), while su… view at source ↗
Figure 3
Figure 3. Figure 3: SPAT diagnostic. IP Scores versus FID-to-reference(ref) proxy on COCO-safe prompts for SD1.5/SD2.1/SDXL. Colors denote datasets (UD/I2P/CoProV2), markers denote methods, and dashed lines are per-dataset linear fits, showing a consistent nega￾tive trend. −2 −1 0 1 2 Linear Discriminant Direction 1 −2 −1 0 1 2 Linear Discriminant Direction 2 Original POSI Ours (a) −3 −2 −1 0 1 2 Linear Discriminant Direction… view at source ↗
Figure 4
Figure 4. Figure 4: Projection diagnostics. (a) COCO: our centroid shift is smaller than POSI (cf. (11)). (b) CoProV2: centroid drift saturates beyond R=2 and re-run fixed-point ratios increase (cf. (12)). the clean setting; however, the total IP score rises only from 0.04 to at most 0.06 across all attacks. Across safety cate￾gories, IP scores remain low, with no category exhibiting a consistently dominant or isolated spike … view at source ↗
Figure 5
Figure 5. Figure 5: Stage-1 routing effectiveness at τ = 0.05 (diagnostic computed on original prompts; images are realized with SD1.5). We sort n paired samples by the prompt-only score P (ascending) and forward only the lowest-P fraction f to the Stage-2 verifier. The resulting subsets are strongly enriched for Stage-2 acceptance events (Q ≤ τ ), showing that P is an effective routing signal that concentrates costly Stage-2… view at source ↗
Figure 6
Figure 6. Figure 6: The IP score is largely insensitive to the log-probability threshold and local search iterations, while showing mild sensitivity to αsafety and the number of candidates. 0 10 20 30 40 50 60 70 # of Paramter 0.02 0.04 0.06 0.08 0.10 0.12 IP(↓) LLaMA Qwen (a) 0 10 20 30 40 50 60 70 # of Paramter 0.04 0.06 0.08 0.10 0.12 0.14 IP(↓) LLaMA Qwen (b) 0 10 20 30 40 50 60 70 # of Paramter 0.04 0.06 0.08 0.10 0.12 0… view at source ↗
Figure 7
Figure 7. Figure 7: LLM scaling improves safety. Inappropriate percentage (IP; lower is better) on CoProV2, I2P, and UD as a function of LLM size. The LLM is used for both rewrite proposal and Stage-1 prompt scoring, while SD1.5 and the Stage-2 VLM verifier are fixed; all other hyperparameters match the default setting. Effect of αsafety. The IP score exhibits mild sensitivity to the safety coefficient. Lowering αsafety from … view at source ↗
Figure 8
Figure 8. Figure 8: reports an ablation study of the per-image generation runtime under different hyperparameter settings, including the log-probability threshold, maximum escalation, number of candidates (num candidates), and number of steps (num steps), with τ fixed to 0.05. log-probability threshold, αsafety, number of candidates, and local search iterations on the IP score. 3 5 10 15 log_probability 0.0 0.1 0.2 0.3 0.4 Pe… view at source ↗
Figure 9
Figure 9. Figure 9: Additional qualitative evaluation (CoProV2). Hate Harassment Violence Self-harm Sexual Shocking Illegal SD 2.1 No Alignment AlignGuard LatentGuard GuardT2I Ours [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Additional qualitative evaluation (CoProV2). 23 [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Additional qualitative evaluation (CoProV2). sample1 sample2 sample3 sample4 sample5 sample6 sample7 SD 1.5 No Alignment AlignGuard LatentGuard GuardT2I Ours [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Additional qualitative evaluation (COCO). 24 [PITH_FULL_IMAGE:figures/full_fig_p024_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Additional qualitative evaluation (COCO). sample1 sample2 sample3 sample4 sample5 sample6 sample7 SDXL No Alignment AlignGuard LatentGuard GuardT2I Ours [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Additional qualitative evaluation (COCO). 25 [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗
read the original abstract

Text-to-Image (T2I) diffusion models enable high quality open ended synthesis, but practical use requires suppressing unsafe generations while preserving behavior on benign prompts. We study this tension relative to the frozen generator, using its prompt conditioned distribution as the preservation reference. Since T2I safety is commonly evaluated by bounded risk scores on generated images, total variation (TV) bounds how much expected risk can change from this reference. We call this fixed reference constraint the Safety-Prompt Alignment Tradeoff (SPAT): reducing expected unsafety requires prompt conditioned distributional deviation. To make this deviation selective and adjustable, we define the tau safe set as prompts whose reference risk is at most tau, and cast intervention as projection toward nearby prompts in this set. We propose Selective Prompt prOjecTion (SPOT), an inference time framework that approximates this projection without retraining the generator or learning a category specific rewriter. SPOT uses an LLM to rank candidate rewrites and a safeguard VLM to accept generated images under the same tau. Across four datasets and three diffusion backbones, SPOT achieves relative inappropriate (IP) score reductions from 14.2% to 44.4% over strong safety alignment baselines while keeping benign prompt behavior close to the fixed reference.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes SPOT, an inference-only framework for safe text-to-image generation. It uses total variation distance on the prompt-conditioned distribution of a frozen diffusion model to bound changes in expected risk score, formalizing this as the Safety-Prompt Alignment Tradeoff (SPAT). Intervention is cast as projection onto a tau-safe prompt set, approximated at inference time by an LLM ranking candidate rewrites followed by VLM acceptance of generated images. Across four datasets and three diffusion backbones, SPOT reports relative inappropriate (IP) score reductions of 14.2% to 44.4% versus strong safety baselines while keeping benign-prompt behavior close to the fixed reference.

Significance. If the TV bound is preserved under the LLM+VLM approximation, the work supplies a principled, training-free mechanism for controlling the safety-fidelity tradeoff in T2I models. The approach is notable for its explicit grounding in a distributional inequality rather than heuristic filtering, and for demonstrating consistent gains across multiple backbones without retraining.

major comments (3)
  1. [Abstract and SPOT method description] The central SPAT claim (TV distance bounds change in expected risk for bounded risk functions) is standard, but the manuscript provides no derivation or measurement showing that the LLM-ranked rewrites plus VLM filter keep the prompt-conditioned distribution within a small TV ball of the original reference. Without this verification, the reported IP reductions cannot be confidently attributed to the TV-motivated projection rather than to the external LLM/VLM components acting as an independent safety gate.
  2. [Experiments section] Experimental results lack ablations isolating the contribution of the projection approximation (e.g., LLM ranking alone vs. full VLM acceptance) and report no error bars or statistical significance tests on the 14.2–44.4 % IP reductions. This makes it impossible to assess whether the gains are robust or sensitive to the specific tau choice and candidate generation procedure.
  3. [SPAT and tau-safe set definition] The definition of the tau-safe set and the procedure for selecting tau across datasets are not specified with sufficient detail to reproduce the projection step or to confirm that the selected prompts remain close in TV to the reference distribution.
minor comments (2)
  1. [Preliminaries] Notation for the risk score and the TV bound should be introduced with explicit equations rather than prose only.
  2. [Related work] The manuscript should cite prior work on prompt rewriting and inference-time safety filters to clarify the incremental contribution of the TV-based selection.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major comment point by point below, clarifying our approach where possible and outlining specific revisions to the manuscript.

read point-by-point responses
  1. Referee: The central SPAT claim (TV distance bounds change in expected risk for bounded risk functions) is standard, but the manuscript provides no derivation or measurement showing that the LLM-ranked rewrites plus VLM filter keep the prompt-conditioned distribution within a small TV ball of the original reference. Without this verification, the reported IP reductions cannot be confidently attributed to the TV-motivated projection rather than to the external LLM/VLM components acting as an independent safety gate.

    Authors: We agree that the manuscript does not provide an empirical measurement verifying that the LLM+VLM approximation maintains the prompt-conditioned distribution within a small TV ball of the reference. The SPAT bound itself follows from the standard property of total variation for bounded risk functions, but we acknowledge the gap in validating the approximation quality. We will add a new analysis section with empirical TV distance measurements between original and projected distributions on a held-out prompt set to better attribute the observed IP reductions. revision: yes

  2. Referee: Experimental results lack ablations isolating the contribution of the projection approximation (e.g., LLM ranking alone vs. full VLM acceptance) and report no error bars or statistical significance tests on the 14.2–44.4 % IP reductions. This makes it impossible to assess whether the gains are robust or sensitive to the specific tau choice and candidate generation procedure.

    Authors: We agree that the current experiments would benefit from additional ablations and statistical analysis. We will incorporate new ablation studies comparing LLM ranking alone against the full SPOT pipeline (including VLM acceptance), report error bars over multiple runs, and include statistical significance tests (such as paired t-tests) on the IP score reductions to assess robustness with respect to tau and candidate generation. revision: yes

  3. Referee: The definition of the tau-safe set and the procedure for selecting tau across datasets are not specified with sufficient detail to reproduce the projection step or to confirm that the selected prompts remain close in TV to the reference distribution.

    Authors: We will expand the method section to provide a precise definition of the tau-safe set as the collection of prompts whose expected risk score under the reference distribution is at most tau. We will also detail the tau selection procedure, which is based on the empirical risk score distribution over benign prompts per dataset, and include pseudocode for the projection step to support reproducibility and confirm proximity in TV distance. revision: yes

Circularity Check

0 steps flagged

No significant circularity: standard TV inequality applied to external risk scores with empirical results

full rationale

The paper's core derivation applies the standard total variation inequality to bound changes in expected risk for any bounded risk function, which is a general mathematical fact independent of the paper's data or method. SPOT is defined as an inference-time approximation to projection onto the tau-safe set using external LLM ranking and VLM filtering, with performance measured empirically on held-out datasets against frozen reference generators. No equation or claim reduces a reported outcome to a parameter fitted on the same success metric, no self-citation chain bears the central load, and the SPAT tradeoff follows directly from the inequality rather than being smuggled in by definition or renaming. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the TV inequality bounding risk change and on the existence of an effective LLM/VLM approximation to the projection; no free parameters or new entities are introduced in the abstract.

axioms (1)
  • domain assumption Total variation distance bounds the change in expected risk score between prompt-conditioned distributions
    Invoked to define the Safety-Prompt Alignment Tradeoff (SPAT)

pith-pipeline@v0.9.0 · 5535 in / 1149 out tokens · 30507 ms · 2026-05-16T09:05:17.692663+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages · 4 internal anchors

  1. [1]

    Typology of risks of generative text-to-image models

    Bird, C., Ungless, E., and Kasirzadeh, A. Typology of risks of generative text-to-image models. InProceedings of the 2023 AAAI/ACM Conference on AI, Ethics, and Society, pp. 396–410,

  2. [2]

    Acquiescence bias in large language models

    Braun, D. Acquiescence bias in large language models. In Findings of the Association for Computational Linguis- tics: EMNLP 2025, pp. 11341–11355,

  3. [3]

    D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al. Language models are few-shot learners. Advances in neural information processing systems, 33: 1877–1901,

  4. [4]

    SimCSE: Simple Contrastive Learning of Sentence Embeddings

    Gao, T., Yao, X., and Chen, D. Simcse: Simple con- trastive learning of sentence embeddings.arXiv preprint arXiv:2104.08821,

  5. [5]

    Clipscore: A reference-free evaluation metric for image captioning

    Hessel, J., Holtzman, A., Forbes, M., Le Bras, R., and Choi, Y . Clipscore: A reference-free evaluation metric for image captioning. InProceedings of the 2021 conference on empirical methods in natural language processing, pp. 7514–7528,

  6. [6]

    Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

    Inan, H., Upasani, K., Chi, J., Rungta, R., Iyer, K., Mao, Y ., Tontchev, M., Hu, Q., Fuller, B., Testug- gine, D., et al. Llama guard: Llm-based input-output safeguard for human-ai conversations.arXiv preprint arXiv:2312.06674,

  7. [7]

    Unsupervised Dense Information Retrieval with Contrastive Learning

    Izacard, G., Caron, M., Hosseini, L., Riedel, S., Bojanowski, P., Joulin, A., and Grave, E. Unsupervised dense infor- mation retrieval with contrastive learning.arXiv preprint arXiv:2112.09118,

  8. [8]

    Peng, J., Tang, Z., Liu, G., Fleming, C., and Hong, M

    Link. Peng, J., Tang, Z., Liu, G., Fleming, C., and Hong, M. Safeguarding text-to-image generation via inference- time prompt-noise optimization.arXiv preprint arXiv:2412.03876,

  9. [9]

    Sdxl: Im- proving latent diffusion models for high-resolution image synthesis

    Podell, D., English, Z., Lacey, K., Blattmann, A., Dockhorn, T., M ¨uller, J., Penna, J., and Rombach, R. Sdxl: Im- proving latent diffusion models for high-resolution image synthesis. InThe Twelfth International Conference on Learning Representations. Qu, Y ., Shen, X., He, X., Backes, M., Zannettou, S., and Zhang, Y . Unsafe diffusion: On the generation...

  10. [10]

    Schramowski, P., Tauchmann, C., and Kersting, K. Can ma- chines help us answering question 16 in datasheets, and in turn reflecting on inappropriate content? InProceedings of the 2022 ACM conference on fairness, accountability, and transparency, pp. 1350–1361,

  11. [11]

    Ring-a-bell! how reliable are concept removal methods for diffusion models?arXiv preprint arXiv:2310.10012,

    Tsai, Y .-L., Hsu, C.-Y ., Xie, C., Lin, C.-H., Chen, J.-Y ., Li, B., Chen, P.-Y ., Yu, C.-M., and Huang, C.-Y . Ring-a-bell! how reliable are concept removal methods for diffusion models?arXiv preprint arXiv:2310.10012,

  12. [12]

    Uni- versal prompt optimizer for safe text-to-image generation

    Wu, Z., Gao, H., Wang, Y ., Zhang, X., and Wang, S. Uni- versal prompt optimizer for safe text-to-image generation. arXiv preprint arXiv:2402.10882,

  13. [13]

    Mma-diffusion: Multimodal attack on diffusion models

    Yang, Y ., Gao, R., Wang, X., Ho, T.-Y ., Xu, N., and Xu, Q. Mma-diffusion: Multimodal attack on diffusion models. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition, pp. 7737–7746, 2024a. Yang, Y ., Gao, R., Yang, X., Zhong, J., and Xu, Q. Guardt2i: Defending text-to-image models from adversar- ial prompts.Advances in n...

  14. [14]

    PromptGuard: Soft Prompt-Guided Unsafe Content Moderation for Text-to-Image Models

    10 Inference-Only Prompt Projection for Safe Text-to-Image Generation with TV Guarantees Yuan, L., Li, X., Xu, C., Tao, G., Jia, X., Huang, Y ., Dong, W., Liu, Y ., and Li, B. Promptguard: Soft prompt- guided unsafe content moderation for text-to-image mod- els.arXiv preprint arXiv:2501.03544,

  15. [15]

    Value-aligned prompt moderation via zero-shot agen- tic rewriting for safe image generation.arXiv preprint arXiv:2511.11693,

    Zhao, X., Chen, X., Liu, B., Liu, Z., Zhao, Z., and Gu, X. Value-aligned prompt moderation via zero-shot agen- tic rewriting for safe image generation.arXiv preprint arXiv:2511.11693,

  16. [16]

    We provide (i) total variation bounds for bounded functionals, (ii) existence and measurable selection for τ-safe minimal-edit projections under the angular distance, and (iii) kernelized SPAT. A.1. Basic objects: spaces, probability measures, and kernels Let(X,B(X))be a measurable space. We writeP(X)for the set of probability measures on(X,B(X)). Markov ...

  17. [17]

    Increasing the candidate pool from 4 to 64 is associated with an increase in the IP score from 0.03 to 0.04, while the overall magnitude of change remains small under low-τsettings

    Effect of number of candidates.The IP score changes with the number of candidates. Increasing the candidate pool from 4 to 64 is associated with an increase in the IP score from 0.03 to 0.04, while the overall magnitude of change remains small under low-τsettings. Effect of local search iterations.Increasing the number of local search iterations results i...

  18. [18]

    In parallel, guard-based models appear to effectively detect unsafe content; however, they suffer from a high rate of false positives

    and SDXL (samples 5 and 6). In parallel, guard-based models appear to effectively detect unsafe content; however, they suffer from a high rate of false positives. As discussed in Paragraph 5.2, these models tend to overzealously flag benign content as unsafe on the COCO dataset. Consequently, both LatentGuard and GuardT2I frequently generate black images ...