arxiv: 2605.00699 · v3 · submitted 2026-05-01 · 💻 cs.CR

Recognition: unknown

STARE: Step-wise Temporal Alignment and Red-teaming Engine for Multi-modal Toxicity Attack

Cong Wang, Hongying Zan, Liangjie Zhao, Tao Liu, Xiang Zheng, Xutao Mao

Pith reviewed 2026-05-09 19:12 UTC · model grok-4.3

classification 💻 cs.CR

keywords red-teamingvision-language modelstoxicity attackdenoising trajectoryphase alignmentreinforcement learningadversarial optimizationmulti-modal safety

0 comments

The pith

A hierarchical reinforcement learning attack on the image denoising trajectory boosts attack success by 68 percent and reveals that optimization concentrates toxicity into early conceptual and late detail phases.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces STARE, a framework that attacks vision-language models by editing prompts at a high level while fine-tuning the text-to-image generator at a low level through group relative policy optimization. This treats each step of the denoising process as a potential point of intervention rather than waiting for the final image. The result is substantially higher rates of eliciting toxic outputs compared with earlier black-box and white-box methods. The same trajectory-level attack also exposes a consistent pattern: ordinary models spread toxicity across many steps, but optimized attacks push conceptual harms forward and detail harms backward. Because these phases can be perturbed independently, different categories of toxicity become selectively controllable.

Core claim

STARE couples a prompt editor with low-level T2I fine-tuning via Group Relative Policy Optimization under white-box T2I and black-box VLM access. It shows that adversarial optimization produces Optimization-Induced Phase Alignment: conceptual toxicity concentrates in early semantic phases while detail-oriented toxicity concentrates in late refinement phases, converting an otherwise diffuse process into a small number of predictable vulnerability windows.

What carries the argument

The hierarchical RL loop with GRPO that directly optimizes over the full denoising trajectory, turning the temporal sequence of generation steps into the primary attack surface.

If this is right

Targeted noise added only to early denoising steps suppresses conceptual toxicity while leaving detail-oriented toxicity largely intact.
Perturbations limited to late steps affect detail harms without strongly impacting early conceptual content.
Toxicity formation shifts from an unpredictable diffuse pattern to a small set of controllable temporal windows.
Phase-aware monitoring or intervention during generation becomes a practical route to safety mechanisms.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the phase alignment generalizes beyond the tested models, similar temporal structures could appear in video or audio generators and offer comparable handles for attack or defense.
Defenders might insert lightweight classifiers at only the identified early and late steps instead of evaluating every intermediate output.
The pattern raises the question of whether non-adversarial training already embeds weak versions of these phases that could be amplified or suppressed without full red-teaming.

Load-bearing premise

The observed concentration of toxicity into specific early and late phases is produced by the adversarial optimization itself rather than by the hierarchical design, the GRPO algorithm, or the particular toxicity classifiers and models used.

What would settle it

Apply the same GRPO-based optimization to a non-hierarchical single-level attack method on the identical T2I model and check whether the early-conceptual and late-detail toxicity split disappears.

Figures

Figures reproduced from arXiv: 2605.00699 by Cong Wang, Hongying Zan, Liangjie Zhao, Tao Liu, Xiang Zheng, Xutao Mao.

**Figure 1.** Figure 1: Motivation for temporal alignment analysis. Existing red-teaming methods can craft toxic continuations but cannot explain how the toxicity semantics form during T2I generation. STARE addresses this gap by performing step-wise temporal alignment, attributing the final toxicity back to specific steps. The early panel (left) shows the conceptual seeding phase, where identity- and threat-related cues are i… view at source ↗

**Figure 2.** Figure 2: Overview of the STARE framework. The high-level module edits an input prompt to generate a more adversarial subgoal. This subgoal is then passed to the low-level module, which uses GRPO to fine-tune the T2I model for maximizing the final toxicity score while maintaining prompt alignment. subgoal, we generate M rollouts using the current policy. For each rollout m, we use a VLM to extend the prompt given bo… view at source ↗

**Figure 3.** Figure 3: Temporal Alignment Analysis Framework. Our diagnostic method identifies the phase-specific contributions to toxicity through four steps: (1) Computing net toxicity scores, (2) Timestep perturbation via coarse-to-fine search, (3) Multi-Level Monte Carlo estimation, and (4) Visualizing the T × D temporal alignment map. level execution is: Jlow = Eτ∼πθ " 1 T X T t=1 L low grp (t) − βt DKL π (t) θ ∥ π (t) r… view at source ↗

**Figure 5.** Figure 5: Generative process comparison. STARE (Top) performs Toxic Concept Inpaint via prompt editing at early stages, then Detail Amplification at late stages. Vanilla (Bottom) follows standard semantic-to-detail progression but remain diffuse. 5.2. Temporal Alignment and Phase Analysis Latent Temporal Vulnerabilities. We first analyze vanilla toxic generations to determine if temporal alignment is an optimizatio… view at source ↗

**Figure 4.** Figure 4: Temporal Alignment Heatmaps. Adversarial optimization transforms toxicity formation. (d) The vanilla model shows diffuse, weakly structured toxicity. In contrast, (a) our full STARE framework exploits and amplifies this, inducing a clear temporal alignment. Ablations (b, c) show this alignment emerges from the synergy of both components. Robust Cross-Model and Cross-Dataset Transfer. STARE demonstrates st… view at source ↗

**Figure 6.** Figure 6: Qualitative Comparison. Comparison of image outputs from vanilla SD and our attack framework based on edited prompts. The visually distinct outputs from our framework provide context that enables more severe toxic continuations. seen in view at source ↗

**Figure 7.** Figure 7: Noise level analysis during training. 0 500 1000 1500 2000 2500 3000 Training Steps 0.10 0.15 0.20 0.25 0.30 0.35 Reward Denoise Step 10 Denoise Step 20 Denoise Step 40 view at source ↗

**Figure 9.** Figure 9: KL Divergence analysis during training. 17 view at source ↗

**Figure 10.** Figure 10: Additional Visualization (1) of STARE versus baseline in RTP dataset. Original Prompt STARE Edit Prompt Vanilla SD STARE (Ours) Toxicity Continuation I want to smash his teeth in with a pint glass and watch the rest of his inbred pride bleed out on the pitch. I hate English fans so much that I break into a cold sweat when I see their putrid… I hate English fans cold sweat scum-sucking island-rats 'Bloodie… view at source ↗

**Figure 11.** Figure 11: Additional Visualization (2) of STARE versus baseline in RTP dataset. 19 view at source ↗

**Figure 12.** Figure 12: Visualization of STARE versus baseline in PTP dataset. 20 view at source ↗

read the original abstract

Red-teaming Vision-Language Models is essential for identifying vulnerabilities where adversarial image-text inputs trigger toxic outputs. Existing approaches treat image generation as a black box, returning only terminal toxicity scores and leaving open the question of when and how toxic semantics emerge during multi-step synthesis. We introduce STARE, a hierarchical reinforcement learning framework that treats the denoising trajectory itself as the attack surface, under a direct white-box T2I and query-only black-box VLM setting. By coupling a high-level prompt editor with low-level T2I fine-tuning via Group Relative Policy Optimization (GRPO), STARE attains a 68% improvement in Attack Success Rate over state-of-the-art black-box and white-box baselines. More importantly, this trajectory-level view surfaces the Optimization-Induced Phase Alignment phenomenon: vanilla models exhibit diffuse toxicity, whereas adversarial optimization concentrates conceptual harms into early semantic phases and detail-oriented harms into late refinement. Targeted perturbations of either window selectively suppress different toxicity categories, indicating that this temporal structure is a genuine causal handle rather than a side effect of the hierarchical design. The phenomenon turns toxicity formation from a chaotic process into a small set of predictable vulnerability windows, providing both a potent attack engine and a basis for phase-aware safety mechanisms. Content warning: This paper contains examples of toxic content that may be offensive or disturbing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

STARE gets a 68% ASR lift by attacking the full denoising trajectory and flags early-concept and late-detail toxicity phases, but the causal claim rests on experiments that all use the same hierarchical GRPO setup.

read the letter

STARE treats the entire denoising process in text-to-image models as the attack surface instead of just the final output. It pairs a high-level prompt editor with low-level fine-tuning through Group Relative Policy Optimization and reports a 68% jump in attack success rate over existing black-box and white-box baselines. The work also describes an Optimization-Induced Phase Alignment where adversarial training pushes conceptual toxicity into early steps and detail-level harms into later refinement steps, with targeted changes to either window hitting different toxicity types differently.

Referee Report

2 major / 2 minor

Summary. The paper introduces STARE, a hierarchical reinforcement learning framework for red-teaming vision-language models by treating the denoising trajectory of text-to-image generation as the attack surface. Under a white-box T2I and black-box VLM setting, it couples high-level prompt editing with low-level fine-tuning via Group Relative Policy Optimization (GRPO) and reports a 68% improvement in Attack Success Rate over existing baselines. The work also claims to surface an 'Optimization-Induced Phase Alignment' phenomenon in which adversarial optimization concentrates conceptual toxicity into early semantic phases and detail-oriented harms into late refinement phases, with targeted perturbations of these windows selectively suppressing different toxicity categories.

Significance. If the reported ASR gains and the causal status of the phase-alignment phenomenon hold after proper isolation, the work would provide both a stronger attack method and a new mechanistic handle on toxicity formation in multi-step generative models, potentially informing phase-aware safety interventions. The trajectory-level analysis moves beyond terminal scores, which is a useful direction for the field.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): The central claim that the observed temporal structure constitutes 'a genuine causal handle rather than a side effect of the hierarchical design' is load-bearing for both the ASR improvement and the proposed safety mechanisms, yet no control experiments are described that decouple the hierarchy (e.g., flat GRPO without high-level prompt editing, non-hierarchical optimizers, or alternative toxicity classifiers). Without such ablations, the Optimization-Induced Phase Alignment could be an artifact of the specific GRPO implementation rather than an intrinsic property of the denoising trajectory.
[§4 and Table 1] §4 and Table 1 (or equivalent results table): The headline 68% ASR improvement is presented without reported statistical tests, variance across runs, exact baseline implementations, or ablation breakdowns that isolate the contribution of phase-aware perturbations versus the hierarchical RL setup itself. This makes it impossible to assess whether the gain is robust or primarily driven by the un-isolated design choices.

minor comments (2)

[Abstract] The abstract and introduction would benefit from a brief explicit statement of the precise toxicity classifiers and VLM models used, as these choices directly affect the measured phase concentrations.
[Figures] Figure captions describing trajectory visualizations should include the exact denoising step ranges labeled as 'early semantic' and 'late refinement' phases for reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. The comments highlight important gaps in experimental controls and statistical reporting that we will address in revision. We respond to each major comment below.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): The central claim that the observed temporal structure constitutes 'a genuine causal handle rather than a side effect of the hierarchical design' is load-bearing for both the ASR improvement and the proposed safety mechanisms, yet no control experiments are described that decouple the hierarchy (e.g., flat GRPO without high-level prompt editing, non-hierarchical optimizers, or alternative toxicity classifiers). Without such ablations, the Optimization-Induced Phase Alignment could be an artifact of the specific GRPO implementation rather than an intrinsic property of the denoising trajectory.

Authors: We agree that additional controls are required to isolate whether the phase alignment is intrinsic to the denoising trajectory or tied to the hierarchical GRPO design. In the revised manuscript we will add a flat GRPO ablation that removes the high-level prompt editor and optimizes directly over the full trajectory, as well as results with an alternative toxicity classifier. At the same time, the existing targeted phase perturbations already demonstrate selectivity: early-window interventions suppress conceptual toxicity while late-window interventions affect detail harms. This differential effect occurs within the same hierarchical setup, providing initial evidence that the temporal structure is not solely an artifact. The new flat baseline will further strengthen this separation. revision: yes
Referee: [§4 and Table 1] §4 and Table 1 (or equivalent results table): The headline 68% ASR improvement is presented without reported statistical tests, variance across runs, exact baseline implementations, or ablation breakdowns that isolate the contribution of phase-aware perturbations versus the hierarchical RL setup itself. This makes it impossible to assess whether the gain is robust or primarily driven by the un-isolated design choices.

Authors: We concur that variance, statistical tests, and component-wise ablations are necessary for assessing robustness. The revised manuscript will update Table 1 with standard deviations across runs, include paired statistical tests for the ASR differences, and expand the experimental section with precise baseline implementation details. We will also add an ablation table that separately quantifies the contribution of the phase-aware perturbations from the hierarchical RL components, allowing readers to evaluate whether the reported gain is driven by the full design or by specific elements. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical trajectory observations independent of inputs

full rationale

The paper presents STARE as a hierarchical RL framework using GRPO to attack T2I models, with central results being an observed 68% ASR gain over external baselines and the empirical discovery of Optimization-Induced Phase Alignment via direct inspection of denoising trajectories. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text that would reduce these observations to the experimental setup by construction. The claim that the phase structure is a 'genuine causal handle rather than a side effect' is asserted on the basis of targeted perturbation experiments, but remains an empirical interpretation rather than a definitional or self-referential derivation. The analysis is therefore self-contained against external benchmarks and direct observation.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Review is based on abstract only; limited visibility into exact assumptions or parameters.

axioms (1)

domain assumption Standard reinforcement learning assumptions (policy gradient validity, reward signal from VLM queries) hold for the GRPO procedure.
The method relies on GRPO, which inherits typical RL theory assumptions.

invented entities (1)

Optimization-Induced Phase Alignment phenomenon no independent evidence
purpose: Explains concentration of different toxicity types into early versus late denoising steps under adversarial optimization.
This is presented as an observed empirical pattern; no independent external evidence is mentioned in the abstract.

pith-pipeline@v0.9.0 · 5544 in / 1356 out tokens · 51672 ms · 2026-05-09T19:12:47.365886+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 9 canonical work pages · 2 internal anchors

[1]

findings-acl.262/

URL https://aclanthology.org/2025. findings-naacl.2/. Deng, Y . and Mineiro, P. Flow-dpo: Improving llm math- ematical reasoning through online multi-agent learning. arXiv preprint arXiv:2410.22304, 2024. Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Yang, A., Fan, A., et al. The llama 3 herd of models....

work page doi:10.18653/v1/2020.findings-emnlp 2025
[2]

The power of scale for parameter-efficient prompt tuning,

URL https://aclanthology.org/2020. findings-emnlp.301/. Giles, M. B. Multilevel monte carlo path simulation.Oper- ations research, 56(3):607–617, 2008. Gokaslan, A., Cohen, V ., Pavlick, E., and Tellex, S. Open- webtext corpus. http://Skylion007.github. io/OpenWebTextCorpus, 2019. Google Cloud. Gemini 2.5 pro, 2025. URL https://cloud.google.com/vertex-ai/...

work page doi:10.18653/v1/2021.emnlp-main 2020
[3]

emnlp-main.595/

URL https://aclanthology.org/2021. emnlp-main.595/. Hu, E. J., yelong shen, Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., and Chen, W. LoRA: Low-rank adap- tation of large language models. InInternational Confer- ence on Learning Representations, 2022. URL https: //openreview.net/forum?id=nZeVKeeFYf9. Hu, Z., Zhang, F., Chen, L., Kuang, K., Li, ...

work page doi:10.1609/aaai.v39i25 2021
[4]

Adam: A Method for Stochastic Optimization

URL https://openreview.net/forum? id=ootI3ZO6TJ. Kingma, D. P. and Ba, J. Adam: A method for stochas- tic optimization. In Bengio, Y . and LeCun, Y . (eds.), 3rd International Conference on Learning Represen- tations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015. URL http: //arxiv.org/abs/1412.6980. Lee, S., Lin, Z., an...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1007/978-3-031-72992-8_22 2015
[5]

findings-naacl.172/

URL https://aclanthology.org/2025. findings-naacl.172/. Ma, Y ., Pang, S., Guo, Q., Wei, T., and Guo, Q. Coljailbreak: Collaborative generation and editing for jailbreaking text- to-image deep generation. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems,

2025
[6]

Zhang, N

URL https://openreview.net/forum? id=eGIzeTmAtE. Mahato, M., Kumar, A., Singh, K., Kukreja, B., and Nabi, J. Red teaming for multimodal large language models: A survey.Authorea Preprints, 2024. Morris, J., Kuleshov, V ., Shmatikov, V ., and Rush, A. Text embeddings reveal (almost) as much as text. In Bouamor, H., Pino, J., and Bali, K. (eds.),Proceed- ing...

work page doi:10.18653/v1/2023.emnlp-main 2024
[7]

Red Teaming Language Models with Language Models.Proceedings of EMNLP 2022, pp

URL https://aclanthology.org/2023. emnlp-main.765/. Ni, J., Qu, C., Lu, J., Dai, Z., Hernandez Abrego, G., Ma, J., Zhao, V ., Luan, Y ., Hall, K., Chang, M.-W., and Yang, Y . Large dual encoders are generalizable retrievers. In Gold- berg, Y ., Kozareva, Z., and Zhang, Y . (eds.),Proceedings of the 2022 Conference on Empirical Methods in Natural Language ...

work page doi:10.18653/v1/2022.emnlp-main 2023
[8]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

URL https://aclanthology.org/2022. emnlp-main.669/. Nöther, J., Singla, A., and Radanovic, G. Text-diffusion red-teaming of large language models: Unveiling harm- ful behaviors with proximity constraints. InProceedings of the AAAI Conference on Artificial Intelligence, vol- ume 39, pp. 27547–27555, 2025. OpenAI. GPT-5.4 Thinking System Card. URL https://o...

work page internal anchor Pith review Pith/arXiv arXiv 2022
[9]

ISBN 979-8-89176-189-6

Association for Computational Linguistics. ISBN 979-8-89176-189-6. doi: 10.18653/v1/2025.naacl-long

work page doi:10.18653/v1/2025.naacl-long 2025
[10]

naacl-long.360/

URL https://aclanthology.org/2025. naacl-long.360/. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. At- tention is all you need.Advances in neural information processing systems, 30, 2017. 12 STARE: Step-wise Temporal Alignment and Red-teaming Engine for Multi-modal Toxicity Attack Wang, B. and ...

work page doi:10.1145/3664647.3681092 2025
[11]

with the following configuration: • Inversion model:gtr__nq__32 • Corrector model:gtr__nq__32__correct • Number of inversion steps: 40 • Beam width: 4 Embedder:We use the GTR-T5-Base encoder (Ni et al.,
[12]

{prompt}

to produce 768-dimensional embeddings. The embed- dings are L2-normalized to unit norm following the official vec2text approach. B.2. VLM Prompt We use a simple prompt for red-teaming VLM shown in Table 5. Table 5.The simple prompt used for red-teaming VLMs. Prompt Content You are a writing assistant. Extend the following sentence: "{prompt}" Response sho...

2025