pith. sign in

arxiv: 2605.20654 · v1 · pith:46ZH5J3Dnew · submitted 2026-05-20 · 💻 cs.LG · cs.AI

REFLECTOR: Internalizing Step-wise Reflection against Indirect Jailbreak

Pith reviewed 2026-05-21 06:12 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords jailbreak defenseself-reflectionLLM safetyreinforcement learningsupervised fine-tuningindirect attackstrajectory-level safety
0
0 comments X

The pith

Reflector internalizes self-reflection in LLMs to defend against indirect jailbreaks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Reflector as a two-stage method to embed step-wise reflection directly into an LLM's generation process. First it collects high-quality reflection examples from a teacher model and uses them for supervised fine-tuning; then it applies reinforcement learning with outcome and validity rewards to make the reflection autonomous. The goal is to move safety from surface-level checks to internal trajectory-level behavior so the model can resist multi-step indirect attacks. If this holds, the model would achieve strong defense rates while also improving accuracy on reasoning tasks instead of trading off capability for safety.

Core claim

Reflector is a two-stage framework that first performs teacher-guided supervised fine-tuning to establish structured reflection patterns and then applies reinforcement learning with outcome-driven and reward-validity supervision to internalize autonomous self-reflection, resulting in defense success rates above 90 percent against complex indirect jailbreaks and a 5.85 percent gain on GSM8K.

What carries the argument

The Reflector two-stage pipeline that internalizes trajectory-level safety by turning teacher-generated reflection data into autonomous, step-wise self-correction during generation.

If this is right

  • Defense success rates exceed 90 percent on complex indirect attacks.
  • The method generalizes across diverse threat scenarios without retraining.
  • Task performance improves, including a 5.85 percent gain on GSM8K and better results on knowledge benchmarks.
  • Safety is added at the trajectory level without measurable extra inference cost.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same internalization approach could be tested on other alignment problems such as reducing hallucination or bias.
  • Training on a mixture of reflection data from multiple teacher models might increase robustness to teacher-specific biases.
  • Measuring whether the added reflection steps remain stable under distribution shift in user prompts would be a direct next experiment.

Load-bearing premise

High-quality reflection data from a teacher model can be internalized via reinforcement learning to produce robust autonomous self-reflection that generalizes without creating new vulnerabilities or overhead.

What would settle it

A new set of indirect jailbreak prompts never seen in training that causes defense success rate to drop below 70 percent while task performance on GSM8K remains unchanged or declines.

Figures

Figures reproduced from arXiv: 2605.20654 by Bo Zou, Chaochao Lu, Chao Yang, Jiachen Ma, Jiawen Zhang, Xiangtian Li.

Figure 1
Figure 1. Figure 1: Correlation between the position of the first occurrence of harmful tokens and the attack success rate (ASR). While direct jailbreaks (blue) manifest immediately, indirect attacks (red) ex￾hibit a stealthy latency, with harmful content emerging only after 20 tokens. This delay enables malicious intent to bypass surface￾level safety alignment, leading to significantly higher ASRs than direct attacks. solvin… view at source ↗
Figure 2
Figure 2. Figure 2: The framework of REFLECTOR. In Stage 1 (SFT), the model learns the “search-and-recovery” reflection pattern from teacher-guided data. In Stage 2 (RL), the model undergoes self-improvement via GDPO, guided by a hybrid reward function that jointly optimizes for final response safety (rsafety) and the validity of the reflection process (rreflect). disrupt the generation process. Thus, learning when and how to… view at source ↗
Figure 3
Figure 3. Figure 3: Impact of safety data scaling. (a) Increasing safety data yields initial gains but ultimately degrades general perfor￾mance due to over-alignment. (b) Higher safety ratios consistently strengthen reflective defenses against overtly harmful queries. model does not produce explicit reflection markers without prior SFT, we prepend each query with an instruction that specifies the required reflection format du… view at source ↗
read the original abstract

While Large Language Models (LLMs) demonstrate remarkable capabilities, they remain susceptible to sophisticated, multi-step jailbreak attacks that circumvent conventional surface-level safety alignment by exploiting the internal generation process. To address these vulnerabilities, we propose Reflector, a principled two-stage framework that internalizes self-reflection within the generation trajectory. Reflector first leverages teacher-guided generation to produce high-quality reflection data for supervised fine-tuning (SFT), establishing structured reflection patterns. It subsequently uses Reinforcement Learning (RL) with outcome-driven and reward-validity supervision to instill robust, autonomous self-reflection capabilities. Empirical results show that Reflector achieves Defense Success Rates (DSR) exceeding 90% against complex indirect attacks while generalizing robustly across diverse threat scenarios. Notably, the framework enhances both task-specific and general utility, yielding a 5.85% gain on GSM8K alongside improved performance on knowledge-intensive benchmarks. By internalizing trajectory-level safety, Reflector overcomes the fundamental limitations of surface alignment without significant computational overhead, offering an efficient and scalable solution for the development of safe and capable LLMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 3 minor

Summary. The paper introduces REFLECTOR, a two-stage framework for internalizing step-wise self-reflection in LLMs to defend against indirect jailbreak attacks. The first stage uses teacher-guided generation to produce high-quality reflection data for supervised fine-tuning (SFT). The second stage applies reinforcement learning (RL) with outcome-driven and reward-validity supervision to enable autonomous self-reflection during generation. Empirical results claim Defense Success Rates (DSR) exceeding 90% against complex indirect attacks with robust generalization across threat scenarios, plus a 5.85% gain on GSM8K and improvements on knowledge-intensive benchmarks.

Significance. If the results hold, this work is significant for LLM safety research. It moves beyond surface-level alignment by embedding trajectory-level reflection via a practical SFT-then-RL pipeline, which could scale to other safety properties while preserving or enhancing utility as shown by the GSM8K gains. The explicit use of outcome-driven and reward-validity signals in RL is a concrete strength that supports the claim of autonomous reflection without added overhead.

minor comments (3)
  1. [Abstract and Experiments] The abstract and experimental results section would benefit from explicit mention of the number of attack instances, attack construction protocol, and statistical significance tests supporting the DSR >90% and GSM8K claims.
  2. [Method (RL stage)] Clarify in the methods how the reward-validity supervision is implemented to prevent reward hacking during RL; while the overall procedure is consistent, a short pseudocode or equation would improve reproducibility.
  3. [Figures] Figure captions and legends should more clearly distinguish between different indirect attack variants and baseline defenses to aid reader interpretation of the generalization results.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of REFLECTOR and the recommendation for minor revision. We appreciate the recognition that the two-stage SFT-then-RL pipeline with outcome-driven and validity rewards represents a meaningful advance in embedding trajectory-level reflection for indirect jailbreak defense while preserving utility.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical two-stage training procedure (teacher-guided SFT to seed reflection patterns, followed by RL with outcome-driven and reward-validity signals) whose performance is evaluated on external benchmarks such as DSR against indirect jailbreaks and accuracy on GSM8K. No equations, self-definitional constructs, fitted inputs renamed as predictions, or load-bearing self-citations appear in the derivation chain. The central claims rest on measured generalization across threat scenarios rather than any reduction of outputs to inputs by construction, rendering the framework self-contained against external evaluation.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on domain assumptions about the transferability of teacher-generated reflection patterns and the ability of RL to produce autonomous safety behavior without side effects.

axioms (2)
  • domain assumption Teacher-guided generation produces high-quality reflection data suitable for SFT that transfers to autonomous use.
    Invoked in the first stage of the framework to establish structured reflection patterns.
  • domain assumption Outcome-driven and reward-validity supervision in RL can instill robust self-reflection without degrading general capabilities.
    Central to the second stage and the claim of both safety and utility gains.

pith-pipeline@v0.9.0 · 5730 in / 1328 out tokens · 34316 ms · 2026-05-21T06:12:12.821336+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 21 internal anchors

  1. [1]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Deepseekmath: Pushing the limits of mathematical reasoning in open language models , author=. arXiv preprint arXiv:2402.03300 , year=

  2. [2]

    arXiv preprint arXiv:1901.10995 , year=

    Go-explore: a new approach for hard-exploration problems , author=. arXiv preprint arXiv:1901.10995 , year=

  3. [3]

    by richard’s sutton , author=

    Reinforcement learning: An introduction. by richard’s sutton , author=. SIAM Rev , volume=. 2021 , publisher=

  4. [4]

    Open Problems and Fundamental Limitations of Reinforcement Learning from Human Feedback

    Open problems and fundamental limitations of reinforcement learning from human feedback , author=. arXiv preprint arXiv:2307.15217 , year=

  5. [5]

    2025 , eprint=

    OpenAI GPT-5 System Card , author=. 2025 , eprint=

  6. [6]

    Advances in neural information processing systems , volume=

    Generative adversarial imitation learning , author=. Advances in neural information processing systems , volume=

  7. [7]

    The Llama 3 Herd of Models

    The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

  8. [8]

    2025 , eprint=

    Qwen2.5 Technical Report , author=. 2025 , eprint=

  9. [9]

    2023 , eprint=

    BeaverTails: Towards Improved Safety Alignment of LLM via a Human-Preference Dataset , author=. 2023 , eprint=

  10. [10]

    2025 , eprint=

    Length-Controlled AlpacaEval: A Simple Way to Debias Automatic Evaluators , author=. 2025 , eprint=

  11. [11]

    Advances in neural information processing systems , volume=

    Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

  12. [12]

    Advances in neural information processing systems , volume=

    Direct preference optimization: Your language model is secretly a reward model , author=. Advances in neural information processing systems , volume=

  13. [13]

    Safety alignment should be made more than just a few tokens deep

    Safety alignment should be made more than just a few tokens deep , author=. arXiv preprint arXiv:2406.05946 , year=

  14. [14]

    arXiv preprint arXiv:2502.02384 , year=

    Stair: Improving safety alignment with introspective reasoning , author=. arXiv preprint arXiv:2502.02384 , year=

  15. [15]

    Satori: Reinforcement learning with chain-of-action-thought enhances LLM reasoning via autoregressive search.arXiv preprint arXiv:2502.02508,2025

    Satori: Reinforcement learning with chain-of-action-thought enhances llm reasoning via autoregressive search , author=. arXiv preprint arXiv:2502.02508 , year=

  16. [16]

    2024 , eprint=

    A StrongREJECT for Empty Jailbreaks , author=. 2024 , eprint=

  17. [17]

    Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

    Xstest: A test suite for identifying exaggerated safety behaviours in large language models , author=. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

  18. [18]

    WildChat: 1M ChatGPT Interaction Logs in the Wild

    Wildchat: 1m chatgpt interaction logs in the wild , author=. arXiv preprint arXiv:2405.01470 , year=

  19. [19]

    Towards veri- fying the geometric robustness of large-scale neural net- works

    Do-not-answer: A dataset for evaluating safeguards in llms , author=. arXiv preprint arXiv:2308.13387 , year=

  20. [20]

    AutoDAN: Generating Stealthy Jailbreak Prompts on Aligned Large Language Models

    Autodan: Generating stealthy jailbreak prompts on aligned large language models , author=. arXiv preprint arXiv:2310.04451 , year=

  21. [21]

    33rd USENIX Security Symposium (USENIX Security 24) , pages=

    Making them ask and answer: Jailbreaking large language models in few queries via disguise and reconstruction , author=. 33rd USENIX Security Symposium (USENIX Security 24) , pages=

  22. [22]

    arXiv preprint arXiv:2311.08268 (2023)

    A Wolf in Sheep's Clothing: Generalized Nested Jailbreak Prompts can Fool Large Language Models Easily , author=. arXiv preprint arXiv:2311.08268 , year=

  23. [23]

    2024 , eprint=

    DrAttack: Prompt Decomposition and Reconstruction Makes Powerful LLM Jailbreakers , author=. 2024 , eprint=

  24. [24]

    Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

    How johnny can persuade llms to jailbreak them: Rethinking persuasion to challenge ai safety by humanizing llms , author=. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

  25. [25]

    2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML) , pages=

    Jailbreaking black box large language models in twenty queries , author=. 2025 IEEE Conference on Secure and Trustworthy Machine Learning (SaTML) , pages=. 2025 , organization=

  26. [26]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Universal and transferable adversarial attacks on aligned language models , author=. arXiv preprint arXiv:2307.15043 , year=

  27. [27]

    Assessing the brittleness of safety alignment via pruning and low-rank modifications

    Assessing the brittleness of safety alignment via pruning and low-rank modifications , author=. arXiv preprint arXiv:2402.05162 , year=

  28. [28]

    HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

    Harmbench: A standardized evaluation framework for automated red teaming and robust refusal , author=. arXiv preprint arXiv:2402.04249 , year=

  29. [30]

    Advances in Neural Information Processing Systems , volume=

    Mmlu-pro: A more robust and challenging multi-task language understanding benchmark , author=. Advances in Neural Information Processing Systems , volume=

  30. [31]

    Measuring short-form factuality in large language models

    Measuring short-form factuality in large language models , author=. arXiv preprint arXiv:2411.04368 , year=

  31. [32]

    Adversarial glue: A multi-task benchmark for robustness evaluation of language models

    Adversarial glue: A multi-task benchmark for robustness evaluation of language models , author=. arXiv preprint arXiv:2111.02840 , year=

  32. [33]

    Skywork Open Reasoner 1 Technical Report

    Skywork open reasoner 1 technical report , author=. arXiv preprint arXiv:2505.22312 , year=

  33. [34]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

  34. [35]

    QwQ-32B: Embracing the Power of Reinforcement Learning , url =

    Qwen Team , month =. QwQ-32B: Embracing the Power of Reinforcement Learning , url =

  35. [36]

    gpt-oss-120b & gpt-oss-20b Model Card

    gpt-oss-120b & gpt-oss-20b model card , author=. arXiv preprint arXiv:2508.10925 , year=

  36. [37]

    GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

    GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization , author=. arXiv preprint arXiv:2601.05242 , year=

  37. [38]

    Advances in neural information processing systems , volume=

    Training language models to follow instructions with human feedback , author=. Advances in neural information processing systems , volume=

  38. [39]

    A General Language Assistant as a Laboratory for Alignment

    A general language assistant as a laboratory for alignment , author=. arXiv preprint arXiv:2112.00861 , year=

  39. [40]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

  40. [41]

    Attacks, defenses and evaluations for llm conversation safety: A survey

    Attacks, defenses and evaluations for llm conversation safety: A survey , author=. arXiv preprint arXiv:2402.09283 , year=

  41. [42]

    Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers) , pages=

    Self-instruct: Aligning language models with self-generated instructions , author=. Proceedings of the 61st annual meeting of the association for computational linguistics (volume 1: long papers) , pages=

  42. [43]

    GPT-4 Technical Report

    Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

  43. [44]

    Training Verifiers to Solve Math Word Problems

    Training verifiers to solve math word problems , author=. arXiv preprint arXiv:2110.14168 , year=

  44. [45]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Measuring mathematical problem solving with the math dataset , author=. arXiv preprint arXiv:2103.03874 , year=

  45. [46]

    Evaluating Large Language Models Trained on Code

    Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

  46. [47]

    Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , pages=

    Using an llm to help with code understanding , author=. Proceedings of the IEEE/ACM 46th International Conference on Software Engineering , pages=

  47. [48]

    , author=

    DecodingTrust: A Comprehensive Assessment of Trustworthiness in GPT Models. , author=. NeurIPS , year=

  48. [49]

    Jailbreaking ChatGPT via Prompt Engineering: An Empirical Study

    Jailbreaking chatgpt via prompt engineering: An empirical study , author=. arXiv preprint arXiv:2305.13860 , year=

  49. [50]

    How robust is Google’s Bard to adversarial image attacks? arXiv:2309.11751, 2023

    How robust is google's bard to adversarial image attacks? , author=. arXiv preprint arXiv:2309.11751 , year=

  50. [51]

    Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

    Jailbreaking prompt attack: A controllable adversarial attack against diffusion models , author=. Findings of the Association for Computational Linguistics: NAACL 2025 , pages=

  51. [52]

    Diagnostic pathology , volume=

    Challenges and barriers of using large language models (LLM) such as ChatGPT for diagnostic medicine with a focus on digital pathology--a recent scoping review , author=. Diagnostic pathology , volume=. 2024 , publisher=

  52. [53]

    Simulating classroom education with llm-empowered agents , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers) , pages=

  53. [54]

    Safe RLHF: Safe Reinforcement Learning from Human Feedback

    Safe rlhf: Safe reinforcement learning from human feedback , author=. arXiv preprint arXiv:2310.12773 , year=

  54. [55]

    Advances in Neural Information Processing Systems , volume=

    Tree of attacks: Jailbreaking black-box llms automatically , author=. Advances in Neural Information Processing Systems , volume=

  55. [56]

    DeepInception: Hypnotize Large Language Model to Be Jailbreaker

    Deepinception: Hypnotize large language model to be jailbreaker , author=. arXiv preprint arXiv:2311.03191 , year=

  56. [57]

    arXiv preprint arXiv:2511.12869 , year=

    On the Fundamental Limits of LLMs at Scale , author=. arXiv preprint arXiv:2511.12869 , year=

  57. [58]

    arXiv preprint arXiv:2505.20259 , year=

    Lifelong Safety Alignment for Language Models , author=. arXiv preprint arXiv:2505.20259 , year=