arxiv: 2605.11882 · v1 · submitted 2026-05-12 · 💻 cs.AI

Recognition: 1 theorem link

· Lean Theorem

On-Policy Self-Evolution via Failure Trajectories for Agentic Safety Alignment

Bo Yin , Qi Li , Xinchao Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-13 06:24 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLM agentssafety alignmenton-policy learningfailure trajectoriesself-evolutionPareto optimizationtool useagent safety

0 comments

The pith

Failed trajectories supply the supervision that lets LLM agents evolve safer on-policy behavior.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Tool-using LLM agents fail inside trajectories by making unsafe tool calls or following injected instructions, yet existing alignment methods rely on sparse final-response signals and often degrade task performance. FATE turns each failure into repair candidates generated by the same policy, scores them with verifiers on security, utility, over-refusal, and validity, then filters the best ones to create dense trajectory-level training data. This on-policy loop feeds into Pareto-Front Policy Optimization that combines supervised warmup with multi-objective updates to keep safety gains from harming usefulness. Experiments across AgentDojo, AgentHarm, and ATBench show the approach lowers attack success rate by 33.5 percent and harmful compliance by 82.6 percent while maintaining external trajectory-safety diagnosis gains.

Core claim

The paper claims that an on-policy self-evolving framework converts verifier-scored failure trajectories into filtered repair supervision, enabling agents to improve safety without expert demonstrations; the framework uses the same policy to propose repairs, applies multi-criteria verifier filtering, and employs Pareto-Front Policy Optimization to preserve safety-utility trade-offs during self-evolution.

What carries the argument

FATE, the on-policy loop that turns failure trajectories into repair supervision by generating candidates from the current policy, re-scoring them with verifiers, filtering across security-utility-validity criteria, and applying Pareto-Front Policy Optimization for training updates.

Load-bearing premise

Verifier-based filtering of repair candidates produces unbiased, high-quality supervision signals that generalize without introducing new failure modes or degrading long-term agent stability.

What would settle it

Replace the verifier filters with random selection during training and check whether the reported reductions in attack success rate and harmful compliance disappear or reverse on the same benchmarks.

Figures

Figures reproduced from arXiv: 2605.11882 by Bo Yin, Qi Li, Xinchao Wang.

**Figure 2.** Figure 2: Quality of same-policy repair proposals. (a) Re [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Effect of iterative self-evolution on Qwen3-8B-Instruct. FATE progressively reduces [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

Tool-using LLM agents fail through trajectories rather than only final responses, as they may execute unsafe tool calls, follow injected instructions, comply with harmful requests, or over-refuse benign tasks despite producing a seemingly safe answer. Existing safety-alignment signals are largely response-level or off-policy, and often incur a safety-utility trade-off: improving agent safety comes at the cost of degraded task performance. Such sparse and single-objective rewards severely limit real-world usability. To bridge this gap, we propose FATE, an on-policy self-evolving framework that transforms verifier-scored failures into repair supervision without expert demonstrations. For each failure, the same policy proposes repair candidates, which are then re-scored by verifiers and filtered across security, utility, over-refusal control, and trajectory validity. This dense trajectory-level information is then used as a supervision signal for agent self-evolution. During this process, we further introduce Pareto-Front Policy Optimization (PFPO), combining supervised warmup with Pareto-aware policy optimization to preserve safety-utility trade-offs. Experiments on AgentDojo, AgentHarm, and ATBench show that FATE improves safety across different models and scales while preserving useful behavior. Compared with strong baselines, FATE reduces attack success rate by 33.5%, harmful compliance by 82.6%, and improves external trajectory-safety diagnosis by 6.5%. These results suggest that failed trajectories can provide structured repair supervision for safer self-evolving agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FATE turns agent failure trajectories into on-policy repair signals with multi-objective filtering and Pareto optimization, but the reported gains hinge on unexamined verifier quality.

read the letter

The core idea here is turning an agent's own failed trajectories into training data by having the policy propose fixes, scoring them on security, utility, over-refusal, and validity, then feeding the winners back through Pareto-front policy optimization. That setup directly targets trajectory-level problems like bad tool calls or injected instructions instead of stopping at final responses, and it tries to avoid the usual safety-utility drop. The abstract shows this on three benchmarks with a 33.5% attack success reduction and 82.6% drop in harmful compliance while holding task performance steady, which is the kind of practical claim worth checking. The PFPO combination of supervised warmup plus Pareto-aware updates looks like a straightforward way to keep the trade-off visible during training. What the paper does cleanly is frame the supervision as dense and on-policy without needing external expert demos. The stress-test worry about verifier bias is the main soft spot. If the verifiers share data or criteria with the evaluation benchmarks, the filtering step could select repairs that look good only under those same scorers, which would make the headline numbers less convincing. The abstract gives no information on verifier training, independence, or human agreement checks, so that part stays under-supported until the methods are read. No statistical significance or baseline reimplementation details are mentioned either, which is common in abstracts but still leaves the empirical strength unclear. This is aimed at researchers building or aligning tool-using agents who already work with trajectory data and verifiers. A reader focused on practical safety fixes would get usable pipeline ideas even if they replace the verifiers. It deserves peer review because the problem framing and on-policy loop are coherent and the benchmarks are relevant, though any referee would need to press on verifier construction and result robustness.

Referee Report

2 major / 2 minor

Summary. The paper proposes FATE, an on-policy self-evolving framework for safety alignment of tool-using LLM agents. It transforms verifier-scored failure trajectories into repair supervision by having the policy propose candidates that are filtered on security, utility, over-refusal, and validity criteria. This is combined with Pareto-Front Policy Optimization (PFPO) to balance safety and utility. Experiments on AgentDojo, AgentHarm, and ATBench demonstrate reductions in attack success rate by 33.5%, harmful compliance by 82.6%, and improved trajectory-safety diagnosis by 6.5% compared to baselines.

Significance. If the verifier filtering provides unbiased high-quality signals that generalize, FATE could offer a scalable way to improve agent safety using self-generated data without expert demonstrations, addressing the safety-utility trade-off in agentic systems. The on-policy nature and trajectory-level supervision are promising for real-world deployment.

major comments (2)

[Experiments] The reported quantitative gains (33.5% ASR reduction, 82.6% harmful compliance reduction) lack details on baseline implementations, statistical significance testing, data splits, and how post-hoc filtering choices influence the results. This leaves the central empirical claim only partially supported.
[Method] The verifier-based filtering of repair candidates is central to generating supervision signals, but the manuscript provides insufficient details on verifier construction, training data, independence from the evaluation benchmarks (AgentDojo, AgentHarm, ATBench), and validation against human judgments. This raises concerns about potential biases inflating the safety gains.

minor comments (2)

[Abstract] The abstract mentions 'strong baselines' but does not name them; consider specifying in the abstract or early introduction.
[Notation] Ensure consistent use of terms like 'trajectory validity' across sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We agree that the manuscript would benefit from expanded details on the experimental setup and verifier methodology to strengthen the empirical claims. We will revise accordingly.

read point-by-point responses

Referee: [Experiments] The reported quantitative gains (33.5% ASR reduction, 82.6% harmful compliance reduction) lack details on baseline implementations, statistical significance testing, data splits, and how post-hoc filtering choices influence the results. This leaves the central empirical claim only partially supported.

Authors: We acknowledge that the current version provides limited explicit details on these aspects. In the revision, we will expand the Experiments section and add an appendix with: full baseline implementation descriptions and reproduction steps (including any hyperparameter choices), statistical significance testing (e.g., standard deviations across runs and p-values from appropriate tests), explicit data split information for each benchmark, and an ablation analysis showing the impact of different post-hoc filtering thresholds on the reported gains. This will provide stronger support for the central claims. revision: yes
Referee: [Method] The verifier-based filtering of repair candidates is central to generating supervision signals, but the manuscript provides insufficient details on verifier construction, training data, independence from the evaluation benchmarks (AgentDojo, AgentHarm, ATBench), and validation against human judgments. This raises concerns about potential biases inflating the safety gains.

Authors: We agree that greater transparency is needed here. The revised manuscript will include an expanded Methods subsection detailing verifier construction (architecture and prompting), training data sources and curation process, explicit confirmation of independence from the three evaluation benchmarks, and results from human validation (including agreement rates and any discrepancy analysis). We will also add a brief discussion of potential biases and mitigation steps. These additions directly address the concern about inflated gains. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical on-policy framework with external benchmark validation

full rationale

The paper describes an empirical self-evolution method (FATE) that generates repair candidates from failures, filters them via verifiers on security/utility/validity criteria, and applies Pareto-Front Policy Optimization for training. All reported gains (ASR reduction, compliance drop, diagnosis improvement) are measured against external benchmarks (AgentDojo, AgentHarm, ATBench) and strong baselines. No equations, fitted parameters, or self-citations are used to derive the core claims; the supervision signal is produced procedurally from the policy itself but evaluated independently. The derivation chain is therefore self-contained and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that external verifiers can reliably score security, utility, over-refusal, and validity without systematic bias, plus the premise that Pareto optimization can maintain the safety-utility frontier during self-evolution; no free parameters or invented physical entities are specified.

axioms (1)

domain assumption Verifier models can accurately and consistently score agent trajectories for security, utility, over-refusal control, and trajectory validity.
The filtering step that produces the supervision signal depends entirely on these verifiers being trustworthy.

pith-pipeline@v0.9.0 · 5561 in / 1348 out tokens · 43352 ms · 2026-05-13T06:24:57.778564+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

FATE uses the current policy to propose repairs, verifier feedback to filter them, and Pareto-front selection to form balanced training targets.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · 23 internal anchors

[1]

Phi-4 Technical Report

Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Constrained policy optimization

Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. InInternational conference on machine learning, pages 22–31. Pmlr, 2017

work page 2017
[3]

AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter, Matt Fredrikson, et al. Agentharm: A benchmark for measuring harmfulness of llm agents.arXiv preprint arXiv:2410.09024, 2024

work page internal anchor Pith review arXiv 2024
[4]

Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[5]

Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions.arXiv preprint arXiv:2309.07875, 2023

Federico Bianchi, Mirac Suzgun, Giuseppe Attanasio, Paul Röttger, Dan Jurafsky, Tatsunori Hashimoto, and James Zou. Safety-tuned llamas: Lessons from improving the safety of large language models that follow instructions.arXiv preprint arXiv:2309.07875, 2023

work page arXiv 2023
[6]

Shieldagent: Shielding agents via verifiable safety policy reasoning

Zhaorun Chen, Mintong Kang, and Bo Li. Shieldagent: Shielding agents via verifiable safety policy reasoning.arXiv preprint arXiv:2503.22738, 2025

work page arXiv 2025
[7]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[8]

A fast and elitist multiobjective genetic algorithm: Nsga-ii.IEEE transactions on evolutionary computation, 6 (2):182–197, 2002

Kalyanmoy Deb, Amrit Pratap, Sameer Agarwal, and TAMT Meyarivan. A fast and elitist multiobjective genetic algorithm: Nsga-ii.IEEE transactions on evolutionary computation, 6 (2):182–197, 2002

work page 2002
[9]

Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents.Advances in Neural Information Processing Systems, 37:82895–82920, 2024

Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents.Advances in Neural Information Processing Systems, 37:82895–82920, 2024

work page 2024
[10]

WorkArena: How Capable Are Web Agents at Solving Common Knowledge Work Tasks?

Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H Laradji, Manuel Del Verme, Tom Marty, Léo Boisvert, Megh Thakkar, Quentin Cappart, David Vazquez, et al. Workarena: How capable are web agents at solving common knowledge work tasks?arXiv preprint arXiv:2403.07718, 2024

work page internal anchor Pith review arXiv 2024
[11]

The Llama 3 Herd of Models

Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection

Kai Greshake, Sahar Abdelnabi, Shailesh Mishra, Christoph Endres, Thorsten Holz, and Mario Fritz. Not what you’ve signed up for: Compromising real-world llm-integrated applications with indirect prompt injection. InProceedings of the 16th ACM workshop on artificial intelligence and security, pages 79–90, 2023

work page 2023
[13]

Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms.Advances in neural information processing systems, 37:8093–8131, 2024

Seungju Han, Kavel Rao, Allyson Ettinger, Liwei Jiang, Bill Yuchen Lin, Nathan Lambert, Yejin Choi, and Nouha Dziri. Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms.Advances in neural information processing systems, 37:8093–8131, 2024

work page 2024
[14]

A practical guide to multi-objective reinforcement learning and planning: Cf hayes et al

Conor F Hayes, Roxana R˘adulescu, Eugenio Bargiacchi, Johan Källström, Matthew Macfarlane, Mathieu Reymond, Timothy Verstraeten, Luisa M Zintgraf, Richard Dazeley, Fredrik Heintz, et al. A practical guide to multi-objective reinforcement learning and planning: Cf hayes et al. Autonomous Agents and Multi-Agent Systems, 36(1):26, 2022

work page 2022
[15]

Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.Iclr, 1(2):3, 2022. 10

work page 2022
[16]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Hakan Inan, Kartikeya Upasani, Jianfeng Chi, Rashi Rungta, Krithika Iyer, Yuning Mao, Michael Tontchev, Qing Hu, Brian Fuller, Davide Testuggine, et al. Llama guard: Llm-based input-output safeguard for human-ai conversations.arXiv preprint arXiv:2312.06674, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Co-evolving agents: Learning from failures as hard negatives.arXiv preprint arXiv:2511.22254, 2025

Yeonsung Jung, Trilok Padhi, Sina Shaham, Dipika Khullar, Joonhyun Jeong, Ninareh Mehrabi, and Eunho Yang. Co-evolving agents: Learning from failures as hard negatives.arXiv preprint arXiv:2511.22254, 2025

work page arXiv 2025
[18]

CoLA: A Choice Leakage Attack Framework to Expose Privacy Risks in Subset Training

Qi Li, Cheng-Long Wang, Yinzhi Cao, and Di Wang. Cola: A choice leakage attack framework to expose privacy risks in subset training.arXiv preprint arXiv:2604.12342, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[19]

ATBench: A Diverse and Realistic Agent Trajectory Benchmark for Safety Evaluation and Diagnosis

Yu Li, Haoyu Luo, Yuejin Xie, Yuqian Fu, Zhonghao Yang, Shuai Shao, Qihan Ren, Wanying Qu, Yanwei Fu, Yujiu Yang, et al. Atbench: A diverse and realistic trajectory benchmark for long-horizon agent safety.arXiv preprint arXiv:2604.02022, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[20]

Let’s verify step by step

Hunter Lightman, Vineet Kosaraju, Yuri Burda, Harrison Edwards, Bowen Baker, Teddy Lee, Jan Leike, John Schulman, Ilya Sutskever, and Karl Cobbe. Let’s verify step by step. InThe twelfth international conference on learning representations, 2023

work page 2023
[21]

Ministral 3

Alexander H Liu, Kartik Khandelwal, Sandeep Subramanian, Victor Jouault, Abhinav Rastogi, Adrien Sadé, Alan Jeffares, Albert Jiang, Alexandre Cahill, Alexandre Gavaudan, et al. Ministral 3.arXiv preprint arXiv:2601.08584, 2026

work page internal anchor Pith review arXiv 2026
[22]

AgentDoG: A Diagnostic Guardrail Framework for AI Agent Safety and Security

Dongrui Liu, Qihan Ren, Chen Qian, Shuai Shao, Yuejin Xie, Yu Li, Zhonghao Yang, Haoyu Luo, Peng Wang, Qingyu Liu, et al. Agentdog: A diagnostic guardrail framework for ai agent safety and security.arXiv preprint arXiv:2601.18491, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[23]

AgentBench: Evaluating LLMs as Agents

Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents.arXiv preprint arXiv:2308.03688, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[24]

Formalizing and benchmarking prompt injection attacks and defenses

Yupei Liu, Yuqi Jia, Runpeng Geng, Jinyuan Jia, and Neil Zhenqiang Gong. Formalizing and benchmarking prompt injection attacks and defenses. In33rd USENIX Security Symposium (USENIX Security 24), pages 1831–1847, 2024

work page 2024
[25]

Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023

Aman Madaan, Niket Tandon, Prakhar Gupta, Skyler Hallinan, Luyu Gao, Sarah Wiegreffe, Uri Alon, Nouha Dziri, Shrimai Prabhumoye, Yiming Yang, et al. Self-refine: Iterative refinement with self-feedback.Advances in neural information processing systems, 36:46534–46594, 2023

work page 2023
[26]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mantas Mazeika, Long Phan, Xuwang Yin, Andy Zou, Zifan Wang, Norman Mu, Elham Sakhaee, Nathaniel Li, Steven Basart, Bo Li, et al. Harmbench: A standardized evaluation framework for automated red teaming and robust refusal.arXiv preprint arXiv:2402.04249, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[27]

Springer Science & Business Media, 1999

Kaisa Miettinen.Nonlinear multiobjective optimization, volume 12. Springer Science & Business Media, 1999

work page 1999
[28]

Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Advances in neural information processing systems, 35:27730–27744, 2022

work page 2022
[29]

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs

Yujia Qin, Shihao Liang, Yining Ye, Kunlun Zhu, Lan Yan, Yaxi Lu, Yankai Lin, Xin Cong, Xiangru Tang, Bill Qian, et al. Toolllm: Facilitating large language models to master 16000+ real-world apis.arXiv preprint arXiv:2307.16789, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[30]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. Advances in neural information processing systems, 36:53728–53741, 2023

work page 2023
[31]

A reduction of imitation learning and structured prediction to no-regret online learning

Stéphane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InProceedings of the fourteenth interna- tional conference on artificial intelligence and statistics, pages 627–635. JMLR Workshop and Conference Proceedings, 2011. 11

work page 2011
[32]

Xstest: A test suite for identifying exaggerated safety behaviours in large language models

Paul Röttger, Hannah Kirk, Bertie Vidgen, Giuseppe Attanasio, Federico Bianchi, and Dirk Hovy. Xstest: A test suite for identifying exaggerated safety behaviours in large language models. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers),...

work page 2024
[33]

Identifying the Risks of LM Agents with an LM-Emulated Sandbox

Yangjun Ruan, Honghua Dong, Andrew Wang, Silviu Pitis, Yongchao Zhou, Jimmy Ba, Yann Dubois, Chris J Maddison, and Tatsunori Hashimoto. Identifying the risks of lm agents with an lm-emulated sandbox.arXiv preprint arXiv:2309.15817, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[34]

Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36: 68539–68551, 2023

Timo Schick, Jane Dwivedi-Yu, Roberto Dessì, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Language models can teach themselves to use tools.Advances in neural information processing systems, 36: 68539–68551, 2023

work page 2023
[35]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[36]

Your agent may misevolve: Emergent risks in self-evolving llm agents.arXiv preprint arXiv:2509.26354, 2025

Shuai Shao, Qihan Ren, Chen Qian, Boyi Wei, Dadi Guo, Jingyi Yang, Xinhao Song, Linfeng Zhang, Weinan Zhang, Dongrui Liu, et al. Your agent may misevolve: Emergent risks in self-evolving llm agents.arXiv preprint arXiv:2509.26354, 2025

work page arXiv 2025
[37]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, YK Li, Yang Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning.Advances in neural information processing systems, 36:8634–8652, 2023

work page 2023
[39]

Towards lifecycle unlearning commitment management: Measuring sample-level unlearning completeness

Cheng-Long Wang, Qi Li, Zihang Xiang, Yinzhi Cao, and Di Wang. Towards lifecycle unlearning commitment management: Measuring sample-level unlearning completeness. In 34th USENIX Security Symposium (USENIX Security 25), pages 6481–6500, 2025

work page 2025
[40]

Learning from failure: Integrating negative examples when fine-tuning large language models as agents.arXiv preprint arXiv:2402.11651, 2024

Renxi Wang, Haonan Li, Xudong Han, Yixuan Zhang, and Timothy Baldwin. Learning from failure: Integrating negative examples when fine-tuning large language models as agents.arXiv preprint arXiv:2402.11651, 2024

work page arXiv 2024
[41]

Jailbroken: How does llm safety training fail?Advances in neural information processing systems, 36:80079–80110, 2023

Alexander Wei, Nika Haghtalab, and Jacob Steinhardt. Jailbroken: How does llm safety training fail?Advances in neural information processing systems, 36:80079–80110, 2023

work page 2023
[42]

Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

Tianbao Xie, Danyang Zhang, Jixuan Chen, Xiaochuan Li, Siheng Zhao, Ruisheng Cao, Toh J Hua, Zhoujun Cheng, Dongchan Shin, Fangyu Lei, et al. Osworld: Benchmarking multimodal agents for open-ended tasks in real computer environments.Advances in Neural Information Processing Systems, 37:52040–52094, 2024

work page 2024
[43]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[44]

ReAct: Synergizing Reasoning and Acting in Language Models

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models.arXiv preprint arXiv:2210.03629, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[45]

Refinement provenance inference: Detecting llm-refined training prompts from model behavior.arXiv preprint arXiv:2601.01966, 2026

Bo Yin, Qi Li, Runpeng Yu, and Xinchao Wang. Refinement provenance inference: Detecting llm-refined training prompts from model behavior.arXiv preprint arXiv:2601.01966, 2026

work page arXiv 2026
[46]

Discrete diffusion in large language and multimodal models: A survey.arXiv preprint arXiv:2506.13759, 2025

Runpeng Yu, Qi Li, and Xinchao Wang. Discrete diffusion in large language and multimodal models: A survey.arXiv preprint arXiv:2506.13759, 2025

work page arXiv 2025
[47]

Star: Bootstrapping reasoning with reasoning.Advances in Neural Information Processing Systems, 35:15476–15488, 2022

Eric Zelikman, Yuhuai Wu, Jesse Mu, and Noah Goodman. Star: Bootstrapping reasoning with reasoning.Advances in Neural Information Processing Systems, 35:15476–15488, 2022. 12

work page 2022
[48]

Qwen3Guard Technical Report

Haiquan Zhao, Chenhan Yuan, Fei Huang, Xiaomeng Hu, Yichang Zhang, An Yang, Bowen Yu, Dayiheng Liu, Jingren Zhou, Junyang Lin, et al. Qwen3guard technical report.arXiv preprint arXiv:2510.14276, 2025

work page internal anchor Pith review arXiv 2025
[49]

On Safety Risks in Experience-Driven Self-Evolving Agents

Weixiang Zhao, Yichen Zhang, Yingshuo Wang, Yang Deng, Yanyan Zhao, Xuda Zhi, Yongbo Huang, Wanxiang Che, Bing Qin, Ting Liu, et al. On safety risks in experience-driven self- evolving agents.arXiv preprint arXiv:2604.16968, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[50]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[51]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043, 2023. 13 A Algorithmic Details Algorithm 1 summarizes the full FATE self-evolution procedure. At each round, the current policy is first rolled out to mine its ow...

work page internal anchor Pith review Pith/arXiv arXiv 2023