pith. machine review for the scientific record. sign in

arxiv: 2605.06992 · v1 · submitted 2026-05-07 · 💻 cs.LG · stat.ML

Recognition: 2 theorem links

· Lean Theorem

Why Does Agentic Safety Fail to Generalize Across Tasks?

Nadav Cohen, Tomer Slor, Yoav Nagel, Yonatan Slutzky, Yotam Alexander

Pith reviewed 2026-05-11 01:51 UTC · model grok-4.3

classification 💻 cs.LG stat.ML
keywords agentic safetygeneralizationLipschitz constantH-infinity controllinear-quadratic controlAI agents
0
0 comments X

The pith

Safety requirements increase the sensitivity of optimal controllers to task specifications compared to non-safe execution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that failures of safety to generalize in AI agents across tasks stem from an inherent property of safety itself, not just training shortcomings. It proves this by showing that the mapping from task specification to optimal controller has a higher Lipschitz constant when H∞-robustness safety is included than when it is not. This is analyzed in linear-quadratic control and backed by experiments where neural network agents navigate quadcopters and LLM agents handle customer tasks, with safety dropping on unseen tasks while execution holds. A sympathetic reader cares because it indicates that scaling up current training will not fix safe generalization in multi-task agents.

Core claim

In linear-quadratic control with H∞-robustness, the mapping from task specification to an optimal controller has a higher Lipschitz constant with safety requirements than without, yielding a Lipschitz bound of independent interest. This shows that the relationship between a task and its safe execution is more complex than between a task and its execution alone.

What carries the argument

The Lipschitz constant of the mapping from task specification to optimal controller, which is proven larger when H∞-robustness is added to the linear-quadratic objective.

If this is right

  • Safety generalization requires approaches different from those that improve task execution alone.
  • Empirical safety failures on unseen tasks reflect structural complexity rather than insufficient training.
  • The higher Lipschitz constant bounds how much controller changes with task variations when safety is enforced.
  • Current efforts to enhance agentic safety are likely insufficient for multi-task deployment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Safety fine-tuning techniques may inherit similar sensitivity issues when applied to tasks outside the training distribution.
  • Task-independent safety layers or constraints could reduce dependence on the sensitive mapping.
  • The theoretical bound might be tested by measuring controller variation directly in more complex simulated agents.

Load-bearing premise

Linear-quadratic control with H∞-robustness sufficiently captures the essential structure of safety in general agentic settings with neural networks or LLMs.

What would settle it

Finding that safety performance generalizes to new tasks at the same rate as task execution in a neural network or LLM agent, without new mechanisms, would contradict the claim.

Figures

Figures reproduced from arXiv: 2605.06992 by Nadav Cohen, Tomer Slor, Yoav Nagel, Yonatan Slutzky, Yotam Alexander.

Figure 1
Figure 1. Figure 1: In the theoretically analyzed setting (linear-quadratic control with [PITH_FULL_IMAGE:figures/full_fig_p008_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Demonstration of the separation of Lipschitz constants derived in Theorem 1—the Lipschitz constant of [PITH_FULL_IMAGE:figures/full_fig_p043_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: In the theoretically analyzed setting (linear-quadratic control with [PITH_FULL_IMAGE:figures/full_fig_p043_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: In the theoretically analyzed setting (linear-quadratic control with [PITH_FULL_IMAGE:figures/full_fig_p044_4.png] view at source ↗
read the original abstract

AI agents are increasingly deployed in multi-task settings, where the task to perform is specified at test time, and the agent must generalize to unseen tasks. A major concern in such settings is safety: often, an agent must not only execute unseen tasks, but do so while avoiding risks and handling ones that materialize. Empirical evidence suggests that even when the ability to execute generalizes to unseen tasks, the ability to do so safely frequently does not. This paper provides theory and experiments indicating that failures of agentic safety to generalize across tasks are not merely due to limitations of training methods, but reflect an inherent property of safety itself: the relationship between a task and its safe execution is more complex than the relationship between a task and its execution alone. Theoretically, we analyze linear-quadratic control with $H_{\infty}$-robustness, and prove that the mapping from task specification to an optimal controller has higher Lipschitz constant with safety requirements than without, yielding a Lipschitz bound of independent interest. Empirically, we demonstrate our conclusions in simulated quadcopter navigation with a neural network agent and in CRM with an LLM agent. Our findings suggest that current efforts to enhance agentic safety may be insufficient, and point to a need for fundamentally different approaches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that failures of agentic safety to generalize across tasks stem from an inherent structural property rather than training limitations: the mapping from task specification to an optimal controller has a strictly higher Lipschitz constant when H∞ safety constraints are imposed than in the nominal case. This is proven for linear-quadratic control, yielding a Lipschitz bound of independent interest, and illustrated empirically via neural-network policies on quadcopter navigation and LLM agents on CRM tasks.

Significance. If the higher-Lipschitz property under safety constraints extends beyond the LQ setting, the result would establish that safety generalization gaps are fundamental, shifting research from incremental training improvements toward new paradigms for safe multi-task agents. The control-theoretic derivation of the Lipschitz inflation and the bound itself constitute a clear technical contribution. The empirical sections usefully demonstrate the phenomenon in two distinct domains but do not yet confirm the proposed mechanism.

major comments (2)
  1. [§3] §3 (Theoretical Analysis): The proof that safety requirements strictly increase the Lipschitz constant of the task-to-controller map is derived exclusively for linear dynamics, quadratic costs, and H∞-robust synthesis. No argument is given showing that this inflation persists for the nonlinear neural-network or LLM policies used in the experiments, nor that it dominates other factors such as non-convexity or prompt sensitivity.
  2. [§4] §4 (Empirical Evaluation): The quadcopter and CRM experiments report generalization gaps under safety constraints but contain no direct measurement or bound on the Lipschitz constant of the learned policy map, no ablation isolating the effect of the safety term, and no error bars or statistical tests. Consequently the empirical results illustrate the problem without testing whether the theoretical Lipschitz mechanism is operative or causal.
minor comments (2)
  1. [Abstract] The abstract states that the analysis 'yields a Lipschitz bound of independent interest' yet neither states the explicit form of the bound nor compares its tightness to existing H∞ results.
  2. [§2] Notation for the task-to-controller map and the safe versus nominal Lipschitz constants is introduced without a consolidated table or diagram, making cross-references between the proof and the empirical claims harder to follow.

Simulated Author's Rebuttal

2 responses · 2 unresolved

We are grateful to the referee for the careful reading and valuable suggestions. The comments help clarify the scope of our theoretical results and the strength of the empirical evidence. We provide point-by-point responses and describe the changes we will make in the revised manuscript.

read point-by-point responses
  1. Referee: [§3] §3 (Theoretical Analysis): The proof that safety requirements strictly increase the Lipschitz constant of the task-to-controller map is derived exclusively for linear dynamics, quadratic costs, and H∞-robust synthesis. No argument is given showing that this inflation persists for the nonlinear neural-network or LLM policies used in the experiments, nor that it dominates other factors such as non-convexity or prompt sensitivity.

    Authors: We concur that our theoretical analysis is confined to the linear-quadratic regulator with H∞ robustness. This setting was selected because it permits a precise derivation of the Lipschitz constant and yields a bound that may be of independent interest in control theory. For the neural network and LLM experiments, we present them as empirical illustrations of the safety generalization gap in more complex domains, rather than as direct validations of the Lipschitz mechanism. In the revision, we will add a paragraph in §3 discussing potential reasons why the Lipschitz inflation could extend to nonlinear settings, such as the increased sensitivity required to maintain robustness margins under safety constraints. We will also note that factors like non-convex optimization and prompt sensitivity may interact with or amplify this effect, and clarify that our claim is that the structural property contributes to the observed failures, not that it is the sole cause. revision: partial

  2. Referee: [§4] §4 (Empirical Evaluation): The quadcopter and CRM experiments report generalization gaps under safety constraints but contain no direct measurement or bound on the Lipschitz constant of the learned policy map, no ablation isolating the effect of the safety term, and no error bars or statistical tests. Consequently the empirical results illustrate the problem without testing whether the theoretical Lipschitz mechanism is operative or causal.

    Authors: The referee correctly identifies several shortcomings in the empirical section. Computing or bounding the Lipschitz constant of high-dimensional neural network policies is generally intractable, which precluded direct measurement. Similarly, the original experiments did not include ablations or statistical analysis. In the revised version, we will incorporate error bars based on multiple random seeds and perform statistical tests (e.g., t-tests) to assess the significance of the generalization gaps. We will also add ablation experiments that train agents with and without the safety constraints to isolate their contribution to the observed gaps. These changes will provide stronger empirical support, although a direct causal link via Lipschitz measurement in the nonlinear case remains difficult to establish empirically. revision: yes

standing simulated objections not resolved
  • Rigorous proof that the Lipschitz constant inflation persists for nonlinear neural-network and LLM policies
  • Direct measurement or bound on the Lipschitz constant of the learned policies in the quadcopter and CRM experiments

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper's central theoretical result is a mathematical proof in linear-quadratic control with H∞-robustness showing that the task-to-optimal-controller mapping has a strictly higher Lipschitz constant when safety constraints are imposed. This is derived from first-principles analysis of the control problem and does not reduce to any fitted parameters, self-definitional constructs, or load-bearing self-citations. The Lipschitz bound is explicitly presented as a result of independent interest. Empirical sections on neural quadcopter policies and LLM-based CRM agents are described as demonstrations and illustrations rather than the load-bearing evidence for the generalization claim. No steps match the enumerated circularity patterns; the derivation chain remains self-contained against external control-theoretic benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper rests on the domain assumption that LQ control with H∞ robustness models agentic safety; no free parameters are introduced in the abstract and no new entities are postulated.

axioms (1)
  • domain assumption Linear-quadratic control with H∞-robustness captures the core relationship between task specification and safe controller selection
    Invoked as the setting in which the Lipschitz bound is derived and proven.

pith-pipeline@v0.9.0 · 5530 in / 1192 out tokens · 41839 ms · 2026-05-11T01:51:55.471875+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

132 extracted references · 132 canonical work pages · 21 internal anchors

  1. [1]

    A deterministic setting for the numerical computation of the stabilizing solutions to stochastic game-theoretic riccati equations.Mathematics, 11(9):2068, 2023

    Samir Aberkane and Vasile Dragan. A deterministic setting for the numerical computation of the stabilizing solutions to stochastic game-theoretic riccati equations.Mathematics, 11(9):2068, 2023

  2. [2]

    Constrained policy optimization

    Joshua Achiam, David Held, Aviv Tamar, and Pieter Abbeel. Constrained policy optimization. In International conference on machine learning, pages 22–31. Pmlr, 2017

  3. [3]

    Routledge, 2021

    Eitan Altman.Constrained Markov decision processes. Routledge, 2021

  4. [4]

    Github’s copilot code review: Can ai spot security flaws before you commit?arXiv preprint arXiv:2509.13650, 2025

    Amena Amro and Manar H Alalfi. Github’s copilot code review: Can ai spot security flaws before you commit?arXiv preprint arXiv:2509.13650, 2025

  5. [5]

    Brian D. O. Anderson and John B. Moore.Optimal Control: Linear Quadratic Methods. Dover Publications, 2007

  6. [6]

    AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents

    Maksym Andriushchenko, Alexandra Souly, Mateusz Dziemian, Derek Duenas, Maxwell Lin, Justin Wang, Dan Hendrycks, Andy Zou, Zico Kolter, Matt Fredrikson, et al. Agentharm: A benchmark for measuring harmfulness of llm agents.arXiv preprint arXiv:2410.09024, 2024

  7. [7]

    Princeton university press, 2021

    Karl Johan Åström and Richard Murray.Feedback systems: an introduction for scientists and engineers. Princeton university press, 2021

  8. [8]

    Training a Helpful and Harmless Assistant with Reinforcement Learning from Human Feedback

    Yuntao Bai, Andy Jones, Kamal Ndousse, Amanda Askell, Anna Chen, Nova DasSarma, Dawn Drain, Stanislav Fort, Deep Ganguli, Tom Henighan, et al. Training a helpful and harmless assistant with reinforcement learning from human feedback.arXiv preprint arXiv:2204.05862, 2022. 10

  9. [9]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Chris McKinnon, Catherine Chen, Catherine Olsson, Danny Hernandez, Dawn Drain, Eli Li, Nelson Elhage, Zac Hatfield-Dodds, Tristan Hume, Jan Leike, Liane Lovitt, Neel Nanda, Chris Olah, Sam Ringer, Nicholas Schiefer, Ilya Suts...

  10. [10]

    Sur les opérations dans les ensembles abstraits et leur application aux équations intégrales

    Stefan Banach. Sur les opérations dans les ensembles abstraits et leur application aux équations intégrales. Fundamenta Mathematicae, 3:133–181, 1922

  11. [11]

    A system for human-ai collaboration for online customer support

    Debayan Banerjee, Mathis Poser, Christina Wiethof, Varun Shankar Subramanian, Richard Paucar, Eva AC Bittner, and Chris Biemann. A system for human-ai collaboration for online customer support. arXiv preprint arXiv:2301.12158, 2023

  12. [12]

    Universal approximation bounds for superpositions of a sigmoidal function.IEEE Transactions on Information theory, 39(3):930–945, 2002

    Andrew R Barron. Universal approximation bounds for superpositions of a sigmoidal function.IEEE Transactions on Information theory, 39(3):930–945, 2002

  13. [13]

    Springer Science & Business Media, 2008

    Tamer Ba¸ sar and Pierre Bernhard.H-infinity optimal control and related minimax design problems: a dynamic game approach. Springer Science & Business Media, 2008

  14. [14]

    A model of inductive bias learning.Journal of artificial intelligence research, 12: 149–198, 2000

    Jonathan Baxter. A model of inductive bias learning.Journal of artificial intelligence research, 12: 149–198, 2000

  15. [15]

    Athena Scientific, 2019

    Dimitri Bertsekas.Reinforcement learning and optimal control, volume 1. Athena Scientific, 2019

  16. [16]

    DeepSeek LLM: Scaling Open-Source Language Models with Longtermism

    Xiao Bi, Deli Chen, Guanting Chen, Shanhuang Chen, Damai Dai, Chengqi Deng, Honghui Ding, Kai Dong, Qiushi Du, Zhe Fu, et al. Deepseek llm: Scaling open-source language models with longtermism. arXiv preprint arXiv:2401.02954, 2024

  17. [17]

    Birupaksha Biswas and Suhena Sarkar. Responsible agentic artificial intelligence governance: Risk, safety, and ethical challenges in autonomous systems.International Journal of Applied Resilience and Sustainability, 2(2):142–167, 2026

  18. [18]

    Crazyflie 2.0

    Bitcraze AB. Crazyflie 2.0. https://www.bitcraze.io/products/old-products/ crazyflie-2-0/, 2020

  19. [19]

    Stability and generalization.Journal of machine learning research, 2(Mar):499–526, 2002

    Olivier Bousquet and André Elisseeff. Stability and generalization.Journal of machine learning research, 2(Mar):499–526, 2002

  20. [20]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, ...

  21. [21]

    Loredana Caruccio, Stefano Cirillo, Giuseppe Polese, Giandomenico Solimando, Shanmugam Sundara- murthy, and Genoveffa Tortora. Claude 2.0 large language model: Tackling a real-world classification problem with a new iterative prompt engineering approach.Intelligent Systems with Applications, 21: 200336, 2024

  22. [22]

    Constrained meta-reinforcement learning for adaptable safety guarantee with differentiable convex programming

    Minjae Cho and Chuangchuang Sun. Constrained meta-reinforcement learning for adaptable safety guarantee with differentiable convex programming. InProceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 20975–20983, 2024

  23. [23]

    Quantifying generalization in reinforcement learning

    Karl Cobbe, Oleg Klimov, Chris Hesse, Taehoon Kim, and John Schulman. Quantifying generalization in reinforcement learning. InInternational conference on machine learning, pages 1282–1289. PMLR, 2019

  24. [24]

    Leveraging procedural generation to benchmark reinforcement learning

    Karl Cobbe, Chris Hesse, Jacob Hilton, and John Schulman. Leveraging procedural generation to benchmark reinforcement learning. InInternational conference on machine learning, pages 2048–2056. PMLR, 2020

  25. [25]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 11

  26. [26]

    Safe RLHF: Safe Reinforcement Learning from Human Feedback

    Josef Dai, Xuehai Pan, Ruiyang Sun, Jiaming Ji, Xinbo Xu, Mickel Liu, Yizhou Wang, and Yaodong Yang. Safe rlhf: Safe reinforcement learning from human feedback.arXiv preprint arXiv:2310.12773, 2023

  27. [27]

    Safe exploration in continuous action spaces,

    Gal Dalal, Krishnamurthy Dvijotham, Matej Vecerik, Todd Hester, Cosmin Paduraru, and Yuval Tassa. Safe exploration in continuous action spaces.arXiv preprint arXiv:1801.08757, 2018

  28. [28]

    Peter Dayan and Geoffrey E. Hinton. Feudal reinforcement learning. InAdvances in Neural Information Processing Systems (NeurIPS), 1993

  29. [29]

    Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents.Advances in Neural Information Processing Systems, 37:82895–82920, 2024

    Edoardo Debenedetti, Jie Zhang, Mislav Balunovic, Luca Beurer-Kellner, Marc Fischer, and Florian Tramèr. Agentdojo: A dynamic environment to evaluate prompt injection attacks and defenses for llm agents.Advances in Neural Information Processing Systems, 37:82895–82920, 2024

  30. [30]

    Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

    Xiang Deng, Yu Gu, Boyuan Zheng, Shijie Chen, Sam Stevens, Boshi Wang, Huan Sun, and Yu Su. Mind2web: Towards a generalist agent for the web.Advances in Neural Information Processing Systems, 36:28091–28114, 2023

  31. [31]

    Hierarchical reinforcement learning with the maxq value function decomposition

    Thomas G Dietterich. Hierarchical reinforcement learning with the maxq value function decomposition. Journal of artificial intelligence research, 13:227–303, 2000

  32. [32]

    Simon & Schuster, Inc., 1994

    Peter Dorato, Vito Cerone, and Chaouki Abdallah.Linear-quadratic control: an introduction. Simon & Schuster, Inc., 1994

  33. [33]

    Iterative algorithms for stabilizing solutions of game theoretic riccati equations of stochastic control

    V Dragan, G Freiling, T Morozan, and AM Stoica. Iterative algorithms for stabilizing solutions of game theoretic riccati equations of stochastic control. InProceedings of 18th International Symposium on Mathematical Theory of Networks and Systems. Blacksburg, Virginia, USA, CD-Rom, 2008

  34. [34]

    Laradji, Manuel Del Verme, Tom Marty, David Vazquez, Nicolas Chapados, and Alexandre Lacoste

    Alexandre Drouin, Maxime Gasse, Massimo Caccia, Issam H. Laradji, Manuel Del Verme, Tom Marty, David Vazquez, Nicolas Chapados, and Alexandre Lacoste. WorkArena: How capable are web agents at solving common knowledge work tasks? In Ruslan Salakhutdinov, Zico Kolter, Katherine Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, ed...

  35. [35]

    Bartlett, Ilya Sutskever, and Pieter Abbeel

    Yan Duan, John Schulman, Xi Chen, Peter L. Bartlett, Ilya Sutskever, and Pieter Abbeel. RL 2: Fast reinforcement learning via slow reinforcement learning.arXiv preprint arXiv:1611.02779, 2016

  36. [36]

    WorkflowLLM: Enhancing Workflow Orchestration Capability of Large Lan- guage Models

    Shengda Fan, Xin Cong, Yuepeng Fu, Zhong Zhang, Shuyan Zhang, Yuanwei Liu, Yesai Wu, Yankai Lin, Zhiyuan Liu, and Maosong Sun. Workflowllm: Enhancing workflow orchestration capability of large language models.arXiv preprint arXiv:2411.05451, 2024

  37. [37]

    Model-agnostic meta-learning for fast adaptation of deep networks

    Chelsea Finn, Pieter Abbeel, and Sergey Levine. Model-agnostic meta-learning for fast adaptation of deep networks. InProceedings of the 34th International Conference on Machine Learning (ICML), 2017

  38. [38]

    A comprehensive survey on safe reinforcement learning.Journal of Machine Learning Research, 16(1):1437–1480, 2015

    Javier Garc ia and Fernando Fern ’andez. A comprehensive survey on safe reinforcement learning.Journal of Machine Learning Research, 16(1):1437–1480, 2015

  39. [39]

    Samuel Gehman, Suchin Gururangan, Maarten Sap, Yejin Choi, and Noah A. Smith. Realtoxicityprompts: Evaluating neural toxic degeneration in language models.Findings of the Association for Computational Linguistics: EMNLP, 2020

  40. [40]

    Bounded autonomy: Behavioral specification languages and runtime enforcement architectures for trustworthy agentic ai systems.Authorea Preprints, 2026

    Harper Gough. Bounded autonomy: Behavioral specification languages and runtime enforcement architectures for trustworthy agentic ai systems.Authorea Preprints, 2026

  41. [41]

    Courier Corporation, 2012

    Michael Green and David JN Limebeer.Linear robust control. Courier Corporation, 2012

  42. [42]

    Train hard, fight easy: Robust meta reinforcement learning.Advances in Neural Information Processing Systems, 36:68276–68299, 2023

    Ido Greenberg, Shie Mannor, Gal Chechik, and Eli Meirom. Train hard, fight easy: Robust meta reinforcement learning.Advances in Neural Information Processing Systems, 36:68276–68299, 2023

  43. [43]

    Cost-aware offline safe meta reinforcement learning with robust in-distribution online task adaptation

    Cong Guan, Ruiqi Xue, Ziqian Zhang, Lihe Li, Yi-Chen Li, Lei Yuan, and Yang Yu. Cost-aware offline safe meta reinforcement learning with robust in-distribution online task adaptation. InProceedings of the 23rd International Conference on Autonomous Agents and Multiagent Systems, pages 743–751, 2024

  44. [44]

    Redcode: Risky code execution and generation benchmark for code agents.Advances in Neural Information Processing Systems, 37:106190–106236, 2024

    Chengquan Guo, Xun Liu, Chulin Xie, Andy Zhou, Yi Zeng, Zinan Lin, Dawn Song, and Bo Li. Redcode: Risky code execution and generation benchmark for code agents.Advances in Neural Information Processing Systems, 37:106190–106236, 2024. 12

  45. [45]

    Large Language Model based Multi-Agents: A Survey of Progress and Challenges

    Taicheng Guo, Xiuying Chen, Yaqi Wang, Ruidi Chang, Shichao Pei, Nitesh V Chawla, Olaf Wiest, and Xiangliang Zhang. Large language model based multi-agents: A survey of progress and challenges.arXiv preprint arXiv:2402.01680, 2024

  46. [46]

    Springer, 2002

    László Györfi, Michael Kohler, Adam Krzy˙zak, and Harro Walk.A distribution-free theory of nonpara- metric regression. Springer, 2002

  47. [47]

    Harris and K

    Charles R. Harris, K. Jarrod Millman, Stéfan J. van der Walt, Ralf Gommers, Pauli Virtanen, David Cournapeau, Eric Wieser, Julian Taylor, Sebastian Berg, Nathaniel J. Smith, et al. Array programming with NumPy.Nature, 585(7825):357–362, sep 2020. doi: 10.1038/s41586-020-2649-2

  48. [48]

    Haynsworth

    Emilie V . Haynsworth. Determination of the inertia of a partitioned hermitian matrix.Linear Algebra and its Applications, 1(1):73–81, 1968

  49. [49]

    Cambridge university press, 2012

    Roger A Horn and Charles R Johnson.Matrix analysis. Cambridge university press, 2012

  50. [50]

    Lora: Low-rank adaptation of large language models.arXiv e-prints, pages arXiv–2106, 2021

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models.arXiv e-prints, pages arXiv–2106, 2021

  51. [51]

    Limits to verification and validation of agentic behavior

    David J Jilk. Limits to verification and validation of agentic behavior. InArtificial Intelligence Safety and Security, pages 225–234. Chapman and Hall/CRC, 2018

  52. [52]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?arXiv preprint arXiv:2310.06770, 2023

  53. [53]

    Reinforcement learning: A survey

    Leslie Pack Kaelbling, Michael L Littman, and Andrew W Moore. Reinforcement learning: A survey. Journal of artificial intelligence research, 4:237–285, 1996

  54. [54]

    Cambridge University Press, Cambridge, 3 edition, 2004

    Yitzhak Katznelson.An Introduction to Harmonic Analysis. Cambridge University Press, Cambridge, 3 edition, 2004

  55. [55]

    Generalizing from a few environments in safety-critical reinforcement learning.arXiv preprint arXiv:1907.01475, 2019

    Zachary Kenton, Angelos Filos, Owain Evans, and Yarin Gal. Generalizing from a few environments in safety-critical reinforcement learning.arXiv preprint arXiv:1907.01475, 2019

  56. [56]

    A cmdp-within-online framework for meta-safe reinforcement learning.arXiv preprint arXiv:2405.16601, 2024

    Vanshaj Khattar, Yuhao Ding, Bilgehan Sel, Javad Lavaei, and Ming Jin. A cmdp-within-online framework for meta-safe reinforcement learning.arXiv preprint arXiv:2405.16601, 2024

  57. [57]

    Learning shared safety constraints from multi-task demonstrations.Advances in Neural Information Processing Systems, 36:5808–5826, 2023

    Konwoo Kim, Gokul Swamy, Zuxin Liu, Ding Zhao, Sanjiban Choudhury, and Steven Z Wu. Learning shared safety constraints from multi-task demonstrations.Advances in Neural Information Processing Systems, 36:5808–5826, 2023

  58. [58]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.arXiv preprint arXiv:1412.6980, 2014

  59. [59]

    A survey of zero-shot generalisation in deep reinforcement learning.Journal of Artificial Intelligence Research, 76:201–264, 2023

    Robert Kirk, Amy Zhang, Edward Grefenstette, and Tim Rocktäschel. A survey of zero-shot generalisation in deep reinforcement learning.Journal of Artificial Intelligence Research, 76:201–264, 2023

  60. [60]

    Wiley-interscience New York, 1972

    Huibert Kwakernaak and Raphael Sivan.Linear optimal control systems, volume 1. Wiley-interscience New York, 1972

  61. [61]

    Safety generalization under distribution shift in safe reinforcement learning: A diabetes testbed.arXiv preprint arXiv:2601.21094, 2026

    Minjae Kwon, Josephine Lamp, and Lu Feng. Safety generalization under distribution shift in safe reinforcement learning: A diabetes testbed.arXiv preprint arXiv:2601.21094, 2026

  62. [62]

    arXiv preprint arXiv:2210.14215 , year=

    Michael Laskin, Luyu Wang, Junhyuk Oh, Emilio Parisotto, Stephen Spencer, Richie Steigerwald, DJ Strouse, Steven Hansen, Angelos Filos, Ethan Brooks, et al. In-context reinforcement learning with algorithm distillation.arXiv preprint arXiv:2210.14215, 2022

  63. [63]

    arXiv preprint arXiv:2410.06703 , year =

    Ido Levy, Ben Wiesel, Sami Marreed, Alon Oved, Avi Yaeli, and Segev Shlomov. St-webagentbench: A benchmark for evaluating safety and trustworthiness in web agents.arXiv preprint arXiv:2410.06703, 2024

  64. [64]

    Deep reinforcement learning: An overview

    Yuxi Li. Deep reinforcement learning: An overview.arXiv preprint arXiv:1701.07274, 2017

  65. [65]

    Reinforcement learning in robust markov decision processes.Advances in neural information processing systems, 26, 2013

    Shiau Hong Lim, Huan Xu, and Shie Mannor. Reinforcement learning in robust markov decision processes.Advances in neural information processing systems, 26, 2013. 13

  66. [66]

    DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    Aixin Liu, Aoxue Mei, Bangcai Lin, Bing Xue, Bingxuan Wang, Bingzheng Xu, Bochao Wu, Bowei Zhang, Chaofan Lin, Chen Dong, et al. Deepseek-v3. 2: Pushing the frontier of open large language models.arXiv preprint arXiv:2512.02556, 2025

  67. [67]

    AgentBench: Evaluating LLMs as Agents

    Xiao Liu, Hao Yu, Hanchen Zhang, Yifan Xu, Xuanyu Lei, Hanyu Lai, Yu Gu, Hangliang Ding, Kaiwen Men, Kejuan Yang, et al. Agentbench: Evaluating llms as agents.arXiv preprint arXiv:2308.03688, 2023

  68. [68]

    Paying less generalization tax: A cross-domain generalization study of rl training for llm agents.arXiv preprint arXiv:2601.18217, 2026

    Zhihan Liu, Lin Guan, Yixin Nie, Kai Zhang, Zhuoqun Hao, Lin Chen, Asli Celikyilmaz, Zhaoran Wang, and Na Zhang. Paying less generalization tax: A cross-domain generalization study of rl training for llm agents.arXiv preprint arXiv:2601.18217, 2026

  69. [69]

    On the robustness of safe reinforcement learning under observational perturbations.arXiv preprint arXiv:2205.14691, 2022

    Zuxin Liu, Zijian Guo, Zhepeng Cen, Huan Zhang, Jie Tan, Bo Li, and Ding Zhao. On the robustness of safe reinforcement learning under observational perturbations.arXiv preprint arXiv:2205.14691, 2022

  70. [70]

    Decoupled Weight Decay Regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101, 2017

  71. [71]

    Mesa: Offline meta-rl for safe adaptation and fault tolerance.arXiv preprint arXiv:2112.03575, 2021

    Michael Luo, Ashwin Balakrishna, Brijen Thananjeyan, Suraj Nair, Julian Ibarz, Jie Tan, Chelsea Finn, Ion Stoica, and Ken Goldberg. Mesa: Offline meta-rl for safe adaptation and fault tolerance.arXiv preprint arXiv:2112.03575, 2021

  72. [72]

    Transformers are meta-reinforcement learners

    Luckeciano C Melo. Transformers are meta-reinforcement learners. InInternational conference on machine learning, pages 15340–15359. PMLR, 2022

  73. [73]

    Llama 3.2 model card, 2024

    Meta AI. Llama 3.2 model card, 2024

  74. [74]

    Gaia: a benchmark for general ai assistants

    Grégoire Mialon, Clémentine Fourrier, Thomas Wolf, Yann LeCun, and Thomas Scialom. Gaia: a benchmark for general ai assistants. InThe Twelfth International Conference on Learning Representations, 2023

  75. [75]

    Playwright

    Microsoft. Playwright. https://github.com/microsoft/playwright. GitHub repository. Accessed: 2026-05-05

  76. [76]

    The time-invariant linear-quadratic optimal control problem.Automatica, 13(4):347–357, 1977

    BP Molinari. The time-invariant linear-quadratic optimal control problem.Automatica, 13(4):347–357, 1977

  77. [77]

    Robust reinforcement learning.Neural computation, 17(2):335–359, 2005

    Jun Morimoto and Kenji Doya. Robust reinforcement learning.Neural computation, 17(2):335–359, 2005

  78. [78]

    Sg-bench: Evaluating llm safety generalization across diverse tasks and prompt types.Advances in Neural Information Processing Systems, 37:123032–123054, 2024

    Yutao Mou, Shikun Zhang, and Wei Ye. Sg-bench: Evaluating llm safety generalization across diverse tasks and prompt types.Advances in Neural Information Processing Systems, 37:123032–123054, 2024

  79. [79]

    Constrained meta reinforcement learning with provable test-time safety.arXiv preprint arXiv:2601.21845, 2026

    Tingting Ni and Maryam Kamgarpour. Constrained meta reinforcement learning with provable test-time safety.arXiv preprint arXiv:2601.21845, 2026

  80. [80]

    Robust control of markov decision processes with uncertain transition matrices.Operations Research, 53(5):780–798, 2005

    Arnab Nilim and Laurent El Ghaoui. Robust control of markov decision processes with uncertain transition matrices.Operations Research, 53(5):780–798, 2005

Showing first 80 references.