pith. sign in

arxiv: 2512.13788 · v3 · pith:LD3QPJPGnew · submitted 2025-12-15 · 💻 cs.LG · cs.RO

Constrained Policy Optimization via Sampling-Based Weight-Space Projection

Pith reviewed 2026-05-21 16:25 UTC · model grok-4.3

classification 💻 cs.LG cs.RO
keywords constrained policy optimizationsafe reinforcement learningweight-space projectionsampling-based constraintsQCQP projectionsafe-by-inductionrollout safety evaluation
0
0 comments X

The pith

SCPO keeps every policy safe during training by projecting each gradient step into a local safe region built from rollouts and smoothness bounds.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to perform constrained policy optimization when safety constraints can be checked only through rollouts and cannot be differentiated. SCPO builds a local safe region around the current parameters by combining sampled safety evaluations with bounds that limit how much safety can change under small parameter moves, then solves a convex QCQP to project the proposed update into that region. This construction yields a safe-by-induction guarantee: any safe starting policy produces only safe policies afterward, provided each projection remains feasible. In control tasks that supply a stabilizing backup policy, the same mechanism also preserves closed-loop stability while permitting performance gains beyond the backup.

Core claim

SCPO constructs a local safe region by combining rollout-based safety evaluations with smoothness bounds relating parameter perturbations to changes in safety metrics, and projects each gradient update via a convex QCQP. Starting from any safe initialization, every subsequent policy remains safe whenever the projections stay feasible. In the presence of a stabilizing backup policy the method additionally guarantees closed-loop stability while still allowing objective improvement.

What carries the argument

The sampling-based local safe region in weight space, defined by rollout evaluations plus smoothness bounds on safety-metric changes, inside which each gradient step is projected by solving a convex QCQP.

If this is right

  • From any safe starting policy, every iterate stays inside the safe set whenever the QCQP projection succeeds.
  • In control problems equipped with a stabilizing backup, closed-loop stability holds throughout adaptation.
  • Unsafe gradient directions are automatically rejected, so feasibility is preserved for the entire training trajectory.
  • Objective improvement remains possible while respecting the rollout safety constraints.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same projection technique could be used in any gradient-based optimizer whose constraints are cheap to evaluate by sampling but expensive to differentiate.
  • Tighter smoothness bounds would shrink the safe region and therefore produce more conservative steps; looser bounds would enlarge it and risk occasional safety violations.
  • The method separates the safety mechanism from the choice of optimizer, so it can be wrapped around existing policy-gradient or actor-critic algorithms without altering their update rules.

Load-bearing premise

Smoothness bounds exist that accurately relate small changes in policy parameters to changes in the rollout-evaluated safety metrics without needing gradients of the constraint functions.

What would settle it

A training run in which a projected update produces a policy that violates the rollout safety constraint even though the projection was declared feasible and the smoothness bounds were applied.

Figures

Figures reproduced from arXiv: 2512.13788 by Eunhyek Joa, Francesco Borrelli, Shengfan Cao.

Figure 1
Figure 1. Figure 1: Architecture of πθ. The certified safe controller πsafe is a non￾parametric baseline, and the neural network ϕθ operates as a residual to refine performance. The final linear layers (shown in red) are initialized with zero weights and biases, ensuring that the combined policy initially reproduces πsafe. The gray block indicates a repeated module, which appears six additional times. Existing work on safe ad… view at source ↗
Figure 3
Figure 3. Figure 3: Regression with soft constraint (λ = 1), without proposed projection. B. Double Integrator Imitation Learning We next study safe learning on a constrained control task to empirically validate that SCPO keeps πθ stabilizing under harmful supervision from a malicious expert. Consider the discrete-time double integrator with the fol￾lowing dynamics, constraints: xk+1 =  1 ∆t 0 1  xk +  1 2∆t 2 ∆t  uk,  −… view at source ↗
Figure 2
Figure 2. Figure 2: Constrained regression with proposed projection. The loss in (28) is [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Closed-loop trajectories from an example initial state for the double [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Comparison between the backward reachable set to [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
read the original abstract

Safety-critical learning requires policies that improve performance without leaving the safe operating regime. We study constrained policy learning where model parameters must satisfy rollout-based safety constraints that can be evaluated but not differentiated analytically. We propose SCPO, a sampling-based weight-space projection method that enforces safety directly in parameter space without requiring gradient access to the constraint functions. SCPO constructs a local safe region by combining rollout-based safety evaluations with smoothness bounds relating parameter perturbations to changes in safety metrics, and projects each gradient update via a convex QCQP. We establish a safe-by-induction guarantee: starting from any safe initialization, all intermediate policies remain safe given feasible projections. In constrained control settings with a stabilizing backup policy, SCPO further ensures closed-loop stability while enabling safe adaptation beyond the conservative backup. Experiments on constrained regression with harmful supervision and double-integrator imitation with a malicious expert show that SCPO rejects unsafe updates, maintains feasibility throughout training, and achieves meaningful objective improvement.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes SCPO, a sampling-based weight-space projection method for constrained policy optimization in settings where safety constraints can be evaluated via rollouts but lack analytic gradients. SCPO constructs a local safe region around the current parameters by combining a rollout-based safety evaluation with an additive smoothness bound on how the safety metric changes under parameter perturbations, then solves a convex QCQP to project each gradient update into this region. The central claim is a safe-by-induction guarantee: starting from any safe initialization, feasible projections ensure all subsequent policies remain safe. The paper also claims closed-loop stability when a stabilizing backup policy is available and reports empirical results on constrained regression with harmful supervision and double-integrator imitation learning with a malicious expert, showing maintained feasibility and objective improvement.

Significance. If the smoothness bounds rigorously ensure the constructed local region is contained in the true safe set, the method would offer a practical way to enforce rollout-based safety constraints without differentiability, enabling safe adaptation beyond conservative backups in control and RL settings. The induction-based guarantee and QCQP projection could be useful for safety-critical applications where constraint gradients are unavailable.

major comments (2)
  1. [Section 3, Theorem 1] Proof of the safe-by-induction guarantee (Section 3, Theorem 1): The induction step requires that any parameter vector satisfying the QCQP (i.e., inside the local safe region defined by the rollout evaluation plus the smoothness additive term) is guaranteed to satisfy the true rollout-based safety constraint. The manuscript must explicitly state and justify the conditions under which the smoothness bound (obtained via sampling or an a-priori constant without constraint gradients) is a strict upper bound on the worst-case change in the safety metric over all possible trajectories; if finite sampling can miss adversarial trajectories or if the chosen Lipschitz-style constant is not globally valid, a feasible projection can produce an unsafe policy and break the induction.
  2. [Section 4.2, Eq. (8)–(10)] Definition of the local safe region and QCQP (Section 4.2, Eq. (8)–(10)): The construction combines a single rollout evaluation at the current parameters with an additive term derived from the smoothness bound. It is unclear whether the paper provides a formal proof that this region is always a subset of the true safe set, or only an empirical/heuristic containment; without the former, the safety claim is not load-bearing.
minor comments (2)
  1. [Section 2] Notation for the safety metric and the QCQP objective/constraints should be introduced with explicit symbols and dimensions in the main text rather than deferred to the appendix.
  2. [Figure 2] Figure 2 (projection illustration): Add explicit labels for the local safe region boundary and the true constraint surface to clarify the relationship between the QCQP feasible set and the true safe set.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and valuable feedback on our manuscript. We address each major comment point by point below, providing clarifications on the assumptions underlying the safety guarantees. We will incorporate revisions to make the conditions and proofs more explicit.

read point-by-point responses
  1. Referee: [Section 3, Theorem 1] Proof of the safe-by-induction guarantee (Section 3, Theorem 1): The induction step requires that any parameter vector satisfying the QCQP (i.e., inside the local safe region defined by the rollout evaluation plus the smoothness additive term) is guaranteed to satisfy the true rollout-based safety constraint. The manuscript must explicitly state and justify the conditions under which the smoothness bound (obtained via sampling or an a-priori constant without constraint gradients) is a strict upper bound on the worst-case change in the safety metric over all possible trajectories; if finite sampling can miss adversarial trajectories or if the chosen Lipschitz-style constant is not globally valid, a feasible projection can produce an unsafe policy and break the induction.

    Authors: We appreciate the referee pointing out the need for explicit conditions. Theorem 1 establishes the safe-by-induction guarantee under the assumption that the smoothness bound is a valid upper bound on the worst-case change in the safety metric. The manuscript considers two sources for this bound: an a-priori global Lipschitz constant (when available from domain knowledge) or a conservative sampling-based estimate obtained by evaluating the maximum observed change over sampled trajectories and adding a margin. We will revise the statement of Theorem 1 and the surrounding text to explicitly list these assumptions and note that finite sampling requires a sufficiently conservative estimate (e.g., via over-sampling or an additive safety factor) to ensure the bound holds. Under these conditions, any QCQP-feasible point lies inside the true safe set, preserving the induction. This clarification does not change the result but improves rigor. revision: yes

  2. Referee: [Section 4.2, Eq. (8)–(10)] Definition of the local safe region and QCQP (Section 4.2, Eq. (8)–(10)): The construction combines a single rollout evaluation at the current parameters with an additive term derived from the smoothness bound. It is unclear whether the paper provides a formal proof that this region is always a subset of the true safe set, or only an empirical/heuristic containment; without the former, the safety claim is not load-bearing.

    Authors: The containment is formally proven rather than heuristic. The local safe region is defined so that the QCQP enforces the constraint safety(current) + smoothness_bound(delta) <= safety_threshold. By the definition of the smoothness bound, this implies that the true safety at the projected parameters cannot exceed the threshold. The proof of this containment appears in the supporting argument for Theorem 1. We will add an explicit lemma or remark in Section 4.2 that isolates this containment argument, with a direct reference to the equations and the induction proof. This makes the formal subset relationship load-bearing and transparent. revision: yes

Circularity Check

0 steps flagged

No circularity: safe-by-induction guarantee follows from initialization and feasible projections without reducing to fitted inputs or self-definitions

full rationale

The paper's central derivation constructs a local safe region from rollout evaluations at current parameters plus additive smoothness bounds on parameter perturbations, then projects gradient steps via QCQP. The safe-by-induction claim is stated as holding whenever the initialization is safe and each projection remains feasible inside that constructed region. This logic does not equate any prediction to its own inputs by construction, nor does it rely on load-bearing self-citations, imported uniqueness theorems, or ansatzes smuggled from prior work. The smoothness bounds are presented as an external modeling assumption rather than a quantity fitted to the target safety metric itself. The derivation therefore remains self-contained against the stated assumptions and does not exhibit any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the existence of usable smoothness bounds between parameter changes and safety metrics plus the assumption that the resulting QCQP remains feasible at each step. No free parameters or invented entities are mentioned in the abstract.

axioms (2)
  • domain assumption Smoothness bounds exist that relate small parameter perturbations to bounded changes in rollout-evaluated safety metrics
    Invoked to construct the local safe region without analytic gradients.
  • domain assumption The convex QCQP projection is feasible whenever the current policy is safe
    Required for the safe-by-induction property to hold throughout training.

pith-pipeline@v0.9.0 · 5691 in / 1371 out tokens · 36301 ms · 2026-05-21T16:25:30.173162+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

19 extracted references · 19 canonical work pages

  1. [1]

    Safe Reinforcement Learn- ing Using Advantage-Based Intervention,

    N. Wagener, B. Boots, and C.-A. Cheng, “Safe Reinforcement Learn- ing Using Advantage-Based Intervention,” July 2021

  2. [2]

    Safety Cor- rection from Baseline: Towards the Risk-aware Policy in Robotics via Dual-agent Reinforcement Learning,

    L. Zhang, Z. Yan, L. Shen, S. Li, X. Wang, and D. Tao, “Safety Cor- rection from Baseline: Towards the Risk-aware Policy in Robotics via Dual-agent Reinforcement Learning,” in2022 IEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), pp. 9027–9033, Oct. 2022

  3. [3]

    LoRA: Low-Rank Adaptation of Large Language Models,

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-Rank Adaptation of Large Language Models,” Oct. 2021

  4. [4]

    A Review of Safe Reinforcement Learning: Methods, Theory and Applications,

    S. Gu, L. Yang, Y . Du, G. Chen, F. Walter, J. Wang, and A. Knoll, “A Review of Safe Reinforcement Learning: Methods, Theory and Applications,” May 2024

  5. [5]

    A Comprehensive Survey on Safe Reinforcement Learning,

    J. Garc ´ıa and F. Fern ´andez, “A Comprehensive Survey on Safe Reinforcement Learning,”Journal of Machine Learning Research, vol. 16, no. 42, pp. 1437–1480, 2015

  6. [6]

    CRPO: A New Approach for Safe Reinforcement Learning with Convergence Guarantee,

    T. Xu, Y . Liang, and G. Lan, “CRPO: A New Approach for Safe Reinforcement Learning with Convergence Guarantee,” inProceedings of the 38th International Conference on Machine Learning, pp. 11480– 11491, PMLR, July 2021

  7. [7]

    Natural Policy Gradi- ent Primal-Dual Method for Constrained Markov Decision Processes,

    D. Ding, K. Zhang, T. Basar, and M. Jovanovic, “Natural Policy Gradi- ent Primal-Dual Method for Constrained Markov Decision Processes,” inAdvances in Neural Information Processing Systems, vol. 33, pp. 8378–8390, Curran Associates, Inc., 2020

  8. [8]

    Trust Region Policy Optimization,

    J. Schulman, S. Levine, P. Moritz, M. I. Jordan, and P. Abbeel, “Trust Region Policy Optimization,” Apr. 2017

  9. [9]

    Proximal Policy Optimization Algorithms,

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov, “Proximal Policy Optimization Algorithms,” Aug. 2017

  10. [10]

    Constrained Policy Optimization,

    J. Achiam, D. Held, A. Tamar, and P. Abbeel, “Constrained Policy Optimization,” May 2017

  11. [11]

    A Lyapunov-based Approach to Safe Reinforcement Learning,

    Y . Chow, O. Nachum, E. Duenez-Guzman, and M. Ghavamzadeh, “A Lyapunov-based Approach to Safe Reinforcement Learning,” May 2018

  12. [12]

    Safe LoRA: The Silver Lining of Reducing Safety Risks when Fine-tuning Large Language Models,

    C.-Y . Hsu, Y .-L. Tsai, C.-H. Lin, P.-Y . Chen, C.-M. Yu, and C.-Y . Huang, “Safe LoRA: The Silver Lining of Reducing Safety Risks when Fine-tuning Large Language Models,” Jan. 2025

  13. [13]

    SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation,

    M. Li, W. M. Si, M. Backes, Y . Zhang, and Y . Wang, “SaLoRA: Safety-Alignment Preserved Low-Rank Adaptation,” Jan. 2025

  14. [14]

    Safe Reinforcement Learning via Projection on a Safe Set: How to Achieve Optimality?,

    S. Gros, M. Zanon, and A. Bemporad, “Safe Reinforcement Learning via Projection on a Safe Set: How to Achieve Optimality?,” Apr. 2020

  15. [15]

    Robot Reinforcement Learning on the Constraint Manifold,

    P. Liu, D. Tateo, H. B. Ammar, and J. Peters, “Robot Reinforcement Learning on the Constraint Manifold,” inProceedings of the 5th Conference on Robot Learning, pp. 1357–1366, PMLR, Jan. 2022

  16. [16]

    HardNet: Hard-Constrained Neural Networks with Universal Approximation Guarantees,

    Y . Min and N. Azizan, “HardNet: Hard-Constrained Neural Networks with Universal Approximation Guarantees,” Oct. 2025

  17. [17]

    H. K. Khalil,Nonlinear systems. Upper Saddle River, N.J.: Prentice Hall, 2002

  18. [18]

    Algorithms for Verifying Deep Neural Networks,

    C. Liu, T. Arnon, C. Lazarus, C. Strong, C. Barrett, and M. J. Kochenderfer, “Algorithms for Verifying Deep Neural Networks,” Foundations and Trends® in Optimization, vol. 4, no. 3-4, pp. 244– 404, 2021

  19. [19]

    A Simple Approach to Constraint- Aware Imitation Learning with Application to Autonomous Racing,

    S. Cao, E. Joa, and F. Borrelli, “A Simple Approach to Constraint- Aware Imitation Learning with Application to Autonomous Racing,” Aug. 2025