pith. machine review for the scientific record. sign in

arxiv: 2605.01195 · v2 · submitted 2026-05-02 · 💻 cs.RO

Recognition: no theorem link

TAIL-Safe: Task-Agnostic Safety Monitoring for Imitation Learning Policies

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:42 UTC · model grok-4.3

classification 💻 cs.RO
keywords imitation learningsafety monitoringflow-matching policiesrobot manipulationcontrol invariant setsGaussian splattingrecovery mechanismstask-agnostic criteria
0
0 comments X

The pith

TAIL-Safe defines a safe operating region for sensitive imitation learning policies using three short-term visual and grasp criteria.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Imitation learning policies such as flow-matching methods learn complex robot tasks from demonstrations yet often fail at runtime from small perturbations that cause compounding drift. TAIL-Safe addresses this by learning a Lipschitz-continuous Q-function whose zero-superlevel set marks state-action pairs from which the policy reliably completes the task. The function is trained on a Gaussian Splatting digital twin using only three task-agnostic criteria: visibility, recognizability, and graspability. When the nominal policy proposes an action outside this set, a gradient-ascent recovery step steers the trajectory back inside the set. Experiments on a Franka Emika robot show that the guided policies complete tasks consistently where the unguided versions fail.

Core claim

The zero-superlevel set of a Q-function trained to predict long-term success from short-term visibility, recognizability, and graspability criteria forms an empirical control invariant set for a trained imitation learning policy. When the policy's proposed action leaves this set, a recovery action obtained by gradient ascent on the Q-function returns the system to the set, in keeping with Nagumo's theorem. This mechanism is learned entirely in a high-fidelity digital twin and then transferred to the physical robot, enabling flow-matching policies to maintain task success under runtime perturbations.

What carries the argument

A Lipschitz-continuous Q-value function that scores state-action pairs by predicted long-term task success using the three short-term criteria; its zero-superlevel set serves as the empirical safe set, and gradient ascent on the function supplies the recovery actions.

If this is right

  • Flow-matching policies that previously failed under runtime perturbations now achieve consistent task success when their actions are filtered by the learned safe set.
  • The safety monitor operates without retraining the original imitation policy and uses only task-agnostic short-term observations.
  • High-fidelity digital twins built with Gaussian Splatting allow systematic collection of failure data without risking hardware.
  • The recovery mechanism keeps trajectories inside the control-invariant set by local gradient steps rather than global replanning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same short-term criteria and recovery logic could be applied to other sensitive imitation methods such as diffusion policies without task-specific redesign.
  • If the digital-twin to real transfer holds, the approach offers a route to safer deployment of imitation policies in unstructured settings where perturbations are common.
  • Separating the safety layer from the task policy may allow independent improvement or replacement of either component over time.

Load-bearing premise

The Q-function trained inside the digital twin transfers to the physical robot and the three short-term criteria suffice to predict whether the policy will complete the task from a given state-action pair.

What would settle it

Run the same flow-matching policy with and without TAIL-Safe on the physical robot under identical perturbations; if success rates remain near zero with TAIL-Safe but rise with it, or if real-robot performance collapses despite accurate digital-twin predictions, the central claim is refuted.

Figures

Figures reproduced from arXiv: 2605.01195 by Momotaz Begum, Riad Ahmed.

Figure 1
Figure 1. Figure 1: Overview of TAIL-Safe. (Top-left) A Gaussian Splatting pipeline constructs a digital twin (∼20 min: 5 min capture + 15 min reconstruction). (Top-right) The simulator generates safe/unsafe trajectories under perturbations for training WeightNet (score fusion) and Q-ValueNet (success prediction). (Middle) At deployment, TAIL-Safe monitors Q(s, a) in real time, remaining inactive while Q(s, a) > 0. (Bottom) W… view at source ↗
Figure 2
Figure 2. Figure 2: Q-Function Landscape. Multi-view 2D projections show the bounded safe set (green, Q ≥ 0) in action space. White arrows indicate ∇aQ pointing inward, enabling gradient-based recovery. Step 0: Energy Landscape Step 10: Energy Landscape Step 20: Energy Landscape Step 30: Energy Landscape view at source ↗
Figure 3
Figure 3. Figure 3: Q-Function as Bounded Hill. The Q-function peaks at expert actions and decays smoothly, ensuring ∇aQ points toward safe configurations for gradient-based recovery. actions maintaining all task criteria until task completion is likely achievable, while Q(s, a) < 0 signals likely failure. Appendix VII-C provides additional details on training the Q￾function. We next address how of task-success assurance by e… view at source ↗
Figure 4
Figure 4. Figure 4: TAIL-Safe Recovery During Teleoperation. (Left) Safe actions with Q > 0. (Middle) Teleoperator pushes robot out of safe set. (Right) TAIL-Safe locks device and activates recovery, steering back to safety. Bottom plots show Q-value trajectory recovering to positive values. Pick Red Candy Pick and Place Red Candy view at source ↗
Figure 4
Figure 4. Figure 4: TAIL-Safe Recovery During Teleoperation. (Left) Safe actions with Q > 0. (Middle) Teleoperator pushes robot out of safe set. (Right) TAIL-Safe locks device and activates recovery, steering back to safety. Bottom plots show Q-value trajectory recovering to positive values. Pick Red Candy Pick and Place Red Candy [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Experimental Tasks. Real robot setup for (left) Candy Picking and (right) Pick-and-Place under object perturbations (±5cm, ±30◦ ). We set η = 0.05 based on grid search; convergence typically requires 3–5 iterations (mean: 2.3), enabling 20Hz operation. See Appendix VII-F for hyperparameter sensitivity. Proposition 1 (Bounded Recovery Step). Under Lipschitz constraint LQ, the recovery update ∆a = η∇aQ/∥∇aQ∥… view at source ↗
Figure 5
Figure 5. Figure 5: Experimental Tasks. Real robot setup for (left) Candy Picking and (right) Pick-and-Place under object perturbations (±5cm, ±30◦ ). TAIL-Safe, Q remains negative until failure; with TAIL-Safe, recovery activates when Q < 0 and steers back to safety. We also characterize TAIL-Safe’s limitations ( [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 7
Figure 7. Figure 7: Trajectory Distribution. (a) Safe initializations pro￾duce clustered paths reaching the target. (b) Unsafe initializa￾tions produce scattered, erratic paths. The Policy rolled out without the recovery, went outside of safe set and failed Without Recovery With Recovery The recovery was active, went outside of safe set and got corrected and became successful 𝑡 = 0 𝑡 = 0 𝑡 = 0 𝑡 = T 𝑡 = T view at source ↗
Figure 6
Figure 6. Figure 6: Failure Modes in Vanilla Flow-Matching Policy. Ensemble disagreement is near chance (AUROC 0.525): when an in-distribution policy fails, the seeds tend to agree on the same wrong action, so action variance does not flag the failure; using the ensemble mean as a recovery signal is actively harmful (∆Q=−0.86) and inference is roughly 5× slower [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Safe Set and Q-Value Propagation. (a) Without TAIL￾Safe: Q-value remains negative, task fails. (b) With TAIL-Safe: recovery controller steers robot back to safety upon detecting Q < 0. C. Determining the Resilience of TAIL-Safe to Different Type of Runtime Perturbations We qualitatively demonstrate TAIL-Safe’s resilience to run￾time perturbations view at source ↗
Figure 7
Figure 7. Figure 7: Trajectory Distribution. (a) Safe initializations pro￾duce clustered paths reaching the target. (b) Unsafe initializa￾tions produce scattered, erratic paths. The Policy rolled out without the recovery, went outside of safe set and failed Without Recovery With Recovery The recovery was active, went outside of safe set and got corrected and became successful 𝑡 = 0 𝑡 = 0 𝑡 = 0 𝑡 = T 𝑡 = T [PITH_FULL_IMAGE:fi… view at source ↗
Figure 9
Figure 9. Figure 9: Perturbations Beyond TAIL-Safe’s Capability. (a) Extreme proprioceptive deviation: robot configuration falls outside training distribution. (b) Extreme object displacement: objects placed far from nominal regions cannot be recognized. No Lipschitz No Energy Q(s,a) No Energy Shaping Q(s,a) Baseline Q(s,a) No Lipschitz No Energy Gradient Magnitude No Lipschitz No Energy Q(s,a) No Energy Shaping Gradient Magn… view at source ↗
Figure 8
Figure 8. Figure 8: Safe Set and Q-Value Propagation. (a) Without TAIL￾Safe: Q-value remains negative, task fails. (b) With TAIL-Safe: recovery controller steers robot back to safety upon detecting Q < 0. The learned CBF detects unsafe states well (AUROC 0.987) but 97.4% of its gradients in the unsafe region are near￾zero (< 10−2 ), so gradient-ascent recovery succeeds only 6.9% of the time when judged against an oracle Q(s, … view at source ↗
Figure 10
Figure 10. Figure 10: Effect of Imposing Lipschitz and Energy Shaping. Left: No constraints: flat Q-landscape. Middle: +Lipschitz: slightly improved. Right: +Energy shaping: bounded hill with strong gradients at safe set boundary. 2) Lipschitz Constraint and Energy Shaping for Recovery The recovery controller requires bounded gradients for stabil￾ity and sufficient magnitude for rapid correction. Table III and view at source ↗
Figure 9
Figure 9. Figure 9: Perturbations Beyond TAIL-Safe’s Capability. (a) Extreme proprioceptive deviation: robot configuration falls outside training distribution. (b) Extreme object displacement: objects placed far from nominal regions cannot be recognized. No Lipschitz No Energy Q(s,a) No Energy Shaping Q(s,a) Baseline Q(s,a) No Lipschitz No Energy Gradient Magnitude No Lipschitz No Energy Q(s,a) No Energy Shaping Gradient Magn… view at source ↗
Figure 11
Figure 11. Figure 11: 2D Energy Landscapes Along Trajectories. Two representative trajectories from the evaluation set. Left: spatial paths in X-Y plane. Right: Q-function contours at Step 0 and Step 10 showing action perturbations ∆a0, ∆a1 (cm) around the expert action (⋆). Green/yellow: Q > 0 (safe); red: Q < 0 (unsafe). The bounded “hill” around expert actions validates our control invariant formulation. Visibility Recogniz… view at source ↗
Figure 10
Figure 10. Figure 10: Effect of Imposing Lipschitz and Energy Shaping. Left: No constraints: flat Q-landscape. Middle: +Lipschitz: slightly improved. Right: +Energy shaping: bounded hill with strong gradients at safe set boundary. TABLE V: Q-Function Calibration Metrics Metric Value Interpretation AUROC 99.3% Excellent Discrimination AUPRC 99.7% High Precision-Recall False Safe Rate 0.84% Critical Safety Metric False Unsafe Ra… view at source ↗
Figure 12
Figure 12. Figure 12: WeightNet Learned Weights Across Trajectory Phases. Dynamic weight distribution across five timesteps of a pick￾and-place task. Top: wrist camera images showing approach to grasp. Bottom: weights for visibility (blue), recognizability (orange), and graspability (purple). During approach, visibility dominates; near contact, graspability increases sharply; at completion, weights balance. This context-depend… view at source ↗
Figure 11
Figure 11. Figure 11: 2D Energy Landscapes Along Trajectories. Two representative trajectories from the evaluation set. Left: spatial paths in X-Y plane. Right: Q-function contours at Step 0 and Step 10 showing action perturbations ∆a0, ∆a1 (cm) around the expert action (⋆). Green/yellow: Q > 0 (safe); red: Q < 0 (unsafe). The bounded “hill” around expert actions validates our control invariant formulation. TABLE VI: FlowPolic… view at source ↗
Figure 12
Figure 12. Figure 12: WeightNet Learned Weights Across Trajectory Phases. Dynamic weight distribution across five timesteps of a pick-and-place task. Top: wrist camera images show￾ing approach to grasp. Bottom: weights for visibility (blue), recognizability (orange), and graspability (purple). During approach, visibility dominates; near contact, graspability in￾creases sharply; at completion, weights balance. This context￾depe… view at source ↗
read the original abstract

Recent imitation learning (IL) algorithms such as flow-matching and diffusion policies demonstrate remarkable performance in learning complex manipulation tasks. However, these policies often fail even when operating within their training distribution due to extreme sensitivity to initial conditions and irreducible approximation errors that lead to compounding drift. This makes it unsafe to deploy IL policies in the field where out-of-distribution scenarios are prevalent. A prerequisite for safe deployment is enabling the policy to determine whether it can execute a task the way it was learned from demonstrations. This paper presents TAIL-Safe, a principled approach to identify, for a trained IL policy, a safe set from where the policy empirically succeeds in completing the learned task. We propose a Lipschitz-continuous Q-value function that maps state-action pairs to a long-term safety score based on three short-term task-agnostic criteria: visibility, recognizability, and graspability. The zero-superlevel set of this function characterizes an empirical control invariant set over state-action pairs. When the nominal policy proposes an action outside this set, we apply a recovery mechanism inspired by Nagumo's theorem that uses gradient ascent to the Q-function to steer the policy back to safety. To learn this Q-function, we construct a high-fidelity digital twin using Gaussian Splatting that enables systematic collection of failure data without risk to physical hardware. Experiments with a Franka Emika robot demonstrate that flow-matching policies, which fail under run-time perturbations, achieve consistent task success when guided by the proposed TAIL-Safe.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper introduces TAIL-Safe, a task-agnostic safety monitor for imitation learning policies (e.g., flow-matching) that learns a Lipschitz-continuous Q-function from three short-term criteria (visibility, recognizability, graspability) collected in a Gaussian Splatting digital twin. The zero-superlevel set of this Q is treated as an empirical control invariant; when the nominal policy exits this set, a Nagumo-inspired gradient-ascent recovery steers actions back inside. The abstract asserts that this guidance enables flow-matching policies to achieve consistent task success on a Franka Emika robot under run-time perturbations where the unguided policy fails.

Significance. If the transfer of the learned Q from the digital twin to the physical robot holds and the three short-term criteria reliably predict long-term task success, the approach would offer a practical, policy-agnostic layer for safe deployment of sensitive IL policies in unstructured environments. The use of a high-fidelity Gaussian Splatting twin for systematic, risk-free failure-data collection is a concrete strength that could be adopted more broadly.

major comments (3)
  1. [Abstract] Abstract: the central claim that 'flow-matching policies... achieve consistent task success when guided by the proposed TAIL-Safe' is stated without any quantitative metrics (success rates, number of trials, perturbation magnitudes, or baseline comparisons). This absence directly prevents assessment of whether the recovery mechanism actually enlarges the region of reliable execution.
  2. [Method and Experiments] Q-function training and transfer (implied in the method and experiments description): the manuscript asserts that the Lipschitz Q trained on twin-generated labels transfers to the physical Franka without providing any domain-gap quantification, real-world Q-value correlation with observed outcomes, or ablation on the three criteria. Because the recovery step relies on the zero-superlevel set being a reliable empirical control invariant, this unverified transfer is load-bearing for the safety guarantee.
  3. [Method] Sufficiency of short-term criteria: no analysis is supplied showing that visibility/recognizability/graspability labels collected in the twin correlate with long-term task completion under perturbation; without such evidence or an ablation removing one criterion, it remains unclear whether the Q-function actually captures compounding-error dynamics.
minor comments (1)
  1. [Abstract] The abstract refers to 'a principled approach' yet the safety set is defined empirically from a learned Q; a brief clarification of what 'principled' denotes (e.g., Lipschitz continuity or Nagumo inspiration) would improve precision.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which helps us clarify the contributions and strengthen the presentation of TAIL-Safe. We address each major comment below and commit to revisions that directly respond to the concerns raised.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that 'flow-matching policies... achieve consistent task success when guided by the proposed TAIL-Safe' is stated without any quantitative metrics (success rates, number of trials, perturbation magnitudes, or baseline comparisons). This absence directly prevents assessment of whether the recovery mechanism actually enlarges the region of reliable execution.

    Authors: We agree that the abstract should include quantitative support for the central claim to enable immediate assessment. The experiments section reports success rates, trial counts, perturbation magnitudes, and baseline comparisons, but these were not condensed into the abstract. In the revised manuscript we will update the abstract to explicitly state key metrics (e.g., success rate improvement, number of trials, and perturbation ranges) while preserving the word limit. revision: yes

  2. Referee: [Method and Experiments] Q-function training and transfer (implied in the method and experiments description): the manuscript asserts that the Lipschitz Q trained on twin-generated labels transfers to the physical Franka without providing any domain-gap quantification, real-world Q-value correlation with observed outcomes, or ablation on the three criteria. Because the recovery step relies on the zero-superlevel set being a reliable empirical control invariant, this unverified transfer is load-bearing for the safety guarantee.

    Authors: We acknowledge that explicit domain-gap quantification, Q-value correlation with real-world outcomes, and criterion ablation are currently absent and would strengthen the transfer claim. The real-robot experiments demonstrate that the transferred Q enables recovery where the nominal policy fails, providing empirical evidence of transfer. In the revision we will add (i) a quantitative domain-gap analysis between twin and real observations, (ii) correlation plots of predicted Q-values against observed success/failure, and (iii) an ablation on the three criteria, all placed in a new subsection of the experiments. revision: yes

  3. Referee: [Method] Sufficiency of short-term criteria: no analysis is supplied showing that visibility/recognizability/graspability labels collected in the twin correlate with long-term task completion under perturbation; without such evidence or an ablation removing one criterion, it remains unclear whether the Q-function actually captures compounding-error dynamics.

    Authors: The three criteria were chosen as task-agnostic, short-horizon proxies for common manipulation failure modes that compound over time. We agree that direct evidence of their correlation with long-term success and an ablation study are needed to confirm they capture compounding-error dynamics. In the revised paper we will add a correlation analysis between the collected labels and long-term task completion under perturbation, together with an ablation that removes each criterion in turn and reports the resulting change in Q-function predictive accuracy and recovery performance. revision: yes

Circularity Check

0 steps flagged

Safety score learned from independent short-term criteria; no reduction of central claim to fitted inputs by construction

full rationale

The derivation defines the Q-function as a learned mapping from state-action pairs to a safety score using labels derived from three explicit short-term criteria (visibility, recognizability, graspability) collected in a separate Gaussian Splatting digital twin. The zero-superlevel set and gradient-ascent recovery are then applied to the nominal flow-matching policy, but the policy itself is not used to define or fit the Q-function. Experiments on the physical Franka provide external validation of task success rather than deriving the success metric from the policy's own outputs. No self-citation chains, ansatz smuggling, or renaming of known results appear in the provided derivation steps; the approach remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 1 invented entities

The central claim rests on transfer from simulation to reality and on the three criteria being adequate proxies for long-term success; these are not independently verified in the abstract.

free parameters (1)
  • Q-function neural network parameters
    Weights of the Q-network are fitted to data collected in the digital twin.
axioms (2)
  • domain assumption Nagumo's theorem applies to the gradient-ascent recovery step for maintaining control invariance
    Recovery mechanism is explicitly inspired by Nagumo's theorem.
  • domain assumption The three short-term criteria suffice to determine whether the policy will succeed long-term
    Used to construct the Q-value function that defines the safe set.
invented entities (1)
  • Empirical control invariant set as zero-superlevel set of the learned Q-function no independent evidence
    purpose: Characterizes states from which the IL policy empirically succeeds
    Core definition of the safe operating region.

pith-pipeline@v0.9.0 · 5569 in / 1405 out tokens · 48059 ms · 2026-05-11T01:42:26.427338+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 5 internal anchors

  1. [1]

    Control barrier function based quadratic programs for safety critical systems.IEEE Transactions on Automatic Control, 62(8):3861–3876, 2017

    Aaron D Ames, Xiangru Xu, Jessy W Grizzle, and Paulo Tabuada. Control barrier function based quadratic programs for safety critical systems.IEEE Transactions on Automatic Control, 62(8):3861–3876, 2017

  2. [2]

    Hamilton-jacobi reachability: A brief overview and recent advances

    Somil Bansal, Mo Chen, Sylvia Herbert, and Claire J Tomlin. Hamilton-jacobi reachability: A brief overview and recent advances. In2017 IEEE 56th Annual Confer- ence on Decision and Control (CDC), pages 2242–2253. IEEE, 2017

  3. [3]

    Kevin Black, Noah Brown, Danny Driess, Adnan Esber, Michael Suber, Brian Ichter, et al.π 0: A vision-language- action flow model for general robot control.arXiv preprint arXiv:2410.24164, 2024

  4. [4]

    Set invariance in control.Automatica, 35(11):1747–1767, 1999

    Franco Blanchini. Set invariance in control.Automatica, 35(11):1747–1767, 1999

  5. [5]

    Springer, 2008

    Franco Blanchini and Stefano Miani.Set-Theoretic Methods in Control. Springer, 2008

  6. [6]

    Dif- fusion policy: Visuomotor policy learning via action diffusion

    Cheng Chi, Siyuan Feng, Yilun Du, Zhenjia Xu, Eric Cousineau, Benjamin Burchfiel, and Shuran Song. Dif- fusion policy: Visuomotor policy learning via action diffusion. InProceedings of Robotics: Science and Systems (RSS), 2023

  7. [7]

    Exploring the limitations of behavior cloning for autonomous driving.IEEE International Conference on Computer Vision (ICCV), pages 9329– 9338, 2019

    Felipe Codevilla, Eder Santana, Antonio M L ´opez, and Adrien Gaidon. Exploring the limitations of behavior cloning for autonomous driving.IEEE International Conference on Computer Vision (ICCV), pages 9329– 9338, 2019

  8. [8]

    Safe control with learned certificates: A survey of neural lyapunov, barrier, and contraction methods for robotics and control.IEEE Transactions on Robotics, 39(3):1749– 1767, 2023

    Charles Dawson, Sicun Gao, and Chuchu Fan. Safe control with learned certificates: A survey of neural lyapunov, barrier, and contraction methods for robotics and control.IEEE Transactions on Robotics, 39(3):1749– 1767, 2023

  9. [9]

    Reach-avoid problems with time-varying dynam- ics, targets and constraints

    Jaime F Fisac, Mo Chen, Claire J Tomlin, and S Shankar Sastry. Reach-avoid problems with time-varying dynam- ics, targets and constraints. InInternational Conference on Hybrid Systems: Computation and Control, pages 11–

  10. [10]

    Thriftydagger: Budget-aware novelty and risk gating for interactive imitation learning

    Ryan Hoque, Ashwin Balakrishna, Ellen Novoseller, Albert Wilcox, Daniel S Brown, and Ken Goldberg. Thriftydagger: Budget-aware novelty and risk gating for interactive imitation learning. InConference on Robot Learning, pages 598–608. PMLR, 2021

  11. [11]

    Lazydagger: Reducing context switching in interactive imitation learning

    Ryan Hoque, Ashwin Balakrishna, Carl Putterman, Michael Luo, Daniel S Brown, Daniel Seita, Bri- jen Thananjeyan, Ellen Novoseller, and Ken Goldberg. Lazydagger: Reducing context switching in interactive imitation learning. InIEEE International Conference on Automation Science and Engineering (CASE), pages 502–509. IEEE, 2021

  12. [12]

    Isaacs: Iterative soft adversarial actor-critic for safety.arXiv preprint arXiv:2212.03228, 2023

    Kai-Chieh Hsu, Duy Nguyen, and Jaime F Fisac. Isaacs: Iterative soft adversarial actor-critic for safety.arXiv preprint arXiv:2212.03228, 2023

  13. [13]

    Hg-dagger: In- teractive imitation learning with human experts

    Michael Kelly, Chelsea Sidrane, Katherine Driggs- Campbell, and Mykel J Kochenderfer. Hg-dagger: In- teractive imitation learning with human experts. InIEEE International Conference on Robotics and Automation (ICRA), pages 8077–8083. IEEE, 2019

  14. [14]

    3d gaussian splatting for real- time radiance field rendering

    Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real- time radiance field rendering. InACM Transactions on Graphics, volume 42, 2023

  15. [15]

    Simple and scalable predictive uncertainty estimation using deep ensembles

    Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. InAdvances in Neural Information Processing Systems, volume 30, 2017

  16. [16]

    Flow matching for generative modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Max- imilian Nickel, and Matthew Le. Flow matching for generative modeling. InInternational Conference on Learning Representations, 2023

  17. [17]

    Constrained variational policy optimization for safe reinforcement learning

    Zuxin Liu, Zhepeng Cen, Vladislav Isenber, Wei Liu, Zhiwei Steven Wu, Bo Li, and Ding Zhao. Constrained variational policy optimization for safe reinforcement learning. InInternational Conference on Machine Learn- ing, pages 13644–13668. PMLR, 2022

  18. [18]

    Spectral normalization for generative adversarial networks

    Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for generative adversarial networks. InInternational Conference on Learning Representations, 2018

  19. [19]

    ¨Uber die lage der integralkurven gew¨ohnlicher differentialgleichungen.Proceedings of the Physico-Mathematical Society of Japan, 24:551–559, 1942

    Mitio Nagumo. ¨Uber die lage der integralkurven gew¨ohnlicher differentialgleichungen.Proceedings of the Physico-Mathematical Society of Japan, 24:551–559, 1942

  20. [20]

    Generalizing safety beyond collision-avoidance via latent-space reachability analysis

    Kensuke Nakamura, Lasse Peters, and Andrea Ba- jcsy. Generalizing safety beyond collision-avoidance via latent-space reachability analysis. InProceedings of Robotics: Science and Systems (RSS), 2025

  21. [21]

    DINOv2: Learning Robust Visual Features without Supervision

    Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fer- nandez, Daniel Haziza, Francisco Massa, Alaaeldin El- Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

  22. [22]

    SAM 2: Segment Anything in Images and Videos

    Nikhila Ravi, Valentin Gabeur, Yuan-Ting Hu, Ronghang Hu, Chaitanya Ryali, Tengyu Ma, Haitham Khedr, Ro- man R¨adle, Chloe Rolber, Laura Gustafson, et al. Sam 2: Segment anything in images and videos.arXiv preprint arXiv:2408.00714, 2024

  23. [23]

    Learning control barrier functions from expert demonstrations

    Alexander Robey, Haimin Hu, Lars Lindemann, Hanwen Zhang, Dimos V Dimarogonas, Stephen Tu, and Nikolai Matni. Learning control barrier functions from expert demonstrations. InIEEE Conference on Decision and Control (CDC), pages 3717–3724. IEEE, 2020

  24. [24]

    A reduction of imitation learning and structured prediction to no-regret online learning

    St ´ephane Ross, Geoffrey Gordon, and Drew Bagnell. A reduction of imitation learning and structured prediction to no-regret online learning. InInternational Conference on Artificial Intelligence and Statistics, pages 627–635. JMLR, 2011

  25. [25]

    Solving stabilize-avoid op- timal control via epigraph form and deep reinforcement learning

    Oswin So and Chuchu Fan. Solving stabilize-avoid op- timal control via epigraph form and deep reinforcement learning. InRobotics: Science and Systems, 2023

  26. [26]

    Grasp pose detection in point clouds

    Andreas Ten Pas, Marcus Gualtieri, Kate Saenko, and Robert Platt. Grasp pose detection in point clouds. InThe International Journal of Robotics Research, volume 36, pages 1455–1473, 2017

  27. [27]

    Barriernet: Differentiable control barrier functions for learning of safe robot control

    Wei Xiao, Tsun-Hsuan Wang, Ramin Hasani, Mathias Lechner, Alexander Amini, and Daniela Rus. Barriernet: Differentiable control barrier functions for learning of safe robot control. InIEEE Transactions on Robotics, volume 39, pages 2289–2307. IEEE, 2023

  28. [28]

    3D Diffusion Policy: Generalizable Visuomotor Policy Learning via Simple 3D Representations

    Yanjie Ze, Gu Yan, Yunshuang Wu, Annabella Macaluso, Yuying Ge, Jianglong Ye, Nicklas Hansen, Li Erran Li, and Xiaolong Wang. 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations. arXiv preprint arXiv:2403.03954, 2024

  29. [29]

    3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations

    Yanjie Ze, Gu Yan, Yuping Wu, Jianyu Xu, Qinwen Lu, Qiuyuan Chen, Shuo Li, Yi Ma, Deepak Pathak, and Adam Kortylewski. 3d diffusion policy: Generalizable visuomotor policy learning via simple 3d representations. InRobotics: Science and Systems (RSS), 2024

  30. [30]

    Learning Fine-Grained Bimanual Manipulation with Low-Cost Hardware

    Tony Z Zhao, Vikash Kumar, Sergey Levine, and Chelsea Finn. Learning fine-grained bimanual manipulation with low-cost hardware.arXiv preprint arXiv:2304.13705, 2023. APPENDIX A. Proof of Proposition 1 Proof:The first claim follows directly from the nor- malization:∥∆a∥ 2 =η∥∇ aQ∥2/∥∇aQ∥2 =η. For the second claim, by the Lipschitz continuity ofQ, we have |...

  31. [31]

    For step sizesη < 2g/LQ, the improvement is positive

    Substi- tuting∆a=η∇ aQ/∥∇aQ∥2: Q(s, a+ ∆a)−Q(s, a)≥ ∇ aQ⊤∆a− LQη2 2 =η∥∇ aQ∥2 − LQη2 2 When∇ aQ̸= 0, letg=∥∇ aQ∥2 >0. For step sizesη < 2g/LQ, the improvement is positive. Settingc=g−L Qη/2> 0for sufficiently smallηcompletes the proof. B. Safety Criteria Score Computation

  32. [32]

    We project the object’s position into the camera frame and compute a geometric score based on the density of visible points and their distance from the image center

    Visibility Score,s f ov.This score ensures the target object remains within the sensor’s field of view throughout execution. We project the object’s position into the camera frame and compute a geometric score based on the density of visible points and their distance from the image center. This prevents the robot from moving the object into blind spots wh...

  33. [33]

    Rather than training a separate out-of-distribution detector, we extract feature embeddings directly from the pre- trained policy’s visual encoder

    Recognizability Score,s rec.This score evaluates how well the current visual observation aligns with the training distribution. Rather than training a separate out-of-distribution detector, we extract feature embeddings directly from the pre- trained policy’s visual encoder. Specifically, we use the flow- matching policy’s internal visual backbone to extr...

  34. [34]

    deceptive

    Graspability Score,s grasp.This score evaluates the geometric quality of potential contact with the target object. We perform semantic segmentation using SAM2 [22] to isolate the object’s point cloud and sample antipodal grasp candidates using established grasp quality metrics [26]. The score reflects the alignment between the current end-effector pose an...