ScenePilot: Controllable Boundary-Driven Critical Scenario Generation for Autonomous Driving

Cheng-zhong Xu; He Li; Qiyu Ruan; Yuxuan Wang; Zhenning Li

arxiv: 2605.21168 · v3 · pith:TE3P3BVKnew · submitted 2026-05-20 · 💻 cs.AI

ScenePilot: Controllable Boundary-Driven Critical Scenario Generation for Autonomous Driving

Qiyu Ruan , Yuxuan Wang , He Li , Zhenning Li , Cheng-zhong Xu This is my paper

Pith reviewed 2026-05-21 04:21 UTC · model grok-4.3

classification 💻 cs.AI

keywords critical scenario generationautonomous drivingsafety validationreinforcement learningphysical feasibilityboundary-driven generationadversarial testingSafeBench

0 comments

The pith

ScenePilot generates scenarios at the physical feasibility boundary to expose autonomous vehicle failures more reliably than prior methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to show that safety-critical scenario generation can be improved by explicitly targeting the boundary band: trajectories that obey vehicle-road physical limits yet still cause a deployed AV planner to crash. It does so by casting generation as constrained multi-objective reinforcement learning that balances an RSS-derived physical-feasibility score against an online-learned risk predictor, kept inside the target band by step-level shielding. A reader should care because existing generators either produce physically impossible crashes that waste evaluation effort or stay too far inside safe regions, leaving real edge cases untested. If the approach works, downstream adversarial fine-tuning on the generated scenarios can measurably lower real crash rates for the tested planners.

Core claim

ScenePilot formulates generation as constrained multi-objective reinforcement learning that combines an RSS-derived physical-feasibility score σ with an online-learned AV-risk predictor Φ and applies step-level feasibility-aware shielding so that produced trajectories remain inside the boundary band—physically solvable in principle yet capable of inducing failures in the deployed autonomy stack.

What carries the argument

The boundary band, the set of trajectories that satisfy vehicle-road physical constraints yet still cause the target AV stack to fail, maintained by a constrained multi-objective RL objective that trades off the RSS-derived feasibility score σ against the learned risk predictor Φ under step-level shielding.

If this is right

Evaluations on SafeBench with multiple planners produce collision rates 6.2 percentage points higher than prior methods while physical validity is preserved.
Adversarial fine-tuning of the tested planners on the generated boundary-band scenarios reduces their crash rates in subsequent testing.
The same generation pipeline can be applied to different autonomy stacks without changing the core feasibility-plus-risk formulation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the boundary-band property transfers across simulation environments, the method could serve as a standardized stress-test suite for regulatory AV safety assessment.
Extending the shielding mechanism to include additional kinematic constraints could further tighten the generated scenarios around controller-agnostic failure modes.
The online-learned risk predictor could be reused as a cheap surrogate for expensive closed-loop simulation during early-stage AV development.

Load-bearing premise

The combination of the learned AV-risk predictor, the RSS feasibility score, and step-level shielding is enough to keep generated trajectories inside the intended boundary band without controller-specific artifacts or later filtering.

What would settle it

Run the generated scenarios on a planner never seen during generation and measure whether collision rates stay at least 6 percentage points above baselines while the fraction of physically invalid trajectories remains near zero.

Figures

Figures reproduced from arXiv: 2605.21168 by Cheng-zhong Xu, He Li, Qiyu Ruan, Yuxuan Wang, Zhenning Li.

**Figure 1.** Figure 1: Illustration of four interaction regimes relative to AV controller and physical feasibility. 1. Introduction Safety-critical scenarios are rare in real traffic but decisive for autonomous vehicles (AVs). Large-scale naturalistic driving logs cover everyday interactions, yet truly highconsequence events occupy only a tiny fraction of the data ( [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of our ScenePilot framework. We characterize each rollout with an AV risk signal and a physics feasibility signal, and train a scenario policy to produce scenarios concentrated on the physically feasible yet AV policy-infeasible boundary band. antees, but it is counterproductive for critical scenario exploration. Many near-crash yet still physically avoidable frames would be flagged as unsafe and… view at source ↗

**Figure 3.** Figure 3: Visualization of a near-boundary scenario generated by ScenePilot. To further examine whether ScenePilot remains effective across heterogeneous AV stacks, we conduct an additional study beyond the standard SafeBench RL-controller evaluation. We generate 100 SafeBench Scenario 6 cases using CARLA Autopilot as the ego stack, and replay the generated cases on Autopilot, AIM-BEV, TransFuser, BehaviorAgent, … view at source ↗

**Figure 4.** Figure 4: Quantitative characterization of the AV–physics gap between ScenePilot and ChatScene. (a) Physically invalid frame rate under different AV-risk thresholds. (b) Coverage ratio of ScenePilot to ChatScene in the AV-risk–physical-feasibility space. To better understand the generated scenarios, we analyze their AV–physics characteristics [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Aggregated value loss during ScenePilot training [PITH_FULL_IMAGE:figures/full_fig_p019_5.png] view at source ↗

**Figure 6.** Figure 6: Aggregated policy loss during ScenePilot training. B.5. AV Policy Fine-tuning We follow the adversarial fine-tuning procedure of ChatScene to adapt the surrogate AV policy under generated adversarial scenarios. Concretely, we start from the publicly released SAC-trained surrogate AV checkpoint from Chatscene and fine-tune it in the same simulator setting without modifying the architecture. We run fine-tuni… view at source ↗

read the original abstract

Safety-critical scenarios are central to evaluating autonomous driving systems, yet their rarity in naturalistic logs makes simulation-based stress testing indispensable. Most scenario generation methods treat surrounding agents as adversaries, but they either (i) induce failures without explicitly modeling vehicle-road physical limits, yielding visually extreme yet physically unsolvable crashes, or (ii) enforce physical feasibility or policy feasibility in isolation, which can over-focus on aggressive maneuvers or remain tied to a controller-dependent capability boundary. We propose ScenePilot, a feasibility-guided, boundary-driven framework that targets the boundary band: scenarios that are physically solvable in principle yet still cause the deployed autonomy stack to fail. We formulate generation as constrained multi-objective reinforcement learning, combining an RSS-derived physical-feasibility score $\sigma$ with an online-learned AV-risk predictor $\Phi$, and introduce step-level feasibility-aware shielding to keep exploration near the feasibility boundary while avoiding infeasible artifacts. Experiments on SafeBench with multiple planners show that ScenePilot yields substantially higher collision rates (+6.2 percentage points) while preserving physical validity, and that adversarial fine-tuning on these boundary-band scenarios consistently reduces downstream crash rates. The code is available at https://github.com/QiyuRuan/ScenePilot.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ScenePilot uses RSS feasibility, online risk prediction, and shielding inside constrained RL to target the physical boundary band, lifting collision rates 6 points on SafeBench while keeping validity.

read the letter

ScenePilot targets the boundary band of scenarios that are physically solvable but still cause autonomous driving systems to crash. It does this through constrained multi-objective reinforcement learning that blends an RSS-derived feasibility score with an online-learned risk predictor, plus step-level shielding to stay close to that edge. This approach stands out because it tries to balance physical limits with actual challenge to the AV, rather than just pushing for failures or enforcing feasibility separately. The experiments on SafeBench with different planners report a 6.2 percentage point rise in collision rates while maintaining physical validity. They also show that using these scenarios for adversarial fine-tuning lowers crash rates in the tested systems. Making the code public at the GitHub link is a good move for anyone wanting to reproduce or build on it. The results are encouraging for improving how we validate AVs. However, the abstract lacks details like error bars, the number of simulation runs behind the numbers, and specifics on the shielding rules. This leaves some uncertainty about how consistent the gains are across setups. The concern about whether the feasibility score and predictor might create artifacts tied to the controllers under test is reasonable to check, as it could affect how general the boundary really is. This kind of work fits researchers focused on scenario generation and safety testing for self-driving cars. Readers dealing with RL in constrained environments or RSS models will get something out of the formulation and results. The core idea holds together without obvious circularity. I would send this to peer review. It has enough substance and a public implementation to warrant referee input on the experiments and any potential biases in the generation process.

Referee Report

2 major / 2 minor

Summary. ScenePilot is a feasibility-guided framework for generating safety-critical scenarios in autonomous driving. It formulates scenario generation as constrained multi-objective reinforcement learning that combines an RSS-derived physical-feasibility score σ with an online-learned AV-risk predictor Φ, augmented by step-level feasibility-aware shielding. The method targets the 'boundary band' of scenarios that are physically solvable in principle yet cause deployed autonomy stacks to fail. On SafeBench with multiple planners, the paper reports a +6.2 percentage point increase in collision rates while preserving physical validity, and shows that adversarial fine-tuning on the generated scenarios reduces downstream crash rates. Code is released at https://github.com/QiyuRuan/ScenePilot.

Significance. If the central claims hold after addressing validation gaps, the work would offer a meaningful advance in simulation-based stress testing for autonomous vehicles by producing controllable, physically grounded scenarios that lie between overly aggressive and trivially solvable extremes. The open-source release and the reported downstream benefit from adversarial fine-tuning are concrete strengths that could support reproducibility and practical adoption in AV safety pipelines.

major comments (2)

[Abstract / Experiments] Abstract and Experiments section: the headline +6.2 percentage point collision-rate lift is reported without error bars, confidence intervals, the number of independent runs, or explicit description of any data-exclusion or shielding rules applied to produce the number. This detail is load-bearing for the claim of a 'substantially higher' and reproducible improvement.
[Method] Method section (formulation of σ, Φ, and shielding): the central claim that the combination of the RSS-derived feasibility score σ, the online-learned risk predictor Φ, and step-level shielding keeps trajectories inside the intended boundary band without controller-dependent artifacts or implicit post-hoc filtering lacks an independent solvability check (e.g., against an oracle planner with perfect information). Without such verification, it remains unclear whether observed failures reflect genuine boundary-band stress or artifacts of the generation process itself; this assumption is load-bearing for interpreting the +6.2 pp result.

minor comments (2)

[Method] Notation for the multi-objective reward and the precise definition of the boundary band could be stated more formally (e.g., with an explicit mathematical characterization) to aid reproducibility.
[Experiments] Figure captions and table legends should explicitly state the number of trials, random seeds, and any post-processing steps used to compute reported metrics.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and have revised the manuscript to improve statistical reporting and validation of the boundary-band claim.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and Experiments section: the headline +6.2 percentage point collision-rate lift is reported without error bars, confidence intervals, the number of independent runs, or explicit description of any data-exclusion or shielding rules applied to produce the number. This detail is load-bearing for the claim of a 'substantially higher' and reproducible improvement.

Authors: We agree that the headline result requires statistical context for reproducibility. In the revised manuscript we now report the +6.2 pp improvement as the mean over five independent runs with different random seeds, include error bars showing one standard deviation, and explicitly describe the shielding rules together with any data-exclusion criteria in the Experiments section. revision: yes
Referee: [Method] Method section (formulation of σ, Φ, and shielding): the central claim that the combination of the RSS-derived feasibility score σ, the online-learned risk predictor Φ, and step-level shielding keeps trajectories inside the intended boundary band without controller-dependent artifacts or implicit post-hoc filtering lacks an independent solvability check (e.g., against an oracle planner with perfect information). Without such verification, it remains unclear whether observed failures reflect genuine boundary-band stress or artifacts of the generation process itself; this assumption is load-bearing for interpreting the +6.2 pp result.

Authors: We acknowledge the value of an independent check. The revised manuscript adds a solvability verification against an oracle planner with perfect information. This analysis shows that the large majority of ScenePilot trajectories remain physically solvable by the oracle while still inducing failures in the tested autonomy stacks, confirming that the generated scenarios lie in the intended boundary band rather than being artifacts of the generation process. We also clarify that the RSS-derived σ is controller-independent by construction. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper formulates scenario generation as constrained multi-objective RL that combines an RSS-derived physical-feasibility score σ (external rule set) with an online-learned AV-risk predictor Φ and step-level shielding. No equations or claims in the abstract reduce a reported performance metric (e.g., +6.2 pp collision-rate lift) to a fitted parameter or self-citation by construction. The central experimental results on SafeBench are presented as empirical outcomes rather than tautological outputs of the generation process itself. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The framework rests on the external RSS physical-feasibility rules and on the assumption that an online-learned risk predictor can be trained without introducing bias into the boundary-band targeting. No explicit free parameters or invented entities are named in the abstract.

axioms (2)

domain assumption RSS-derived physical-feasibility score σ accurately captures vehicle-road physical limits.
Invoked to define the feasible region that the generator must respect.
domain assumption Step-level feasibility-aware shielding prevents drift into infeasible states without distorting the risk signal.
Central to keeping exploration near the boundary band.

pith-pipeline@v0.9.0 · 5750 in / 1411 out tokens · 24981 ms · 2026-05-21T04:21:14.938834+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We develop a constrained multi-objective adversarial generator that couples physical and policy signals (σ,Φ) with step-level feasibility-aware shielding and feasibility-threshold sweeping to concentrate on physically feasible yet policy-infeasible near-boundary scenarios.
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We formulate generation as constrained multi-objective reinforcement learning, combining an RSS-derived physical-feasibility score σ with an online-learned AV-risk predictor Φ, and introduce step-level feasibility-aware shielding

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.