arxiv: 2605.10034 · v1 · submitted 2026-05-11 · 💻 cs.RO

Recognition: no theorem link

Beyond Self-Play and Scale: A Behavior Benchmark for Generalization in Autonomous Driving

Andreas Look, Anna Rothenh\"ausler, Aron Distelzweig, Daniel Jost, Daphne Cornelisse, Eugene Vinitsky, Faris Janjo\v{s}, Joschka Boedecker, Oliver Scheel, Raghu Rajan

Pith reviewed 2026-05-12 03:16 UTC · model grok-4.3

classification 💻 cs.RO

keywords autonomous drivingreinforcement learningself-playgeneralizationbenchmarktraffic agentsnuPlanhybrid planner

0 comments

The pith

Reinforcement learning policies for autonomous driving trained through pure self-play overfit to their opponents and fail to generalize to other traffic behaviors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces BehaviorBench to evaluate large-scale RL driving policies on established benchmarks like nuPlan and inside their own simulator. It extracts interaction-rich scenarios from real data where simple lane following fails and replaces the single rule-based traffic model with a diverse set of interactive agents. The central finding is that policies which learn emergent interactions during self-play still overfit to the specific opponents seen in training. The authors respond by proposing a hybrid planner that combines the PPO policy with a rule-based component.

Core claim

Policies trained via pure self-play under standard reward functions overfit to their training opponents and fail to generalize to other traffic agent behaviors, even though interactive behaviors emerge during training. BehaviorBench makes this visible by connecting PufferDrive to nuPlan, using complex splits from the Waymo Open Motion Dataset, and testing against heterogeneous traffic models instead of only the Intelligent Driver Model.

What carries the argument

BehaviorBench, a test suite that connects RL-trained policies to nuPlan, uses interaction-rich WOMD scenarios, and evaluates against a diverse suite of interactive traffic agents.

Load-bearing premise

The performance drop against diverse traffic agents is caused by overfitting to self-play opponents rather than differences in simulation fidelity, insufficient training scale, or other factors.

What would settle it

Retraining the same policy architecture inside self-play that includes the diverse traffic agents and then observing no performance drop on the benchmark would falsify the overfitting claim.

Figures

Figures reproduced from arXiv: 2605.10034 by Andreas Look, Anna Rothenh\"ausler, Aron Distelzweig, Daniel Jost, Daphne Cornelisse, Eugene Vinitsky, Faris Janjo\v{s}, Joschka Boedecker, Oliver Scheel, Raghu Rajan.

**Figure 2.** Figure 2: Qualitative comparison of our Interactive1k and Random1k splits. Four representative scenarios from each split, showing the ego vehicle (red), surrounding agents (blue), the ego goal (green), and the expert trajectory (red). Interactive1k (top row) is dominated by dense urban situations while Random1k (bottom row) reflects simpler cases. are available either as a discrete acceleration–steering combination … view at source ↗

**Figure 3.** Figure 3: Planner performance across traffic agents and benchmark splits. Each radar plot shows the six core metrics, At-Fault Collision, Off-Road, Goal completion, Comfort, Center Alignment, and Lane Alignment, for a single (planner, traffic agent) combination. Rows correspond to the four planners and columns to the eight traffic agents. All axes are oriented such that the outer ring corresponds to the best observe… view at source ↗

**Figure 4.** Figure 4: Collision, comfort, and goalcompletion scores on Interactive1k for the three reward-conditioned policies π aggr θ , π norm θ , and π caut θ , with the ego controlled by the conditioned policy and surrounding agents by IDM. Reward conditioning. To verify that our traffic agents behave in accordance with their reward conditioning, we compare the three behavior modes across a set of complementary metrics. … view at source ↗

**Figure 5.** Figure 5: At-fault collision rate (%) on the Interactive1k split for PPO planners trained via self-play under a simple (top) and complex (bottom) reward formulation, evaluated at intermediate checkpoints up to 1011 agent steps against PPO (solid) and IDM (dashed) traffic. Both reward formulations yield low collision rates against their own behavior but fail to generalize to IDM behavior. Scaling RL alone is not e… view at source ↗

**Figure 6.** Figure 6: Score of the PDM+PPO hybrid planner as a function of top-K on Interactive1k (left) and [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: Per-step runtime of the PDM+PPO hybrid planner as a function of top-K on Interactive1k [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Score of the PDM+PPO hybrid planner as a function of the planning horizon [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗

**Figure 9.** Figure 9: Per-step runtime of the PDM+PPO hybrid planner as a function of the planning horizon on [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: Example scenarios from our Interactive1k split. [PITH_FULL_IMAGE:figures/full_fig_p023_10.png] view at source ↗

**Figure 11.** Figure 11: Example scenarios from our Random1k split. [PITH_FULL_IMAGE:figures/full_fig_p024_11.png] view at source ↗

**Figure 12.** Figure 12: Qualitative comparison of traffic agent models. We visualize rollouts of IDM, PPO, SMART, expert and the conditioned PPO traffic agents (top to bottom) on the same scenario at four time steps over T = 3 s (left to right). Agents are shown with their past trajectories. 25 [PITH_FULL_IMAGE:figures/full_fig_p025_12.png] view at source ↗

**Figure 13.** Figure 13: Qualitative comparison of traffic agent models. We visualize rollouts of IDM, PPO, SMART, expert and the conditioned PPO traffic agents (top to bottom) on the same scenario at four time steps over T = 3 s (left to right). Agents are shown with their past trajectories. 26 [PITH_FULL_IMAGE:figures/full_fig_p026_13.png] view at source ↗

**Figure 14.** Figure 14: Qualitative comparison of traffic agent models. We visualize rollouts of IDM, PPO, SMART, expert and the conditioned PPO traffic agents (top to bottom) on the same scenario at four time steps over T = 3 s (left to right). Agents are shown with their past trajectories. 27 [PITH_FULL_IMAGE:figures/full_fig_p027_14.png] view at source ↗

**Figure 15.** Figure 15: Qualitative comparison of traffic agent models. We visualize rollouts of IDM, PPO, SMART, expert and the conditioned PPO traffic agents (top to bottom) on the same scenario at four time steps over T = 3 s (left to right). Agents are shown with their past trajectories. 28 [PITH_FULL_IMAGE:figures/full_fig_p028_15.png] view at source ↗

**Figure 16.** Figure 16: Qualitative comparison of traffic agent models. We visualize rollouts of IDM, PPO, SMART, expert and the conditioned PPO traffic agents (top to bottom) on the same scenario at four time steps over T = 3 s (left to right). Agents are shown with their past trajectories. 29 [PITH_FULL_IMAGE:figures/full_fig_p029_16.png] view at source ↗

**Figure 17.** Figure 17: Qualitative comparison of traffic agent models. We visualize rollouts of IDM, PPO, SMART, expert and the conditioned PPO traffic agents (top to bottom) on the same scenario at four time steps over T = 3 s (left to right). Agents are shown with their past trajectories. 30 [PITH_FULL_IMAGE:figures/full_fig_p030_17.png] view at source ↗

**Figure 18.** Figure 18: Qualitative comparison of traffic agent models. We visualize rollouts of IDM, PPO, SMART, expert and the conditioned PPO traffic agents (top to bottom) on the same scenario at four time steps over T = 3 s (left to right). Agents are shown with their past trajectories. 31 [PITH_FULL_IMAGE:figures/full_fig_p031_18.png] view at source ↗

**Figure 19.** Figure 19: Qualitative comparison of traffic agent models. We visualize rollouts of IDM, PPO, SMART, expert and the conditioned PPO traffic agents (top to bottom) on the same scenario at four time steps over T = 3 s (left to right). Agents are shown with their past trajectories. 32 [PITH_FULL_IMAGE:figures/full_fig_p032_19.png] view at source ↗

**Figure 20.** Figure 20: Qualitative comparison of traffic agent models. We visualize rollouts of IDM, PPO, SMART, expert and the conditioned PPO traffic agents (top to bottom) on the same scenario at four time steps over T = 3 s (left to right). Agents are shown with their past trajectories. 33 [PITH_FULL_IMAGE:figures/full_fig_p033_20.png] view at source ↗

**Figure 21.** Figure 21: Qualitative comparison of traffic agent models. We visualize rollouts of IDM, PPO, SMART, expert and the conditioned PPO traffic agents (top to bottom) on the same scenario at four time steps over T = 3 s (left to right). Agents are shown with their past trajectories. 34 [PITH_FULL_IMAGE:figures/full_fig_p034_21.png] view at source ↗

read the original abstract

Recent Autonomous Driving (AD) works such as GigaFlow and PufferDrive have unlocked Reinforcement Learning (RL) at scale as a training strategy for driving policies. Yet such policies remain disconnected from established benchmarks, leaving the performance of large-scale RL for driving on standardized evaluations unknown. We present BehaviorBench -- a comprehensive test suite that closes this gap along three axes: Evaluation, Complexity, and Behavior Diversity. In terms of Evaluation, we provide an interface connecting PufferDrive to nuPlan, which, for the first time, enables policies trained via RL at scale to be evaluated on an established planning benchmark for autonomous driving. Complementarily, we offer an evaluation framework that allows planners to be benchmarked directly inside the PufferDrive simulation, at a fraction of the time. Regarding Complexity, we observe that today's standardized benchmarks are so simple that near-perfect scores are achievable by straight lane following with collision checking. We extract a meaningful, interaction-rich split from the Waymo Open Motion Dataset (WOMD) on which strong performance is impossible without multi-agent reasoning. Lastly, we address Behavior Diversity. Existing benchmarks commonly evaluate planners against a single rule-based traffic model, the Intelligent Driver Model (IDM). We provide a diverse suite of interactive traffic agents to stress-test policies under heterogeneous behaviors, beyond just using IDM. Overall, our benchmarking analysis uncovers the following insight: despite learning interactive behaviors in an emergent manner, policies trained via pure self-play under standard reward functions overfit to their training opponents and fail to generalize to other traffic agent behaviors. Building on this observation, we propose a hybrid planner that combines a PPO policy with a rule-based planner.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper builds a benchmark bridge from scaled RL driving policies to nuPlan and a tougher WOMD split, plus flags self-play overfitting, but the abstract alone leaves the main claim unproven.

read the letter

The main thing to know is that BehaviorBench connects PufferDrive-trained RL policies to nuPlan for the first time, pulls out an interaction-heavy WOMD split where simple lane-following fails, and supplies varied traffic agents instead of just IDM. They also state that pure self-play policies overfit to their training opponents and do not transfer to other behaviors, which leads them to suggest a hybrid planner mixing PPO with rules.

Referee Report

1 major / 1 minor

Summary. The paper introduces BehaviorBench, a test suite for autonomous driving policies with three components: an interface linking PufferDrive-trained RL policies to the nuPlan benchmark for standardized evaluation, an interaction-rich scenario split extracted from the Waymo Open Motion Dataset (WOMD) where simple lane-following is insufficient, and a diverse suite of interactive traffic agents extending beyond the standard Intelligent Driver Model (IDM). The central empirical insight is that policies trained via pure self-play under standard reward functions overfit to their training opponents and fail to generalize to heterogeneous traffic behaviors, motivating a proposed hybrid planner that combines a PPO policy with a rule-based planner.

Significance. If the reported results and controls hold, the work would be significant for establishing the first direct connection between large-scale self-play RL training and established AD benchmarks like nuPlan, while providing evaluation tools that stress multi-agent complexity and behavior diversity. This could guide future research away from pure self-play toward more robust generalization strategies and offer a reproducible framework for testing interactive driving policies at lower computational cost than full nuPlan runs.

major comments (1)

[Abstract] Abstract: The central claim that 'policies trained via pure self-play under standard reward functions overfit to their training opponents and fail to generalize to other traffic agent behaviors' is stated without any quantitative results, ablation studies, error bars, training details, reward specifications, or description of how the diverse traffic agents were constructed. This absence makes it impossible to assess whether the performance drop is caused by overfitting (as claimed) or by unmeasured factors such as simulation fidelity differences between PufferDrive and nuPlan, rendering the key insight unverifiable from the provided manuscript.

minor comments (1)

The abstract asserts that the nuPlan interface is provided 'for the first time'; the full manuscript should include citations to any prior attempts at connecting RL policies to nuPlan to substantiate this novelty claim.

Simulated Author's Rebuttal

1 responses · 1 unresolved

We thank the referee for their constructive feedback. We address the single major comment below.

read point-by-point responses

Referee: The central claim that 'policies trained via pure self-play under standard reward functions overfit to their training opponents and fail to generalize to other traffic agent behaviors' is stated without any quantitative results, ablation studies, error bars, training details, reward specifications, or description of how the diverse traffic agents were constructed. This absence makes it impossible to assess whether the performance drop is caused by overfitting (as claimed) or by unmeasured factors such as simulation fidelity differences between PufferDrive and nuPlan, rendering the key insight unverifiable from the provided manuscript.

Authors: We agree that the abstract states the central claim without quantitative support, ablations, error bars, reward details, or agent construction information, which prevents readers from verifying the overfitting interpretation versus other factors such as simulator differences. Because only the abstract is available here, we cannot reference or quote specific results from the body. We will revise the abstract to include a brief quantitative summary of the observed performance degradation, high-level training and reward information, and a description of the diverse traffic agents, plus a short note on simulation fidelity controls. This will make the key insight verifiable at the abstract level while preserving brevity. revision: yes

standing simulated objections not resolved

Specific quantitative results, ablation studies, error bars, reward specifications, and exact descriptions of traffic agent construction are absent from the provided abstract, so we cannot supply or cite them in this response.

Circularity Check

0 steps flagged

No circularity: empirical benchmark evaluation

full rationale

The paper introduces BehaviorBench as an evaluation interface and test suite linking large-scale RL policies (from PufferDrive) to external benchmarks (nuPlan, WOMD) and a new suite of diverse traffic agents. The central insight—that pure self-play policies overfit to training opponents—is presented as an observed performance pattern from these comparisons, not as a derived result from any equations, fitted parameters, or self-referential definitions. No self-citations, uniqueness theorems, ansatzes, or renamings of known results appear in the provided abstract. The work is self-contained as a standard empirical benchmarking contribution with independent external validation points.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the work is framed as empirical benchmarking without new theoretical constructs or fitted constants.

pith-pipeline@v0.9.0 · 5616 in / 1096 out tokens · 43058 ms · 2026-05-12T03:16:27.054017+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages · 1 internal anchor

[1]

Valentin Charraut, Waël Doulazmi, Thomas Tournaire, and Thibault Buhet

doi: 10.1109/CVPR.2019.00895. Valentin Charraut, Waël Doulazmi, Thomas Tournaire, and Thibault Buhet. V-max: A reinforcement learning framework for autonomous driving.arXiv preprint arXiv:2503.08388,

work page doi:10.1109/cvpr.2019.00895 2019
[2]

Idd-x: A multi-view dataset for ego-relative important object localization and explanation in den se and unstructured traﬃc

doi: 10.1109/ICRA57147.2024.10611364. Daphne Cornelisse, Aarav Pandya, Kevin Joseph, Joseph Suárez, and Eugene Vinitsky. Building reliable sim driving agents by scaling self-play,

work page doi:10.1109/icra57147.2024.10611364 2024
[3]

Radiance fields for robotic teleoperation,

doi: 10.1109/IROS58592.2024.10803052. Zhiyu Huang, Peter Karkus, Boris Ivanovic, Yuxiao Chen, Marco Pavone, and Chen Lv. Dtpp: Dif- ferentiable joint conditional prediction and cost evaluation for tree policy planning in autonomous driving. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 6806–6812,

work page doi:10.1109/iros58592.2024.10803052 2024
[4]

Carl: Learning scalable planning policies with simple rewards,

Bernhard Jaeger, Daniel Dauner, Jens Beißwenger, Simon Gerstenecker, Kashyap Chitta, and An- dreas Geiger. Carl: Learning scalable planning policies with simple rewards.arXiv preprint arXiv:2504.17838, 2025a. Bernhard Jaeger, Daniel Dauner, Jens Beißwenger, Simon Gerstenecker, Kashyap Chitta, and Andreas Geiger. Carl: Learning scalable planning policies w...

work page arXiv
[5]

Idd-x: A multi-view dataset for ego-relative important object localization and explanation in den se and unstructured traﬃc

doi: 10.1109/ICRA57147.2024.10611484. Saman Kazemkhani, Aarav Pandya, Daphne Cornelisse, Brennan Shacklett, and Eugene Vinitsky. Gpudrive: Data-driven, multi-agent driving simulation at 1 million fps. InProceedings of the International Conference on Learning Representations (ICLR),

work page doi:10.1109/icra57147.2024.10611484 2024
[6]

The waymo open sim agents challenge

Nico Montali, John Lambert, Paul Arber, Nick Bber, et al. The waymo open sim agents challenge. In NeurIPS 2023 Datasets and Benchmarks Track,

work page 2023
[7]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.ArXiv, abs/1707.06347,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

doi: 10.1103/physreve.62.1805

ISSN 1095-3787. doi: 10.1103/physreve.62.1805. Wei Wu, Xiaoxin Feng, Ziyan Gao, and Yuheng Kan. Smart: Scalable multi-agent real-time motion generation via next-token prediction. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 114048–114071. C...

work page doi:10.1103/physreve.62.1805
[9]

Yuan Yin, Pegah Khayatan, Éloi Zablocki, Alexandre Boulch, and Matthieu Cord

doi: 10.52202/079017-3622. Yuan Yin, Pegah Khayatan, Éloi Zablocki, Alexandre Boulch, and Matthieu Cord. Regents: Real- world safety-critical driving scenario generation made stable. InEuropean Conference on Computer Vision, pages 262–276. Springer,

work page doi:10.52202/079017-3622
[10]

11 A Related Work In this section, we review datasets and benchmarks that support closed-loop simulation, distinguishing our scope from perception-centric prediction benchmarks like nuScenes [Caesar et al., 2020] or Argoverse [Chang et al., 2019]. We include the Waymo Open Motion Dataset (WOMD) [Ettinger et al., 2021] due to its essential role in validati...

work page 2020
[11]

Several benchmarks have been proposed on top of nuPlan to evaluate closed-loop planners, each addressing different shortcomings but introducing new limitations

A.1 Datasets and Benchmarks nuPlan [Karnchanachari et al., 2024] serves as the primary large-scale dataset and benchmark for closed-loop planning, comprising 1,282 hours of driving data from four cities with high-fidelity object tracks, traffic lights, and over 70 scenario types. Several benchmarks have been proposed on top of nuPlan to evaluate closed-lo...

work page 2024
[12]

This comparison provides the necessary context to understand how BehaviorBench relates to existing tools in terms of functionality and performance across these core categories. Supported Datasets and BenchmarksThe field of autonomous driving simulation ranges from localized, synthetic environments to massive-scale, real-world data integration to enhance m...

work page 2023
[13]

instantiates the corresponding behavior. Unconditional Conditioned Encoder branches ego, road, partner + c_reward Shared embedding input[h ego;h road;h part] [h ego;h road;h part;h c] Recurrent core LSTM LSTM Actor / Critic discrete 91 / scalar discrete 91 / scalar #Parameters 614k 635k Table 7: Side-by-side summary. The conditioned variant differs only i...

work page 2025
[14]

0 1 5 10 20 40 Horizon 25 30 35 40 45 50 55Runtime [ms] Interactive1k 0 1 5 10 20 40 Horizon Random1k Figure 9: Per-step runtime of the PDM+PPO hybrid planner as a function of the planning horizon on Interactive1k (left) and Random1k (right). Each red marker reports the mean per-step runtime in ms over all scenarios for one traffic agent type, and the box...

work page 2024
[15]

Limitations

We observe that the conditioned traffic agents exhibit behavior consistent with the intended conditioning. Positive examples are shown in Fig.12, Fig.13, and Fig.19. In particular, the aggressive agents in red tend to drive faster and overtake other vehicles more frequently than the other agent types. Such an overtaking maneuver is shown in Fig.13, where ...

work page 2015