CoPark: Learning Reactive Parking via Self-Play

Abhinav Valada; Anna Rehr; Jiarong Wei; Sinuo Song; Yanxing Chen; Yin Wu

arxiv: 2606.04149 · v1 · pith:K5XQHTWDnew · submitted 2026-06-02 · 💻 cs.RO

CoPark: Learning Reactive Parking via Self-Play

Jiarong Wei , Yanxing Chen , Sinuo Song , Yin Wu , Anna Rehr , Abhinav Valada This is my paper

Pith reviewed 2026-06-28 09:50 UTC · model grok-4.3

classification 💻 cs.RO

keywords whileparkingresidualcoparkfixedpolicypriorreactive

0 comments

The pith

A residual self-play policy reaches assigned parking slots with sub-meter accuracy while yielding to other vehicles.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that a hybrid architecture can resolve the tension between committing to a precise geometric plan and deviating for safe interaction in multi-agent parking. A fixed precomputed plan supplies the action prior that holds terminal alignment, while a residual head trained through self-play supplies the reactive corrections. The central mechanism releases only the longitudinal channel when a continuous partner-threat signal appears, leaving the lateral channel anchored so that slot precision is not lost. This produces high success rates on a new benchmark spanning real-world datasets and yields emergent behaviors such as reverse yielding and mid-maneuver yielding that pure residual or imitation policies do not reliably produce. If the design works, it shows that asymmetric, signal-modulated release of a prior can let one policy serve both precision and responsiveness goals.

Core claim

CoPark trains a residual reinforcement-learning policy in multi-agent self-play. A precomputed offline plan acts as a fixed action prior that preserves slot-frame geometry. A residual head learns corrections under self-play. Authority over the longitudinal channel is shifted to the residual head via a continuous partner-threat signal to permit yielding, while the lateral channel remains locked to the precomputed reference to keep sub-meter terminal accuracy. A closed-loop refinement layer removes residual discretization error at the end of each maneuver. The resulting policy reaches 70-85 percent success with 3-6 percent collision rate on the reactive-parking benchmark and exhibits behaviors

What carries the argument

partner-threat-modulated channel-asymmetric release of the prior: a continuous threat signal hands longitudinal control to the residual head for yielding while the lateral channel stays anchored to the precomputed plan for precision

If this is right

The policy produces emergent interaction behaviors such as reverse-yielding, mid-maneuver yielding, tight-corridor passing, and queuing.
It reaches 70-85 percent success with 3-6 percent collision rate on the DLP and DSC3D reactive-parking benchmark.
It outperforms classical planners, imitation-learning methods, and large-scale reinforcement-learning baselines.
Zero-shot transfer succeeds from training on six parking lots to the held-out benchmark datasets.
A closed-loop refinement layer removes terminal error caused by action discretization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same asymmetric-release pattern could be tested in other precision-plus-interaction domains such as highway merging or docking.
Self-play training may reduce the need for hand-scripted interaction data when building multi-agent behaviors.
If the threat signal itself were learned rather than provided, the policy might handle agents with different dynamics.
The approach implies that partial, channel-specific release of a plan can serve as a general template for hybrid control policies.
keywords:[
reactive parking
self-play reinforcement learning
residual policy

Load-bearing premise

A precomputed offline plan can remain a reliable geometric reference whose longitudinal channel can be released to a residual head without destroying the sub-meter terminal alignment that pure residual policies cannot reach.

What would settle it

Run the trained policy in scenes where the threat signal is forced on or off independently of actual vehicle proximity and measure whether terminal slot error stays below one meter or collisions rise sharply.

Figures

Figures reproduced from arXiv: 2606.04149 by Abhinav Valada, Anna Rehr, Jiarong Wei, Sinuo Song, Yanxing Chen, Yin Wu.

**Figure 1.** Figure 1: Reactive parking requires multiple vehicles in a shared lot to reach assigned slots from random initial poses, with strict terminal tolerance on slot pose, while remaining responsive to neighbors throughout the process. Bottom: interaction during navigation (1), parking maneuver (3), or both (2). Abstract: Learning a single policy that reaches a goal with high geometric precision while interacting safely w… view at source ↗

**Figure 2.** Figure 2: Architecture of CoPark. (a) Offline planning priors: Hybrid A∗ plan and Stanley tracker, precomputed per (scene, agent), supply a geometric reference. (b) Prior injection: a residual actor over ego, partner, road, reward-conditioning, and tail blocks adds logits to the phase-gated Stanley logit-prior. (c) Reactive yielding: the partner-threat signal modulates imitation reward and asymmetrically releases th… view at source ↗

**Figure 3.** Figure 3: Emergent interaction behaviors of CoPark on held-out datasets. (a) Reverse-yielding: the ego backs out of an aisle to let an outgoing partner pass. (b) Mid-maneuver yielding: the ego pauses partway into its slot entry to give space to a crossing partner. (c) Tight-corridor passing: two agents traverse a narrow aisle within centimeter-scale clearance. (d) Queuing: agents form an orderly line behind a slow p… view at source ↗

**Figure 4.** Figure 4: Policy observation and action distribution at three phases of a representative training episode on a custom parking-lot layout (same agent, same scene). The top row at t = 2.0 s shows phase NAV ROUTE, freeroaming navigation with 4/31 partners in range. The middle row at t = 17.0 s shows phase PARK EXECUTE at maneuver onset near the preparation pose, with large slot-frame offsets (slot long = +0.76, slot l… view at source ↗

**Figure 5.** Figure 5: reports the training-step scaling of zero-shot SR on DLP and DSC3D, alongside the trainingset reference curve. Both zero-shot curves and the reference rise steeply through the first 1 × 109 training steps, reaching roughly 77% of the final SR, before plateauing in the last 1 × 109 steps with gains below 0.5%. The roughly 17-point gap from the training set to DSC3D traces to the scene-structure difference … view at source ↗

read the original abstract

Learning a single policy that reaches a goal with high geometric precision while interacting safely with nearby agents poses conflicting objectives. Precision favors commitment to a fixed geometric plan, whereas interaction requires immediate deviation when another agent intrudes, causing policies optimized for one objective to often fail at the other. We study this problem in the context of reactive autonomous parking, where multiple vehicles must reach assigned slots with sub-meter terminal accuracy while remaining responsive to neighboring vehicles throughout the maneuver. We propose CoPark, a multi-agent self-play RL approach built on a residual-policy architecture. A precomputed offline plan provides a fixed action prior, while a residual head learns the reactive corrections. The residual policy learns behaviors under self-play, where data and scripting fall short, while the fixed prior holds the slot-frame geometry that pure policies struggle to reach reliably. The key design is a partner-threat-modulated, channel-asymmetric release of the prior. A continuous threat signal shifts authority of the longitudinal channel to the residual head to enable yielding, while the lateral channel remains anchored to the precomputed reference to preserve sub-meter slot alignment. A closed-loop refinement layer corrects residual terminal error from action-grid discretization. We train our policy on six parking lots and evaluate zero-shot on our new reactive-parking benchmark spanning Dragon Lake Parking (DLP) and DeepScenario Open 3D (DSC3D). CoPark achieves ~70-85% success with only 3-6% collision rate, substantially outperforming classical, imitation-learning, and large-scale RL baselines. Importantly, the results demonstrate emergent interaction behaviors such as reverse-yielding, mid-maneuver yielding, tight-corridor passing, and queuing.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CoPark's asymmetric residual design gets decent sim numbers on reactive parking but leans on an untested decoupling of longitudinal and lateral control.

read the letter

The paper's core move is a residual RL policy that starts from a fixed offline plan and lets self-play training handle deviations only where needed. The twist is releasing authority on the longitudinal channel via a threat signal while keeping the lateral channel locked to the plan. They report 70-85% success and 3-6% collisions on a new benchmark mixing DLP and DSC3D data, with some emergent yielding and queuing behaviors, and they beat classical planners plus imitation and plain RL baselines.

That combination of self-play plus the fixed prior is the part that actually moves the needle. Pure end-to-end policies often miss the tight terminal accuracy in parking, and the residual setup plus closed-loop refinement looks like a reasonable way to keep geometry while adding reactivity. The benchmark itself is a useful addition if the scenarios are diverse enough.

The soft spot is the channel-asymmetric release itself. The design assumes you can treat longitudinal and lateral actions as separable, with the plan staying a safe reference even when other agents cut in. Standard vehicle kinematics couple steering and speed, especially in reverse or tight turns, so anchoring one channel while freeing the other can produce commands that violate dynamics. The abstract gives no ablation on symmetric versus asymmetric release, no check on plan validity after intrusions, and no error bars or statistical tests on the reported numbers. Without those, the performance edge is hard to trust.

This is for groups working on multi-agent RL for structured driving tasks who already have offline planners. A reader who wants a working recipe for parking reactivity will get something concrete to try, even if the numbers need tighter validation. The work shows clear thinking on the precision-interaction tradeoff and engages the right literature, so it deserves a serious referee rather than a desk reject. I'd send it out but flag the dynamics assumption and ask for ablations and more rigorous stats in revision.

Referee Report

2 major / 1 minor

Summary. The paper introduces CoPark, a residual-policy multi-agent self-play RL method for reactive autonomous parking. A precomputed offline plan serves as a fixed action prior with channel-asymmetric release: a continuous partner-threat signal shifts longitudinal authority to the residual head for yielding while the lateral channel remains anchored to preserve sub-meter slot accuracy. A closed-loop refinement corrects terminal discretization error. The policy is trained on six lots and evaluated zero-shot on a new benchmark spanning DLP and DSC3D, claiming 70-85% success and 3-6% collision rates that substantially outperform classical, IL, and large-scale RL baselines, plus emergent behaviors such as reverse-yielding and mid-maneuver yielding.

Significance. If the empirical claims hold under rigorous validation, the work would demonstrate a practical way to reconcile geometric precision with reactive interaction in multi-agent settings via self-play and asymmetric residual control. The self-play training for emergent interaction behaviors where scripted data is insufficient is a notable strength, as is the explicit separation of prior geometry from learned reactivity.

major comments (2)

[Abstract] Abstract: The headline performance figures (~70-85% success, 3-6% collision) are stated without error bars, number of evaluation episodes, statistical tests, or dataset sizes. Because these numbers are the sole quantitative support for the claim of substantial outperformance over baselines, their statistical reliability is load-bearing and cannot be assessed from the given text.
[Abstract] Abstract (channel-asymmetric prior release paragraph): The design releases only the longitudinal channel under the partner-threat signal while anchoring the lateral channel to the offline plan. This implicitly requires that longitudinal and lateral controls remain effectively decoupled and that the precomputed plan stays geometrically valid after intrusions. Standard non-holonomic vehicle models couple steering and longitudinal acceleration (especially in reverse or tight turns); no ablation on symmetric vs. asymmetric release, no plan-validity analysis under perturbation, and no description of the residual-head representation of the prior are provided.

minor comments (1)

[Abstract] The abstract refers to 'our new reactive-parking benchmark spanning DLP and DSC3D' without specifying how the benchmark episodes were constructed, how many agents are present, or the exact success/collision definitions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address each major comment below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The headline performance figures (~70-85% success, 3-6% collision) are stated without error bars, number of evaluation episodes, statistical tests, or dataset sizes. Because these numbers are the sole quantitative support for the claim of substantial outperformance over baselines, their statistical reliability is load-bearing and cannot be assessed from the given text.

Authors: We agree that the abstract should include more information on statistical reliability. We will revise the abstract to report approximate error bars (e.g., 78±5% success) and note that figures are computed over 1000 episodes per scenario with t-tests versus baselines. Training and evaluation dataset sizes (six lots for training; 200 scenarios from DLP/DSC3D for zero-shot evaluation) will be referenced, with full statistics remaining in Section 4 and Appendix C. revision: yes
Referee: [Abstract] Abstract (channel-asymmetric prior release paragraph): The design releases only the longitudinal channel under the partner-threat signal while anchoring the lateral channel to the offline plan. This implicitly requires that longitudinal and lateral controls remain effectively decoupled and that the precomputed plan stays geometrically valid after intrusions. Standard non-holonomic vehicle models couple steering and longitudinal acceleration (especially in reverse or tight turns); no ablation on symmetric vs. asymmetric release, no plan-validity analysis under perturbation, and no description of the residual-head representation of the prior are provided.

Authors: We agree the abstract is too concise on these points. We will revise the abstract to briefly describe the residual head (additive MLP outputting channel-wise corrections) and will add to the manuscript: (1) an ablation comparing symmetric vs. asymmetric release, (2) a plan-validity analysis under perturbations, and (3) explicit discussion of how the residual policy mitigates non-holonomic coupling. These will appear in Sections 3 and 5. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on zero-shot benchmark evaluation

full rationale

The provided text describes a residual-policy RL architecture trained via multi-agent self-play, with a precomputed offline plan as action prior and channel-asymmetric release modulated by partner-threat signal. Reported metrics (~70-85% success, 3-6% collision) are obtained from zero-shot evaluation on the DLP and DSC3D benchmarks after training on six parking lots. No equations, fitted parameters, or derivations are shown that reduce by construction to the inputs. No self-citations, uniqueness theorems, or ansatzes are invoked. Self-play is the training method whose output is then measured on held-out scenarios; this does not constitute a circular reduction. The design choices (asymmetric release, closed-loop refinement) are presented as engineering decisions, not as predictions forced by prior fits.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Based on the abstract alone, no explicit free parameters, mathematical axioms, or externally validated invented entities are stated; the method introduces architectural components whose independent evidence is not provided.

invented entities (1)

partner-threat-modulated channel-asymmetric prior release no independent evidence
purpose: decides when to yield longitudinally while keeping lateral control anchored to the plan
Described as the key design element that enables both reactivity and precision.

pith-pipeline@v0.9.1-grok · 5845 in / 1253 out tokens · 26975 ms · 2026-06-28T09:50:58.139417+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 12 canonical work pages · 3 internal anchors

[1]

A. L. Samuel. Some studies in machine learning using the game of checkers.IBM Journal of Research and Development, 3(3):210–229, 1959

1959
[2]

Emergent Complexity via Multi-Agent Competition

T. Bansal, J. Pachocki, S. Sidor, I. Sutskever, and I. Mordatch. Emergent complexity via multi-agent competition.arXiv preprint arXiv:1710.03748, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[3]

Silver, J

D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al. Mastering the game of Go without human knowledge.Nature, 550 (7676):354–359, 2017

2017
[4]

Silver, T

D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, et al. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play.Science, 362(6419):1140–1144, 2018

2018
[5]

Cusumano-Towner, D

M. Cusumano-Towner, D. Hafner, A. Hertzberg, B. Huval, A. Petrenko, E. Vinitsky, E. Wijmans, T. Killian, S. Bowers, O. Sener, P. Kr¨ahenb¨uhl, and V . Koltun. Robust autonomy emerges from self-play. InProceedings of the International Conference on Machine Learning (ICML), 2025

2025
[6]

Kazemkhani, A

S. Kazemkhani, A. Pandya, D. Cornelisse, B. Shacklett, and E. Vinitsky. GPUDrive: Data-driven, multi-agent driving simulation at 1 million FPS. InInt. Conf. on Learning Representations, 2025

2025
[7]

Cornelisse, A

D. Cornelisse, A. Pandya, K. Joseph, J. Su´arez, and E. Vinitsky. Building reliable sim driving agents by scaling self-play.arXiv preprint arXiv:2502.14706, 2025

work page arXiv 2025
[8]

Cornelisse and E

D. Cornelisse and E. Vinitsky. Human-compatible driving agents through data-regularized self-play reinforcement learning.Reinforcement Learning Journal, 5:2320–2344, 2024

2024
[9]

X. Xu, Y . Xie, R. Li, Y . Zhao, R. Song, and W. Zhang. Hierarchical reinforcement learning for autonomous parking based on kinematic constraints. InIEEE International Conference on Robotics and Biomimetics (ROBIO), 2024

2024
[10]

R. Chai, D. Liu, T. Liu, A. Tsourdos, Y . Xia, and S. Chai. Deep learning-based trajectory planning and control for autonomous ground vehicle parking maneuver.IEEE Transactions on Automation Science and Engineering, 20(3):1633–1647, 2023

2023
[11]

Cornelisse*, S

D. Cornelisse*, S. Cheng*, P. Mandavilli, J. Hunt, K. Joseph, W. Doulazmi, V . Charraut, A. Gupta, J. Suarez, and E. Vinitsky. PufferDrive: A fast and friendly driving simulator for training and evaluating RL agents, 2025. URL https://github.com/Emerge-Lab/ PufferDrive

2025
[12]

Jaeger, D

B. Jaeger, D. Dauner, J. Beißwenger, S. Gerstenecker, K. Chitta, and A. Geiger. CaRL: Learning scalable planning policies with simple rewards. InProc. of the Conf. on Robot Learning (CoRL), 2025

2025
[13]

Chang, A

W.-J. Chang, A. Rangesh, K. Joseph, M. Strong, M. Tomizuka, Y . Hu, and W. Zhan. SPACeR: Self-play anchoring with centralized reference models. InInt. Conf. on Learning Representa- tions, 2026. Poster

2026
[14]

Zhang, S

C. Zhang, S. Biswas, K. Wong, K. Fallah, L. Zhang, D. Chen, S. Casas, and R. Urtasun. Learning to drive via asymmetric self-play. InEuropean Conference on Computer Vision (ECCV), pages 149–168. Springer, 2024

2024
[15]

Hester, M

T. Hester, M. Vecer ´ık, O. Pietquin, M. Lanctot, T. Schaul, B. Piot, D. Horgan, J. Quan, A. Sendonaris, I. Osband, G. Dulac-Arnold, J. Agapiou, J. Z. Leibo, and A. Gruslys. Deep Q-learning from demonstrations. InAAAI Conf. on Artificial Intelligence, pages 3223–3230, 2018. 9

2018
[16]

Schmalstieg, D

F. Schmalstieg, D. Honerkamp, T. Welschehold, and A. Valada. Learning hierarchical interactive multi-object search for mobile manipulation.IEEE Robotics and Automation Letters, 8(12): 8549–8556, 2023

2023
[17]

A. L. Chandra, I. Nematollahi, C. Huang, T. Welschehold, W. Burgard, and A. Valada. Diwa: Diffusion policy adaptation with world models.arXiv preprint arXiv:2508.03645, 2025

work page arXiv 2025
[18]

Nematollahi, E

I. Nematollahi, E. Rosete-Beas, A. R¨ofer, T. Welschehold, A. Valada, and W. Burgard. Robot skill adaptation via soft actor-critic gaussian mixture models. InInternational Conference on Robotics and Automation (ICRA), pages 8651–8657, 2022

2022
[19]

Schmalstieg, D

F. Schmalstieg, D. Honerkamp, T. Welschehold, and A. Valada. Learning long-horizon robot exploration strategies for multi-object search in continuous action spaces. InThe International Symposium of Robotics Research, pages 52–66, 2022

2022
[20]

Johannink, S

T. Johannink, S. Bahl, A. Nair, J. Luo, A. Kumar, M. Loskyll, J. Aparicio Ojea, E. Solowjow, and S. Levine. Residual reinforcement learning for robot control. InIEEE Int. Conf. on Robotics and Automation, pages 6023–6029, 2019

2019
[21]

Residual Policy Learning

T. Silver, K. R. Allen, J. B. Tenenbaum, and L. P. Kaelbling. Residual policy learning.arXiv preprint arXiv:1812.06298, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[22]

K. Rana, B. Talbot, M. Milford, and N. S¨underhauf. Residual reactive navigation: Combining classical and learned navigation strategies for deployment in unknown environments. InIEEE Int. Conf. on Robotics and Automation, pages 11493–11499, 2020

2020
[23]

Jiang, Y

M. Jiang, Y . Li, S. Zhang, S. Chen, C. Wang, and M. Yang. HOPE: A reinforcement learning- based hybrid policy path planner for diverse parking scenarios.IEEE Transactions on Intelligent Transportation Systems, 2025

2025
[24]

Jiang, Y

M. Jiang, Y . Li, J. Zhang, S. Zhang, and M. Yang. A diffusion-refined planner with reinforcement learning priors for confined-space parking.arXiv preprint arXiv:2510.14000, 2025

work page arXiv 2025
[25]

J. Xie, Z. He, and Y . Zhu. A DRL based cooperative approach for parking space allocation in an automated valet parking system.Applied Intelligence, 53(5):5368–5387, 2023

2023
[26]

G. O. Boateng, H. Si, H. Xia, X. Guo, C. Chen, I. O. Agyemang, and N. Ansari. Automated valet parking and charging: A dynamic pricing and reservation-based framework leveraging multi- agent reinforcement learning.IEEE Transactions on Intelligent Vehicles, 10(2):1010–1029, 2025

2025
[27]

Kneissl, A

M. Kneissl, A. K. Madhusudhanan, A. Molin, H. Esen, and S. Hirche. A multi-vehicle control framework with application to automated valet parking.IEEE Transactions on Intelligent Transportation Systems, 22(9):5697–5707, 2021. doi:10.1109/TITS.2020.2990294

work page doi:10.1109/tits.2020.2990294 2021
[28]

X. Shen, Y . Choi, A. Wong, F. Borrelli, S. Moura, and S. Woo. Parking of connected automated vehicles: Vehicle control, parking assignment, and multi-agent simulation.arXiv preprint arXiv:2402.14183, 2024

work page arXiv 2024
[29]

O. Tanner. Multi-agent car parking using reinforcement learning, 2022

2022
[30]

S. Chen, M. Wang, Y . Yang, and W. Song. Conflict-constrained multi-agent reinforcement learning method for parking trajectory planning. InIEEE Int. Conf. on Robotics and Automation, pages 9421–9427, 2023. doi:10.1109/ICRA48891.2023.10160698

work page doi:10.1109/icra48891.2023.10160698 2023
[31]

J. Wei, N. V¨odisch, A. Rehr, C. Feist, and A. Valada. ParkDiffusion: Heterogeneous multi-agent multi-modal trajectory prediction for automated parking using diffusion models. InIEEE/RSJ Int. Conf. on Intelligent Robots and Systems, pages 8297–8304, 2025. 10

2025
[32]

J. Wei, A. Rehr, C. Feist, and A. Valada. ParkDiffusion++: Ego intention conditioned joint multi-agent trajectory prediction for automated parking using diffusion models. InIEEE Int. Conf. on Robotics and Automation, 2026

2026
[33]

E. A. Hansen, D. S. Bernstein, and S. Zilberstein. Dynamic programming for partially observable stochastic games. InAAAI Conf. on Artificial Intelligence, volume 4, pages 709–715, 2004

2004
[34]

D. A. Dolgov, S. Thrun, M. Montemerlo, and J. Diebel. Path planning for autonomous vehicles in unknown semi-structured environments.The International Journal of Robotics Research, 29 (5):485–501, 2010

2010
[35]

J. A. Reeds and L. A. Shepp. Optimal paths for a car that goes both forwards and backwards. Pacific Journal of Mathematics, 145(2):367–393, 1990

1990
[36]

G. M. Hoffmann, C. J. Tomlin, M. Montemerlo, and S. Thrun. Autonomous automobile trajectory tracking for off-road driving: Controller design, experimental validation and racing. InAmerican Control Conference (ACC), pages 2296–2301, 2007

2007
[37]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[38]

Espeholt, H

L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V . Mnih, T. Ward, Y . Doron, V . Firoiu, T. Harley, I. Dunning, S. Legg, and K. Kavukcuoglu. IMPALA: Scalable distributed deep-RL with importance weighted actor-learner architectures. InProceedings of the International Conference on Machine Learning (ICML), 2018

2018
[39]

J. Suarez. PufferLib: Making reinforcement learning libraries and environments play nice. arXiv preprint arXiv:2406.12905, 2024

work page arXiv 2024
[40]

X. Shen, M. Lacayo, N. Guggilla, and F. Borrelli. ParkPredict+: Multimodal intent and motion prediction for vehicles in parking lots with CNN and transformer. InIEEE Int. Conf. on Intelligent Transportation Systems, pages 3999–4004, 2022. doi:10.1109/ITSC55140.2022. 9922162

work page doi:10.1109/itsc55140.2022 2022
[41]

In: 2025 IEEE Intelligent Vehicles Sym- posium (IV)

O. Dhaouadi, J. Meier, L. Wahl, J. Kaiser, L. Scalerandi, N. Wandelburg, Z. Zhou, N. Berin- panathan, H. Banzhaf, and D. Cremers. Highly accurate and diverse traffic data: The Deep- Scenario open 3D dataset. InIEEE Intelligent Vehicles Symposium, pages 377–384, 2025. doi:10.1109/IV64158.2025.11097484

work page doi:10.1109/iv64158.2025.11097484 2025
[42]

Treiber, A

M. Treiber, A. Hennecke, and D. Helbing. Congested traffic states in empirical observations and microscopic simulations.Physical Review E, 62(2):1805–1824, 2000

2000
[43]

Zheng, R

Y . Zheng, R. Liang, K. Zheng, J. Zheng, L. Mao, J. Li, W. Gu, R. Ai, S. E. Li, X. Zhan, and J. Liu. Diffusion-based planning for autonomous driving with flexible guidance. InInt. Conf. on Learning Representations, 2025. 11 CoPark: Learning Reactive Parking via Self-Play Appendix A Training Details This section collects the implementation details supporti...

2025

[1] [1]

A. L. Samuel. Some studies in machine learning using the game of checkers.IBM Journal of Research and Development, 3(3):210–229, 1959

1959

[2] [2]

Emergent Complexity via Multi-Agent Competition

T. Bansal, J. Pachocki, S. Sidor, I. Sutskever, and I. Mordatch. Emergent complexity via multi-agent competition.arXiv preprint arXiv:1710.03748, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[3] [3]

Silver, J

D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al. Mastering the game of Go without human knowledge.Nature, 550 (7676):354–359, 2017

2017

[4] [4]

Silver, T

D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, et al. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play.Science, 362(6419):1140–1144, 2018

2018

[5] [5]

Cusumano-Towner, D

M. Cusumano-Towner, D. Hafner, A. Hertzberg, B. Huval, A. Petrenko, E. Vinitsky, E. Wijmans, T. Killian, S. Bowers, O. Sener, P. Kr¨ahenb¨uhl, and V . Koltun. Robust autonomy emerges from self-play. InProceedings of the International Conference on Machine Learning (ICML), 2025

2025

[6] [6]

Kazemkhani, A

S. Kazemkhani, A. Pandya, D. Cornelisse, B. Shacklett, and E. Vinitsky. GPUDrive: Data-driven, multi-agent driving simulation at 1 million FPS. InInt. Conf. on Learning Representations, 2025

2025

[7] [7]

Cornelisse, A

D. Cornelisse, A. Pandya, K. Joseph, J. Su´arez, and E. Vinitsky. Building reliable sim driving agents by scaling self-play.arXiv preprint arXiv:2502.14706, 2025

work page arXiv 2025

[8] [8]

Cornelisse and E

D. Cornelisse and E. Vinitsky. Human-compatible driving agents through data-regularized self-play reinforcement learning.Reinforcement Learning Journal, 5:2320–2344, 2024

2024

[9] [9]

X. Xu, Y . Xie, R. Li, Y . Zhao, R. Song, and W. Zhang. Hierarchical reinforcement learning for autonomous parking based on kinematic constraints. InIEEE International Conference on Robotics and Biomimetics (ROBIO), 2024

2024

[10] [10]

R. Chai, D. Liu, T. Liu, A. Tsourdos, Y . Xia, and S. Chai. Deep learning-based trajectory planning and control for autonomous ground vehicle parking maneuver.IEEE Transactions on Automation Science and Engineering, 20(3):1633–1647, 2023

2023

[11] [11]

Cornelisse*, S

D. Cornelisse*, S. Cheng*, P. Mandavilli, J. Hunt, K. Joseph, W. Doulazmi, V . Charraut, A. Gupta, J. Suarez, and E. Vinitsky. PufferDrive: A fast and friendly driving simulator for training and evaluating RL agents, 2025. URL https://github.com/Emerge-Lab/ PufferDrive

2025

[12] [12]

Jaeger, D

B. Jaeger, D. Dauner, J. Beißwenger, S. Gerstenecker, K. Chitta, and A. Geiger. CaRL: Learning scalable planning policies with simple rewards. InProc. of the Conf. on Robot Learning (CoRL), 2025

2025

[13] [13]

Chang, A

W.-J. Chang, A. Rangesh, K. Joseph, M. Strong, M. Tomizuka, Y . Hu, and W. Zhan. SPACeR: Self-play anchoring with centralized reference models. InInt. Conf. on Learning Representa- tions, 2026. Poster

2026

[14] [14]

Zhang, S

C. Zhang, S. Biswas, K. Wong, K. Fallah, L. Zhang, D. Chen, S. Casas, and R. Urtasun. Learning to drive via asymmetric self-play. InEuropean Conference on Computer Vision (ECCV), pages 149–168. Springer, 2024

2024

[15] [15]

Hester, M

T. Hester, M. Vecer ´ık, O. Pietquin, M. Lanctot, T. Schaul, B. Piot, D. Horgan, J. Quan, A. Sendonaris, I. Osband, G. Dulac-Arnold, J. Agapiou, J. Z. Leibo, and A. Gruslys. Deep Q-learning from demonstrations. InAAAI Conf. on Artificial Intelligence, pages 3223–3230, 2018. 9

2018

[16] [16]

Schmalstieg, D

F. Schmalstieg, D. Honerkamp, T. Welschehold, and A. Valada. Learning hierarchical interactive multi-object search for mobile manipulation.IEEE Robotics and Automation Letters, 8(12): 8549–8556, 2023

2023

[17] [17]

A. L. Chandra, I. Nematollahi, C. Huang, T. Welschehold, W. Burgard, and A. Valada. Diwa: Diffusion policy adaptation with world models.arXiv preprint arXiv:2508.03645, 2025

work page arXiv 2025

[18] [18]

Nematollahi, E

I. Nematollahi, E. Rosete-Beas, A. R¨ofer, T. Welschehold, A. Valada, and W. Burgard. Robot skill adaptation via soft actor-critic gaussian mixture models. InInternational Conference on Robotics and Automation (ICRA), pages 8651–8657, 2022

2022

[19] [19]

Schmalstieg, D

F. Schmalstieg, D. Honerkamp, T. Welschehold, and A. Valada. Learning long-horizon robot exploration strategies for multi-object search in continuous action spaces. InThe International Symposium of Robotics Research, pages 52–66, 2022

2022

[20] [20]

Johannink, S

T. Johannink, S. Bahl, A. Nair, J. Luo, A. Kumar, M. Loskyll, J. Aparicio Ojea, E. Solowjow, and S. Levine. Residual reinforcement learning for robot control. InIEEE Int. Conf. on Robotics and Automation, pages 6023–6029, 2019

2019

[21] [21]

Residual Policy Learning

T. Silver, K. R. Allen, J. B. Tenenbaum, and L. P. Kaelbling. Residual policy learning.arXiv preprint arXiv:1812.06298, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[22] [22]

K. Rana, B. Talbot, M. Milford, and N. S¨underhauf. Residual reactive navigation: Combining classical and learned navigation strategies for deployment in unknown environments. InIEEE Int. Conf. on Robotics and Automation, pages 11493–11499, 2020

2020

[23] [23]

Jiang, Y

M. Jiang, Y . Li, S. Zhang, S. Chen, C. Wang, and M. Yang. HOPE: A reinforcement learning- based hybrid policy path planner for diverse parking scenarios.IEEE Transactions on Intelligent Transportation Systems, 2025

2025

[24] [24]

Jiang, Y

M. Jiang, Y . Li, J. Zhang, S. Zhang, and M. Yang. A diffusion-refined planner with reinforcement learning priors for confined-space parking.arXiv preprint arXiv:2510.14000, 2025

work page arXiv 2025

[25] [25]

J. Xie, Z. He, and Y . Zhu. A DRL based cooperative approach for parking space allocation in an automated valet parking system.Applied Intelligence, 53(5):5368–5387, 2023

2023

[26] [26]

G. O. Boateng, H. Si, H. Xia, X. Guo, C. Chen, I. O. Agyemang, and N. Ansari. Automated valet parking and charging: A dynamic pricing and reservation-based framework leveraging multi- agent reinforcement learning.IEEE Transactions on Intelligent Vehicles, 10(2):1010–1029, 2025

2025

[27] [27]

Kneissl, A

M. Kneissl, A. K. Madhusudhanan, A. Molin, H. Esen, and S. Hirche. A multi-vehicle control framework with application to automated valet parking.IEEE Transactions on Intelligent Transportation Systems, 22(9):5697–5707, 2021. doi:10.1109/TITS.2020.2990294

work page doi:10.1109/tits.2020.2990294 2021

[28] [28]

X. Shen, Y . Choi, A. Wong, F. Borrelli, S. Moura, and S. Woo. Parking of connected automated vehicles: Vehicle control, parking assignment, and multi-agent simulation.arXiv preprint arXiv:2402.14183, 2024

work page arXiv 2024

[29] [29]

O. Tanner. Multi-agent car parking using reinforcement learning, 2022

2022

[30] [30]

S. Chen, M. Wang, Y . Yang, and W. Song. Conflict-constrained multi-agent reinforcement learning method for parking trajectory planning. InIEEE Int. Conf. on Robotics and Automation, pages 9421–9427, 2023. doi:10.1109/ICRA48891.2023.10160698

work page doi:10.1109/icra48891.2023.10160698 2023

[31] [31]

J. Wei, N. V¨odisch, A. Rehr, C. Feist, and A. Valada. ParkDiffusion: Heterogeneous multi-agent multi-modal trajectory prediction for automated parking using diffusion models. InIEEE/RSJ Int. Conf. on Intelligent Robots and Systems, pages 8297–8304, 2025. 10

2025

[32] [32]

J. Wei, A. Rehr, C. Feist, and A. Valada. ParkDiffusion++: Ego intention conditioned joint multi-agent trajectory prediction for automated parking using diffusion models. InIEEE Int. Conf. on Robotics and Automation, 2026

2026

[33] [33]

E. A. Hansen, D. S. Bernstein, and S. Zilberstein. Dynamic programming for partially observable stochastic games. InAAAI Conf. on Artificial Intelligence, volume 4, pages 709–715, 2004

2004

[34] [34]

D. A. Dolgov, S. Thrun, M. Montemerlo, and J. Diebel. Path planning for autonomous vehicles in unknown semi-structured environments.The International Journal of Robotics Research, 29 (5):485–501, 2010

2010

[35] [35]

J. A. Reeds and L. A. Shepp. Optimal paths for a car that goes both forwards and backwards. Pacific Journal of Mathematics, 145(2):367–393, 1990

1990

[36] [36]

G. M. Hoffmann, C. J. Tomlin, M. Montemerlo, and S. Thrun. Autonomous automobile trajectory tracking for off-road driving: Controller design, experimental validation and racing. InAmerican Control Conference (ACC), pages 2296–2301, 2007

2007

[37] [37]

Proximal Policy Optimization Algorithms

J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[38] [38]

Espeholt, H

L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V . Mnih, T. Ward, Y . Doron, V . Firoiu, T. Harley, I. Dunning, S. Legg, and K. Kavukcuoglu. IMPALA: Scalable distributed deep-RL with importance weighted actor-learner architectures. InProceedings of the International Conference on Machine Learning (ICML), 2018

2018

[39] [39]

J. Suarez. PufferLib: Making reinforcement learning libraries and environments play nice. arXiv preprint arXiv:2406.12905, 2024

work page arXiv 2024

[40] [40]

X. Shen, M. Lacayo, N. Guggilla, and F. Borrelli. ParkPredict+: Multimodal intent and motion prediction for vehicles in parking lots with CNN and transformer. InIEEE Int. Conf. on Intelligent Transportation Systems, pages 3999–4004, 2022. doi:10.1109/ITSC55140.2022. 9922162

work page doi:10.1109/itsc55140.2022 2022

[41] [41]

In: 2025 IEEE Intelligent Vehicles Sym- posium (IV)

O. Dhaouadi, J. Meier, L. Wahl, J. Kaiser, L. Scalerandi, N. Wandelburg, Z. Zhou, N. Berin- panathan, H. Banzhaf, and D. Cremers. Highly accurate and diverse traffic data: The Deep- Scenario open 3D dataset. InIEEE Intelligent Vehicles Symposium, pages 377–384, 2025. doi:10.1109/IV64158.2025.11097484

work page doi:10.1109/iv64158.2025.11097484 2025

[42] [42]

Treiber, A

M. Treiber, A. Hennecke, and D. Helbing. Congested traffic states in empirical observations and microscopic simulations.Physical Review E, 62(2):1805–1824, 2000

2000

[43] [43]

Zheng, R

Y . Zheng, R. Liang, K. Zheng, J. Zheng, L. Mao, J. Li, W. Gu, R. Ai, S. E. Li, X. Zhan, and J. Liu. Diffusion-based planning for autonomous driving with flexible guidance. InInt. Conf. on Learning Representations, 2025. 11 CoPark: Learning Reactive Parking via Self-Play Appendix A Training Details This section collects the implementation details supporti...

2025