CoPark: Learning Reactive Parking via Self-Play
Pith reviewed 2026-06-28 09:50 UTC · model grok-4.3
The pith
A residual self-play policy reaches assigned parking slots with sub-meter accuracy while yielding to other vehicles.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CoPark trains a residual reinforcement-learning policy in multi-agent self-play. A precomputed offline plan acts as a fixed action prior that preserves slot-frame geometry. A residual head learns corrections under self-play. Authority over the longitudinal channel is shifted to the residual head via a continuous partner-threat signal to permit yielding, while the lateral channel remains locked to the precomputed reference to keep sub-meter terminal accuracy. A closed-loop refinement layer removes residual discretization error at the end of each maneuver. The resulting policy reaches 70-85 percent success with 3-6 percent collision rate on the reactive-parking benchmark and exhibits behaviors
What carries the argument
partner-threat-modulated channel-asymmetric release of the prior: a continuous threat signal hands longitudinal control to the residual head for yielding while the lateral channel stays anchored to the precomputed plan for precision
If this is right
- The policy produces emergent interaction behaviors such as reverse-yielding, mid-maneuver yielding, tight-corridor passing, and queuing.
- It reaches 70-85 percent success with 3-6 percent collision rate on the DLP and DSC3D reactive-parking benchmark.
- It outperforms classical planners, imitation-learning methods, and large-scale reinforcement-learning baselines.
- Zero-shot transfer succeeds from training on six parking lots to the held-out benchmark datasets.
- A closed-loop refinement layer removes terminal error caused by action discretization.
Where Pith is reading between the lines
- The same asymmetric-release pattern could be tested in other precision-plus-interaction domains such as highway merging or docking.
- Self-play training may reduce the need for hand-scripted interaction data when building multi-agent behaviors.
- If the threat signal itself were learned rather than provided, the policy might handle agents with different dynamics.
- The approach implies that partial, channel-specific release of a plan can serve as a general template for hybrid control policies.
- keywords:[
- reactive parking
- self-play reinforcement learning
- residual policy
Load-bearing premise
A precomputed offline plan can remain a reliable geometric reference whose longitudinal channel can be released to a residual head without destroying the sub-meter terminal alignment that pure residual policies cannot reach.
What would settle it
Run the trained policy in scenes where the threat signal is forced on or off independently of actual vehicle proximity and measure whether terminal slot error stays below one meter or collisions rise sharply.
Figures
read the original abstract
Learning a single policy that reaches a goal with high geometric precision while interacting safely with nearby agents poses conflicting objectives. Precision favors commitment to a fixed geometric plan, whereas interaction requires immediate deviation when another agent intrudes, causing policies optimized for one objective to often fail at the other. We study this problem in the context of reactive autonomous parking, where multiple vehicles must reach assigned slots with sub-meter terminal accuracy while remaining responsive to neighboring vehicles throughout the maneuver. We propose CoPark, a multi-agent self-play RL approach built on a residual-policy architecture. A precomputed offline plan provides a fixed action prior, while a residual head learns the reactive corrections. The residual policy learns behaviors under self-play, where data and scripting fall short, while the fixed prior holds the slot-frame geometry that pure policies struggle to reach reliably. The key design is a partner-threat-modulated, channel-asymmetric release of the prior. A continuous threat signal shifts authority of the longitudinal channel to the residual head to enable yielding, while the lateral channel remains anchored to the precomputed reference to preserve sub-meter slot alignment. A closed-loop refinement layer corrects residual terminal error from action-grid discretization. We train our policy on six parking lots and evaluate zero-shot on our new reactive-parking benchmark spanning Dragon Lake Parking (DLP) and DeepScenario Open 3D (DSC3D). CoPark achieves ~70-85% success with only 3-6% collision rate, substantially outperforming classical, imitation-learning, and large-scale RL baselines. Importantly, the results demonstrate emergent interaction behaviors such as reverse-yielding, mid-maneuver yielding, tight-corridor passing, and queuing.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces CoPark, a residual-policy multi-agent self-play RL method for reactive autonomous parking. A precomputed offline plan serves as a fixed action prior with channel-asymmetric release: a continuous partner-threat signal shifts longitudinal authority to the residual head for yielding while the lateral channel remains anchored to preserve sub-meter slot accuracy. A closed-loop refinement corrects terminal discretization error. The policy is trained on six lots and evaluated zero-shot on a new benchmark spanning DLP and DSC3D, claiming 70-85% success and 3-6% collision rates that substantially outperform classical, IL, and large-scale RL baselines, plus emergent behaviors such as reverse-yielding and mid-maneuver yielding.
Significance. If the empirical claims hold under rigorous validation, the work would demonstrate a practical way to reconcile geometric precision with reactive interaction in multi-agent settings via self-play and asymmetric residual control. The self-play training for emergent interaction behaviors where scripted data is insufficient is a notable strength, as is the explicit separation of prior geometry from learned reactivity.
major comments (2)
- [Abstract] Abstract: The headline performance figures (~70-85% success, 3-6% collision) are stated without error bars, number of evaluation episodes, statistical tests, or dataset sizes. Because these numbers are the sole quantitative support for the claim of substantial outperformance over baselines, their statistical reliability is load-bearing and cannot be assessed from the given text.
- [Abstract] Abstract (channel-asymmetric prior release paragraph): The design releases only the longitudinal channel under the partner-threat signal while anchoring the lateral channel to the offline plan. This implicitly requires that longitudinal and lateral controls remain effectively decoupled and that the precomputed plan stays geometrically valid after intrusions. Standard non-holonomic vehicle models couple steering and longitudinal acceleration (especially in reverse or tight turns); no ablation on symmetric vs. asymmetric release, no plan-validity analysis under perturbation, and no description of the residual-head representation of the prior are provided.
minor comments (1)
- [Abstract] The abstract refers to 'our new reactive-parking benchmark spanning DLP and DSC3D' without specifying how the benchmark episodes were constructed, how many agents are present, or the exact success/collision definitions.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on the abstract. We address each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: The headline performance figures (~70-85% success, 3-6% collision) are stated without error bars, number of evaluation episodes, statistical tests, or dataset sizes. Because these numbers are the sole quantitative support for the claim of substantial outperformance over baselines, their statistical reliability is load-bearing and cannot be assessed from the given text.
Authors: We agree that the abstract should include more information on statistical reliability. We will revise the abstract to report approximate error bars (e.g., 78±5% success) and note that figures are computed over 1000 episodes per scenario with t-tests versus baselines. Training and evaluation dataset sizes (six lots for training; 200 scenarios from DLP/DSC3D for zero-shot evaluation) will be referenced, with full statistics remaining in Section 4 and Appendix C. revision: yes
-
Referee: [Abstract] Abstract (channel-asymmetric prior release paragraph): The design releases only the longitudinal channel under the partner-threat signal while anchoring the lateral channel to the offline plan. This implicitly requires that longitudinal and lateral controls remain effectively decoupled and that the precomputed plan stays geometrically valid after intrusions. Standard non-holonomic vehicle models couple steering and longitudinal acceleration (especially in reverse or tight turns); no ablation on symmetric vs. asymmetric release, no plan-validity analysis under perturbation, and no description of the residual-head representation of the prior are provided.
Authors: We agree the abstract is too concise on these points. We will revise the abstract to briefly describe the residual head (additive MLP outputting channel-wise corrections) and will add to the manuscript: (1) an ablation comparing symmetric vs. asymmetric release, (2) a plan-validity analysis under perturbations, and (3) explicit discussion of how the residual policy mitigates non-holonomic coupling. These will appear in Sections 3 and 5. revision: yes
Circularity Check
No significant circularity; claims rest on zero-shot benchmark evaluation
full rationale
The provided text describes a residual-policy RL architecture trained via multi-agent self-play, with a precomputed offline plan as action prior and channel-asymmetric release modulated by partner-threat signal. Reported metrics (~70-85% success, 3-6% collision) are obtained from zero-shot evaluation on the DLP and DSC3D benchmarks after training on six parking lots. No equations, fitted parameters, or derivations are shown that reduce by construction to the inputs. No self-citations, uniqueness theorems, or ansatzes are invoked. Self-play is the training method whose output is then measured on held-out scenarios; this does not constitute a circular reduction. The design choices (asymmetric release, closed-loop refinement) are presented as engineering decisions, not as predictions forced by prior fits.
Axiom & Free-Parameter Ledger
invented entities (1)
-
partner-threat-modulated channel-asymmetric prior release
no independent evidence
Reference graph
Works this paper leans on
-
[1]
A. L. Samuel. Some studies in machine learning using the game of checkers.IBM Journal of Research and Development, 3(3):210–229, 1959
1959
-
[2]
Emergent Complexity via Multi-Agent Competition
T. Bansal, J. Pachocki, S. Sidor, I. Sutskever, and I. Mordatch. Emergent complexity via multi-agent competition.arXiv preprint arXiv:1710.03748, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[3]
Silver, J
D. Silver, J. Schrittwieser, K. Simonyan, I. Antonoglou, A. Huang, A. Guez, T. Hubert, L. Baker, M. Lai, A. Bolton, et al. Mastering the game of Go without human knowledge.Nature, 550 (7676):354–359, 2017
2017
-
[4]
Silver, T
D. Silver, T. Hubert, J. Schrittwieser, I. Antonoglou, M. Lai, A. Guez, M. Lanctot, L. Sifre, D. Kumaran, T. Graepel, et al. A general reinforcement learning algorithm that masters chess, shogi, and Go through self-play.Science, 362(6419):1140–1144, 2018
2018
-
[5]
Cusumano-Towner, D
M. Cusumano-Towner, D. Hafner, A. Hertzberg, B. Huval, A. Petrenko, E. Vinitsky, E. Wijmans, T. Killian, S. Bowers, O. Sener, P. Kr¨ahenb¨uhl, and V . Koltun. Robust autonomy emerges from self-play. InProceedings of the International Conference on Machine Learning (ICML), 2025
2025
-
[6]
Kazemkhani, A
S. Kazemkhani, A. Pandya, D. Cornelisse, B. Shacklett, and E. Vinitsky. GPUDrive: Data-driven, multi-agent driving simulation at 1 million FPS. InInt. Conf. on Learning Representations, 2025
2025
-
[7]
D. Cornelisse, A. Pandya, K. Joseph, J. Su´arez, and E. Vinitsky. Building reliable sim driving agents by scaling self-play.arXiv preprint arXiv:2502.14706, 2025
-
[8]
Cornelisse and E
D. Cornelisse and E. Vinitsky. Human-compatible driving agents through data-regularized self-play reinforcement learning.Reinforcement Learning Journal, 5:2320–2344, 2024
2024
-
[9]
X. Xu, Y . Xie, R. Li, Y . Zhao, R. Song, and W. Zhang. Hierarchical reinforcement learning for autonomous parking based on kinematic constraints. InIEEE International Conference on Robotics and Biomimetics (ROBIO), 2024
2024
-
[10]
R. Chai, D. Liu, T. Liu, A. Tsourdos, Y . Xia, and S. Chai. Deep learning-based trajectory planning and control for autonomous ground vehicle parking maneuver.IEEE Transactions on Automation Science and Engineering, 20(3):1633–1647, 2023
2023
-
[11]
Cornelisse*, S
D. Cornelisse*, S. Cheng*, P. Mandavilli, J. Hunt, K. Joseph, W. Doulazmi, V . Charraut, A. Gupta, J. Suarez, and E. Vinitsky. PufferDrive: A fast and friendly driving simulator for training and evaluating RL agents, 2025. URL https://github.com/Emerge-Lab/ PufferDrive
2025
-
[12]
Jaeger, D
B. Jaeger, D. Dauner, J. Beißwenger, S. Gerstenecker, K. Chitta, and A. Geiger. CaRL: Learning scalable planning policies with simple rewards. InProc. of the Conf. on Robot Learning (CoRL), 2025
2025
-
[13]
Chang, A
W.-J. Chang, A. Rangesh, K. Joseph, M. Strong, M. Tomizuka, Y . Hu, and W. Zhan. SPACeR: Self-play anchoring with centralized reference models. InInt. Conf. on Learning Representa- tions, 2026. Poster
2026
-
[14]
Zhang, S
C. Zhang, S. Biswas, K. Wong, K. Fallah, L. Zhang, D. Chen, S. Casas, and R. Urtasun. Learning to drive via asymmetric self-play. InEuropean Conference on Computer Vision (ECCV), pages 149–168. Springer, 2024
2024
-
[15]
Hester, M
T. Hester, M. Vecer ´ık, O. Pietquin, M. Lanctot, T. Schaul, B. Piot, D. Horgan, J. Quan, A. Sendonaris, I. Osband, G. Dulac-Arnold, J. Agapiou, J. Z. Leibo, and A. Gruslys. Deep Q-learning from demonstrations. InAAAI Conf. on Artificial Intelligence, pages 3223–3230, 2018. 9
2018
-
[16]
Schmalstieg, D
F. Schmalstieg, D. Honerkamp, T. Welschehold, and A. Valada. Learning hierarchical interactive multi-object search for mobile manipulation.IEEE Robotics and Automation Letters, 8(12): 8549–8556, 2023
2023
- [17]
-
[18]
Nematollahi, E
I. Nematollahi, E. Rosete-Beas, A. R¨ofer, T. Welschehold, A. Valada, and W. Burgard. Robot skill adaptation via soft actor-critic gaussian mixture models. InInternational Conference on Robotics and Automation (ICRA), pages 8651–8657, 2022
2022
-
[19]
Schmalstieg, D
F. Schmalstieg, D. Honerkamp, T. Welschehold, and A. Valada. Learning long-horizon robot exploration strategies for multi-object search in continuous action spaces. InThe International Symposium of Robotics Research, pages 52–66, 2022
2022
-
[20]
Johannink, S
T. Johannink, S. Bahl, A. Nair, J. Luo, A. Kumar, M. Loskyll, J. Aparicio Ojea, E. Solowjow, and S. Levine. Residual reinforcement learning for robot control. InIEEE Int. Conf. on Robotics and Automation, pages 6023–6029, 2019
2019
-
[21]
T. Silver, K. R. Allen, J. B. Tenenbaum, and L. P. Kaelbling. Residual policy learning.arXiv preprint arXiv:1812.06298, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[22]
K. Rana, B. Talbot, M. Milford, and N. S¨underhauf. Residual reactive navigation: Combining classical and learned navigation strategies for deployment in unknown environments. InIEEE Int. Conf. on Robotics and Automation, pages 11493–11499, 2020
2020
-
[23]
Jiang, Y
M. Jiang, Y . Li, S. Zhang, S. Chen, C. Wang, and M. Yang. HOPE: A reinforcement learning- based hybrid policy path planner for diverse parking scenarios.IEEE Transactions on Intelligent Transportation Systems, 2025
2025
- [24]
-
[25]
J. Xie, Z. He, and Y . Zhu. A DRL based cooperative approach for parking space allocation in an automated valet parking system.Applied Intelligence, 53(5):5368–5387, 2023
2023
-
[26]
G. O. Boateng, H. Si, H. Xia, X. Guo, C. Chen, I. O. Agyemang, and N. Ansari. Automated valet parking and charging: A dynamic pricing and reservation-based framework leveraging multi- agent reinforcement learning.IEEE Transactions on Intelligent Vehicles, 10(2):1010–1029, 2025
2025
-
[27]
M. Kneissl, A. K. Madhusudhanan, A. Molin, H. Esen, and S. Hirche. A multi-vehicle control framework with application to automated valet parking.IEEE Transactions on Intelligent Transportation Systems, 22(9):5697–5707, 2021. doi:10.1109/TITS.2020.2990294
- [28]
-
[29]
O. Tanner. Multi-agent car parking using reinforcement learning, 2022
2022
-
[30]
S. Chen, M. Wang, Y . Yang, and W. Song. Conflict-constrained multi-agent reinforcement learning method for parking trajectory planning. InIEEE Int. Conf. on Robotics and Automation, pages 9421–9427, 2023. doi:10.1109/ICRA48891.2023.10160698
-
[31]
J. Wei, N. V¨odisch, A. Rehr, C. Feist, and A. Valada. ParkDiffusion: Heterogeneous multi-agent multi-modal trajectory prediction for automated parking using diffusion models. InIEEE/RSJ Int. Conf. on Intelligent Robots and Systems, pages 8297–8304, 2025. 10
2025
-
[32]
J. Wei, A. Rehr, C. Feist, and A. Valada. ParkDiffusion++: Ego intention conditioned joint multi-agent trajectory prediction for automated parking using diffusion models. InIEEE Int. Conf. on Robotics and Automation, 2026
2026
-
[33]
E. A. Hansen, D. S. Bernstein, and S. Zilberstein. Dynamic programming for partially observable stochastic games. InAAAI Conf. on Artificial Intelligence, volume 4, pages 709–715, 2004
2004
-
[34]
D. A. Dolgov, S. Thrun, M. Montemerlo, and J. Diebel. Path planning for autonomous vehicles in unknown semi-structured environments.The International Journal of Robotics Research, 29 (5):485–501, 2010
2010
-
[35]
J. A. Reeds and L. A. Shepp. Optimal paths for a car that goes both forwards and backwards. Pacific Journal of Mathematics, 145(2):367–393, 1990
1990
-
[36]
G. M. Hoffmann, C. J. Tomlin, M. Montemerlo, and S. Thrun. Autonomous automobile trajectory tracking for off-road driving: Controller design, experimental validation and racing. InAmerican Control Conference (ACC), pages 2296–2301, 2007
2007
-
[37]
Proximal Policy Optimization Algorithms
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[38]
Espeholt, H
L. Espeholt, H. Soyer, R. Munos, K. Simonyan, V . Mnih, T. Ward, Y . Doron, V . Firoiu, T. Harley, I. Dunning, S. Legg, and K. Kavukcuoglu. IMPALA: Scalable distributed deep-RL with importance weighted actor-learner architectures. InProceedings of the International Conference on Machine Learning (ICML), 2018
2018
- [39]
-
[40]
X. Shen, M. Lacayo, N. Guggilla, and F. Borrelli. ParkPredict+: Multimodal intent and motion prediction for vehicles in parking lots with CNN and transformer. InIEEE Int. Conf. on Intelligent Transportation Systems, pages 3999–4004, 2022. doi:10.1109/ITSC55140.2022. 9922162
-
[41]
In: 2025 IEEE Intelligent Vehicles Sym- posium (IV)
O. Dhaouadi, J. Meier, L. Wahl, J. Kaiser, L. Scalerandi, N. Wandelburg, Z. Zhou, N. Berin- panathan, H. Banzhaf, and D. Cremers. Highly accurate and diverse traffic data: The Deep- Scenario open 3D dataset. InIEEE Intelligent Vehicles Symposium, pages 377–384, 2025. doi:10.1109/IV64158.2025.11097484
-
[42]
Treiber, A
M. Treiber, A. Hennecke, and D. Helbing. Congested traffic states in empirical observations and microscopic simulations.Physical Review E, 62(2):1805–1824, 2000
2000
-
[43]
Zheng, R
Y . Zheng, R. Liang, K. Zheng, J. Zheng, L. Mao, J. Li, W. Gu, R. Ai, S. E. Li, X. Zhan, and J. Liu. Diffusion-based planning for autonomous driving with flexible guidance. InInt. Conf. on Learning Representations, 2025. 11 CoPark: Learning Reactive Parking via Self-Play Appendix A Training Details This section collects the implementation details supporti...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.