pith. machine review for the scientific record. sign in

arxiv: 2605.04470 · v1 · submitted 2026-05-06 · 💻 cs.LG · cs.RO

Recognition: unknown

CRAFT: Counterfactual-to-Interactive Reinforcement Fine-Tuning for Driving Policies

Authors on Pith no claims yet

Pith reviewed 2026-05-08 17:16 UTC · model grok-4.3

classification 💻 cs.LG cs.RO
keywords counterfactual reinforcement learningclosed-loop fine-tuningautonomous drivingpolicy optimizationresidual correctionimitation learningBench2Drive
0
0 comments X

The pith

CRAFT decomposes closed-loop policy gradients into dense counterfactual proxies and sparse grounded residuals for driving fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes CRAFT as an on-policy method that turns the trade-off between counterfactual density and closed-loop grounding into a complementary system. It treats group-normalized counterfactual advantages as a dense proxy for the true advantages that would arise from executing the policy in the real world, then adds a residual correction term drawn only from interaction-critical events to remove bias. An asymmetric KL self-distillation term keeps the online policy close to an exponential-moving-average teacher. If this decomposition works, the resulting updates stay low-variance yet unbiased under the same visited-state distribution that the deployed policy actually sees, producing stronger closed-loop behavior than either pure counterfactual or pure interactive baselines.

Core claim

CRAFT formulates closed-loop post-training as proxy-residual optimization: group-normalized counterfactual advantages computed from imperfect future estimates serve as a dense proxy for real closed-loop advantages, and this proxy is aligned to the closed-loop world through grounded residual correction collected from interaction-critical events. The framework further regularizes the online policy toward an EMA teacher via asymmetric KL self-distillation. Theoretically the real closed-loop policy gradient decomposes into proxy and residual terms under the same visited-state distribution, so that an aligned proxy reduces residual variance while the grounded residual removes proxy bias.

What carries the argument

Proxy-residual optimization that decomposes the closed-loop policy gradient into a dense counterfactual proxy term and a sparse grounded residual term under the same state distribution.

If this is right

  • Strongest closed-loop gains on Bench2Drive hold across hierarchical planning, vision-language-action, and vocabulary-scoring policy architectures.
  • The same visited-state distribution for proxy and residual terms keeps the overall gradient unbiased while lowering variance.
  • Dense proxy plus sparse residual correction yields better scaling behavior and stability than either component alone.
  • Transfer results improve when the proxy is aligned to the closed-loop world rather than left unadjusted.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same proxy-residual split could be tested in other domains where cheap rollouts exist but real interactions are expensive, such as robotic manipulation.
  • Collecting residuals only at critical events implies that data budgets should be concentrated on rare but high-impact states rather than uniform sampling.
  • If the EMA teacher regularization is essential, similar distillation tricks may stabilize other on-policy methods that mix simulated and real signals.
  • The approach suggests a general template for turning any dense but biased signal into a usable learning target by adding a lightweight grounded correction.

Load-bearing premise

Group-normalized counterfactual advantages from imperfect future estimates act as a sufficiently unbiased dense proxy for true closed-loop advantages, and residual corrections collected only from interaction-critical events remove the remaining bias without introducing new variance or selection effects.

What would settle it

A controlled ablation in which residual corrections are withheld while keeping the counterfactual proxy and self-distillation fixed, showing whether the closed-loop gains on Bench2Drive disappear or reverse.

Figures

Figures reproduced from arXiv: 2605.04470 by Danqi Zhao, Hao Cheng, Keyu Chen, Nanfei Ye, Sifa Zheng, Wenchao Sun, Yida Wang.

Figure 1
Figure 1. Figure 1: Two existing post-training paradigms exhibit complementary trade-offs. Closed-loop RL view at source ↗
Figure 2
Figure 2. Figure 2: CRAFT combines dense counterfactual proxy supervision, grounded residual correction view at source ↗
Figure 3
Figure 3. Figure 3: Fine-grained ability profiles. Radar plots compare fine-tuning methods across driving policies. Each axis represents a scenario-specific capability, and higher values indicate better performance. Implementation Details For all policies, we fine-tune only the planning-related modules while freezing pre-trained visual processing. All fine-tuning methods share a CaRL-style reward formulation based on progress… view at source ↗
Figure 4
Figure 4. Figure 4: Scaling behavior and reward dynamics. (a) Closed-loop performance as the Bench2Drive scale increases. (b) Reward evolution during RL training on a challenging scenario, with final evaluation performance. 0.32 0.38 0.44 0.5 Distillation Coefficient of Variation 0.3 0.35 0.4 0.45 better better (a) MindDrive KL Stability 0.3 0.38 0.46 0.54 Distillation Coefficient of Variation 0.3 0.36 0.42 0.48 0.54 better b… view at source ↗
Figure 5
Figure 5. Figure 5: Training stability. Panels (a)–(d) report the coefficients of variation for KL terms across fine-tuning methods, and panel (e) reports the coefficient of variation for the expected advantage across driving policies. 4.3 Ablation and Generalization Analysis The ablation study in view at source ↗
Figure 6
Figure 6. Figure 6: Policy distribution shift in a pedestrian-crossing scenario. Top: a closed-loop scene snapshot with SparseDriveV2 as the ego vehicle. Bottom: the policy distributions over the trajectory vocabulary for the pre-trained and fine-tuned policies. CRAFT shifts probability mass toward braking-compatible modes. The results in view at source ↗
Figure 7
Figure 7. Figure 7: Operational paradigms of driving policies. • HiP-AD [46]. The ego decision is represented as joint path-speed candidates with 48 ego modes. CRAFT and the non-PPO baselines update only the final classification branch in the last plan￾refinement layer. When PPO is used, a lightweight critic is trained in addition to this branch. • MindDrive [24]. The high-level speed decision is chosen by a VLM decision expe… view at source ↗
read the original abstract

Open-loop imitation learning has advanced modern autonomous driving policy architectures, but closed-loop deployment remains vulnerable to policy-induced distribution shift. Existing post-training paradigms exhibit fundamental trade-offs: closed-loop RL fine-tuning provides grounded feedback from executed actions but is constrained by the sparsity of informative events, whereas counterfactual fine-tuning provides dense supervision over candidate futures but inherits bias from imperfect future estimates. We introduce Counterfactual-to-Interactive Reinforcement Fine-Tuning (CRAFT), an on-policy framework that formulates closed-loop post-training as proxy-residual optimization. CRAFT uses group-normalized counterfactual advantages as a dense proxy for real closed-loop advantages and aligns this proxy with the closed-loop world through grounded residual correction from interaction-critical events. To stabilize adaptation, CRAFT regularizes the online policy toward an EMA teacher via asymmetric KL self-distillation. Theoretically, CRAFT decomposes the real closed-loop policy gradient into proxy and residual terms under the same visited-state distribution, reducing residual variance with an aligned proxy while mitigating proxy bias through grounded residual approximation. Empirically, CRAFT achieves the strongest closed-loop gains on Bench2Drive across hierarchical planning, vision-language-action, and vocabulary-scoring architectures. Ablations, scaling behavior, stability analyses, and transfer results further validate the complementary roles of dense counterfactual proxy and grounded residual correction. Project page: https://currychen77.github.io/CRAFT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces CRAFT, an on-policy fine-tuning framework for autonomous driving policies that decomposes the closed-loop policy gradient into a dense group-normalized counterfactual proxy term and a grounded residual correction term from interaction-critical events. It claims this decomposition occurs under the same visited-state distribution, reducing bias from imperfect estimates while providing dense supervision, and demonstrates empirical superiority on the Bench2Drive benchmark across multiple policy architectures, supported by ablations and scaling studies.

Significance. If the theoretical decomposition holds with the claimed properties, CRAFT offers a principled method to combine the density of counterfactual supervision with the grounding of closed-loop interaction, potentially mitigating distribution shift in deployed driving policies. The validation across hierarchical planning, vision-language-action, and vocabulary-scoring architectures, along with ablations on the complementary roles of proxy and residual, strengthens the contribution. Reproducible results on a public benchmark and stability analyses are positive aspects.

major comments (2)
  1. [Abstract] Abstract: The abstract asserts that 'CRAFT decomposes the real closed-loop policy gradient into proxy and residual terms under the same visited-state distribution'. However, the residual correction is defined to be nonzero only on interaction-critical events, a selective subset of states. This induces a mismatched state distribution for the residual term, undermining the shared-distribution assumption central to the bias-reduction guarantee. No quantitative bound on the resulting approximation error is provided.
  2. [Theoretical Analysis] Theoretical Analysis: Despite the strong theoretical claims, the abstract and summary provide no equations, proof sketch, or derivation details for the proxy-residual decomposition. This absence makes it difficult to assess whether the group-normalized counterfactual advantages are sufficiently unbiased or if the residual term independently corrects bias without circular dependence on the policy being optimized.
minor comments (1)
  1. [Abstract] The abstract asserts 'strongest closed-loop gains' without including specific numerical results or baseline comparisons; these should be summarized with effect sizes in the abstract for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below with clarifications on the theoretical formulation and indicate the revisions we will incorporate.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The abstract asserts that 'CRAFT decomposes the real closed-loop policy gradient into proxy and residual terms under the same visited-state distribution'. However, the residual correction is defined to be nonzero only on interaction-critical events, a selective subset of states. This induces a mismatched state distribution for the residual term, undermining the shared-distribution assumption central to the bias-reduction guarantee. No quantitative bound on the resulting approximation error is provided.

    Authors: We appreciate the referee's careful reading of this point. In the CRAFT formulation, both the proxy and residual terms are defined as expectations under the identical on-policy visited-state distribution induced by the current policy. The residual correction is nonzero only on interaction-critical events but is explicitly zero on all other states within the same distribution; it does not sample from or induce a separate state measure. This structure preserves the shared-distribution property for the overall decomposition. We will add an explicit clarification of this point to the abstract and derive a quantitative bound on the residual approximation error for inclusion in the revised theoretical analysis. revision: yes

  2. Referee: [Theoretical Analysis] Theoretical Analysis: Despite the strong theoretical claims, the abstract and summary provide no equations, proof sketch, or derivation details for the proxy-residual decomposition. This absence makes it difficult to assess whether the group-normalized counterfactual advantages are sufficiently unbiased or if the residual term independently corrects bias without circular dependence on the policy being optimized.

    Authors: We agree that the abstract and summary focus on high-level intuition rather than equations. The full derivation of the proxy-residual decomposition—including the group normalization of counterfactual advantages, the conditions ensuring the proxy remains sufficiently unbiased, and the demonstration that the residual term corrects bias using grounded interaction data without circular dependence on the policy—is presented in the Theoretical Analysis section of the manuscript. To address the concern, we will incorporate a concise proof sketch with key equations into the abstract and introduction of the revised version. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected in CRAFT's proxy-residual decomposition

full rationale

The paper's central theoretical claim decomposes the closed-loop policy gradient into a group-normalized counterfactual proxy plus a grounded residual correction, both asserted to operate under the same on-policy visited-state distribution. No equations or definitions in the provided text reduce this decomposition to a self-referential fit, a renamed input, or a load-bearing self-citation chain. The counterfactual advantages are computed from future estimates and the residual is collected from interaction-critical events; these are presented as complementary but independent mechanisms rather than one being defined in terms of the other. The derivation therefore remains self-contained against external benchmarks and does not collapse to its inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no visibility into specific free parameters, background axioms, or new entities; the method implicitly relies on standard RL assumptions such as Markovian dynamics and the existence of informative interaction events, but none are enumerated or justified here.

pith-pipeline@v0.9.0 · 5562 in / 1277 out tokens · 164713 ms · 2026-05-08T17:16:07.813425+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

59 extracted references · 29 canonical work pages · 8 internal anchors

  1. [1]

    Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wang, L. Lu, X. Jia, Q. Liu, J. Dai, Y . Qiao, and H. Li. Planning-oriented autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

  2. [2]

    Jiang, S

    B. Jiang, S. Chen, Q. Xu, B. Liao, J. Chen, H. Zhou, Q. Zhang, W. Liu, C. Huang, and X. Wang. Vad: Vectorized scene representation for efficient autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8350, 2023

  3. [3]

    W. Sun, X. Lin, Y . Shi, C. Zhang, H. Wu, and S. Zheng. Sparsedrive: End-to-end autonomous driving via sparse scene representation. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 8795–8801. IEEE, 2025

  4. [4]

    Z. Peng, W. Ding, Y . You, Y . Chen, W. Luo, T. Tian, Y . Cao, A. Sharma, D. Xu, B. Ivanovic, et al. Counterfactual vla: Self-reflective vision-language-action model with adaptive reasoning.arXiv preprint arXiv:2512.24426, 2025

  5. [5]

    C. Dang, S. Ang, Y . Li, H. Tian, J. Wang, G. Li, H. Ye, J. Ma, L. Chen, and Y . Wang. Drivefine: Refining-augmented masked diffusion vla for precise and robust driving.arXiv preprint arXiv:2602.14577, 2026

  6. [6]

    T. Xia, Y . Li, L. Zhou, J. Yao, K. Xiong, H. Sun, B. Wang, K. Ma, G. Chen, H. Ye, et al. Drivelaw: Unifying planning and video generation in a latent driving world.arXiv preprint arXiv:2512.23421, 2025

  7. [7]

    Y . Li, S. Shang, W. Liu, B. Zhan, H. Wang, Y . Wang, Y . Chen, X. Wang, Y . An, C. Tang, et al. Drivevla-w0: World models amplify data scaling law in autonomous driving.arXiv preprint arXiv:2510.12796, 2025

  8. [8]

    Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024

  9. [9]

    D. Ye, Z. Liu, M. Sun, B. Shi, P. Zhao, H. Wu, H. Yu, S. Yang, X. Wu, Q. Guo, et al. Mastering complex control in moba games with deep reinforcement learning. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 6672–6679, 2020

  10. [10]

    Y . Gao, B. Shi, X. Du, L. Wang, G. Chen, Z. Lian, F. Qiu, G. Han, W. Wang, D. Ye, et al. Learning diverse policies in moba games via macro-goals.Advances in Neural Information Processing Systems, 34: 16171–16182, 2021

  11. [11]

    B. Liao, S. Chen, H. Yin, B. Jiang, C. Wang, S. Yan, X. Zhang, X. Li, Y . Zhang, Q. Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12037–12047, 2025

  12. [12]

    L. Liu, C. Jia, G. Yu, Z. Song, J. Li, F. Jia, P. Wu, X. Hao, and Y . Luo. Guideflow: Constraint-guided flow matching for planning in end-to-end autonomous driving.arXiv preprint arXiv:2511.18729, 2025

  13. [13]

    ResAD: Normalized residual trajectory modeling for end-to-end autonomous driving.arXiv preprint arXiv:2510.08562, 2025

    Z. Zheng, S. Chen, H. Yin, X. Zhang, J. Zou, X. Wang, Q. Zhang, and L. Zhang. Resad: Normalized residual trajectory modeling for end-to-end autonomous driving.arXiv preprint arXiv:2510.08562, 2025

  14. [14]

    Unleashing the potential of diffusion models for end-to-end autonomous driving

    Y . Zheng, T. Tan, B. Huang, E. Liu, R. Liang, J. Zhang, J. Cui, G. Chen, K. Ma, H. Ye, et al. Unleashing the potential of diffusion models for end-to-end autonomous driving.arXiv preprint arXiv:2602.22801, 2026

  15. [15]

    H. Fu, D. Zhang, Z. Zhao, J. Cui, D. Liang, C. Zhang, D. Zhang, H. Xie, B. Wang, and X. Bai. Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 24823–24834, 2025

  16. [16]

    K. Renz, L. Chen, E. Arani, and O. Sinavski. Simlingo: Vision-only closed-loop autonomous driving with language-action alignment. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 11993–12003, 2025

  17. [17]

    Z. Zhou, T. Cai, S. Z. Zhao, Y . Zhang, Z. Huang, B. Zhou, and J. Ma. Autovla: A vision-language-action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning.arXiv preprint arXiv:2506.13757, 2025

  18. [18]

    X. Wang, Q. Liu, W. Ding, Z. Yang, W. Li, C. Liu, B. Li, K. Zhan, X. Lang, and W. Chen. Unifying language-action understanding and generation for autonomous driving.arXiv preprint arXiv:2603.01441, 2026. 10

  19. [19]

    Z. Li, K. Li, S. Wang, S. Lan, Z. Yu, Y . Ji, Z. Li, Z. Zhu, J. Kautz, Z. Wu, et al. Hydra-mdp: End-to-end multimodal planning with multi-target hydra-distillation.arXiv preprint arXiv:2406.06978, 2024

  20. [20]

    W. Yao, Z. Li, S. Lan, Z. Wang, X. Sun, J. M. Alvarez, and Z. Wu. Drivesuprim: Towards precise trajectory selection for end-to-end planning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 11910–11918, 2026

  21. [21]

    Z. Li, W. Yao, Z. Wang, X. Sun, J. Chen, N. Chang, M. Shen, Z. Wu, S. Lan, and J. M. Alvarez. Generalized trajectory scoring for end-to-end multimodal planning.arXiv preprint arXiv:2506.06664, 2025

  22. [22]

    W. Sun, X. Lin, K. Chen, Z. Pei, X. Li, Y . Shi, and S. Zheng. Sparsedrivev2: Scoring is all you need for end-to-end autonomous driving.arXiv preprint arXiv:2603.29163, 2026

  23. [23]

    Proximal Policy Optimization Algorithms

    J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017

  24. [24]

    H. Fu, D. Zhang, Z. Zhao, J. Cui, H. Xie, B. Wang, G. Chen, D. Liang, and X. Bai. Minddrive: A vision-language-action model for autonomous driving via online reinforcement learning.arXiv preprint arXiv:2512.13636, 2025

  25. [25]

    Jaeger, D

    B. Jaeger, D. Dauner, J. Beißwenger, S. Gerstenecker, K. Chitta, and A. Geiger. Carl: Learning scalable planning policies with simple rewards. In9th Annual Conference on Robot Learning, 2025

  26. [26]

    H. Gao, S. Chen, B. Jiang, B. Liao, Y . Shi, X. Guo, Y . Pu, X. Li, W. Liu, Q. Zhang, et al. Rad: Training an end-to-end driving policy via large-scale 3dgs-based reinforcement learning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  27. [27]

    H. Gao, S. Chen, Y . Zhu, Y . Song, W. Liu, Q. Zhang, and X. Wang. Rad-2: Scaling reinforcement learning in a generator-discriminator framework.arXiv preprint arXiv:2604.15308, 2026

  28. [28]

    C. Ni, G. Zhao, X. Wang, Z. Zhu, W. Qin, X. Chen, G. Jia, G. Huang, and W. Mei. Recondreamer-rl: En- hancing reinforcement learning via diffusion-based scene reconstruction.arXiv preprint arXiv:2508.08170, 2025

  29. [29]

    arXiv preprint arXiv:2603.18315

    Z. Huang, Z. Sheng, Z. Wan, Y . Qu, J. You, S. Jiang, and S. Chen. Drivevlm-rl: Neuroscience-inspired reinforcement learning with vision-language models for safe and deployable autonomous driving.arXiv preprint arXiv:2603.18315, 2026

  30. [30]

    Q. Li, X. Jia, S. Wang, and J. Yan. Think2drive: Efficient reinforcement learning by thinking with latent world model for autonomous driving (in carla-v2). InEuropean conference on computer vision, pages 142–158. Springer, 2024

  31. [31]

    Z. Yang, X. Jia, Q. Li, X. Yang, M. Yao, and J. Yan. Raw2drive: Reinforcement learning with aligned world models for end-to-end autonomous driving (in carla v2). InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025

  32. [32]

    H. Liu, T. Li, H. Yang, L. Chen, C. Wang, K. Guo, H. Tian, H. Li, H. Li, and C. Lv. Reinforced refinement with self-aware expansion for end-to-end autonomous driving.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026

  33. [33]

    Rafailov, A

    R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in Neural Information Processing Systems, 36: 53728–53741, 2023

  34. [34]

    Y . Li, K. Xiong, X. Guo, F. Li, S. Yan, G. Xu, L. Zhou, L. Chen, H. Sun, B. Wang, et al. Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving.arXiv preprint arXiv:2506.08052, 2025

  35. [35]

    J. Zou, S. Chen, B. Liao, Z. Zheng, Y . Song, L. Zhang, Q. Zhang, W. Liu, and X. Wang. Diffusiondrivev2: Reinforcement learning-constrained truncated diffusion modeling in end-to-end autonomous driving.arXiv preprint arXiv:2512.07745, 2025

  36. [36]

    Generative scenario rollouts for end-to-end autonomous driving.arXiv preprint arXiv:2601.11475, 2026

    R. Yasarla, D. Hegde, S. Han, H.-P. Cheng, Y . Shi, M. Sadeghigooghari, S. Mahajan, A. Bhattacharyya, L. Liu, R. Garrepalli, et al. Generative scenario rollouts for end-to-end autonomous driving.arXiv preprint arXiv:2601.11475, 2026

  37. [37]

    DynVLA: Learning world dynamics for action reasoning in autonomous driving.arXiv preprint arXiv:2603.11041, 2026

    S. Shang, B. Zhan, Y . Yan, Y . Wang, Y . Li, Y . An, X. Wang, J. Liu, L. Hou, L. Fan, et al. Dynvla: Learning world dynamics for action reasoning in autonomous driving.arXiv preprint arXiv:2603.11041, 2026. 11

  38. [38]

    Liang, Y

    R. Liang, Y . Zheng, K. Zheng, T. Tan, J. Li, L. Mao, Z. Wang, G. Chen, H. Ye, J. Liu, et al. Dichotomous diffusion policy optimization.arXiv preprint arXiv:2601.00898, 2025

  39. [39]

    C. Chen, Y . Yang, Z. Tan, Y . Wang, R. Zhan, H. Liu, X. Mao, J. Bao, X. Tang, L. Yang, et al. Devil is in narrow policy: Unleashing exploration in driving vla models.arXiv preprint arXiv:2603.06049, 2026

  40. [40]

    Dauner, M

    D. Dauner, M. Hallgarten, T. Li, X. Weng, Z. Huang, Z. Yang, H. Li, I. Gilitschenski, B. Ivanovic, M. Pavone, et al. Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking. Advances in Neural Information Processing Systems, 37:28706–28719, 2024

  41. [41]

    H. Lin, Y . Zhang, W. Ding, J. Wu, and D. Zhao. Model-based policy adaptation for closed-loop end-to-end autonomous driving. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,

  42. [42]

    URLhttps://openreview.net/forum?id=4OLbpaTKJe

  43. [43]

    T. Yan, T. Tang, X. Gui, Y . Li, J. Zhesng, W. Huang, L. Kong, W. Han, X. Zhou, X. Zhang, et al. Ad-r1: Closed-loop reinforcement learning for end-to-end autonomous driving with impartial world models.arXiv preprint arXiv:2511.20325, 2025

  44. [44]

    X. Jia, Z. Yang, Q. Li, Z. Zhang, and J. Yan. Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving.arXiv preprint arXiv:2406.03877, 2024

  45. [45]

    Dosovitskiy, G

    A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V . Koltun. Carla: An open urban driving simulator. In Conference on robot learning, pages 1–16. PMLR, 2017

  46. [46]

    Nguyen, M

    L. Nguyen, M. Fauth, B. Jaeger, D. Dauner, M. Igl, A. Geiger, and K. Chitta. Lead: Minimizing learner- expert asymmetry in end-to-end driving. InConference on Computer Vision and Pattern Recognition (CVPR), 2026

  47. [47]

    Y . Tang, Z. Xu, Z. Meng, and E. Cheng. Hip-ad: Hierarchical and multi-granularity planning with deformable attention for autonomous driving in a single decoder. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 25605–25615, 2025

  48. [48]

    J. Hu, J. K. Liu, H. Xu, and W. Shen. Reinforce++: Stabilizing critic-free policy optimization with global advantage normalization.arXiv preprint arXiv:2501.03262, 2025

  49. [49]

    K. Chen, W. Sun, H. Cheng, and S. Zheng. Rift: Group-relative rl fine-tuning for realistic and controllable traffic simulation.arXiv preprint arXiv:2505.03344, 2025

  50. [50]

    C. Sima, K. Renz, K. Chitta, L. Chen, H. Zhang, C. Xie, J. Beißwenger, P. Luo, A. Geiger, and H. Li. Drivelm: Driving with graph visual question answering. InEuropean conference on computer vision, pages 256–274. Springer, 2024

  51. [51]

    Gerstenecker, A

    S. Gerstenecker, A. Geiger, and K. Renz. Plant 2.0: Exposing biases and structural flaws in closed-loop driving.arXiv preprint arXiv:2511.07292, 2025. 12 Appendix A Proofs This appendix provides the proofs for the formal statements in Section 3. Unless noted otherwise, expectations are taken overs∼d πθ real and candidate trajectoryτ∼π θ(· |s). A.1 Exact D...

  52. [52]

    MindDrive Vision Language Action

  53. [53]

    HiP-AD Hierarchical Planning

  54. [54]

    SparseDriveV2 Vocabulary Scoring Perception Task

  55. [57]

    Hierarchical and Multi-granularity Planning Perception Task

  56. [58]

    Sparse Online Mapping

  57. [59]

    Sparse Tracking Planning Task

  58. [60]

    Coarse Factorized Scoring

  59. [61]

    • HiP-AD [46].The ego decision is represented as joint path-speed candidates with 48 ego modes

    Fine-grained Trajectory Scoring Figure 7:Operational paradigms of driving policies. • HiP-AD [46].The ego decision is represented as joint path-speed candidates with 48 ego modes. CRAFT and the non-PPO baselines update only the final classification branch in the last plan- refinement layer. When PPO is used, a lightweight critic is trained in addition to ...