Recognition: unknown
CRAFT: Counterfactual-to-Interactive Reinforcement Fine-Tuning for Driving Policies
Pith reviewed 2026-05-08 17:16 UTC · model grok-4.3
The pith
CRAFT decomposes closed-loop policy gradients into dense counterfactual proxies and sparse grounded residuals for driving fine-tuning.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
CRAFT formulates closed-loop post-training as proxy-residual optimization: group-normalized counterfactual advantages computed from imperfect future estimates serve as a dense proxy for real closed-loop advantages, and this proxy is aligned to the closed-loop world through grounded residual correction collected from interaction-critical events. The framework further regularizes the online policy toward an EMA teacher via asymmetric KL self-distillation. Theoretically the real closed-loop policy gradient decomposes into proxy and residual terms under the same visited-state distribution, so that an aligned proxy reduces residual variance while the grounded residual removes proxy bias.
What carries the argument
Proxy-residual optimization that decomposes the closed-loop policy gradient into a dense counterfactual proxy term and a sparse grounded residual term under the same state distribution.
If this is right
- Strongest closed-loop gains on Bench2Drive hold across hierarchical planning, vision-language-action, and vocabulary-scoring policy architectures.
- The same visited-state distribution for proxy and residual terms keeps the overall gradient unbiased while lowering variance.
- Dense proxy plus sparse residual correction yields better scaling behavior and stability than either component alone.
- Transfer results improve when the proxy is aligned to the closed-loop world rather than left unadjusted.
Where Pith is reading between the lines
- The same proxy-residual split could be tested in other domains where cheap rollouts exist but real interactions are expensive, such as robotic manipulation.
- Collecting residuals only at critical events implies that data budgets should be concentrated on rare but high-impact states rather than uniform sampling.
- If the EMA teacher regularization is essential, similar distillation tricks may stabilize other on-policy methods that mix simulated and real signals.
- The approach suggests a general template for turning any dense but biased signal into a usable learning target by adding a lightweight grounded correction.
Load-bearing premise
Group-normalized counterfactual advantages from imperfect future estimates act as a sufficiently unbiased dense proxy for true closed-loop advantages, and residual corrections collected only from interaction-critical events remove the remaining bias without introducing new variance or selection effects.
What would settle it
A controlled ablation in which residual corrections are withheld while keeping the counterfactual proxy and self-distillation fixed, showing whether the closed-loop gains on Bench2Drive disappear or reverse.
Figures
read the original abstract
Open-loop imitation learning has advanced modern autonomous driving policy architectures, but closed-loop deployment remains vulnerable to policy-induced distribution shift. Existing post-training paradigms exhibit fundamental trade-offs: closed-loop RL fine-tuning provides grounded feedback from executed actions but is constrained by the sparsity of informative events, whereas counterfactual fine-tuning provides dense supervision over candidate futures but inherits bias from imperfect future estimates. We introduce Counterfactual-to-Interactive Reinforcement Fine-Tuning (CRAFT), an on-policy framework that formulates closed-loop post-training as proxy-residual optimization. CRAFT uses group-normalized counterfactual advantages as a dense proxy for real closed-loop advantages and aligns this proxy with the closed-loop world through grounded residual correction from interaction-critical events. To stabilize adaptation, CRAFT regularizes the online policy toward an EMA teacher via asymmetric KL self-distillation. Theoretically, CRAFT decomposes the real closed-loop policy gradient into proxy and residual terms under the same visited-state distribution, reducing residual variance with an aligned proxy while mitigating proxy bias through grounded residual approximation. Empirically, CRAFT achieves the strongest closed-loop gains on Bench2Drive across hierarchical planning, vision-language-action, and vocabulary-scoring architectures. Ablations, scaling behavior, stability analyses, and transfer results further validate the complementary roles of dense counterfactual proxy and grounded residual correction. Project page: https://currychen77.github.io/CRAFT.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces CRAFT, an on-policy fine-tuning framework for autonomous driving policies that decomposes the closed-loop policy gradient into a dense group-normalized counterfactual proxy term and a grounded residual correction term from interaction-critical events. It claims this decomposition occurs under the same visited-state distribution, reducing bias from imperfect estimates while providing dense supervision, and demonstrates empirical superiority on the Bench2Drive benchmark across multiple policy architectures, supported by ablations and scaling studies.
Significance. If the theoretical decomposition holds with the claimed properties, CRAFT offers a principled method to combine the density of counterfactual supervision with the grounding of closed-loop interaction, potentially mitigating distribution shift in deployed driving policies. The validation across hierarchical planning, vision-language-action, and vocabulary-scoring architectures, along with ablations on the complementary roles of proxy and residual, strengthens the contribution. Reproducible results on a public benchmark and stability analyses are positive aspects.
major comments (2)
- [Abstract] Abstract: The abstract asserts that 'CRAFT decomposes the real closed-loop policy gradient into proxy and residual terms under the same visited-state distribution'. However, the residual correction is defined to be nonzero only on interaction-critical events, a selective subset of states. This induces a mismatched state distribution for the residual term, undermining the shared-distribution assumption central to the bias-reduction guarantee. No quantitative bound on the resulting approximation error is provided.
- [Theoretical Analysis] Theoretical Analysis: Despite the strong theoretical claims, the abstract and summary provide no equations, proof sketch, or derivation details for the proxy-residual decomposition. This absence makes it difficult to assess whether the group-normalized counterfactual advantages are sufficiently unbiased or if the residual term independently corrects bias without circular dependence on the policy being optimized.
minor comments (1)
- [Abstract] The abstract asserts 'strongest closed-loop gains' without including specific numerical results or baseline comparisons; these should be summarized with effect sizes in the abstract for clarity.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment below with clarifications on the theoretical formulation and indicate the revisions we will incorporate.
read point-by-point responses
-
Referee: [Abstract] Abstract: The abstract asserts that 'CRAFT decomposes the real closed-loop policy gradient into proxy and residual terms under the same visited-state distribution'. However, the residual correction is defined to be nonzero only on interaction-critical events, a selective subset of states. This induces a mismatched state distribution for the residual term, undermining the shared-distribution assumption central to the bias-reduction guarantee. No quantitative bound on the resulting approximation error is provided.
Authors: We appreciate the referee's careful reading of this point. In the CRAFT formulation, both the proxy and residual terms are defined as expectations under the identical on-policy visited-state distribution induced by the current policy. The residual correction is nonzero only on interaction-critical events but is explicitly zero on all other states within the same distribution; it does not sample from or induce a separate state measure. This structure preserves the shared-distribution property for the overall decomposition. We will add an explicit clarification of this point to the abstract and derive a quantitative bound on the residual approximation error for inclusion in the revised theoretical analysis. revision: yes
-
Referee: [Theoretical Analysis] Theoretical Analysis: Despite the strong theoretical claims, the abstract and summary provide no equations, proof sketch, or derivation details for the proxy-residual decomposition. This absence makes it difficult to assess whether the group-normalized counterfactual advantages are sufficiently unbiased or if the residual term independently corrects bias without circular dependence on the policy being optimized.
Authors: We agree that the abstract and summary focus on high-level intuition rather than equations. The full derivation of the proxy-residual decomposition—including the group normalization of counterfactual advantages, the conditions ensuring the proxy remains sufficiently unbiased, and the demonstration that the residual term corrects bias using grounded interaction data without circular dependence on the policy—is presented in the Theoretical Analysis section of the manuscript. To address the concern, we will incorporate a concise proof sketch with key equations into the abstract and introduction of the revised version. revision: partial
Circularity Check
No significant circularity detected in CRAFT's proxy-residual decomposition
full rationale
The paper's central theoretical claim decomposes the closed-loop policy gradient into a group-normalized counterfactual proxy plus a grounded residual correction, both asserted to operate under the same on-policy visited-state distribution. No equations or definitions in the provided text reduce this decomposition to a self-referential fit, a renamed input, or a load-bearing self-citation chain. The counterfactual advantages are computed from future estimates and the residual is collected from interaction-critical events; these are presented as complementary but independent mechanisms rather than one being defined in terms of the other. The derivation therefore remains self-contained against external benchmarks and does not collapse to its inputs by construction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wang, L. Lu, X. Jia, Q. Liu, J. Dai, Y . Qiao, and H. Li. Planning-oriented autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023
2023
-
[2]
Jiang, S
B. Jiang, S. Chen, Q. Xu, B. Liao, J. Chen, H. Zhou, Q. Zhang, W. Liu, C. Huang, and X. Wang. Vad: Vectorized scene representation for efficient autonomous driving. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8340–8350, 2023
2023
-
[3]
W. Sun, X. Lin, Y . Shi, C. Zhang, H. Wu, and S. Zheng. Sparsedrive: End-to-end autonomous driving via sparse scene representation. In2025 IEEE International Conference on Robotics and Automation (ICRA), pages 8795–8801. IEEE, 2025
2025
- [4]
- [5]
-
[6]
T. Xia, Y . Li, L. Zhou, J. Yao, K. Xiong, H. Sun, B. Wang, K. Ma, G. Chen, H. Ye, et al. Drivelaw: Unifying planning and video generation in a latent driving world.arXiv preprint arXiv:2512.23421, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [7]
-
[8]
Z. Shao, P. Wang, Q. Zhu, R. Xu, J. Song, X. Bi, H. Zhang, M. Zhang, Y . Li, Y . Wu, et al. Deepseekmath: Pushing the limits of mathematical reasoning in open language models.arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review arXiv 2024
-
[9]
D. Ye, Z. Liu, M. Sun, B. Shi, P. Zhao, H. Wu, H. Yu, S. Yang, X. Wu, Q. Guo, et al. Mastering complex control in moba games with deep reinforcement learning. InProceedings of the AAAI conference on artificial intelligence, volume 34, pages 6672–6679, 2020
2020
-
[10]
Y . Gao, B. Shi, X. Du, L. Wang, G. Chen, Z. Lian, F. Qiu, G. Han, W. Wang, D. Ye, et al. Learning diverse policies in moba games via macro-goals.Advances in Neural Information Processing Systems, 34: 16171–16182, 2021
2021
-
[11]
B. Liao, S. Chen, H. Yin, B. Jiang, C. Wang, S. Yan, X. Zhang, X. Li, Y . Zhang, Q. Zhang, et al. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 12037–12047, 2025
2025
- [12]
-
[13]
Z. Zheng, S. Chen, H. Yin, X. Zhang, J. Zou, X. Wang, Q. Zhang, and L. Zhang. Resad: Normalized residual trajectory modeling for end-to-end autonomous driving.arXiv preprint arXiv:2510.08562, 2025
-
[14]
Unleashing the potential of diffusion models for end-to-end autonomous driving
Y . Zheng, T. Tan, B. Huang, E. Liu, R. Liang, J. Zhang, J. Cui, G. Chen, K. Ma, H. Ye, et al. Unleashing the potential of diffusion models for end-to-end autonomous driving.arXiv preprint arXiv:2602.22801, 2026
-
[15]
H. Fu, D. Zhang, Z. Zhao, J. Cui, D. Liang, C. Zhang, D. Zhang, H. Xie, B. Wang, and X. Bai. Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 24823–24834, 2025
2025
-
[16]
K. Renz, L. Chen, E. Arani, and O. Sinavski. Simlingo: Vision-only closed-loop autonomous driving with language-action alignment. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 11993–12003, 2025
2025
-
[17]
Z. Zhou, T. Cai, S. Z. Zhao, Y . Zhang, Z. Huang, B. Zhou, and J. Ma. Autovla: A vision-language-action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning.arXiv preprint arXiv:2506.13757, 2025
work page internal anchor Pith review arXiv 2025
- [18]
-
[19]
Z. Li, K. Li, S. Wang, S. Lan, Z. Yu, Y . Ji, Z. Li, Z. Zhu, J. Kautz, Z. Wu, et al. Hydra-mdp: End-to-end multimodal planning with multi-target hydra-distillation.arXiv preprint arXiv:2406.06978, 2024
work page internal anchor Pith review arXiv 2024
-
[20]
W. Yao, Z. Li, S. Lan, Z. Wang, X. Sun, J. M. Alvarez, and Z. Wu. Drivesuprim: Towards precise trajectory selection for end-to-end planning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 11910–11918, 2026
2026
- [21]
- [22]
-
[23]
Proximal Policy Optimization Algorithms
J. Schulman, F. Wolski, P. Dhariwal, A. Radford, and O. Klimov. Proximal policy optimization algorithms. arXiv preprint arXiv:1707.06347, 2017
work page internal anchor Pith review arXiv 2017
- [24]
-
[25]
Jaeger, D
B. Jaeger, D. Dauner, J. Beißwenger, S. Gerstenecker, K. Chitta, and A. Geiger. Carl: Learning scalable planning policies with simple rewards. In9th Annual Conference on Robot Learning, 2025
2025
-
[26]
H. Gao, S. Chen, B. Jiang, B. Liao, Y . Shi, X. Guo, Y . Pu, X. Li, W. Liu, Q. Zhang, et al. Rad: Training an end-to-end driving policy via large-scale 3dgs-based reinforcement learning. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
2025
-
[27]
H. Gao, S. Chen, Y . Zhu, Y . Song, W. Liu, Q. Zhang, and X. Wang. Rad-2: Scaling reinforcement learning in a generator-discriminator framework.arXiv preprint arXiv:2604.15308, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [28]
-
[29]
arXiv preprint arXiv:2603.18315
Z. Huang, Z. Sheng, Z. Wan, Y . Qu, J. You, S. Jiang, and S. Chen. Drivevlm-rl: Neuroscience-inspired reinforcement learning with vision-language models for safe and deployable autonomous driving.arXiv preprint arXiv:2603.18315, 2026
-
[30]
Q. Li, X. Jia, S. Wang, and J. Yan. Think2drive: Efficient reinforcement learning by thinking with latent world model for autonomous driving (in carla-v2). InEuropean conference on computer vision, pages 142–158. Springer, 2024
2024
-
[31]
Z. Yang, X. Jia, Q. Li, X. Yang, M. Yao, and J. Yan. Raw2drive: Reinforcement learning with aligned world models for end-to-end autonomous driving (in carla v2). InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
2025
-
[32]
H. Liu, T. Li, H. Yang, L. Chen, C. Wang, K. Guo, H. Tian, H. Li, H. Li, and C. Lv. Reinforced refinement with self-aware expansion for end-to-end autonomous driving.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026
2026
-
[33]
Rafailov, A
R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn. Direct preference optimization: Your language model is secretly a reward model.Advances in Neural Information Processing Systems, 36: 53728–53741, 2023
2023
-
[34]
Y . Li, K. Xiong, X. Guo, F. Li, S. Yan, G. Xu, L. Zhou, L. Chen, H. Sun, B. Wang, et al. Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving.arXiv preprint arXiv:2506.08052, 2025
work page internal anchor Pith review arXiv 2025
- [35]
-
[36]
Generative scenario rollouts for end-to-end autonomous driving.arXiv preprint arXiv:2601.11475, 2026
R. Yasarla, D. Hegde, S. Han, H.-P. Cheng, Y . Shi, M. Sadeghigooghari, S. Mahajan, A. Bhattacharyya, L. Liu, R. Garrepalli, et al. Generative scenario rollouts for end-to-end autonomous driving.arXiv preprint arXiv:2601.11475, 2026
-
[37]
S. Shang, B. Zhan, Y . Yan, Y . Wang, Y . Li, Y . An, X. Wang, J. Liu, L. Hou, L. Fan, et al. Dynvla: Learning world dynamics for action reasoning in autonomous driving.arXiv preprint arXiv:2603.11041, 2026. 11
- [38]
- [39]
-
[40]
Dauner, M
D. Dauner, M. Hallgarten, T. Li, X. Weng, Z. Huang, Z. Yang, H. Li, I. Gilitschenski, B. Ivanovic, M. Pavone, et al. Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking. Advances in Neural Information Processing Systems, 37:28706–28719, 2024
2024
-
[41]
H. Lin, Y . Zhang, W. Ding, J. Wu, and D. Zhao. Model-based policy adaptation for closed-loop end-to-end autonomous driving. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems,
-
[42]
URLhttps://openreview.net/forum?id=4OLbpaTKJe
- [43]
- [44]
-
[45]
Dosovitskiy, G
A. Dosovitskiy, G. Ros, F. Codevilla, A. Lopez, and V . Koltun. Carla: An open urban driving simulator. In Conference on robot learning, pages 1–16. PMLR, 2017
2017
-
[46]
Nguyen, M
L. Nguyen, M. Fauth, B. Jaeger, D. Dauner, M. Igl, A. Geiger, and K. Chitta. Lead: Minimizing learner- expert asymmetry in end-to-end driving. InConference on Computer Vision and Pattern Recognition (CVPR), 2026
2026
-
[47]
Y . Tang, Z. Xu, Z. Meng, and E. Cheng. Hip-ad: Hierarchical and multi-granularity planning with deformable attention for autonomous driving in a single decoder. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 25605–25615, 2025
2025
-
[48]
J. Hu, J. K. Liu, H. Xu, and W. Shen. Reinforce++: Stabilizing critic-free policy optimization with global advantage normalization.arXiv preprint arXiv:2501.03262, 2025
work page internal anchor Pith review arXiv 2025
- [49]
-
[50]
C. Sima, K. Renz, K. Chitta, L. Chen, H. Zhang, C. Xie, J. Beißwenger, P. Luo, A. Geiger, and H. Li. Drivelm: Driving with graph visual question answering. InEuropean conference on computer vision, pages 256–274. Springer, 2024
2024
-
[51]
S. Gerstenecker, A. Geiger, and K. Renz. Plant 2.0: Exposing biases and structural flaws in closed-loop driving.arXiv preprint arXiv:2511.07292, 2025. 12 Appendix A Proofs This appendix provides the proofs for the formal statements in Section 3. Unless noted otherwise, expectations are taken overs∼d πθ real and candidate trajectoryτ∼π θ(· |s). A.1 Exact D...
-
[52]
MindDrive Vision Language Action
-
[53]
HiP-AD Hierarchical Planning
-
[54]
SparseDriveV2 Vocabulary Scoring Perception Task
-
[57]
Hierarchical and Multi-granularity Planning Perception Task
-
[58]
Sparse Online Mapping
-
[59]
Sparse Tracking Planning Task
-
[60]
Coarse Factorized Scoring
-
[61]
• HiP-AD [46].The ego decision is represented as joint path-speed candidates with 48 ego modes
Fine-grained Trajectory Scoring Figure 7:Operational paradigms of driving policies. • HiP-AD [46].The ego decision is represented as joint path-speed candidates with 48 ego modes. CRAFT and the non-PPO baselines update only the final classification branch in the last plan- refinement layer. When PPO is used, a lightweight critic is trained in addition to ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.