pith. machine review for the scientific record. sign in

arxiv: 2605.12625 · v2 · submitted 2026-05-12 · 💻 cs.RO · cs.CV

Recognition: 2 theorem links

· Lean Theorem

Driving Intents Amplify Planning-Oriented Reinforcement Learning

Authors on Pith no claims yet

Pith reviewed 2026-05-15 04:55 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords driving policiesreinforcement learningmode collapsepreference optimizationintent conditioningflow matchingautonomous drivingclassifier-free guidance
0
0 comments X

The pith

Intent-conditioned sampling and multi-intent preference optimization expand driving policy distributions to surpass human demonstrations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that continuous-action driving policies trained on single demonstrations per scene suffer mode collapse, so even best-of-N selection cannot recover missing maneuver alternatives and performance caps below human levels. DIAL counters this in two stages: first by conditioning a flow-matching action head on discrete intent labels via classifier-free guidance to spread samples across distinct modes, then by applying multi-intent GRPO that keeps all intent classes inside every preference group during fine-tuning. This combination lifts best-of-128 rater feedback scores from the prior ceiling of 8.5 to 9.14, exceeding the human-driven baseline of 8.13 for the first time, while also raising held-out performance from 7.681 to 8.211. The central insight is that the limiting factor in preference RL for such policies is not the update rule alone but the need to enlarge and then protect the support of the sampling distribution being optimized.

Core claim

DIAL conditions the flow-matching action head on a discrete intent label with classifier-free guidance to expand the sampling distribution along distinct maneuver modes and break single-demonstration mode collapse. In the second stage, multi-intent GRPO spans all intent classes within every preference group and prevents fine-tuning from re-collapsing around the currently preferred mode. Evaluated on WOD-E2E with eight rule-derived intents, intent-CFG sampling raises best-of-128 RFS to 9.14, surpassing both the prior best of 8.5 and the human demonstration of 8.13, while multi-intent GRPO improves held-out RFS from 7.681 to 8.211.

What carries the argument

DIAL two-stage framework that first uses intent-CFG sampling on a flow-matching head to enlarge coverage over discrete maneuver modes and then applies multi-intent GRPO to maintain that coverage during preference updates.

If this is right

  • Competitive vision-to-action and vision-language-action SFT baselines remain below the human demonstration even at best-of-128.
  • Intent-CFG sampling alone lifts the performance ceiling to RFS 9.14 at best-of-128.
  • Multi-intent GRPO raises held-out RFS from 7.681 to 8.211 while every single-intent baseline peaks lower and degrades by the end of training.
  • The bottleneck in preference RL for continuous-action policies is expanding and preserving the sampling distribution rather than the update mechanism alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same intent-amplification pattern could be tested in other continuous-control settings where only one demonstration trajectory exists per scene.
  • Preference optimization may benefit more from explicit distribution-expansion steps than from further refinement of the update rule.
  • If the fixed eight intents miss important modes, replacing them with learned discrete clusters might further increase the reachable performance ceiling.

Load-bearing premise

The eight rule-derived intents are assumed to span the semantically distinct maneuver modes that matter for preference alignment.

What would settle it

If removing the intent-conditioning stage from DIAL causes best-of-128 RFS to fall back below the human demonstration of 8.13 on the same evaluation set, the claim that intent amplification is required to exceed the demonstrated ceiling would be falsified.

Figures

Figures reproduced from arXiv: 2605.12625 by Benjin Zhu, Chengmin Yang, Hengtong Lu, Jifeng Dai, Pengfei Jing, Victor Shea-Jay Huang, Yan Xie.

Figure 1
Figure 1. Figure 1: Driving intents amplify planning-oriented RL by exposing within-scene preference contrast. (a) Under SFT + ordinary sampling, K rollouts collapse into one maneuver basin and their RFS scores are nearly identical (∆RFS ≈ 0), so the group-relative advantage is uninformative. (b) Under intent-conditioned CFG sampling, K=8 rollouts (one per driving intent) spread across distinct basins and their RFS scores spr… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of DIAL. (a) Stage 1 — CFG Imitation Training. The diffusion action head is conditioned on a discrete intent ci ; CFG dropout (pdrop) teaches the model both conditional and unconditional action distributions. (b) Stage 2 — Multi-Intent GRPO. Per scene, K=16 trajectories are sampled as S=2 rollouts × |C|=8 intents, scored by the RFS rater, and used to update the policy via GRPO against the SFT refe… view at source ↗
Figure 3
Figure 3. Figure 3: Pre-RL proposal ceiling. Best-of-K RFS vs. budget K. Gray dashed: four SFT base￾lines all saturate below GT (8.13, red dashed) at K=128. Blue: intent-conditioned SFT under four strategies (gt, top-rater, predicted, random), all cross GT at K≈8. Navy: 8-intent equal-budget pooling reaches 9.14 at K=128. S = 1 (K = 8) S = 2 (K = 16) S = 3 (K = 24) S = 4 (K = 32) S (SDE seeds per intent) 7.6 7.8 8.0 8.2 8.4 8… view at source ↗
Figure 5
Figure 5. Figure 5: plots held-out RFS throughout RL training for DIAL and the four single-intent baselines from Section 4.4, all sharing the same per-scene budget K = 16. Three patterns are visible. First, DIAL (multi-intent) rises to its peak held-out RFS of 8.211 and subsequently declines only modestly, maintaining a substantially higher level than all single-intent variants throughout. The relative stability reflects that… view at source ↗
read the original abstract

Continuous-action policies trained on a single demonstrated trajectory per scene suffer from mode collapse: samples cluster around the demonstrated maneuver and the policy cannot represent semantically distinct alternatives. Under preference-based evaluation, this caps best-of-N performance -- even oracle selection cannot recover what the sampling distribution does not contain. We introduce DIAL, a two-stage Driving-Intent-Amplified reinforcement Learning framework for preference-aligned continuous-action driving policies. In the first stage, DIAL conditions the flow-matching action head on a discrete intent label with classifier-free guidance (CFG), which expands the sampling distribution along distinct maneuver modes and breaks single-demonstration mode collapse. In the second stage, DIAL carries this expanded distribution into preference RL through multi-intent GRPO, which spans all intent classes within every preference group and prevents fine-tuning from re-collapsing around the currently preferred mode. Instantiated for end-to-end driving with eight rule-derived intents and evaluated on WOD-E2E: competitive Vision-to-Action (VA) and Vision-Language-Action (VLA) Supervised Finetuning (SFT) baselines plateau below the human-driven demonstration at best-of-128, with the strongest prior (RAP) capping at Rater Feedback Score (RFS) 8.5 even with best-of-64; intent-CFG sampling lifts this ceiling to RFS 9.14 at best-of-128, surpassing both the prior best (RAP 8.5) and the human-driven demonstration (8.13) for the first time; and multi-intent GRPO improves held-out RFS from 7.681 to 8.211, while every single-intent baseline peaks lower and degrades by training end. These results suggest that the bottleneck of preference RL on continuous-action policies trained from demonstrations is not only how to update the policy, but to expand and preserve the sampling distribution being optimized.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces DIAL, a two-stage Driving-Intent-Amplified reinforcement Learning framework. Stage 1 conditions a flow-matching action head on discrete intent labels via classifier-free guidance (CFG) to expand the sampling distribution and break single-demonstration mode collapse. Stage 2 applies multi-intent GRPO during preference RL to preserve coverage across intent classes. On WOD-E2E, intent-CFG sampling reaches RFS 9.14 at best-of-128 (surpassing RAP at 8.5 and human demonstrations at 8.13), while multi-intent GRPO raises held-out RFS from 7.681 to 8.211.

Significance. If the empirical gains prove robust, the work identifies distribution expansion as a key bottleneck in preference RL for continuous-action driving policies and supplies a concrete mechanism (intent-CFG + multi-intent GRPO) that demonstrably exceeds both prior methods and human performance on RFS. The approach is directly applicable to end-to-end vision-to-action and vision-language-action models.

major comments (2)
  1. [Abstract and experimental results] The central performance claims rest on best-of-128 and held-out RFS numbers (9.14 and 8.211) without reported error bars, multiple random seeds, or ablation tables on the number and coverage of the eight rule-derived intents; this makes it impossible to determine whether the reported lift is statistically reliable or sensitive to intent definition.
  2. [§3.1 (intent definition) and §4 (evaluation)] The weakest assumption—that the eight rule-derived intents span all semantically distinct maneuver modes relevant for preference alignment—is load-bearing for the claim that intent-CFG expands the distribution sufficiently; no quantitative validation (e.g., mode coverage metrics or failure-case analysis) is supplied to test this premise.
minor comments (2)
  1. [§3] Notation for the flow-matching head and GRPO objective should be introduced with explicit equations rather than prose descriptions to allow direct comparison with prior flow-matching and GRPO formulations.
  2. [§4] The paper should clarify whether the reported RFS values are computed on the same held-out scenes for all methods and whether best-of-N selection uses the same preference model across baselines.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback on statistical robustness and the coverage assumptions underlying our intent definitions. We address both major comments below and will revise the manuscript to strengthen the presentation of results.

read point-by-point responses
  1. Referee: [Abstract and experimental results] The central performance claims rest on best-of-128 and held-out RFS numbers (9.14 and 8.211) without reported error bars, multiple random seeds, or ablation tables on the number and coverage of the eight rule-derived intents; this makes it impossible to determine whether the reported lift is statistically reliable or sensitive to intent definition.

    Authors: We agree that error bars, multiple seeds, and intent ablations are necessary to establish reliability. In the revised manuscript we will report all key metrics (including best-of-128 RFS and held-out RFS) as means over three independent random seeds with standard-deviation error bars. We will also add an appendix ablation table comparing performance for 4, 8, and 12 intents to quantify sensitivity to the number and coverage of intent classes. revision: yes

  2. Referee: [§3.1 (intent definition) and §4 (evaluation)] The weakest assumption—that the eight rule-derived intents span all semantically distinct maneuver modes relevant for preference alignment—is load-bearing for the claim that intent-CFG expands the distribution sufficiently; no quantitative validation (e.g., mode coverage metrics or failure-case analysis) is supplied to test this premise.

    Authors: The eight intents are obtained via deterministic rule-based classification of trajectory features (curvature sign, lateral offset, speed profile) that are standard in the autonomous-driving literature. While we cannot exhaustively enumerate every conceivable semantic mode, the paper already shows via qualitative rollouts that intent-CFG produces maneuvers absent from the single-demonstration baseline. We will add explicit quantitative mode-coverage statistics (intent-distribution entropy and fraction of unique intents realized in best-of-N samples) together with a dedicated failure-case analysis of uncovered modes in the revised §4. revision: partial

Circularity Check

0 steps flagged

No significant circularity; claims rest on external empirical comparisons

full rationale

The paper presents DIAL as a two-stage method using intent-CFG for distribution expansion and multi-intent GRPO for preference optimization, with performance measured via direct comparisons to independent external references (RAP baseline at RFS 8.5, human demonstration at 8.13, and held-out metrics). The eight rule-derived intents are introduced as explicit inputs without redefinition in terms of outputs. No equations or steps reduce by construction to fitted parameters, self-citations, or ansatzes; the reported lifts (best-of-128 RFS 9.14, GRPO gain to 8.211) are statistical outcomes of experimentation against non-internal benchmarks. The derivation chain remains self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the assumption that rule-derived discrete intents are sufficient to cover distinct driving modes and that preference labels remain stable across those modes; no new physical entities or free parameters beyond standard RL hyperparameters are introduced in the abstract.

axioms (1)
  • domain assumption Discrete intent labels derived from traffic rules span the semantically relevant maneuver modes for preference alignment
    Invoked when the first stage conditions the flow-matching head on these labels to break mode collapse.

pith-pipeline@v0.9.0 · 5662 in / 1261 out tokens · 33048 ms · 2026-05-15T04:55:03.436500+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · 12 internal anchors

  1. [1]

    2505.14139 , archivePrefix =

    Marvin Alles, Nutan Chen, Patrick van der Smagt, and Botond Cseke. Flowq: Energy-guided flow policies for offline reinforcement learning.arXiv preprint arXiv:2505.14139,

  2. [2]

    Training Diffusion Models with Reinforcement Learning

    Kevin Black, Michael Janner, Yilun Du, Ilya Kostrikov, and Sergey Levine. Training diffusion models with reinforcement learning.arXiv preprint arXiv:2305.13301,

  3. [3]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. π0: A vision-language-action flow model for general robot control.arXiv preprint arXiv:2410.24164,

  4. [4]

    Multipath: Multiple proba- bilistic anchor trajectory hypotheses for behavior prediction.arXiv preprint arXiv:1910.05449,

    Yuning Chai, Benjamin Sapp, Mayank Bansal, and Dragomir Anguelov. Multipath: Multiple proba- bilistic anchor trajectory hypotheses for behavior prediction.arXiv preprint arXiv:1910.05449,

  5. [5]

    Devil is in Narrow Policy: Unleashing Exploration in Driving

    Canyu Chen, Yuguang Yang, Zhewen Tan, Yizhi Wang, Ruiyi Zhan, Haiyan Liu, Xuanyao Mao, Jason Bao, Xinyue Tang, Linlin Yang, et al. Devil is in narrow policy: Unleashing exploration in driving vla models.arXiv preprint arXiv:2603.06049,

  6. [6]

    VADv2: End-to-End Vectorized Autonomous Driving via Probabilistic Planning

    Shaoyu Chen, Bo Jiang, Hao Gao, Bencheng Liao, Qing Xu, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Vadv2: End-to-end vectorized autonomous driving via probabilistic planning. arXiv preprint arXiv:2402.13243,

  7. [7]

    RAP: 3D rasterization augmented end-to-end planning.arXiv preprint arXiv:2510.04333, 2025

    Lan Feng, Yang Gao, Eloi Zablocki, Quanyi Li, Wuyang Li, Sichao Liu, Matthieu Cord, and Alexan- dre Alahi. Rap: 3d rasterization augmented end-to-end planning.arXiv preprint arXiv:2510.04333,

  8. [8]

    Stylevla: Driving style-aware vision language action model for autonomous driving

    Yuan Gao, Dengyuan Hua, Mattia Piccinini, Finn Rasmus Schäfer, Korbinian Moller, Lin Li, and Johannes Betz. Stylevla: Driving style-aware vision language action model for autonomous driving. arXiv preprint arXiv:2603.09482,

  9. [9]

    MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

    Victor Shea-Jay Huang, Le Zhuo, Yi Xin, Zhaokai Wang, Fu-Yun Wang, Yuchi Wang, Renrui Zhang, Peng Gao, and Hongsheng Li. Tide: Temporal-aware sparse autoencoders for interpretable diffusion transformers in image generation. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 435–443, 2026a. Yuzhou Huang, Benjin Zhu, Hengtong ...

  10. [10]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, et al. Openvla: An open-source vision-language-action model.arXiv preprint arXiv:2406.09246,

  11. [11]

    DriveVLA-W0: World models amplify data scaling law in autonomous driving.arXiv preprint arXiv:2510.12796, 2025

    Yingyan Li, Shuyao Shang, Weisong Liu, Bing Zhan, Haochen Wang, Yuqi Wang, Yuntao Chen, Xiaoman Wang, Yasong An, Chufeng Tang, et al. Drivevla-w0: World models amplify data scaling law in autonomous driving.arXiv preprint arXiv:2510.12796, 2025a. Yongkang Li, Kaixin Xiong, Xiangyu Guo, Fang Li, Sixu Yan, Gangwei Xu, Lijun Zhou, Long Chen, Haiyang Sun, Bin...

  12. [12]

    Flow Matching for Generative Modeling

    Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747,

  13. [13]

    Last-vla: Thinking in latent spatio-temporal space for vision-language-action in autonomous driving.arXiv preprint arXiv:2603.01928,

    Yuechen Luo, Fang Li, Shaoqing Xu, Yang Ji, Zehan Zhang, Bing Wang, Yuannan Shen, Jianwei Cui, Long Chen, Guang Chen, et al. Last-vla: Thinking in latent spatio-temporal space for vision-language-action in autonomous driving.arXiv preprint arXiv:2603.01928,

  14. [14]

    Gpt-driver: Learning to drive with gpt.arXiv preprint arXiv:2310.01415, 2023a

    Jiageng Mao, Yuxi Qian, Junjie Ye, Hang Zhao, and Yue Wang. Gpt-driver: Learning to drive with gpt.arXiv preprint arXiv:2310.01415,

  15. [15]

    Flow matching policy gradients.arXiv preprint arXiv:2507.21053,

    David McAllister, Songwei Ge, Brent Yi, Chung Min Kim, Ethan Weber, Hongsuk Choi, Haiwen Feng, and Angjoo Kanazawa. Flow matching policy gradients.arXiv preprint arXiv:2507.21053,

  16. [16]

    Nord: A data-efficient vision-language-action model that drives without reasoning.arXiv preprint arXiv:2602.21172,

    Ishaan Rawal, Shubh Gupta, Yihan Hu, and Wei Zhan. Nord: A data-efficient vision-language-action model that drives without reasoning.arXiv preprint arXiv:2602.21172,

  17. [17]

    Diffusion policy policy optimization.arXiv preprint arXiv:2409.00588, 2024

    Allen Z Ren, Justin Lidard, Lars L Ankile, Anthony Simeonov, Pulkit Agrawal, Anirudha Majumdar, Benjamin Burchfiel, Hongkai Dai, and Max Simchowitz. Diffusion policy policy optimization. arXiv preprint arXiv:2409.00588,

  18. [18]

    Proximal Policy Optimization Algorithms

    11 John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347,

  19. [19]

    DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

    Hao Shao, Yuxuan Hu, Letian Wang, Guanglu Song, Steven L Waslander, Yu Liu, and Hongsheng Li. Lmdrive: Closed-loop end-to-end driving with large language models. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 15120–15130, 2024a. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang,...

  20. [20]

    Learning Vision-Language-Action World Models for Autonomous Driving

    Guoqing Wang, Pin Tang, Xiangxuan Ren, Guodongfang Zhao, Bailan Feng, and Chao Ma. Learning vision-language-action world models for autonomous driving.arXiv preprint arXiv:2604.09059,

  21. [21]

    Diffusion Policies as an Expressive Policy Class for Offline Reinforcement Learning

    Zhendong Wang, Jonathan J Hunt, and Mingyuan Zhou. Diffusion policies as an expressive policy class for offline reinforcement learning.arXiv preprint arXiv:2208.06193,

  22. [22]

    Dilu: A knowledge-driven approach to autonomous driving with large language models

    Licheng Wen, Daocheng Fu, Xin Li, Xinyu Cai, Tao Ma, Pinlong Cai, Min Dou, Botian Shi, Liang He, and Yu Qiao. Dilu: A knowledge-driven approach to autonomous driving with large language models.arXiv preprint arXiv:2309.16292,

  23. [23]

    Latentvla: Efficient vision-language models for autonomous driving via latent action prediction

    Chengen Xie, Bin Sun, Tianyu Li, Junjie Wu, Zhihui Hao, XianPeng Lang, and Hongyang Li. Latentvla: Efficient vision-language models for autonomous driving via latent action prediction. arXiv preprint arXiv:2601.05611,

  24. [24]

    2510.26125 , archivePrefix =

    Runsheng Xu, Hubert Lin, Wonseok Jeon, Hao Feng, Yuliang Zou, Liting Sun, John Gorman, Ekaterina Tolstaya, Sarah Tang, Brandyn White, et al. Wod-e2e: Waymo open dataset for end-to- end driving in challenging long-tail scenarios.arXiv preprint arXiv:2510.26125, 2025a. Yifang Xu, Jiahao Cui, Feipeng Cai, Zhihao Zhu, Hanlin Shang, Shan Luan, Mingwang Xu, Nen...

  25. [25]

    Vla-r1: Enhancing reasoning in vision-language-action models.arXiv preprint arXiv:2510.01623, 2025

    Angen Ye, Zeyu Zhang, Boyuan Wang, Xiaofeng Wang, Dapeng Zhang, and Zheng Zhu. Vla-r1: Enhancing reasoning in vision-language-action models.arXiv preprint arXiv:2510.01623,

  26. [26]

    Samoe- vla: A scene adaptive mixture-of-experts vision-language-action model for autonomous driving

    Zihan You, Hongwei Liu, Chenxu Dang, Zhe Wang, Sining Ang, Aoqi Wang, and Yan Wang. Samoe- vla: A scene adaptive mixture-of-experts vision-language-action model for autonomous driving. arXiv preprint arXiv:2603.08113,

  27. [27]

    Reasoning-vla: A fast and general vision-language-action reasoning model for autonomous driving.arXiv preprint arXiv:2511.19912, 2025a

    Dapeng Zhang, Zhenlong Yuan, Zhangquan Chen, Chih-Ting Liao, Yinda Chen, Fei Shen, Qingguo Zhou, and Tat-Seng Chua. Reasoning-vla: A fast and general vision-language-action reasoning model for autonomous driving.arXiv preprint arXiv:2511.19912, 2025a. Songyan Zhang, Wenhui Huang, Zhan Chen, Chua Jiahao Collister, Qihang Huang, and Chen Lv. Openread: Reinf...

  28. [28]

    AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

    Zewei Zhou, Tianhui Cai, Seth Z Zhao, Yun Zhang, Zhiyu Huang, Bolei Zhou, and Jiaqi Ma. Autovla: A vision-language-action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning.arXiv preprint arXiv:2506.13757,

  29. [29]

    Fine-Tuning Language Models from Human Preferences

    Daniel M Ziegler, Nisan Stiennon, Jeffrey Wu, Tom B Brown, Alec Radford, Dario Amodei, Paul Christiano, and Geoffrey Irving. Fine-tuning language models from human preferences.arXiv preprint arXiv:1909.08593,

  30. [30]

    DiffusionDriveV2: Reinforcement learning-constrained truncated diffusion modeling in end-to-end autonomous driving.arXiv preprint arXiv:2512.07745, 2025

    Jialv Zou, Shaoyu Chen, Bencheng Liao, Zhiyu Zheng, Yuehao Song, Lefei Zhang, Qian Zhang, Wenyu Liu, and Xinggang Wang. Diffusiondrivev2: Reinforcement learning-constrained truncated diffusion modeling in end-to-end autonomous driving.arXiv preprint arXiv:2512.07745,

  31. [31]

    13 A Extended Related Work Vision-language-action policies.Vision-language-action models adapt large multimodal represen- tations to action generation. RT-2 studies how web-scale vision-language pretraining can transfer to robot control [Zitkovich et al., 2023]; OpenVLA provides an open-source VLA model for robotic manipulation [Kim et al., 2024]; and π0 ...