VLADriveBench: Evaluating CoT-Action Relationship in VLA for Autonomous Driving
Pith reviewed 2026-06-27 09:34 UTC · model grok-4.3
The pith
VLADriveBench shows observational alignment and causal influence can diverge sharply in VLA driving models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VLADriveBench pairs four observational metrics (mentioning, hallucination, contradiction, action alignment) with a CoT intervention protocol. When run on three models, the two views conflict: one model achieves the highest observational alignment yet its CoT proves epiphenomenal, while another scores lower observationally yet its CoT exerts strong causal control, with visual salience modulating how much the reasoning affects the final action.
What carries the argument
The CoT intervention protocol, which modifies generated reasoning and records consequent shifts in driving actions, used together with observational checks for relevance and consistency.
If this is right
- Trajectory-quality benchmarks alone can certify models whose reasoning does not affect their outputs.
- Visual salience acts as a gate on whether generated reasoning influences actions.
- Complementary causal checks are required to determine whether a model's explanations are functionally operative.
- Architecture and training differences can produce opposite relationships between observational scores and causal impact.
Where Pith is reading between the lines
- Safety arguments for VLA deployment would need to verify causal CoT influence rather than rely on alignment metrics.
- Training objectives that strengthen the link between salient visual features and reasoning steps could increase causal consistency.
- The same intervention approach could be applied to other embodied VLA tasks to test whether reasoning is epiphenomenal.
Load-bearing premise
Modifying the chain-of-thought text isolates its causal effect on actions without being altered by differences in model architecture, training data, or implementation details.
What would settle it
A controlled test in which editing the CoT of the high-observational model produces no measurable change in its driving trajectories, or editing the CoT of the lower-scoring model produces large trajectory changes only when visual salience is high.
read the original abstract
Vision-language-action (VLA) models generate chain-of-thought (CoT) reasoning alongside driving trajectories, but existing benchmarks evaluate only trajectory quality and do not assess whether the CoT is relevant, consistent, or causally connected to the driving action. We introduce VLADriveBench, a framework that combines observational metrics (mentioning, hallucination, contradiction, action alignment) with a CoT intervention protocol to provide complementary views of the CoT-action relationship. Applying VLADriveBench to three models across two architectures, we find that the two analyses can diverge sharply: ORION scores highest on observational alignment yet its CoT is epiphenomenal, while Alpamayo v1.5 scores lower yet its CoT is strongly causal, with visual salience gating the extent of CoT influence.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces VLADriveBench, a benchmark combining observational metrics (mentioning, hallucination, contradiction, action alignment) with a CoT intervention protocol to evaluate whether chain-of-thought reasoning in vision-language-action (VLA) models is relevant and causally connected to driving actions. Applied to three models across two architectures, it reports that observational and causal analyses diverge sharply, with ORION scoring highest on observational alignment but having epiphenomenal CoT, while Alpamayo v1.5 shows lower observational scores but strongly causal CoT, modulated by visual salience.
Significance. If the intervention protocol can be shown to isolate causal effects without confounds, the result would usefully demonstrate that observational metrics alone are insufficient for assessing CoT utility in VLA driving models and could inform safer model design. The work addresses a clear gap in existing trajectory-only benchmarks.
major comments (2)
- [CoT intervention protocol] The CoT intervention protocol (methods section): the headline divergence result between observational and causal analyses for ORION vs. Alpamayo v1.5 requires that the protocol isolates the causal contribution of CoT without confounding from architecture, training data, or generation differences. No validation experiments, control conditions, token-editing mechanics, or architecture-specific controls are described, so alternative explanations for the reported divergence cannot be ruled out.
- [Abstract and results] Abstract and results: metric definitions, statistical tests, and implementation details for the observational metrics and intervention outcomes are absent, preventing assessment of whether the data support the specific claims about epiphenomenal vs. causal CoT.
minor comments (1)
- [Abstract] The abstract would benefit from a brief sentence on the number of scenarios or trajectories used in the benchmark evaluation.
Simulated Author's Rebuttal
We thank the referee for their thoughtful review and constructive comments on our paper. We address each major comment below and indicate the revisions we plan to make.
read point-by-point responses
-
Referee: [CoT intervention protocol] The CoT intervention protocol (methods section): the headline divergence result between observational and causal analyses for ORION vs. Alpamayo v1.5 requires that the protocol isolates the causal contribution of CoT without confounding from architecture, training data, or generation differences. No validation experiments, control conditions, token-editing mechanics, or architecture-specific controls are described, so alternative explanations for the reported divergence cannot be ruled out.
Authors: We acknowledge the referee's concern regarding the need for validation of the CoT intervention protocol to ensure it isolates causal effects. The manuscript describes the protocol but lacks explicit validation experiments and detailed controls. In the revised version, we will expand the methods section to include descriptions of token-editing mechanics, control conditions, and architecture-specific considerations. We will also add a discussion of potential confounds from architecture and training data differences. This addresses the point by providing more transparency, though full empirical validation may require additional experiments beyond the current scope. revision: partial
-
Referee: [Abstract and results] Abstract and results: metric definitions, statistical tests, and implementation details for the observational metrics and intervention outcomes are absent, preventing assessment of whether the data support the specific claims about epiphenomenal vs. causal CoT.
Authors: We agree that the absence of detailed metric definitions and implementation details in the abstract and results sections hinders evaluation. The revised manuscript will incorporate precise definitions for the observational metrics (mentioning, hallucination, contradiction, action alignment), specify the statistical tests used, and provide implementation details for the intervention outcomes. These changes will be made in the methods and results sections, with a brief mention in the abstract. revision: yes
Circularity Check
No circularity; empirical benchmark without derivation chain
full rationale
The paper introduces VLADriveBench as an empirical evaluation framework combining observational metrics and a CoT intervention protocol. No equations, fitted parameters, predictions, or first-principles derivations appear in the provided abstract or description. The central claims rest on applying the benchmark to existing models rather than any self-referential construction or reduction of results to inputs by definition. This matches the default expectation for non-derivational empirical work.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Orion model weights and code.https://github.com/xiaomi-mlab/Orion, 2025
2025
-
[2]
Alpamayo-r1-10b model weights.https://huggingface.co/nvidia/Alpamayo-R1-10B, 2026
2026
-
[3]
Alpamayo-1.5-10b model weights.https://huggingface.co/nvidia/Alpamayo-1.5-10B, 2026
2026
-
[4]
Claude opus 4.6 system card.https://www.anthropic.com/claude-opus-4-6-system-card, 2026
Anthropic. Claude opus 4.6 system card.https://www.anthropic.com/claude-opus-4-6-system-card, 2026
2026
-
[5]
Introducing claude opus 4.7.https://www.anthropic.com/news/claude-opus-4-7, 2026
Anthropic. Introducing claude opus 4.7.https://www.anthropic.com/news/claude-opus-4-7, 2026
2026
-
[6]
Carla: An open urban driving simulator
Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An open urban driving simulator. InConference on robot learning, pages 1–16. PMLR, 2017
2017
-
[7]
Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation
Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Dingkang Liang, Chong Zhang, Dingyuan Zhang, Hongwei Xie, Bing Wang, and Xiang Bai. Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 24823–24834, 2025. 8
2025
-
[8]
Google DeepMind. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025
Pith/arXiv arXiv 2025
-
[9]
Distilling multi-modal large language models for autonomous driving
Deepti Hegde, Rajeev Yasarla, Hong Cai, Shizhong Han, Apratim Bhattacharyya, Shweta Mahajan, Litian Liu, Risheek Garrepalli, Vishal M Patel, and Fatih Porikli. Distilling multi-modal large language models for autonomous driving. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 27575–27585, 2025
2025
-
[10]
Planning-oriented autonomous driving
Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, Lewei Lu, Xiaosong Jia, Qiang Liu, Jifeng Dai, Yu Qiao, and Hongyang Li. Planning-oriented autonomous driving. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023
2023
-
[11]
Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262, 2024
Jyh-Jing Hwang, Runsheng Xu, Hubert Lin, Wei-Chih Hung, Jingwei Ji, Kristy Choi, Di Huang, Tong He, Paul Covington, Benjamin Sapp, et al. Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262, 2024
Pith/arXiv arXiv 2024
-
[12]
Alon Jacovi and Yoav Goldberg. Towards faithfully interpretable NLP systems: How should we define and evaluate faithfulness? InProceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pages 4198–4205, 2020
2020
-
[13]
Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving.Advances in Neural Information Processing Systems, 37:819–844, 2024
Xiaosong Jia, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2drive: Towards multi-ability benchmarking of closed-loop end-to-end autonomous driving.Advances in Neural Information Processing Systems, 37:819–844, 2024
2024
-
[14]
Vad: Vectorized scene representation for efficient autonomous driving.ICCV, 2023
Bo Jiang, Shaoyu Chen, Qing Xu, Bencheng Liao, Jiajie Chen, Helong Zhou, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Vad: Vectorized scene representation for efficient autonomous driving.ICCV, 2023
2023
-
[15]
Bo Jiang, Shaoyu Chen, Bencheng Liao, Xingyu Zhang, Wei Yin, Qian Zhang, Chang Huang, Wenyu Liu, and Xinggang Wang. Senna: Bridging large vision-language models and end-to-end autonomous driving.arXiv preprint arXiv:2410.22313, 2024
Pith/arXiv arXiv 2024
-
[16]
Bo Jiang, Shaoyu Chen, Qian Zhang, Wenyu Liu, and Xinggang Wang. Alphadrive: Unleashing the power of vlms in autonomous driving via reinforcement learning and reasoning.arXiv preprint arXiv:2503.07608, 2025
Pith/arXiv arXiv 2025
-
[17]
Measuring faithfulness in chain-of-thought reasoning
Tamera Lanham, Anna Chen, Ansh Radhakrishnan, Benoit Steiner, Carson Denison, Danny Hernandez, Dustin Li, Esin Durmus, Evan Hubinger, Jackson Kernion, et al. Measuring faithfulness in chain-of-thought reasoning. arXiv preprint arXiv:2307.13702, 2023
Pith/arXiv arXiv 2023
-
[18]
Yongkang Li, Kaixin Xiong, Xiangyu Guo, Fang Li, Sixu Yan, Gangwei Xu, Lijun Zhou, Long Chen, Haiyang Sun, Bing Wang, et al. Recogdrive: A reinforced cognitive framework for end-to-end autonomous driving.arXiv preprint arXiv:2506.08052, 2025
Pith/arXiv arXiv 2025
-
[19]
Drive-r1: Bridging reasoning and planning in vlms for autonomous driving with reinforcement learning
Yue Li, Meng Tian, Dechang Zhu, Jiangtong Zhu, Zhenyu Lin, Zhiwei Xiong, and Xinhai Zhao. Drive-r1: Bridging reasoning and planning in vlms for autonomous driving with reinforcement learning. InProceedings of the AAAI Conference on Artificial Intelligence, volume 40, pages 6708–6716, 2026
2026
-
[20]
Yunsheng Ma, Burhaneddin Yaman, Xin Ye, Mahmut Yurt, Jingru Luo, Abhirup Mallik, Ziran Wang, and Liu Ren. Aln-p3: Unified language alignment for perception, prediction, and planning in autonomous driving.arXiv preprint arXiv:2505.15158, 2025
arXiv 2025
-
[21]
Gpt-driver: Learning to drive with gpt.arXiv preprint arXiv:2310.01415, 2023
Jiageng Mao, Yuxi Qian, Junjie Ye, Hang Zhao, and Yue Wang. Gpt-driver: Learning to drive with gpt.arXiv preprint arXiv:2310.01415, 2023
Pith/arXiv arXiv 2023
-
[22]
Lingoqa: Visual question answering for autonomous driving
Ana-Maria Marcu, Long Chen, Jan Hünermann, Alice Karnsund, Benoit Hanotte, Prajwal Chidananda, Saurabh Nair, Vijay Badrinarayanan, Alex Kendall, Jamie Shotton, et al. Lingoqa: Visual question answering for autonomous driving. InEuropean Conference on Computer Vision, pages 252–269. Springer, 2024
2024
-
[23]
GPT-5 system card.arXiv preprint arXiv:2601.03267, 2025
OpenAI. GPT-5 system card.arXiv preprint arXiv:2601.03267, 2025
Pith/arXiv arXiv 2025
-
[24]
Zhenghao Peng, Wenhao Ding, Yurong You, Yuxiao Chen, Wenjie Luo, Thomas Tian, Yulong Cao, Apoorva Sharma, Danfei Xu, Boris Ivanovic, et al. Counterfactual vla: Self-reflective vision-language-action model with adaptive reasoning.arXiv preprint arXiv:2512.24426, 2025
arXiv 2025
-
[25]
Drivelm: Driving with graph visual question answering
Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. InEuropean Conference on Computer Vision, pages 256–274. Springer, 2024. 9
2024
-
[26]
Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models.arXiv preprint arXiv:2402.12289, 2024
Pith/arXiv arXiv 2024
-
[27]
Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting.Advances in Neural Information Processing Systems, 36:74952–74965, 2023
Miles Turpin, Julian Michael, Ethan Perez, and Samuel R Bowman. Language models don’t always say what they think: Unfaithful explanations in chain-of-thought prompting.Advances in Neural Information Processing Systems, 36:74952–74965, 2023
2023
-
[28]
Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning
Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M Alvarez. Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 22442–22452, 2025
2025
-
[29]
Yan Wang, Wenjie Luo, Junjie Bai, Yulong Cao, Tong Che, Ke Chen, Yuxiao Chen, Jenna Diamond, Yifan Ding, Wenhao Ding, et al. Alpamayo-r1: Bridging reasoning and action prediction for generalizable autonomous driving in the long tail.arXiv preprint arXiv:2511.00088, 2025
Pith/arXiv arXiv 2025
-
[30]
maintain speed to follow lane
Shuang Zeng, Xinyuan Chang, Mengwei Xie, Xinran Liu, Yifan Bai, Zheng Pan, Mu Xu, and Xing Wei. Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving.Advances in Neural Information Processing Systems, 38:67299–67318, 2026. 10 A Scenario Design Details All scenarios are implemented in CARLA 0.9.15 with no background traffic an...
2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.