pith. machine review for the scientific record. sign in

arxiv: 2604.05378 · v1 · submitted 2026-04-07 · 💻 cs.CL · cs.CV

Recognition: 1 theorem link

ICR-Drive: Instruction Counterfactual Robustness for End-to-End Language-Driven Autonomous Driving

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:38 UTC · model grok-4.3

classification 💻 cs.CL cs.CV
keywords instruction robustnesslanguage-conditioned drivingcounterfactual evaluationCARLA simulatorvision-language-action modelsautonomous drivingembodied AIfailure modes
0
0 comments X

The pith

Minor instruction changes cause large performance drops and new failure modes in language-conditioned driving models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents ICR-Drive as a diagnostic framework that systematically generates variants of navigation instructions across four families—paraphrase, ambiguity, noise, and misleading—and replays identical CARLA routes under fixed simulator conditions to isolate the effect of language alone. Experiments on LMDrive and BEVDriver demonstrate that even small rephrasings or conflicting commands produce substantial drops on standard CARLA Leaderboard metrics and trigger distinct failure behaviors. A sympathetic reader would care because these models are intended for safety-critical deployment where human instructions are rarely precise or consistent, so unmeasured sensitivity to wording creates an unquantified risk. The work therefore shifts evaluation from assuming well-formed inputs to measuring robustness against realistic instruction variation.

Core claim

Instruction counterfactual robustness is under-measured in end-to-end language-driven autonomous driving; controlled perturbations of navigation commands induce large, measurable degradations and distinct failure modes when the same routes and seeds are replayed in CARLA.

What carries the argument

ICR-Drive, the diagnostic framework that generates four families of instruction variants and quantifies performance degradation by replaying matched CARLA routes.

If this is right

  • Standard benchmarks that assume precise instructions will overestimate real-world reliability of language-conditioned driving agents.
  • Deployment of these models requires either additional robustness training or runtime checks that detect conflicting or ambiguous commands.
  • Safety evaluation protocols should incorporate counterfactual instruction tests as a required dimension alongside visual and route perturbations.
  • Model developers can use the same replay methodology to isolate and mitigate specific failure modes induced by misleading instructions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The results imply that future embodied driving models may benefit from explicit instruction-verification modules that can override or query misleading commands before execution.
  • Extending the perturbation approach to multi-turn dialogue or context-dependent instructions could reveal additional robustness gaps not captured by single-command tests.
  • If the observed drops persist across different simulators, the findings would motivate architecture-level changes rather than purely data-augmentation fixes.

Load-bearing premise

The four perturbation families together with the CARLA simulator setup are representative of the distribution of real-world human instructions and driving conditions.

What would settle it

Running the same models on a large set of human-written instruction variants collected from actual drivers in either an extended simulator or a physical vehicle and observing no significant metric degradation or new failure modes would falsify the claimed reliability gap.

Figures

Figures reproduced from arXiv: 2604.05378 by Can Cui, Kaiser Hamid, Nade Liang.

Figure 1
Figure 1. Figure 1: ICR-Drive overview. Environment stochasticity is controlled by fixing the CARLA route and simulator seed, while only the navigation text is varied via counterfactual instruction families (Paraphrase, Ambiguity, Noise, Misleading). Differences in closed-loop behavior and Leaderboard metrics quantify instruction robustness on LangAuto. ness in end-to-end language-conditioned driving. Keeping CARLA routes and… view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative closed-loop behavior under instruction counterfactuals. Each panel contrasts a counterfactual instruction (top) with the baseline instruction (bottom) executed on the same CARLA route with the same simulator seed. Left: front-camera view with the active instruction overlay. Right: bird’s-eye view showing the ego trajectory and deviations from the planned route. 4.3. Counterfactual Suite For eac… view at source ↗
Figure 3
Figure 3. Figure 3: RD vs. TO termination signatures on LangAuto. Each [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Prompt used for LLM-based counterfactual instruction generation (GPT-4o). The same prompt is applied once per perturbation [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
read the original abstract

Recent progress in vision-language-action (VLA) models has enabled language-conditioned driving agents to execute natural-language navigation commands in closed-loop simulation, yet standard evaluations largely assume instructions are precise and well-formed. In deployment, instructions vary in phrasing and specificity, may omit critical qualifiers, and can occasionally include misleading, authority-framed text, leaving instruction-level robustness under-measured. We introduce ICR-Drive, a diagnostic framework for instruction counterfactual robustness in end-to-end language-conditioned autonomous driving. ICR-Drive generates controlled instruction variants spanning four perturbation families: Paraphrase, Ambiguity, Noise, and Misleading, where Misleading variants conflict with the navigation goal and attempt to override intent. We replay identical CARLA routes under matched simulator configurations and seeds to isolate performance changes attributable to instruction language. Robustness is quantified using standard CARLA Leaderboard metrics and per-family performance degradation relative to the baseline instruction. Experiments on LMDrive and BEVDriver show that minor instruction changes can induce substantial performance drops and distinct failure modes, revealing a reliability gap for deploying embodied foundation models in safety-critical driving.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces the ICR-Drive diagnostic framework to evaluate instruction counterfactual robustness in end-to-end language-conditioned autonomous driving models. It defines four families of controlled perturbations (Paraphrase, Ambiguity, Noise, Misleading) applied to navigation instructions, replays fixed CARLA routes with matched seeds and configurations on LMDrive and BEVDriver, and reports performance degradation on standard CARLA Leaderboard metrics together with distinct failure modes, concluding that this reveals a reliability gap for safety-critical deployment of embodied foundation models.

Significance. If the empirical results hold under the stated conditions, the work provides a useful, reproducible diagnostic for an underexplored robustness dimension in vision-language-action driving agents. Credit is due for the controlled replay design that isolates instruction effects, the application to two existing models, and the use of standard CARLA metrics rather than new ad-hoc scores. The framework itself is a concrete contribution that future work can extend.

major comments (3)
  1. [§4] §4 (Experiments) and the perturbation-generation procedure: the central claim that observed performance drops demonstrate a genuine reliability gap for deployment rests on the assumption that the four synthetic families (especially Misleading variants that 'conflict with the navigation goal') are representative of real-world human instruction distributions. No quantitative validation, human-subject study, or distributional comparison is provided to support this proxy, which directly affects the strength of the deployment-gap conclusion.
  2. [§3.2] §3.2 (CARLA replay setup) and §4.1: the isolation of instruction effects via matched seeds and fixed routes is a strength, yet the idealized CARLA perception and dynamics (no real sensor noise, simplified traffic) are not shown to preserve the failure modes under equivalent real-world variability; this is load-bearing for the safety-critical claim.
  3. [Table 2] Table 2 / §4.3 (per-family degradation results): without reported statistical tests, confidence intervals, or raw per-run data, it is unclear whether the 'substantial performance drops' are robust to seed variation or metric definition choices, weakening the quantitative support for the reliability-gap assertion.
minor comments (2)
  1. [Abstract] The abstract and §3.1 refer to 'standard CARLA Leaderboard metrics' without an explicit list or citation to the precise Leaderboard version and metric definitions used; adding this would improve reproducibility.
  2. [§3] Notation for the four perturbation families is introduced in §3 but not consistently cross-referenced in the results tables; a small legend or column header clarification would help.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive review. The comments highlight important considerations for strengthening the claims around our diagnostic framework. We address each major comment below and indicate the revisions we will incorporate.

read point-by-point responses
  1. Referee: [§4] §4 (Experiments) and the perturbation-generation procedure: the central claim that observed performance drops demonstrate a genuine reliability gap for deployment rests on the assumption that the four synthetic families (especially Misleading variants that 'conflict with the navigation goal') are representative of real-world human instruction distributions. No quantitative validation, human-subject study, or distributional comparison is provided to support this proxy, which directly affects the strength of the deployment-gap conclusion.

    Authors: We acknowledge that the perturbations are synthetically constructed and that the manuscript does not include a human-subject study or quantitative distributional comparison against real-world instruction corpora. The four families were designed to cover linguistically motivated variations (paraphrasing, underspecification, surface noise, and goal-conflicting overrides) that could plausibly arise in human-vehicle interaction, with the Misleading family explicitly intended to probe override scenarios. The primary contribution is the controlled diagnostic itself rather than a claim of exact frequency matching. In the revision we will add an explicit Limitations subsection that (i) states the synthetic nature of the perturbations, (ii) notes the absence of ecological-validity validation, and (iii) qualifies the deployment-gap language to emphasize that the observed sensitivity warrants further real-world investigation. This change will be reflected in both the abstract and conclusion. revision: partial

  2. Referee: [§3.2] §3.2 (CARLA replay setup) and §4.1: the isolation of instruction effects via matched seeds and fixed routes is a strength, yet the idealized CARLA perception and dynamics (no real sensor noise, simplified traffic) are not shown to preserve the failure modes under equivalent real-world variability; this is load-bearing for the safety-critical claim.

    Authors: We agree that CARLA provides a simplified perceptual and dynamic environment. The replay protocol with matched seeds and routes was chosen precisely to isolate instruction effects from other sources of variability; this design choice is a deliberate strength for the diagnostic goal. We do not claim that the precise failure modes or degradation magnitudes will transfer unchanged to real-world sensor noise and complex traffic. In the revised manuscript we will expand the discussion in §3.2 and add a paragraph in the conclusion that (i) reiterates the controlled-simulation scope and (ii) explicitly states that validation on physical platforms or higher-fidelity simulators remains necessary before safety-critical conclusions can be drawn. revision: partial

  3. Referee: [Table 2] Table 2 / §4.3 (per-family degradation results): without reported statistical tests, confidence intervals, or raw per-run data, it is unclear whether the 'substantial performance drops' are robust to seed variation or metric definition choices, weakening the quantitative support for the reliability-gap assertion.

    Authors: We accept this criticism. Although the original experiments used fixed seeds for reproducibility across instruction variants, variance across multiple runs and formal statistical tests were not reported. In the revision we will (i) rerun the evaluation suite with at least five independent seeds per condition where computationally feasible, (ii) report means with standard deviations or 95% confidence intervals in Table 2, (iii) add paired statistical tests (e.g., Wilcoxon signed-rank) to quantify the significance of per-family degradations, and (iv) release the per-run metric tables as supplementary material. These additions will directly address concerns about robustness to seed variation and metric sensitivity. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical evaluation framework

full rationale

The paper introduces a diagnostic framework with four perturbation families and evaluates existing models (LMDrive, BEVDriver) via direct replay of fixed CARLA routes using standard Leaderboard metrics. No equations, fitted parameters, or self-referential definitions appear in the provided text. No load-bearing self-citations, uniqueness theorems, or ansatzes are invoked to justify core claims. Results are straightforward measurements of performance degradation under instruction variants, independent of any construction that reduces predictions to inputs by definition. This is a standard empirical setup without circular derivation steps.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim depends on the representativeness of the synthetic perturbations and the fidelity of the CARLA simulator; no free parameters are mentioned, but domain assumptions about simulation-to-reality transfer are implicit.

axioms (2)
  • domain assumption CARLA simulator with matched seeds and routes isolates instruction effects from other sources of variance
    The evaluation protocol relies on this to attribute performance changes solely to instruction language.
  • domain assumption The four perturbation families capture the relevant range of real-world instruction variations
    The framework's diagnostic value rests on this coverage claim.
invented entities (1)
  • ICR-Drive framework no independent evidence
    purpose: Diagnostic tool for measuring instruction counterfactual robustness
    New evaluation method introduced by the paper.

pith-pipeline@v0.9.0 · 5489 in / 1347 out tokens · 46948 ms · 2026-05-10T19:38:30.241618+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Vision-Language-Action Safety: Threats, Challenges, Evaluations, and Mechanisms

    cs.RO 2026-04 accept novelty 4.0

    A literature survey that unifies fragmented work on attacks, defenses, evaluations, and deployment challenges for Vision-Language-Action models in robotics.

Reference graph

Works this paper leans on

26 extracted references · 4 canonical work pages · cited by 1 Pith paper · 2 internal anchors

  1. [1]

    RT-1: Robotics Transformer for Real-World Control at Scale

    Anthony Brohan, Noah Brown, Justice Carbajal, Yevgen Chebotar, Joseph Dabis, Chelsea Finn, Keerthana Gopalakr- ishnan, Karol Hausman, Alex Herzog, Jasmine Hsu, et al. Rt-1: Robotics transformer for real-world control at scale. arXiv preprint arXiv:2212.06817, 2022. 1

  2. [2]

    Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom

    Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Gi- ancarlo Baldan, and Oscar Beijbom. nuscenes: A multi- modal dataset for autonomous driving, 2020. 2

  3. [3]

    Personalized autonomous driving with large language models: Field experiments

    Can Cui, Zichong Yang, Yupeng Zhou, Yunsheng Ma, Juanwu Lu, and Ziran Wang. Personalized autonomous driving with large language models: Field experiments. 2023c.URL https://api. semanticscholar. org/CorpusID, 266335383. 1, 2

  4. [4]

    Carla: An open urban driv- ing simulator

    Alexey Dosovitskiy, German Ros, Felipe Codevilla, Anto- nio Lopez, and Vladlen Koltun. Carla: An open urban driv- ing simulator. InConference on robot learning, pages 1–16. PMLR, 2017. 1, 4

  5. [5]

    Benchmarking neu- ral network robustness to common corruptions and perturba- tions, 2019

    Dan Hendrycks and Thomas Dietterich. Benchmarking neu- ral network robustness to common corruptions and perturba- tions, 2019. 3

  6. [6]

    Planning-oriented autonomous driving

    Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17853–17862, 2023. 1

  7. [7]

    Bench2drive: Towards multi-ability bench- marking of closed-loop end-to-end autonomous driving.Ad- vances in Neural Information Processing Systems, 37:819– 844, 2024

    Xiaosong Jia, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2drive: Towards multi-ability bench- marking of closed-loop end-to-end autonomous driving.Ad- vances in Neural Information Processing Systems, 37:819– 844, 2024. 1

  8. [8]

    Adaptive stress testing for autonomous vehicles

    Mark Koren, Saud Alsaif, Ritchie Lee, and Mykel J Kochen- derfer. Adaptive stress testing for autonomous vehicles. In 2018 IEEE Intelligent Vehicles Symposium (IV), pages 1–7. IEEE, 2018. 3

  9. [9]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 1

  10. [10]

    Gpt-driver: Learning to drive with gpt.arXiv preprint arXiv:2310.01415,

    Jiageng Mao, Yuxi Qian, Junjie Ye, Hang Zhao, and Yue Wang. Gpt-driver: Learning to drive with gpt.arXiv preprint arXiv:2310.01415, 2023. 2

  11. [11]

    Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0

    Abby O’Neill, Abdul Rehman, Abhiram Maddukuri, Ab- hishek Gupta, Abhishek Padalkar, Abraham Lee, Acorn Poo- ley, Agrim Gupta, Ajay Mandlekar, Ajinkya Jain, et al. Open x-embodiment: Robotic learning datasets and rt-x models: Open x-embodiment collaboration 0. In2024 IEEE Inter- national Conference on Robotics and Automation (ICRA), pages 6892–6903. IEEE, 2024. 1

  12. [12]

    Lego-drive: Language- enhanced goal-oriented closed-loop end-to-end autonomous driving

    Pranjal Paul, Anant Garg, Tushar Choudhary, Arun Kumar Singh, and K Madhava Krishna. Lego-drive: Language- enhanced goal-oriented closed-loop end-to-end autonomous driving. In2024 IEEE/RSJ International Conference on In- telligent Robots and Systems (IROS), pages 10020–10026. IEEE, 2024. 2

  13. [13]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 1, 3

  14. [14]

    Beyond accuracy: Behavioral testing of nlp models with checklist

    Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, and Sameer Singh. Beyond accuracy: Behavioral testing of nlp models with checklist. InProceedings of the 58th an- nual meeting of the association for computational linguis- tics, pages 4902–4912, 2020. 1, 3

  15. [15]

    Lmdrive: Closed-loop end-to-end driving with large language models

    Hao Shao, Yuxuan Hu, Letian Wang, Guanglu Song, Steven L Waslander, Yu Liu, and Hongsheng Li. Lmdrive: Closed-loop end-to-end driving with large language models. InProceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 15120–15130, 2024. 1, 2, 4, 5

  16. [16]

    Drivelm: Driving with graph visual question answering

    Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. InEuropean conference on computer vision, pages 256–274. Springer, 2024. 2

  17. [17]

    DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

    Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models.arXiv preprint arXiv:2402.12289, 2024. 1, 2

  18. [18]

    Simulation-based adversarial test genera- tion for autonomous vehicles with machine learning com- ponents

    Cumhur Erkan Tuncali, Georgios Fainekos, Hisahiro Ito, and James Kapinski. Simulation-based adversarial test genera- tion for autonomous vehicles with machine learning com- ponents. In2018 IEEE intelligent vehicles symposium (IV), pages 1555–1562. IEEE, 2018. 3

  19. [19]

    Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning

    Shihao Wang, Zhiding Yu, Xiaohui Jiang, Shiyi Lan, Min Shi, Nadine Chang, Jan Kautz, Ying Li, and Jose M Al- varez. Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning. InPro- ceedings of the computer vision and pattern recognition con- ference, pages 22442–22452, 2025. 3

  20. [20]

    Bevdriver: Leveraging bev maps in llms for robust closed-loop driving.arXiv preprint arXiv:2503.03074, 2025

    Katharina Winter, Mark Azer, and Fabian B Flohr. Bev- driver: Leveraging bev maps in llms for robust closed-loop driving.arXiv preprint arXiv:2503.03074, 2025. 2, 5

  21. [21]

    Are vlms ready for autonomous driving? an empirical study from the reliability, data and metric perspectives

    Shaoyuan Xie, Lingdong Kong, Yuhao Dong, Chonghao Sima, Wenwei Zhang, Qi Alfred Chen, Ziwei Liu, and Liang Pan. Are vlms ready for autonomous driving? an empirical study from the reliability, data and metric perspectives. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 6585–6597, 2025. 3

  22. [22]

    Safebench: A benchmarking platform for safety evaluation of autonomous vehicles.Advances in Neural Information Processing Systems, 35:25667–25682, 2022

    Chejian Xu, Wenhao Ding, Weijie Lyu, Zuxin Liu, Shuai Wang, Yihan He, Hanjiang Hu, Ding Zhao, and Bo Li. Safebench: A benchmarking platform for safety evaluation of autonomous vehicles.Advances in Neural Information Processing Systems, 35:25667–25682, 2022. 3

  23. [23]

    Drivegpt4: Interpretable end-to-end autonomous driving via large language model.IEEE Robotics and Automation Let- ters, 9(10):8186–8193, 2024

    Zhenhua Xu, Yujia Zhang, Enze Xie, Zhen Zhao, Yong Guo, Kwan-Yee K Wong, Zhenguo Li, and Hengshuang Zhao. Drivegpt4: Interpretable end-to-end autonomous driving via large language model.IEEE Robotics and Automation Let- ters, 9(10):8186–8193, 2024. 1, 2

  24. [24]

    Learning to prompt for vision-language models.In- ternational journal of computer vision, 130(9):2337–2348,

    Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models.In- ternational journal of computer vision, 130(9):2337–2348,

  25. [25]

    Promptrobust: Towards evaluating the robust- ness of large language models on adversarial prompts

    Kaijie Zhu, Jindong Wang, Jiaheng Zhou, Zichen Wang, Hao Chen, Yidong Wang, Linyi Yang, Wei Ye, Yue Zhang, Neil Gong, et al. Promptrobust: Towards evaluating the robust- ness of large language models on adversarial prompts. In Proceedings of the 1st ACM workshop on large AI systems and models with privacy and safety analysis, pages 57–68,

  26. [26]

    System update:

    3 ICR-Drive: Instruction Counterfactual Robustness for End-to-End Language-Driven Autonomous Driving Supplementary Material Prompt Used for LLM-Based Counterfactual Instruction Generation System: You are an instruction rewriter for vision-language-action models in autonomous driving. Your input is a JSON file containing route-level navigation instructions...