pith. machine review for the scientific record. sign in

arxiv: 2604.08535 · v1 · submitted 2026-04-09 · 💻 cs.RO · cs.CV

Recognition: unknown

Fail2Drive: Benchmarking Closed-Loop Driving Generalization

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:15 UTC · model grok-4.3

classification 💻 cs.RO cs.CV
keywords closed-loop drivinggeneralizationdistribution shiftCARLA simulatorautonomous driving benchmarkfailure modespaired routesrobustness evaluation
0
0 comments X

The pith

A paired benchmark in CARLA shows state-of-the-art driving models drop 22.8 percent success rate under distribution shifts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that existing closed-loop driving benchmarks fail to test true generalization because they reuse training scenarios at test time, allowing models to succeed through memorization instead of robust behavior. It introduces Fail2Drive with 200 routes and 17 new scenario classes that pair each shifted condition with an in-distribution match, isolating the effect of changes in appearance, layout, behavior, and robustness. Testing multiple models reveals consistent degradation plus specific failures such as ignoring objects visible in LiDAR and failing to distinguish free from occupied space. A sympathetic reader would care because reliable autonomous driving requires models that handle unseen conditions rather than overfit to familiar ones.

Core claim

The central claim is that generalization under distribution shift is a central bottleneck for closed-loop autonomous driving, which the authors demonstrate by creating Fail2Drive, the first paired-route benchmark in CARLA. Each of the 200 routes has an in-distribution counterpart, so performance differences can be attributed directly to the 17 scenario classes spanning appearance, layout, behavioral, and robustness shifts. Evaluation of state-of-the-art models shows an average success-rate drop of 22.8 percent, with analysis revealing unexpected failure modes including ignoring clearly visible LiDAR objects and failing to learn fundamental concepts of free and occupied space. The benchmark's

What carries the argument

Fail2Drive benchmark of 200 paired routes across 17 scenario classes that isolates distribution-shift effects by matching each shifted route to an in-distribution counterpart.

If this is right

  • Evaluation of new driving models must include paired shifted and in-distribution routes to avoid overestimating generalization.
  • Training procedures should explicitly target robustness to appearance, layout, behavioral, and robustness shifts rather than relying on memorization.
  • Models that ignore visible LiDAR objects or fail to represent free and occupied space require architectural or data changes focused on spatial reasoning.
  • The open-source toolbox allows creation and validation of additional scenario pairs to expand the benchmark.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The failure modes suggest current end-to-end models may lack basic spatial understanding that rule-based planners take for granted.
  • Similar paired-route designs could be adapted for real-vehicle testing to diagnose generalization before deployment.
  • The consistent degradation across models implies that scaling data or model size alone may not resolve these issues without targeted shift training.

Load-bearing premise

That the chosen scenario shifts in the CARLA simulator produce distribution changes representative of those encountered in real-world driving without simulator-specific artifacts that would not occur outside the simulator.

What would settle it

Running the same models on a higher-fidelity simulator or real-world closed-loop tests with matched in-distribution and shifted routes and finding no consistent 22.8 percent success-rate drop or the reported failure modes.

Figures

Figures reproduced from arXiv: 2604.08535 by Andreas Geiger, Katrin Renz, Simon Gerstenecker.

Figure 1
Figure 1. Figure 1: Overview: Fail2Drive introduces the first paired-route benchmark for closed-loop generalization on truly unseen long-tail scenarios in CARLA. It turns qualitative failures into measurable generalization gaps. Evaluating seven recent driving models exposes strong shortcut learning and missing fallback behavior, revealing where current approaches break and where progress is most needed. Abstract Generalizati… view at source ↗
Figure 2
Figure 2. Figure 2: Route diversity. Fail2Drive routes (blue) are diversely spread across Town13, covering a wide range of environments, and have little overlap with the official CARLA validation routes (red). traffic situation at the same location without the shift. The paired route isolates robustness under shift. (iii) Full exten￾sibility. All scenarios, assets, and behaviors can be authored and modified without editing CA… view at source ↗
Figure 3
Figure 3. Figure 3: Category-wise generalization performance. Harmonic mean between Driving Score and Success Rate on the four scenario categories of Fail2Drive. The transparent part displays the drop in performance, and the darker ones indicate an increase in score. See Section 5 for details. For TCP and UniAD, we use the reimplementations pro￾vided by [23], which were trained on the Bench2Drive dataset. For all other models… view at source ↗
Figure 6
Figure 6. Figure 6 [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Obstacle failures. SimLingo and TransFuser++ fail to detect and avoid clearly visible obstacles. TransFuser++’s LiDAR visualisation shows the apparent obstacle in the LiDAR data and no bounding box prediction being made. rent driving systems: even strong models do not yet possess a generalizable notion of high-level driving behavior. Visual - Lateral. These scenarios evaluate whether mod￾els can identify a… view at source ↗
Figure 8
Figure 8. Figure 8: Perception failures. TransFuser++ and HiP-AD fail to correctly perceive the parked vehicle’s orientation, defaulting to the orientation seen in CARLA demonstrations, but are thereby able to solve the scenario. TransFuser++) often predict the default CARLA parked-car orientation, regardless of the actual rotation ( [PITH_FULL_IMAGE:figures/full_fig_p007_8.png] view at source ↗
Figure 10
Figure 10. Figure 10: PlanT 2.0 robustness. PlanT performs an unnecessary avoidance maneuver for a construction (left) and a mailbox that both do not block its path, risking vehicle collisions. is to continue driving normally, without unnecessary decel￾eration, lane changes, or hesitation. Across all models, robustness scenarios exhibit the strongest generalization performance of any category. Avoiding overreaction appears sub… view at source ↗
read the original abstract

Generalization under distribution shift remains a central bottleneck for closed-loop autonomous driving. Although simulators like CARLA enable safe and scalable testing, existing benchmarks rarely measure true generalization: they typically reuse training scenarios at test time. Success can therefore reflect memorization rather than robust driving behavior. We introduce Fail2Drive, the first paired-route benchmark for closed-loop generalization in CARLA, with 200 routes and 17 new scenario classes spanning appearance, layout, behavioral, and robustness shifts. Each shifted route is matched with an in-distribution counterpart, isolating the effect of the shift and turning qualitative failures into quantitative diagnostics. Evaluating multiple state-of-the-art models reveals consistent degradation, with an average success-rate drop of 22.8\%. Our analysis uncovers unexpected failure modes, such as ignoring objects clearly visible in the LiDAR and failing to learn the fundamental concepts of free and occupied space. To accelerate follow-up work, Fail2Drive includes an open-source toolbox for creating new scenarios and validating solvability via a privileged expert policy. Together, these components establish a reproducible foundation for benchmarking and improving closed-loop driving generalization. We open-source all code, data, and tools at https://github.com/autonomousvision/fail2drive .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Fail2Drive, a new benchmark for closed-loop driving generalization in CARLA consisting of 200 routes across 17 scenario classes that introduce appearance, layout, behavioral, and robustness distribution shifts. Each shifted route is explicitly paired with an in-distribution counterpart to isolate the effect of the shift. Evaluation of multiple state-of-the-art models shows an average success-rate drop of 22.8%, accompanied by qualitative analysis of failure modes such as ignoring LiDAR-visible objects and failing to learn free/occupied space concepts. The work also releases an open-source toolbox for scenario generation and solvability validation using a privileged expert policy.

Significance. If the paired-route design validly isolates distribution-shift effects, Fail2Drive would provide a reproducible, quantitative foundation for diagnosing generalization failures in closed-loop autonomous driving, moving beyond memorization of training scenarios. The consistent degradation across models and the identification of specific failure modes (e.g., LiDAR object ignoring) offer actionable diagnostics, while the open-sourced code, data, and tools lower the barrier for follow-up work.

major comments (2)
  1. [§3 and §4] §3 (Benchmark Construction) and §4 (Experiments): The central claim that the 22.8% success-rate drop measures generalization under the intended shifts rests on the assumption that each shifted route differs from its in-distribution pair only in the target factor. The manuscript describes the pairing and provides a solvability-validation toolbox, but does not report quantitative expert-policy success rates on both members of each pair. Without this, it remains possible that some pairs introduce solvability differences that the models simply expose rather than pure generalization gaps.
  2. [§4.2] §4.2 (Failure Mode Analysis): The claims that models 'ignore objects clearly visible in the LiDAR' and 'fail to learn the fundamental concepts of free and occupied space' are presented as unexpected diagnostics. These would be strengthened by quantitative supporting metrics (e.g., frequency of such events across multiple runs, comparison against expert trajectories, or occlusion/visibility statistics) rather than relying primarily on qualitative examples, especially given CARLA's idealized ray-casting LiDAR.
minor comments (2)
  1. [Abstract and §4] The abstract states '200 routes' but the main text should clarify the exact distribution across the 17 classes and the number of evaluation episodes per route to allow readers to assess statistical reliability of the 22.8% aggregate figure.
  2. [§3] Notation for success rate and the precise definition of 'paired-route' matching criteria (e.g., how layout or behavior shifts are controlled while keeping other variables fixed) could be made more explicit in §3 to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. We address each major comment point by point below, indicating where revisions will be made to strengthen the paper.

read point-by-point responses
  1. Referee: [§3 and §4] §3 (Benchmark Construction) and §4 (Experiments): The central claim that the 22.8% success-rate drop measures generalization under the intended shifts rests on the assumption that each shifted route differs from its in-distribution pair only in the target factor. The manuscript describes the pairing and provides a solvability-validation toolbox, but does not report quantitative expert-policy success rates on both members of each pair. Without this, it remains possible that some pairs introduce solvability differences that the models simply expose rather than pure generalization gaps.

    Authors: We agree that explicitly reporting the expert-policy success rates on both members of each pair would provide stronger evidence that the observed drops reflect generalization gaps rather than solvability differences. The solvability-validation toolbox (using the privileged expert) was applied during benchmark construction to filter routes, but the manuscript does not include the per-pair quantitative rates. In the revised version, we will add a table reporting the expert success rates for all 200 routes (in-distribution and shifted pairs), confirming that solvability is comparable across pairs and isolating the effect of the distribution shifts. revision: yes

  2. Referee: [§4.2] §4.2 (Failure Mode Analysis): The claims that models 'ignore objects clearly visible in the LiDAR' and 'fail to learn the fundamental concepts of free and occupied space' are presented as unexpected diagnostics. These would be strengthened by quantitative supporting metrics (e.g., frequency of such events across multiple runs, comparison against expert trajectories, or occlusion/visibility statistics) rather than relying primarily on qualitative examples, especially given CARLA's idealized ray-casting LiDAR.

    Authors: We acknowledge that the failure-mode claims would benefit from quantitative backing beyond the qualitative examples. The analysis draws from observed behaviors across multiple model evaluations and runs, but the manuscript presents them illustratively. In the revision, we will incorporate quantitative metrics, including the frequency of LiDAR-visible object collisions (computed via post-hoc trajectory analysis with visibility checks), trajectory comparisons to the expert policy on free/occupied space violations, and basic occlusion statistics where feasible. This will complement the examples while noting the idealized nature of CARLA's LiDAR. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark construction with no derivations or self-referential predictions

full rationale

The paper introduces a paired-route benchmark in CARLA for measuring closed-loop generalization, evaluates existing models on 200 routes across 17 scenario classes, and reports direct success-rate drops plus qualitative failure modes. No equations, parameter fits, uniqueness theorems, or ansatzes appear; the central claims rest on simulator measurements and external model performance rather than any derivation chain that reduces to its own inputs by construction. Self-citations are absent from the provided text, and the open-sourced toolbox is a reproducibility aid, not a load-bearing premise. This is a standard empirical benchmark paper whose results are falsifiable against the simulator and independent models.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The benchmark rests on domain assumptions about simulator fidelity and shift isolation rather than new mathematical axioms or fitted parameters; no invented physical entities are introduced.

axioms (2)
  • domain assumption CARLA simulator dynamics and sensor models are sufficiently faithful to real-world driving to diagnose generalization failures that would occur outside simulation.
    The entire evaluation pipeline depends on this proxy validity; the abstract invokes it implicitly when claiming diagnostic value for real autonomous driving.
  • domain assumption The pairing procedure isolates the intended distribution shift without introducing uncontrolled differences in route difficulty or solvability.
    The 22.8% degradation figure and failure-mode analysis rely on this isolation; the toolbox for validating solvability with a privileged expert is presented as mitigation.

pith-pipeline@v0.9.0 · 5515 in / 1527 out tokens · 98565 ms · 2026-05-10T17:15:25.752161+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MDrive: Benchmarking Closed-Loop Cooperative Driving for End-to-End Multi-agent Systems

    cs.RO 2026-05 unverdicted novelty 7.0

    MDrive benchmark shows multi-agent cooperative driving systems generally outperform single-agent ones in closed-loop settings but perception sharing does not always improve planning and negotiation can harm performanc...

Reference graph

Works this paper leans on

58 extracted references · 5 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    Cosmos- transfer1: Conditional world generation with adaptive multi- modal control, 2025

    Hassan Abu Alhaija, Jose Alvarez, Maciej Bala, Tiffany Cai, Tianshi Cao, Liz Cha, Joshua Chen, Mike Chen, Francesco Ferroni, Sanja Fidler, Dieter Fox, Yunhao Ge, Jinwei Gu, Ali Hassani, Michael Isaev, Pooya Jannaty, Shiyi Lan, Tobias Lasser, Huan Ling, Ming-Yu Liu, Xian Liu, Yifan Lu, Alice Luo, Qianli Ma, Hanzi Mao, Fabio Ramos, Xuanchi Ren, Tianchang Sh...

  2. [2]

    Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yuxin Pan, Giancarlo Baldan, and Oscar Beijbom

    Holger Caesar, Varun Bankiti, Alex H. Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yuxin Pan, Giancarlo Baldan, and Oscar Beijbom. nuScenes: A multi- modal dataset for autonomous driving. InIEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR),

  3. [3]

    Wolff, Alex H

    Holger Caesar, Juraj Kabzan, Kok Seang Tan, Whye Kit Fong, Eric M. Wolff, Alex H. Lang, Luke Fletcher, Os- car Beijbom, and Sammy Omari. nuplan: A closed-loop ml-based planning benchmark for autonomous vehicles. In IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2021. 2

  4. [4]

    Carla autonomous driving leader- board 2.0.https://leaderboard.carla.org/,

    CARLA Contributors. Carla autonomous driving leader- board 2.0.https://leaderboard.carla.org/,

  5. [5]

    Learning from all ve- hicles

    Dian Chen and Philipp Krähenbühl. Learning from all ve- hicles. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 2

  6. [6]

    End-to-end autonomous driving: Challenges and frontiers.IEEE Trans

    Li Chen, Penghao Wu, Kashyap Chitta, Bernhard Jaeger, An- dreas Geiger, and Hongyang Li. End-to-end autonomous driving: Challenges and frontiers.IEEE Trans. Pattern Anal. Mach. Intell., 2024. 1

  7. [7]

    Neat: Neural attention fields for end-to-end autonomous driving

    Kashyap Chitta, Aditya Prakash, and Andreas Geiger. Neat: Neural attention fields for end-to-end autonomous driving. InIEEE/CVF International Conference on Computer Vision (ICCV), 2021. 2

  8. [8]

    Lopez, Vladlen Koltun, and Alexey Dosovitskiy

    Felipe Codevilla, Antonio M. Lopez, Vladlen Koltun, and Alexey Dosovitskiy. On offline evaluation of vision-based driving models. InEuropean Conference on Computer Vi- sion (ECCV), 2018. 2

  9. [9]

    López, and Adrien Gaidon

    Felipe Codevilla, Eder Santana, Antonio M. López, and Adrien Gaidon. Exploring the limitations of behavior cloning for autonomous driving. InIEEE/CVF International Conference on Computer Vision (ICCV), 2019. 1, 2

  10. [10]

    Lookout: Diverse multi-future prediction and planning for self-driving

    Alexander Cui, Abbas Sadat, Sergio Casas, Renjie Liao, and Raquel Urtasun. Lookout: Diverse multi-future prediction and planning for self-driving. InIEEE/CVF International Conference on Computer Vision (ICCV), 2021. 1

  11. [11]

    Marco F. Cusumano-Towner, David Hafner, Alexander Hertzberg, Brody Huval, Aleksei Petrenko, Eugene Vinitsky, Erik Wijmans, Taylor Killian, Stuart Bowers, Ozan Sener, Philipp Krähenbühl, and Vladlen Koltun. Robust autonomy emerges from self-play.arXiv preprint, 2502.03349, 2025. 1

  12. [12]

    Navsim: Data-driven non- reactive autonomous vehicle simulation and benchmark- ing

    Daniel Dauner, Marcel Hallgarten, Tianyu Li, Xinshuo Weng, Zhiyu Huang, Zetong Yang, Hongyang Li, Igor Gilitschenski, Boris Ivanovic, Marco Pavone, Andreas Geiger, and Kashyap Chitta. Navsim: Data-driven non- reactive autonomous vehicle simulation and benchmark- ing. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. 2

  13. [13]

    Carla: An open urban driving simulator, 2017

    Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. Carla: An open urban driving simulator, 2017. 1

  14. [14]

    CARLA: An open urban driving simulator

    Alexey Dosovitskiy, German Ros, Felipe Codevilla, Antonio Lopez, and Vladlen Koltun. CARLA: An open urban driving simulator. InConference on Robot Learning (CoRL), 2017. 2

  15. [15]

    Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation.arXiv preprint arXiv:2503.19755,

    Haoyu Fu, Diankun Zhang, Zongchuang Zhao, Jianfeng Cui, Dingkang Liang, Chong Zhang, Dingyuan Zhang, Hongwei Xie, Bing Wang, and Xiang Bai. ORION: A holistic end- to-end autonomous driving framework by vision-language instructed action generation.arXiv preprint, 2503.19755,

  16. [16]

    Plant 2.0: Exposing biases and structural flaws in closed-loop driv- ing, 2025

    Simon Gerstenecker, Andreas Geiger, and Katrin Renz. Plant 2.0: Exposing biases and structural flaws in closed-loop driv- ing, 2025. 2, 4, 5

  17. [17]

    Can vehicle motion planning generalize to realistic long-tail scenarios? InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2024

    Marcel Hallgarten, Julián Zapata, Martin Stoll, Katrin Renz, and Andreas Zell. Can vehicle motion planning generalize to realistic long-tail scenarios? InIEEE/RSJ International Conference on Intelligent Robots and Systems (IROS), 2024. 2

  18. [18]

    A new open-source off-road environment for benchmark gen- eralization of autonomous driving.IEEE Access, 2021

    Isaac Han, Dong-Hyeok Park, and Kyung-Joong Kim. A new open-source off-road environment for benchmark gen- eralization of autonomous driving.IEEE Access, 2021. 2

  19. [19]

    Planning-oriented autonomous driving

    Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, et al. Planning-oriented autonomous driving. InIEEE/CVF 9 Conference on Computer Vision and Pattern Recognition (CVPR), 2023. 4, 5

  20. [20]

    Hid- den biases of end-to-end driving models

    Bernhard Jaeger, Kashyap Chitta, and Andreas Geiger. Hid- den biases of end-to-end driving models. InIEEE/CVF In- ternational Conference on Computer Vision (ICCV), 2023. 4, 5

  21. [21]

    Common Mistakes in Bench- marking Autonomous Driving.https : / / github

    Bernhard Jaeger, Kashyap Chitta, Daniel Dauner, Katrin Renz, and Andreas Geiger. Common Mistakes in Bench- marking Autonomous Driving.https : / / github . com/autonomousvision/carla_garage/blob/ leaderboard _ 2 / docs / common _ mistakes _ in _ benchmarking_ad.md, 2024. 2

  22. [22]

    Carl: Learning scalable planning policies with simple rewards.arXiv preprint arXiv:2504.17838, 2025

    Bernhard Jaeger, Daniel Dauner, Jens Beißwenger, Simon Gerstenecker, Kashyap Chitta, and Andreas Geiger. Carl: Learning scalable planning policies with simple rewards. arXiv preprint, 2504.17838, 2025. 1

  23. [23]

    Bench2drive: Towards multi-ability bench- marking of closed-loop end-to-end autonomous driving

    Xiaosong Jia, Zhenjie Yang, Qifeng Li, Zhiyuan Zhang, and Junchi Yan. Bench2drive: Towards multi-ability bench- marking of closed-loop end-to-end autonomous driving. In NeurIPS 2024 Datasets and Benchmarks Track, 2024. 1, 2, 5

  24. [24]

    Transfuser: Imitation with transformer-based sensor fusion for autonomous driv- ing.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023

    Chitta Kashyap, Prakash Aditya, Jaeger Bernhard, Yu Ze- hao, Renz Katrin, and Geiger Andreas. Transfuser: Imitation with transformer-based sensor fusion for autonomous driv- ing.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023. 1, 2

  25. [25]

    Neuroncap: Photorealistic closed- loop safety testing for autonomous driving.European Con- ference on Computer Vision (ECCV), 2024

    William Ljungbergh, Adam Tonderski, Joakim Johnan- der, Holger Caesar, Kalle Åström, Michael Felsberg, and Christoffer Petersson. Neuroncap: Photorealistic closed- loop safety testing for autonomous driving.European Con- ference on Computer Vision (ECCV), 2024. 2

  26. [26]

    Ur- bancad: Towards highly controllable and photorealistic 3d vehicles for urban scene simulation

    Yichong Lu, Yichi Cai, Shangzhan Zhang, Hongyu Zhou, Haoji Hu, Huimin Yu, Andreas Geiger, and Yiyi Liao. Ur- bancad: Towards highly controllable and photorealistic 3d vehicles for urban scene simulation. InIEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR),

  27. [27]

    Two video data sets for tracking and retrieval of out of distribution objects

    Kira Maag, Robin Chan, Svenja Uhlemeyer, Kamil Kowol, and Hanno Gottschalk. Two video data sets for tracking and retrieval of out of distribution objects. InAsian Conference on Computer Vision (ACCV), 2023. 2

  28. [28]

    Evaluating the robustness of semantic segmentation for autonomous driving against real- world adversarial patch attacks

    Federico Nesti, Giulio Rossolini, Saasha Nair, Alessandro Biondi, and Giorgio Buttazzo. Evaluating the robustness of semantic segmentation for autonomous driving against real- world adversarial patch attacks. InIEEE/CVF Winter Con- ference on Applications of Computer Vision (WACV), 2022. 2

  29. [29]

    Carla real traffic scenarios – novel training ground and benchmark for autonomous driving, 2021

    Bła ˙zej Osi ´nski, Piotr Miło ´s, Adam Jakubowski, Paweł Zi˛ ecina, Michał Martyniak, Christopher Galias, Antonia Breuer, Silviu Homoceanu, and Henryk Michalewski. Carla real traffic scenarios – novel training ground and benchmark for autonomous driving, 2021. 2

  30. [30]

    Simlingo: Vision-only closed-loop autonomous driving with language-action alignment

    Katrin Renz, Long Chen, Elahe Arani, and Oleg Sinavski. Simlingo: Vision-only closed-loop autonomous driving with language-action alignment. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 1, 2, 4, 5

  31. [31]

    Drivelm: Driving with graph visual question answering

    Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. InEuropean Conference on Computer Vision (ECCV), 2024. 4, 8

  32. [32]

    Dta: Phys- ical camouflage attacks using differentiable transformation network

    Naufal Suryanto, Yongsu Kim, Hyoeun Kang, Ha- rashta Tatimma Larasati, Youngyeo Yun, Thi-Thu-Huong Le, Hunmin Yang, Se-Yoon Oh, and Howon Kim. Dta: Phys- ical camouflage attacks using differentiable transformation network. InIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 2

  33. [33]

    Hip-ad: Hierarchical and multi-granularity planning with deformable attention for autonomous driving in a single decoder, 2025

    Yingqi Tang, Zhuoran Xu, Zhaotie Meng, and Erkang Cheng. Hip-ad: Hierarchical and multi-granularity planning with deformable attention for autonomous driving in a single decoder, 2025. 2, 4, 5

  34. [34]

    Hebert, Takeo Kanade, and Steven A

    Charles Thorpe, Martial H. Hebert, Takeo Kanade, and Steven A. Shafer. Vision and navigation for the carnegie- mellon navlab.IEEE Transactions on Pattern Analysis and Machine Intelligence, 1988. 1

  35. [35]

    Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong base- line

    Peng Wu, Xiaosong Jia, Li Chen, Junchi Yan, Hongyang Li, and Yu Qiao. Trajectory-guided control prediction for end-to-end autonomous driving: A simple yet strong base- line. InAdvances in Neural Information Processing Systems (NeurIPS), 2022. 4, 5

  36. [36]

    Safebench: A benchmarking platform for safety evaluation of autonomous vehicles

    Chejian Xu, Wenhao Ding, Weijie Lyu, ZUXIN LIU, Shuai Wang, Yihan He, Hanjiang Hu, DING ZHAO, and Bo Li. Safebench: A benchmarking platform for safety evaluation of autonomous vehicles. InAdvances in Neural Information Processing Systems, 2022. 2

  37. [37]

    Wenda Xu, Jia Pan, Junqing Wei, and John M. Dolan. Motion planning under uncertainty for on-road autonomous driving. InIEEE International Conference on Robotics and Automation (ICRA), 2014. 1

  38. [38]

    Chatscene: Knowledge-enabled safety-critical scenario generation for autonomous vehicles

    Jiawei Zhang, Chejian Xu, and Bo Li. Chatscene: Knowledge-enabled safety-critical scenario generation for autonomous vehicles. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, 2024. 2

  39. [39]

    Hugsim: A real-time, photo-realistic and closed-loop simulator for autonomous driving.arXiv preprint arXiv:2412.01718, 2024

    Hongyu Zhou, Longzhong Lin, Jiabao Wang, Yichong Lu, Dongfeng Bai, Bingbing Liu, Yue Wang, Andreas Geiger, and Yiyi Liao. Hugsim: A real-time, photo-realistic and closed-loop simulator for autonomous driving.arXiv preprint arXiv:2412.01718, 2024. 2

  40. [40]

    AutoVLA: A Vision-Language-Action Model for End-to-End Autonomous Driving with Adaptive Reasoning and Reinforcement Fine-Tuning

    Zewei Zhou, Tianhui Cai, Seth Z. Zhao, Yun Zhang, Zhiyu Huang, Bolei Zhou, and Jiaqi Ma. Autovla: A vision- language-action model for end-to-end autonomous driving with adaptive reasoning and reinforcement fine-tuning.arXiv preprint, 2506.13757, 2025. 1

  41. [41]

    Hidden biases of end-to- end driving datasets.IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2024

    Julian Zimmerlin, Jens Beißwenger, Bernhard Jaeger, An- dreas Geiger, and Kashyap Chitta. Hidden biases of end-to- end driving datasets.IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), 2024. 2 10 Fail2Drive: Benchmarking Closed-Loop Driving Generalization Supplementary Material A. Scenario description We show one example o...

  42. [42]

    Unlike the standard CARLA parked-vehicle scenario, which always places the vehicle in the same position, our variant can be defined with any orientation, location and asset

    BadParking A parked vehicle partially occludes the ego lane. Unlike the standard CARLA parked-vehicle scenario, which always places the vehicle in the same position, our variant can be defined with any orientation, location and asset. This is meant to challenge models’ spatial un- derstanding with known obstacles. The standardParkedObstaclesce- nario serv...

  43. [43]

    The in-distribution sample is defined by the defaultConstructionObstacle

    ConstructionPermutations A modified version of the standardConstructionObstacle, where con- struction assets can be replaced or removed, isolating dependencies on specific parts of construction sites. The in-distribution sample is defined by the defaultConstructionObstacle

  44. [44]

    The obstacles can be defined by any number of CARLA assets at arbitrary locations and orientations, enabling testing of generalization to unseen objects and structures

    CustomObstacle Fully customizable obstacles block the road. The obstacles can be defined by any number of CARLA assets at arbitrary locations and orientations, enabling testing of generalization to unseen objects and structures. Depending on the obstacle size, aParkedObstacleorCon- structionObstacleis used as an in-distribution sample. 1

  45. [45]

    5 different occlusions are included with Fail2Drive and any CARLA asset can be used

    ObscuredStop Occlusions are placed on stop signs when entering an intersection, challenging the visual traffic sign detection. 5 different occlusions are included with Fail2Drive and any CARLA asset can be used. The in-distribution sample is defined by including the scenario with no occlusion

  46. [46]

    The classicHardBrakescenario with active brake lights is used as the in-distribution sample

    HardBrakeNoLights The leading vehicle suddenly brakes with disabled brake lights, testing if models can judge distance and deceleration without relying on this cue. The classicHardBrakescenario with active brake lights is used as the in-distribution sample

  47. [47]

    Since CARLA includes this scenario only with emergency vehicles, our variations test whether models only yield to emergency vehicles or generalize to other traffic participants

    RightOfWay A custom vehicle takes the ego vehicle’s priority while crossing a junction. Since CARLA includes this scenario only with emergency vehicles, our variations test whether models only yield to emergency vehicles or generalize to other traffic participants. The emergency ve- hicle scenarios serve as the in-distribution sample

  48. [48]

    Fail2Drive introduces 17 animal assets that can be used for all pedestrian scenarios

    Animals An animal crosses the road, forcing the ego vehicle to react, testing if models are able to generalize to actors with other appearances and shapes as pedestrians. Fail2Drive introduces 17 animal assets that can be used for all pedestrian scenarios. By default, CARLA includes only pedestrians, which are used for the in-distribution scenario. 2

  49. [49]

    The in-distribution scenario uses the default CARLA as- sets

    PedestrianOtherBlocker A pedestrian emerges from behind an unseen object to cross the road, evaluating whether models overfit to expect pedestrians only from cer- tain objects. The in-distribution scenario uses the default CARLA as- sets

  50. [50]

    The scenario tests if models react to known cues even when they are placed outside the relevant regions

    RightConstruction A construction obstacle is placed outside the road to the right side, requiring no reaction of the ego vehicle. The scenario tests if models react to known cues even when they are placed outside the relevant regions. The in-distribution sample includes no scenario

  51. [51]

    The in-distribution sample includes no scenario

    OppositeConstruction A construction site is placed in the opposite lane, requiring no reaction from the ego vehicle, again testing overfitting to scenario structures. The in-distribution sample includes no scenario

  52. [52]

    Images include a walking child at two scales and a red light, testing if models can differentiate between these printed images and real objects

    ImageOnObject A deceptive image is placed on an advertisement or a bus stop, the ego vehicle should not react to this influence. Images include a walking child at two scales and a red light, testing if models can differentiate between these printed images and real objects. The in-distribution scenario does not include an image. 3

  53. [53]

    This tests models’ ability to disregard irrelevant objects that do not affect driving behavior

    PassableObstacles Objects are placed on or near the road, allowing the vehicle to pass by maintaining its lane. This tests models’ ability to disregard irrelevant objects that do not affect driving behavior. The in-distribution scenario includes no obstacles

  54. [54]

    Since in CARLA v2, pedes- trians are only present when relevant to a scenario, models may learn to react strongly to their presence

    PedestrianCrowd A large number of pedestrians is standing on the sidewalk while the ego vehicle passes or performs a scenario. Since in CARLA v2, pedes- trians are only present when relevant to a scenario, models may learn to react strongly to their presence. The in-distribution sample is de- fined by the same scenarios without any pedestrians

  55. [55]

    This scenario requires the model to generalize to stop during the overtaking maneuver, which is not shown during training

    ConstructionPedestrian While passing a construction site, a pedestrian crosses the road. This scenario requires the model to generalize to stop during the overtaking maneuver, which is not shown during training. The defaultConstruc- tionObstaclewithout a pedestrian serves as the in-distribution sample

  56. [56]

    This tests whether pedestri- ans are correctly identified and responded to in out-of-distribution sce- narios

    PedestriansOnRoad Pedestrians are walking on the road in front of the ego vehicle, requir- ing deceleration or an evasive maneuver. This tests whether pedestri- ans are correctly identified and responded to in out-of-distribution sce- narios. The in-distribution sample tests solving the underlying route without a scenario. 4

  57. [57]

    While during training, only passable objects are shown, this scenario tests whether models generalize to stop and wait at obstacles

    FullyBlocked An object blocks the entire road, forcing the ego vehicle to stop and wait 60 seconds, until the obstacle is removed and the vehicle can pass. While during training, only passable objects are shown, this scenario tests whether models generalize to stop and wait at obstacles. The in- distribution sample uses no scenario and evaluates a model’s...

  58. [58]

    In addition to waiting at the object, this scenario introduces highly decep- tive visuals

    Wall A large-scale wall with a printed image is placed on the road, requir- ing the agent to wait for 60 seconds until the obstacle is removed. In addition to waiting at the object, this scenario introduces highly decep- tive visuals. Fail2Drive includes one brick wall and three walls with images of roads. The in-distribution route is again defined withou...