{"paper":{"title":"ORION: A Holistic End-to-End Autonomous Driving Framework by Vision-Language Instructed Action Generation","license":"http://creativecommons.org/licenses/by/4.0/","headline":"ORION reports 77.74 Driving Score and 54.62% Success Rate on Bench2Drive, outperforming prior end-to-end methods by 14.28 DS and 19.61% SR through unified VQA and planning optimization.","cross_cats":[],"primary_cat":"cs.CV","authors_text":"Bing Wang, Chong Zhang, Diankun Zhang, Dingkang Liang, Dingyuan Zhang, Haoyu Fu, Hongwei Xie, Jianfeng Cui, Xiang Bai, Zongchuang Zhao","submitted_at":"2025-03-25T15:18:43Z","abstract_excerpt":"End-to-end (E2E) autonomous driving methods still struggle to make correct decisions in interactive closed-loop evaluation due to limited causal reasoning capability. Current methods attempt to leverage the powerful understanding and reasoning abilities of Vision-Language Models (VLMs) to resolve this dilemma. However, the problem is still open that few VLMs for E2E methods perform well in the closed-loop evaluation due to the gap between the semantic reasoning space and the purely numerical trajectory output in the action space. To tackle this issue, we propose ORION, a holistic E2E autonomou"},"claims":{"count":3,"items":[{"kind":"strongest_claim","text":"Our method achieves an impressive closed-loop performance of 77.74 Driving Score (DS) and 54.62% Success Rate (SR) on the challenge Bench2Drive datasets, which outperforms state-of-the-art (SOTA) methods by a large margin of 14.28 DS and 19.61% SR.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"The assumption that aligning the reasoning space of the LLM with the numerical action space through unified E2E optimization will reliably improve closed-loop causal reasoning and trajectory quality without introducing new failure modes.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"ORION reports 77.74 Driving Score and 54.62% Success Rate on Bench2Drive, outperforming prior end-to-end methods by 14.28 DS and 19.61% SR through unified VQA and planning optimization.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"}],"snapshot_sha256":"b3c5ebaddc4da911d6e8ff99255b4994d061f624e73615cd55b80fa9f31f4af6"},"source":{"id":"2503.19755","kind":"arxiv","version":1},"verdict":{"id":"fc19b1ec-7fcb-4ffd-8c9c-fe7ab1c55490","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T08:04:53.553705Z","strongest_claim":"Our method achieves an impressive closed-loop performance of 77.74 Driving Score (DS) and 54.62% Success Rate (SR) on the challenge Bench2Drive datasets, which outperforms state-of-the-art (SOTA) methods by a large margin of 14.28 DS and 19.61% SR.","one_line_summary":"ORION reports 77.74 Driving Score and 54.62% Success Rate on Bench2Drive, outperforming prior end-to-end methods by 14.28 DS and 19.61% SR through unified VQA and planning optimization.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"The assumption that aligning the reasoning space of the LLM with the numerical action space through unified E2E optimization will reliably improve closed-loop causal reasoning and trajectory quality without introducing new failure modes.","pith_extraction_headline":""},"references":{"count":103,"sample":[{"doi":"","year":null,"title":"GPT-4 Technical Report","work_id":"b928e041-6991-4c08-8c81-0359e4097c7b","ref_index":1,"cited_arxiv_id":"2303.08774","is_internal_anchor":true},{"doi":"","year":2022,"title":"Flamingo: a visual language model for few-shot learning","work_id":"cc3738fb-da60-4a20-8b03-c795998bbe7a","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":2023,"title":"Gemini: A Family of Highly Capable Multimodal Models","work_id":"83f7c85b-3f11-450f-ac0c-64d9745220b2","ref_index":3,"cited_arxiv_id":"2312.11805","is_internal_anchor":true},{"doi":"","year":2023,"title":"Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond","work_id":"cbc2bb21-b6bb-46c0-80bf-107e195ffe10","ref_index":4,"cited_arxiv_id":"2308.12966","is_internal_anchor":true},{"doi":"","year":2023,"title":"Improving image generation with better captions","work_id":"31ac6ff5-6cb8-46dd-bbd4-7e8c4804cc01","ref_index":5,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":103,"snapshot_sha256":"568a0696150bb0ca2a4b00373b29c66f3f65f5bd367278f565fb0ecbe9a5cb58","internal_anchors":13},"formal_canon":{"evidence_count":2,"snapshot_sha256":"f7b1533bd7b23ee76bf376a8176ba65f5bcde82785c89854bebb18d564e2224e"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}