arxiv: 2604.02710 · v1 · submitted 2026-04-03 · 💻 cs.RO · cs.AI· cs.CV

Recognition: no theorem link

V2X-QA: A Comprehensive Reasoning Dataset and Benchmark for Multimodal Large Language Models in Autonomous Driving Across Ego, Infrastructure, and Cooperative Views

Bin Ran, Jiaxi Liu, Junwei You, Pei Li, Rui Gan, Sikai Chen, Weizhe Tang, Yan Zhao, Zhuoyu Jiang, Zilin Huang

Pith reviewed 2026-05-13 20:35 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CV

keywords V2X-QAautonomous drivingmultimodal large language modelsbenchmarkcooperative drivinginfrastructure viewviewpoint reasoningMCQA dataset

0 comments

The pith

A new benchmark shows viewpoint access changes how multimodal models reason about traffic scenes from vehicle, infrastructure, and cooperative perspectives.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents V2X-QA, a real-world dataset and benchmark for testing multimodal large language models on autonomous driving tasks under vehicle-only, infrastructure-only, and combined viewpoints. It uses a view-decoupled multiple-choice question answering setup organized into a twelve-task taxonomy that covers perception, prediction, and planning. Experiments with ten current models reveal that infrastructure views improve understanding of overall traffic flow while cooperative reasoning stays difficult because it requires aligning and integrating evidence across different camera angles rather than simply receiving more images. The authors also release V2X-MoE, a baseline model that routes inputs by view and uses separate experts for each viewpoint to handle these differences.

Core claim

V2X-QA demonstrates that viewpoint accessibility substantially affects MLLM performance in autonomous driving, infrastructure-side reasoning supports meaningful macroscopic traffic understanding, and cooperative reasoning remains challenging because it requires cross-view alignment and evidence integration rather than simply additional visual input. The introduced V2X-MoE baseline with explicit view routing and viewpoint-specific LoRA experts achieves stronger results, indicating that viewpoint specialization improves multi-view reasoning.

What carries the argument

The view-decoupled evaluation protocol in V2X-QA that enables controlled comparisons of vehicle-only, infrastructure-only, and cooperative conditions inside a single MCQA framework.

If this is right

Infrastructure views enable better macroscopic traffic understanding than vehicle views alone.
Cooperative reasoning demands explicit cross-view alignment and evidence integration beyond adding more visual inputs.
Models with explicit view routing and viewpoint-specific experts can achieve higher performance on multi-view tasks.
The twelve-task taxonomy supports fine-grained diagnosis of model strengths across perception, prediction, and planning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Connected autonomous driving systems may gain reliability by incorporating infrastructure data sources for broader scene context.
New model architectures focused on cross-view fusion could be developed to overcome current cooperative reasoning limits.
The benchmark can be extended to test physical intelligence in multi-agent scenarios that go beyond single-vehicle planning.

Load-bearing premise

Expert-verified MCQA annotations and the twelve-task taxonomy accurately and comprehensively capture real-world viewpoint-dependent capabilities without introducing annotation biases or coverage gaps.

What would settle it

Run the same ten models on a fresh collection of real-world driving scenes with independently collected ground-truth labels and check whether the reported performance gaps between vehicle, infrastructure, and cooperative conditions remain consistent.

Figures

Figures reproduced from arXiv: 2604.02710 by Bin Ran, Jiaxi Liu, Junwei You, Pei Li, Rui Gan, Sikai Chen, Weizhe Tang, Yan Zhao, Zhuoyu Jiang, Zilin Huang.

**Figure 1.** Figure 1: Overview of V2X-QA. Left: representative examples of the twelve viewpoint-aligned tasks under vehicle-side, infrastructure-side, and cooperative settings. Right: MCQA-based training and evaluation pipeline of V2X-MoE on the V2X-QA training and testing splits. Early driving visual question answering (VQA) and vision–language benchmarks largely follow general-purpose VQA conventions, focusing on scene compre… view at source ↗

**Figure 2.** Figure 2: Construction pipeline of V2X-QA. The pipeline starts from the twelve viewpoint-aligned tasks under three settings, then proceeds to MCQA task bank construction, model-assisted answer selection with human verification, data split generation, and standardized benchmarking across proprietary and open-source MLLMs. V2X-QA is built on real-world vehicle–infrastructure cooperative driving data derived from the V… view at source ↗

**Figure 3.** Figure 3: Statistics of V2X-QA. Left: overall task and functional distribution across the twelve viewpoint-aligned tasks. Right: distribution of perception, prediction, and reasoning & planning samples under vehicle-side, infrastructure-side, and cooperative views [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Answer-distribution diagnostics of V2X-QA. Left: distribution of correct option positions across the twelve tasks. Middle: majority-class ratio for each task. Right: answer entropy for each task. includes perception, prediction, and reasoning and planning samples; the IS subset emphasizes perception together with macroscopic reasoning; and the CO subset covers perception, cooperative prediction, and cooper… view at source ↗

**Figure 5.** Figure 5: Overall architecture and staged training pipeline of V2X-MoE. The model takes viewpoint-aligned visual evidence and MCQA prompts as input, uses a Qwen3-VL multimodal processor and a shared frozen backbone, activates one viewpoint-specific LoRA expert through an explicit router, and is trained by full MCQA training followed by CO-focused and IS-focused refinement. To support controlled benchmarking under th… view at source ↗

**Figure 6.** Figure 6: View-specific LoRA injection with explicit hard routing in V2X-MoE. A viewpoint type signal selects one expert among the vehicle-side, infrastructure-side, and cooperative experts. The selected expert injects low-rank adaptations into the attention projections, while the base transformer parameters remain frozen. denote the predicted four-way option probability vector for sample 𝑖, and let 𝐲𝑖 ∈ {0, 1}4 den… view at source ↗

**Figure 7.** Figure 7: Task-level and viewpoint-level model performance comparison on V2X-QA [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Reliability diagram and confidence distribution of V2X-MoE across vehicle-side, infrastructure-side, and cooperative viewpoint groups. 6. Conclusion This work presents V2X-QA, a novel benchmark for studying multimodal reasoning in autonomous driving under vehicle-side, infrastructure-side, and cooperative viewpoints. Unlike conventional driving QA settings that mainly focus on ego-view evidence, V2X-QA exp… view at source ↗

read the original abstract

Multimodal large language models (MLLMs) have shown strong potential for autonomous driving, yet existing benchmarks remain largely ego-centric and therefore cannot systematically assess model performance in infrastructure-centric and cooperative driving conditions. In this work, we introduce V2X-QA, a real-world dataset and benchmark for evaluating MLLMs across vehicle-side, infrastructure-side, and cooperative viewpoints. V2X-QA is built around a view-decoupled evaluation protocol that enables controlled comparison under vehicle-only, infrastructure-only, and cooperative driving conditions within a unified multiple-choice question answering (MCQA) framework. The benchmark is organized into a twelve-task taxonomy spanning perception, prediction, and reasoning and planning, and is constructed through expert-verified MCQA annotation to enable fine-grained diagnosis of viewpoint-dependent capabilities. Benchmark results across ten representative state-of-the-art proprietary and open-source models show that viewpoint accessibility substantially affects performance, and infrastructure-side reasoning supports meaningful macroscopic traffic understanding. Results also indicate that cooperative reasoning remains challenging since it requires cross-view alignment and evidence integration rather than simply additional visual input. To address these challenges, we introduce V2X-MoE, a benchmark-aligned baseline with explicit view routing and viewpoint-specific LoRA experts. The strong performance of V2X-MoE further suggests that explicit viewpoint specialization is a promising direction for multi-view reasoning in autonomous driving. Overall, V2X-QA provides a foundation for studying multi-perspective reasoning, reliability, and cooperative physical intelligence in connected autonomous driving. The dataset and V2X-MoE resources are publicly available at: https://github.com/junwei0001/V2X-QA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

V2X-QA offers a controlled multi-view benchmark for driving MLLMs with some gaps in annotation validation details.

read the letter

The punchline is that V2X-QA is a new dataset and benchmark designed to evaluate how well multimodal large language models handle autonomous driving questions when given different viewpoints: just the ego vehicle, just the infrastructure, or both together. They use a multiple choice format across twelve tasks in perception, prediction, and planning. The paper does a good job laying out a view-decoupled protocol so that the same underlying scenarios can be tested under each condition. This makes the comparisons more controlled than in prior ego-only benchmarks. Running ten models shows consistent patterns where infrastructure views add value for understanding overall traffic flow and where cooperative inputs create extra challenges around aligning information from separate sources. Releasing both the dataset and the V2X-MoE model with its view routing mechanism gives the community something concrete to work with and improve on. Where it is thinner is in the details around how the questions and answers were created and checked. The description says expert-verified MCQA annotation, but the manuscript does not include inter-annotator agreement figures or a clear account of how disagreements were settled when experts looked at the same scene from different angles. That leaves open the possibility that some of the performance differences come from variations in question difficulty or answer distribution rather than from the models' actual reasoning abilities across views. A bit more on the verification protocol would make the claims more robust. Overall this is for people developing or testing MLLMs in the context of V2X systems and cooperative driving. It is worth sending out for peer review because the core idea of a unified multi-view benchmark is fresh, the data is public, and the baseline results point to real directions for future model design. The work is grounded enough in existing literature to merit referee attention even if some methodological details could be tightened.

Referee Report

2 major / 2 minor

Summary. The paper introduces V2X-QA, a real-world dataset and benchmark for evaluating MLLMs in autonomous driving across ego-vehicle, infrastructure, and cooperative viewpoints. It uses a view-decoupled MCQA protocol organized into a 12-task taxonomy spanning perception, prediction, and reasoning/planning, with expert-verified annotations. Benchmarking ten state-of-the-art proprietary and open-source models shows that viewpoint accessibility substantially affects performance, infrastructure views enable macroscopic traffic understanding, and cooperative reasoning is challenging due to the need for cross-view alignment rather than added visual input. The authors also propose V2X-MoE, a baseline with explicit view routing and viewpoint-specific LoRA experts, which outperforms standard models on the benchmark.

Significance. If the expert annotations are shown to be reliable and unbiased, V2X-QA would represent a meaningful advance by providing the first systematic benchmark for multi-view and cooperative reasoning in V2X settings. The empirical results on viewpoint effects and the V2X-MoE architecture offer concrete guidance for developing models suited to connected autonomous driving, while the public dataset release supports further research on reliability and physical intelligence.

major comments (2)

[Section 3] Section 3 (Dataset Construction and Annotation): No inter-annotator agreement statistics or detailed verification protocol (e.g., how expert disagreements on infrastructure vs. ego questions were resolved) are reported. This is load-bearing for the central claim that viewpoint accessibility substantially affects performance, because unquantified annotation biases or coverage gaps in the 12-task taxonomy could produce the observed gaps without reflecting true model capabilities.
[Section 4] Section 4 (Benchmark Results and Analysis): The conclusion that cooperative reasoning requires cross-view alignment and evidence integration (rather than simply additional visual input) rests on the assumption that task phrasing and answer distributions are balanced across views. Without an explicit check for systematic differences in question difficulty or option distributions by viewpoint, the performance gaps may partly reflect annotation artifacts.

minor comments (2)

[Abstract] Abstract: The description of the twelve-task taxonomy would benefit from a one-sentence summary of the category distribution (e.g., how many tasks fall under perception vs. planning) to give readers immediate context.
[Section 3.1] Figure 1 or Section 3.1: The view-decoupled evaluation protocol diagram should explicitly label the three conditions (vehicle-only, infrastructure-only, cooperative) to make the controlled comparison clearer.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback on our manuscript. The concerns regarding annotation reliability and potential distributional artifacts are well-taken and directly relevant to the strength of our central claims. We address each point below and will incorporate the requested additions and clarifications in the revised version.

read point-by-point responses

Referee: [Section 3] Section 3 (Dataset Construction and Annotation): No inter-annotator agreement statistics or detailed verification protocol (e.g., how expert disagreements on infrastructure vs. ego questions were resolved) are reported. This is load-bearing for the central claim that viewpoint accessibility substantially affects performance, because unquantified annotation biases or coverage gaps in the 12-task taxonomy could produce the observed gaps without reflecting true model capabilities.

Authors: We acknowledge that quantitative inter-annotator agreement (IAA) statistics were omitted from the original submission. In the revised manuscript we will add a new paragraph in Section 3 reporting pairwise agreement rates and Fleiss' kappa computed over the expert annotations. The verification protocol consisted of independent review by two domain experts per item followed by a consensus discussion for any disagreements; viewpoint-specific questions (e.g., infrastructure-only vs. ego-only) were flagged and resolved by a third senior expert when necessary. We will also include a brief table summarizing disagreement rates broken down by task category and viewpoint to demonstrate that coverage gaps do not systematically favor any single view. revision: yes
Referee: [Section 4] Section 4 (Benchmark Results and Analysis): The conclusion that cooperative reasoning requires cross-view alignment and evidence integration (rather than simply additional visual input) rests on the assumption that task phrasing and answer distributions are balanced across views. Without an explicit check for systematic differences in question difficulty or option distributions by viewpoint, the performance gaps may partly reflect annotation artifacts.

Authors: We agree that an explicit balance audit is necessary to rule out annotation artifacts. In the revised Section 4 we will insert (i) summary statistics on question length, number of options, and lexical complexity stratified by viewpoint, and (ii) a control experiment reporting model performance under shuffled-answer and random-guessing baselines for each view. Our internal verification already shows no statistically significant differences in option entropy or question difficulty across the three views, but these numbers will now be reported transparently. This addition will directly support the claim that the observed cooperative-reasoning gap arises from the need for cross-view alignment rather than from unbalanced task phrasing. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset and benchmark evaluation

full rationale

The paper introduces V2X-QA as a new real-world dataset with expert-verified MCQA annotations organized into a twelve-task taxonomy spanning perception, prediction, and planning. It evaluates ten existing MLLMs under a view-decoupled protocol and proposes V2X-MoE as a baseline with view routing and LoRA experts. No equations, parameter fitting, or derivations are present. Claims about viewpoint effects and cooperative challenges rest directly on the empirical benchmark results rather than reducing to self-defined quantities, fitted inputs renamed as predictions, or load-bearing self-citations. The construction is self-contained against external model evaluations and does not invoke uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the creation of a new annotated dataset and the assumption that viewpoint-specific performance differences can be isolated through the described protocol; no free parameters or invented physical entities are introduced.

axioms (1)

domain assumption Expert verification produces unbiased and comprehensive MCQA annotations that reflect real driving capabilities
The benchmark construction relies on experts to create and verify questions across the twelve tasks without detailing inter-annotator agreement or validation against ground-truth driving logs.

pith-pipeline@v0.9.0 · 5642 in / 1335 out tokens · 37254 ms · 2026-05-13T20:35:09.288759+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MDrive: Benchmarking Closed-Loop Cooperative Driving for End-to-End Multi-agent Systems
cs.RO 2026-05 unverdicted novelty 7.0

MDrive benchmark shows multi-agent cooperative driving systems generally outperform single-agent ones in closed-loop settings but perception sharing does not always improve planning and negotiation can harm performanc...

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · cited by 1 Pith paper · 3 internal anchors

[1]

Intern-s1: A scientific multimodal foundation model.arXiv preprint arXiv:2508.15763, 2025

Bai, L., Cai, Z., Cao, M., Cao, W., Chen, C., et al., 2025a. Intern-s1: A scientific multimodal foundation model. arXiv preprint arXiv:2508.15763 doi:10.48550/arXiv.2508.15763. Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., et al., 2025b. Qwen3-vl technical report. arXiv preprint arXiv:2511.21631 doi:10.48550/arXiv. 2511.21631. Bai, S., et al., 2025c. Qw...

work page doi:10.48550/arxiv.2508.15763
[2]

Monthly Weather Review 78, 1–3

Verification of forecasts expressed in terms of probability. Monthly Weather Review 78, 1–3. Chiu,H.k.,Hachiuma,R.,Wang,C.Y.,Smith,S.F.,Wang,Y.C.F.,Chen,M.H.,2025a. V2v-llm:Vehicle-to-vehiclecooperativeautonomousdriving with multi-modal large language models. arXiv preprint arXiv:2502.09980 . Chiu,H.k.,Hachiuma,R.,Wang,C.Y.,Wang,Y.C.F.,Chen,M.H.,Smith,S.F...

work page arXiv
[3]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv:2507.06261. Cui,E.,Wang,W.,Li,Z.,Xie,J.,Zou,H.,Deng,H.,Luo,G.,Lu,L.,Zhu,X.,Dai,J.,2025. Drivemlm:aligningmulti-modallargelanguagemodels with behavioral planning states for autonomous driving. Visual Intelligence 3,

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Planningsafetytrajectorieswithdual-phase,physics-informed,andtransportation knowledge-driven large language models

Junwei You et al.:Preprint submitted to ElsevierPage 22 of 24 V2X-QA: A Benchmark for MLLMs in Autonomous Driving Across Ego, Infrastructure, and Cooperative Views Gan,R.,Li,P.,Long,K.,An,B.,You,J.,Wu,K.,Ran,B.,2025. Planningsafetytrajectorieswithdual-phase,physics-informed,andtransportation knowledge-driven large language models. arXiv preprint arXiv:250...

work page arXiv 2025
[5]

NuRisk: A Visual Question Answering Dataset for Agent-Level Risk Assessment in Autonomous Driving

Nurisk: A visual question answering dataset for agent-level risk assessment in autonomous driving. arXiv preprint arXiv:2509.25944 . Google,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Google AI for Developers documentation

Gemini 3 Flash Preview.https://ai.google.dev/gemini-api/docs/models/gemini-3-flash-preview. Google AI for Developers documentation. Accessed: 2026-04-01. Guo, C., Pleiss, G., Sun, Y., Weinberger, K.Q.,

work page 2026
[7]

Emma: End-to-end multimodal model for autonomous driving.arXiv preprint arXiv:2410.23262, 2024

Emma: End-to-end multimodal model for autonomous driving. arXiv preprint arXiv:2410.23262 . Kharlamova, A., Liu, J., Zhang, T., Yang, X., Alqasimi, H., Sun, Y., Xue, C.J.,

work page arXiv
[8]

arXiv preprint arXiv:2511.18924

Llm-driven kernel evolution: Automating driver updates in linux. arXiv preprint arXiv:2511.18924 . Kim, J., Rohrbach, A., Darrell, T., Canny, J., Akata, Z.,

work page arXiv
[9]

Textual explanations for self-driving vehicles, in: Proceedings of the European conference on computer vision (ECCV), pp. 563–578. Kim,Y.,Abdelrahman,A.S.,Abdel-Aty,M.,2025. Vru-accident:Avision-languagebenchmarkforvideoquestionansweringanddensecaptioning for accident scene understanding, in: Proceedings of the IEEE/CVF International Conference on Compute...

work page 2025
[10]

arXiv preprint arXiv:2511.11168

Cats-v2v: A real-world vehicle-to-vehicle cooperative perception dataset with complex adverse traffic scenarios. arXiv preprint arXiv:2511.11168 . Li, Y., Ma, D., An, Z., Wang, Z., Zhong, Y., Chen, S., Feng, C.,

work page arXiv
[11]

1043–1052

Drama: Joint risk localization and captioning in driving, in: Proceedings of the IEEE/CVF winter conference on applications of computer vision, pp. 1043–1052. OpenAI, 2025a. GPT-5 mini Model.https://developers.openai.com/api/docs/models/gpt-5-mini. OpenAI API documentation; snapshot gpt-5-mini-2025-08-07. Accessed: 2026-04-01. OpenAI, 2025b. GPT-5.2 Model...

work page 2025
[12]

8066–8076

Nuplanqa: A large-scale dataset and benchmark for multi-view driving scene understanding in multi-modal large language models, in: Proceedings of the IEEE/CVF International Conference on Computer Vision, pp. 8066–8076. Qian,T.,Chen,J.,Zhuo,L.,Jiao,Y.,Jiang,Y.G.,2024. Nuscenes-qa:Amulti-modalvisualquestionansweringbenchmarkforautonomousdriving scenario, in...

work page arXiv 2024
[13]

arXiv preprint arXiv:2602.00993

Hermes: A holistic end-to-end risk-aware multimodal embodied system with vision-language models for long-tail autonomous driving. arXiv preprint arXiv:2602.00993 . Tian, K., Mao, J., Zhang, Y., Jiang, J., Zhou, Y., Tu, Z.,

work page arXiv
[14]

DriveVLM: The Convergence of Autonomous Driving and Large Vision-Language Models

Drivevlm: The convergence of autonomous driving and large vision-language models. arXiv preprint arXiv:2402.12289 . Wang,Y.,Zheng,Y.,Fan,W.,Wang,T.,Chu,H.,Tian,D.,Gao,B.,Wang,J.,Chen,H.,2026. Scenepilot-bench:Alarge-scaledatasetandbenchmark for evaluation of vision-language models in autonomous driving. arXiv preprint arXiv:2601.19582 . Xiang, H., Zheng, ...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[15]

IEEE Robotics and Automation Letters 9, 8186–8193

Drivegpt4: Interpretable end-to-end autonomous driving via large language model. IEEE Robotics and Automation Letters 9, 8186–8193. You,J.,Jiang,Z.,Huang,Z.,Shi,H.,Gan,R.,Wu,K.,Cheng,X.,Li,X.,Ran,B.,2026. V2x-vlm:End-to-endv2xcooperativeautonomousdriving through large vision-language models. Transportation Research Part C: Emerging Technologies 183, 10545...

work page 2026
[16]

arXiv preprint arXiv:2506.21041

Seal: Vision-language model-based safe end-to-end cooperative autonomous driving with adaptive long-tail modeling. arXiv preprint arXiv:2506.21041 . Junwei You et al.:Preprint submitted to ElsevierPage 23 of 24 V2X-QA: A Benchmark for MLLMs in Autonomous Driving Across Ego, Infrastructure, and Cooperative Views Yu,G.,Li,H.,Wang,Y.,Chen,P.,Zhou,B.,2022a. A...

work page arXiv
[17]

arXiv preprint arXiv:2511.20022

Waymoqa: A multi-view visual question answering dataset for safety-critical reasoning in autonomous driving. arXiv preprint arXiv:2511.20022 . Zeng, T., Wu, L., Shi, L., Zhou, D., Guo, F.,

work page arXiv