pith. sign in

arxiv: 2605.20942 · v1 · pith:FL35D7A4new · submitted 2026-05-20 · 💻 cs.CV

Bridging Structure and Language: Graph-Based Visual Reasoning for Autonomous Road Understanding

Pith reviewed 2026-05-21 05:21 UTC · model grok-4.3

classification 💻 cs.CV
keywords vision-language modelsroad scene understandinggraph-based reasoningautonomous drivingstructured supervisioncompositional reasoninglane topology
0
0 comments X

The pith

Graph structure lets small vision-language models master compositional road reasoning with few examples.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Current vision-language models fall short on precise reasoning about lane geometry, topology, and traffic relationships even when large. The paper introduces the Combined Road Substrate, a single graph representation that encodes both geometric structure and open-vocabulary semantics so that complex question-answer pairs can be generated automatically through recursive queries. Training 2- or 4-billion-parameter models on only 20 to 80 such enriched scenes produces stable gains across tasks of different depths. The same models also change their error patterns: relational failures drop while attribute-recognition errors remain, indicating that structured supervision rather than scale is the main missing ingredient.

Core claim

State-of-the-art VLMs struggle significantly with structured road reasoning, yet training a small 2- or 4-billion-parameter model with as few as 20 to 80 CRS-enriched scenes yields stable gains in compositional reasoning tasks of varying depth. Verifiable reasoning traces show that CRS-trained models reduce failures at relational scene understanding and leave mainly attribute-recognition errors.

What carries the argument

Combined Road Substrate (CRS), a graph-grounded framework that makes geometric road structure and open-vocabulary semantics jointly executable, supporting recursive graph queries for QA generation plus chain-of-thought traces.

If this is right

  • Small models achieve measurable gains in road geometry and relational reasoning once supplied with graph-derived supervision.
  • The dominant error type moves from relational misunderstanding to attribute recognition after CRS training.
  • Automatic generation of compositionally varied QA pairs scales supervision without manual annotation.
  • Structured graph input can be more decisive for precise road understanding than increasing model size.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same graph-plus-language substrate could be applied to other spatial domains that require both geometry and semantics, such as indoor robot navigation.
  • If the gains persist on real driving data, hybrid graph-VLM systems become a practical route to safer autonomous perception without ever-larger models.
  • The traceable reasoning traces supplied by CRS open a route to post-hoc verification or correction of model outputs in safety-critical settings.

Load-bearing premise

The QA pairs produced by recursive graph queries are logically valid, free of contradictions, and representative of real-world road complexity.

What would settle it

If human-verified QA pairs drawn from the same maps produce no gains or no shift away from relational errors, the value of the automatic CRS generation process would be called into question.

Figures

Figures reproduced from arXiv: 2605.20942 by Katie Z Luo, Lena Wild, Marco Pavone.

Figure 1
Figure 1. Figure 1: Language descriptions of road scenes are matched to graph primitives through canonical [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Model performance by bucket with example questions. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Example output of the qwen3-vl-4b model with CoT (additional examples in Section A.8) [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Accuracy by reasoning depth with 95% CI. Additional results in Section A.7. derived data, we observe substantial improvements in overall accuracy and robustness to reasoning complexity. Notably, even small models exhibit strong gains, particularly on reasoning-heavy tasks (see [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Distribution of all wrong links in the reasoning chain for samples where the model’s final answer is correct. 0 25 50 75 100 Qwen-4B-sft GPT-5.4 Qwen-4B First failure in reasoning chain for incorrect samples (%) anchor misidentified deconstruction wrong property not detected wrong answer selected [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Substituting all potential unique descriptors in relatively simple question templates leads to [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: The graph creation tool semi-automatically creates a spatio-temporal scene graph, while [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Per-type accuracy for free-form generation comparing models [PITH_FULL_IMAGE:figures/full_fig_p026_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Prediction frequency distributions for 3 descriptive question types. [PITH_FULL_IMAGE:figures/full_fig_p026_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative examples of the model output when tasked with road reasoning. [PITH_FULL_IMAGE:figures/full_fig_p027_12.png] view at source ↗
read the original abstract

Structured road understanding of lane geometry, topology, and traffic element relationships is foundational to safe autonomous driving. While vision-language models (VLMs) offer promising semantic flexibility, they lack the geometric and relational grounding required for precise road reasoning. Conversely, traditional modular systems, e.g., HD maps and topological road graphs, provide structural precision but remain semantically rigid. To bridge this gap, we introduce the Combined Road Substrate (CRS), a graph-grounded framework that makes geometric road structure and open-vocabulary semantics jointly executable in a single representation. CRS enables the automatic generation of compositionally complex and linguistically varied question-answer pairs via recursive graph queries, augmented with a "grounding for free" mechanism that ensures logical traceability to specific map elements, and procedurally extracted chain-of-thought supervision traces. We demonstrate that state-of-the-art VLMs - including large, closed-source models - struggle significantly with structured road reasoning, yet training a small 2- or 4-billion-parameter model with as few as 20 to 80 CRS-enriched scenes yields stable gains in compositional reasoning tasks of varying depth. Analysis of model behavior via verifiable reasoning traces reveals a systematic shift in failure modes: whereas baseline models fail at relational scene understanding, CRS-trained models reduce failures to attribute recognition, suggesting that the primary bottleneck in road understanding is not model scale, but the absence of structured supervision.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces the Combined Road Substrate (CRS), a graph-grounded framework integrating geometric road structure with open-vocabulary semantics for autonomous driving. It claims that state-of-the-art VLMs struggle with structured road reasoning, while training small 2- or 4-billion-parameter models on only 20-80 CRS-enriched scenes yields stable gains in compositional reasoning tasks of varying depth, accompanied by a shift in failure modes from relational to attribute errors via verifiable reasoning traces.

Significance. If the results hold, the work indicates that structured supervision via graph representations can address key bottlenecks in VLMs for safety-critical road understanding more efficiently than scaling model size or data volume. The automatic generation of traceable QA pairs and chain-of-thought traces from recursive graph queries represents a practical contribution for creating falsifiable, compositionally rich training signals in autonomous driving applications.

major comments (2)
  1. Abstract: the central empirical claim of performance gains and failure-mode shift with 20-80 scenes is asserted without any quantitative metrics, baseline comparisons, statistical significance tests, or details on scene selection and test-set construction, preventing assessment of whether the gains are load-bearing or reproducible.
  2. Method section on recursive graph queries and QA generation: the claim that automatically generated pairs are logically valid, contradiction-free, and representative of real-world road complexity lacks any quantitative error audit, inter-annotator agreement on held-out samples, or comparison against human-authored queries; this directly affects the validity of the conclusion that structured supervision is the primary bottleneck.
minor comments (2)
  1. Figure captions and notation: the 'grounding for free' traceability mechanism would benefit from an explicit diagram showing how graph elements map to generated QA pairs and CoT traces.
  2. Related work: additional citations to recent graph-neural and topological reasoning approaches in autonomous driving would better situate the CRS contribution.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating where revisions will be made to improve clarity and rigor without altering the core claims of the work.

read point-by-point responses
  1. Referee: Abstract: the central empirical claim of performance gains and failure-mode shift with 20-80 scenes is asserted without any quantitative metrics, baseline comparisons, statistical significance tests, or details on scene selection and test-set construction, preventing assessment of whether the gains are load-bearing or reproducible.

    Authors: We agree that the abstract would be strengthened by including specific quantitative indicators to allow immediate evaluation of the reported improvements. The full manuscript already contains detailed performance tables, baseline comparisons against unmodified VLMs, and descriptions of the scene selection and test-set construction in the experiments section. We will revise the abstract to incorporate key metrics (e.g., accuracy gains across compositional depths) and a concise reference to the evaluation protocol, while preserving the high-level summary style. revision: yes

  2. Referee: Method section on recursive graph queries and QA generation: the claim that automatically generated pairs are logically valid, contradiction-free, and representative of real-world road complexity lacks any quantitative error audit, inter-annotator agreement on held-out samples, or comparison against human-authored queries; this directly affects the validity of the conclusion that structured supervision is the primary bottleneck.

    Authors: The referee is correct that the current manuscript does not report a quantitative error audit or inter-annotator agreement study for the generated QA pairs. The recursive graph query design ensures traceability and logical consistency by construction through grounding to map elements, but we acknowledge the absence of explicit empirical validation of error rates. We will add a dedicated subsection describing a manual audit performed on a held-out sample of generated queries, including error rates, inter-annotator agreement, and a limited comparison to human-authored queries. This addition will directly support the claim regarding the value of structured supervision. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper introduces an independent Combined Road Substrate (CRS) as an external graph representation, uses recursive graph queries to generate QA pairs and chain-of-thought traces, and reports empirical gains from training small models on 20-80 CRS-enriched scenes against state-of-the-art VLMs. These gains and the observed shift in failure modes are measured on tasks produced by the same generation procedure but are not equivalent to the inputs by construction, nor do they rely on self-citation, fitted parameters renamed as predictions, or uniqueness theorems imported from prior author work. The framework supplies its own verifiable traceability mechanism and is evaluated via direct comparison to baseline models, making the central claims externally falsifiable rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

The central claim rests on the premise that graph-derived QA pairs provide faithful structured supervision whose benefits can be isolated from other training factors; this premise is introduced without independent verification in the abstract.

axioms (2)
  • domain assumption Road scenes can be faithfully represented as graphs that jointly encode geometry, topology, and open-vocabulary semantics.
    Invoked when defining the Combined Road Substrate as the substrate for query generation.
  • domain assumption Recursive graph queries produce logically consistent and diverse question-answer pairs suitable for training.
    Required for the claim that CRS-enriched scenes yield stable gains.
invented entities (1)
  • Combined Road Substrate (CRS) no independent evidence
    purpose: Unified graph representation that makes geometric structure and linguistic semantics jointly executable for VLM training and reasoning.
    Newly postulated framework whose properties are used to generate supervision and traceable answers.

pith-pipeline@v0.9.0 · 5775 in / 1484 out tokens · 38110 ms · 2026-05-21T05:21:53.173786+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 60 canonical work pages · 2 internal anchors

  1. [1]

    Hdmapnet: An online hd map construction and evaluation framework

    Qi Li, Yue Wang, Yilun Wang, and Hang Zhao. Hdmapnet: An online hd map construction and evaluation framework. In2022 International Conference on Robotics and Automation (ICRA), pages 4628–4634. IEEE, 2022

  2. [2]

    Vectormapnet: End- to-end vectorized hd map learning

    Yicheng Liu, Tianyuan Yuan, Yue Wang, Yilun Wang, and Hang Zhao. Vectormapnet: End- to-end vectorized hd map learning. InInternational conference on machine learning, pages 22352–22369. PMLR, 2023

  3. [3]

    Maptr: Structured modeling and learning for online vectorized hd map construction

    Bencheng Liao, Shaoyu Chen, Xinggang Wang, Tianheng Cheng, Qian Zhang, Wenyu Liu, and Chang Huang. Maptr: Structured modeling and learning for online vectorized hd map construction. InInternational Conference on Learning Representations, 2023

  4. [4]

    Openlane-v2: A topology reasoning benchmark for unified 3d hd mapping.Advances in Neural Information Processing Systems, 36:18873–18884, 2023

    Huijie Wang, Tianyu Li, Yang Li, Li Chen, Chonghao Sima, Zhenbo Liu, Bangjun Wang, Peijin Jia, Yuting Wang, Shengyin Jiang, et al. Openlane-v2: A topology reasoning benchmark for unified 3d hd mapping.Advances in Neural Information Processing Systems, 36:18873–18884, 2023

  5. [5]

    Graph-based topology reasoning for driving scenes.arXiv preprint arXiv:2304.05277, 2023

    Tianyu Li, Li Chen, Huijie Wang, Yang Li, Jiazhi Yang, Xiangwei Geng, Shengyin Jiang, Yuting Wang, Hang Xu, Chunjing Xu, et al. Graph-based topology reasoning for driving scenes.arXiv preprint arXiv:2304.05277, 2023

  6. [6]

    Hdnet: Exploiting hd maps for 3D object detection

    Bin Yang, Ming Liang, and Raquel Urtasun. Hdnet: Exploiting hd maps for 3D object detection. InCoRL, 2018

  7. [7]

    Mp3: A unified model to map, perceive, predict and plan

    Sergio Casas, Abbas Sadat, and Raquel Urtasun. Mp3: A unified model to map, perceive, predict and plan. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14403–14412, 2021

  8. [8]

    Planning-oriented autonomous driving

    Yihan Hu, Jiazhi Yang, Li Chen, Keyu Li, Chonghao Sima, Xizhou Zhu, Siqi Chai, Senyao Du, Tianwei Lin, Wenhai Wang, et al. Planning-oriented autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 17853–17862, 2023

  9. [9]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

  10. [10]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

  11. [11]

    Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models

    Matt Deitke, Christopher Clark, Sangho Lee, Rohun Tripathi, Yue Yang, Jae Sung Park, Mohammadreza Salehi, Niklas Muennighoff, Kyle Lo, Luca Soldaini, et al. Molmo and pixmo: Open weights and open data for state-of-the-art vision-language models. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 91–104, 2025

  12. [12]

    Babyvision: Visual reasoning beyond language

    Liang Chen, Weichu Xie, Yiyan Liang, Hongfeng He, Hans Zhao, Zhibo Yang, Zhiqi Huang, Haoning Wu, Haoyu Lu, Yiping Bao, et al. Babyvision: Visual reasoning beyond language. arXiv preprint arXiv:2601.06521, 2026

  13. [13]

    Technical performance

    Stanford HAI AI Index Team. Technical performance. Technical report, Stanford University Institute for Human-Centered Artificial Intelligence, 2026. Accessed: April 30, 2026. 10

  14. [14]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  15. [15]

    Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736, 2022

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning.Advances in neural information processing systems, 35:23716–23736, 2022

  16. [16]

    Lever- aging vision-language models for improving domain generalization in image classification

    Sravanti Addepalli, Ashish Ramayee Asokan, Lakshay Sharma, and R Venkatesh Babu. Lever- aging vision-language models for improving domain generalization in image classification. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23922–23932, 2024

  17. [17]

    Gemini 3.1 Pro: A smarter model for your most com- plex tasks

    The Gemini Team. Gemini 3.1 Pro: A smarter model for your most com- plex tasks. https://blog.google/innovation-and-ai/models-and-research/ gemini-models/gemini-3-1-pro/, February 2026

  18. [18]

    Introducing GPT-5.4

    OpenAI. Introducing GPT-5.4. https://openai.com/index/introducing-gpt-5-4/ , March 2026. OpenAI Blog

  19. [19]

    Claude sonnet 4.5

    Anthropic. Claude sonnet 4.5. https://www.anthropic.com, September 2025. Large language model, released September 29, 2025. Model string: claude-sonnet-4-5-20250929

  20. [20]

    nuscenes: A multimodal dataset for autonomous driving

    Holger Caesar, Varun Bankiti, Alex H Lang, Sourabh V ora, Venice Erin Liong, Qiang Xu, Anush Krishnan, Yu Pan, Giancarlo Baldan, and Oscar Beijbom. nuscenes: A multimodal dataset for autonomous driving. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11621–11631, 2020

  21. [21]

    Argoverse 2: Next generation datasets for self-driving perception and forecasting

    Benjamin Wilson, William Qi, Tanmay Agarwal, John Lambert, Jagjeet Singh, Siddhesh Khandelwal, Bowen Pan, Ratnesh Kumar, Andrew Hartnett, Jhony Kaesemodel Pontes, Deva Ramanan, Peter Carr, and James Hays. Argoverse 2: Next generation datasets for self-driving perception and forecasting. InProceedings of the Neural Information Processing Systems Track on D...

  22. [22]

    Scalability in perception for autonomous driving: Waymo open dataset

    Pei Sun, Henrik Kretzschmar, Xerxes Dotiwalla, Aurelien Chouard, Vijaysai Patnaik, Paul Tsui, James Guo, Yin Zhou, Yuning Chai, Benjamin Caine, Vijay Vasudevan, Wei Han, Jiquan Ngiam, Hang Zhao, Aleksei Timofeev, Scott Ettinger, Maxim Krivokon, Amy Gao, Aditya Joshi, Yu Zhang, Jonathon Shlens, Zhifeng Chen, and Dragomir Anguelov. Scalability in perception...

  23. [23]

    Cross-view transformers for real-time map-view semantic segmentation

    Brady Zhou and Philipp Krähenbühl. Cross-view transformers for real-time map-view semantic segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 13760–13769, 2022

  24. [24]

    Seif and Xiaolong Hu

    Heiko G. Seif and Xiaolong Hu. Autonomous driving in the icity—hd maps as a key challenge of the automotive industry.Engineering, 2:159–162, 2016

  25. [25]

    Instagram: Instance-level graph modeling for vectorized hd map learning.IEEE Transactions on Intelligent Transportation Systems, 26(2):1889–1899, 2025

    Juyeb Shin, Hyeonjun Jeong, Francois Rameau, and Dongsuk Kum. Instagram: Instance-level graph modeling for vectorized hd map learning.IEEE Transactions on Intelligent Transportation Systems, 26(2):1889–1899, 2025

  26. [26]

    Maptrv2: An end-to-end framework for online vectorized hd map construction.International Journal of Computer Vision, 133(3):1352–1374, 2025

    Bencheng Liao, Shaoyu Chen, Yunchi Zhang, Bo Jiang, Qian Zhang, Wenyu Liu, Chang Huang, and Xinggang Wang. Maptrv2: An end-to-end framework for online vectorized hd map construction.International Journal of Computer Vision, 133(3):1352–1374, 2025

  27. [27]

    Lanesegnet: Map learning with lane segment perception for autonomous driving

    Tianyu Li, Peijin Jia, Bangjun Wang, Li Chen, Kun Jiang, Junchi Yan, and Hongyang Li. Lanesegnet: Map learning with lane segment perception for autonomous driving. InICLR, 2024

  28. [28]

    Topomlp: A simple yet strong pipeline for driving topology reasoning.ICLR, 2024

    Dongming Wu, Jiahao Chang, Fan Jia, Yingfei Liu, Tiancai Wang, and Jianbing Shen. Topomlp: A simple yet strong pipeline for driving topology reasoning.ICLR, 2024. 11

  29. [29]

    Augmenting lane perception and topology understanding with standard definition navigation maps

    Katie Z Luo, Xinshuo Weng, Yan Wang, Shuang Wu, Jie Li, Kilian Q Weinberger, Yue Wang, and Marco Pavone. Augmenting lane perception and topology understanding with standard definition navigation maps. In2024 IEEE International Conference on Robotics and Automation (ICRA), pages 4029–4035. IEEE, 2024

  30. [30]

    Graphad: Interaction scene graph for end-to-end autonomous driving

    Yunpeng Zhang, Deheng Qian, Ding Li, Yifeng Pan, Yong Chen, Zhenbao Liang, Zhiyao Zhang, Yingzong Liu, Jianhui Mei, Maolei Fu, Yun Ye, Zhujin Liang, Yi Shan, and Dalong Du. Graphad: Interaction scene graph for end-to-end autonomous driving. In James Kwok, editor,Proceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, IJ...

  31. [31]

    Graphpilot: Grounded scene graph conditioning for language-based autonomous driving.arXiv preprint arXiv:2511.11266, 2025

    Fabian Schmidt, Markus Enzweiler, and Abhinav Valada. Graphpilot: Grounded scene graph conditioning for language-based autonomous driving.arXiv preprint arXiv:2511.11266, 2025

  32. [32]

    Vqa: Visual question answering

    Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. InProceedings of the IEEE international conference on computer vision, pages 2425–2433, 2015

  33. [33]

    Making the v in vqa matter: Elevating the role of image understanding in visual question answering

    Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017

  34. [34]

    Gqa: A new dataset for real-world visual reasoning and compositional question answering

    Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6700–6709, 2019

  35. [35]

    Spatialvlm: Endowing vision-language models with spatial reasoning capabilities

    Boyuan Chen, Zhuo Xu, Sean Kirmani, Brian Ichter, Dorsa Sadigh, Leonidas Guibas, and Fei Xia. Spatialvlm: Endowing vision-language models with spatial reasoning capabilities. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14455–14465, 2024

  36. [36]

    Embodied scene understanding for vision language models via metavqa

    Weizhen Wang, Chenda Duan, Zhenghao Peng, Yuxin Liu, and Bolei Zhou. Embodied scene understanding for vision language models via metavqa. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22453–22464, 2025

  37. [37]

    Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario

    Tianwen Qian, Jingjing Chen, Linhai Zhuo, Yang Jiao, and Yu-Gang Jiang. Nuscenes-qa: A multi-modal visual question answering benchmark for autonomous driving scenario. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 4542–4550, 2024

  38. [38]

    Lingoqa: Visual question answering for autonomous driving

    Ana-Maria Marcu, Long Chen, Jan Hünermann, Alice Karnsund, Benoit Hanotte, Prajwal Chidananda, Saurabh Nair, Vijay Badrinarayanan, Alex Kendall, Jamie Shotton, et al. Lingoqa: Visual question answering for autonomous driving. InEuropean Conference on Computer Vision, pages 252–269. Springer, 2024

  39. [39]

    Drivelm: Driving with graph visual question answering

    Chonghao Sima, Katrin Renz, Kashyap Chitta, Li Chen, Hanxue Zhang, Chengen Xie, Jens Beißwenger, Ping Luo, Andreas Geiger, and Hongyang Li. Drivelm: Driving with graph visual question answering. InEuropean conference on computer vision, pages 256–274. Springer, 2024

  40. [40]

    Talk2car: Taking control of your self-driving car

    Thierry Deruyttere, Simon Vandenhende, Dusan Grujicic, Luc Van Gool, and Marie Francine Moens. Talk2car: Taking control of your self-driving car. InProceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP), pages 2088–2098, 2019

  41. [41]

    Drama: Joint risk localization and captioning in driving

    Srikanth Malla, Chiho Choi, Isht Dwivedi, Joon Hee Choi, and Jiachen Li. Drama: Joint risk localization and captioning in driving. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 1043–1052, 2023. 12

  42. [42]

    Talk2bev: Language-enhanced bird’s-eye view maps for autonomous driving

    Tushar Choudhary, Vikrant Dewangan, Shivam Chandhok, Shubham Priyadarshan, Anushka Jain, Arun K Singh, Siddharth Srivastava, Krishna Murthy Jatavallabhula, and K Madhava Krishna. Talk2bev: Language-enhanced bird’s-eye view maps for autonomous driving. In 2024 IEEE International Conference on Robotics and Automation (ICRA), pages 16345–16352. IEEE, 2024

  43. [43]

    Drivevlm: The convergence of autonomous driving and large vision-language models

    Xiaoyu Tian, Junru Gu, Bailin Li, Yicheng Liu, Yang Wang, Zhiyong Zhao, Kun Zhan, Peng Jia, Xianpeng Lang, and Hang Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models. InConference on Robot Learning (CoRL), 2024

  44. [44]

    Reason2drive: Towards interpretable and chain-based reasoning for autonomous driving

    Ming Nie, Renyuan Peng, Chunwei Wang, Xinyue Cai, Jianhua Han, Hang Xu, and Li Zhang. Reason2drive: Towards interpretable and chain-based reasoning for autonomous driving. In European Conference on Computer Vision, pages 292–308. Springer, 2024

  45. [45]

    DRIVINGVQA: A dataset for interleaved visual chain-of-thought in real-world driving scenar- ios

    Charles Corbière, Simon Roburin, Syrielle Montariol, Antoine Bosselut, and Alexandre Alahi. DRIVINGVQA: A dataset for interleaved visual chain-of-thought in real-world driving scenar- ios. In Vera Demberg, Kentaro Inui, and Lluís Marquez, editors,Findings of the Association for Computational Linguistics: EACL 2026, pages 3309–3333, Rabat, Morocco, March 2...

  46. [46]

    Show and tell: A neural image caption generator

    Oriol Vinyals, Alexander Toshev, Samy Bengio, and Dumitru Erhan. Show and tell: A neural image caption generator. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3156–3164, 2015

  47. [47]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InInternational conference on machine learning, pages 19730–19742. PMLR, 2023

  48. [48]

    Gsm- infinite: How do your llms behave over infinitely increas- ing context length and reasoning complexity?arXiv preprint arXiv:2502.05252,

    Yang Zhou, Hongyi Liu, Zhuoming Chen, Yuandong Tian, and Beidi Chen. Gsm-infinite: How do your llms behave over infinitely increasing context length and reasoning complexity?arXiv preprint arXiv:2502.05252, 2025

  49. [49]

    Phantomwiki: On-demand datasets for reasoning and retrieval evaluation

    Albert Gong, Kamil˙e Stankeviˇci¯ut˙e, Chao Wan, Anmol Kabra, Raphael Thesmar, Johann Lee, Julius Klenke, Carla P Gomes, and Kilian Q Weinberger. Phantomwiki: On-demand datasets for reasoning and retrieval evaluation. InForty-second International Conference on Machine Learning, 2025

  50. [50]

    Image retrieval using scene graphs

    Justin Johnson, Ranjay Krishna, Michael Stark, Li-Jia Li, David Shamma, Michael Bernstein, and Li Fei-Fei. Image retrieval using scene graphs. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3668–3678, 2015

  51. [51]

    Auto-encoding scene graphs for image captioning

    Xu Yang, Kaihua Tang, Hanwang Zhang, and Jianfei Cai. Auto-encoding scene graphs for image captioning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10685–10694, 2019

  52. [52]

    Explainable and explicit visual reasoning over scene graphs

    Jiaxin Shi, Hanwang Zhang, and Juanzi Li. Explainable and explicit visual reasoning over scene graphs. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8376–8384, 2019

  53. [53]

    Understanding the role of scene graphs in visual question answering.arXiv preprint arXiv:2101.05479, 2021

    Vinay Damodaran, Sharanya Chakravarthy, Akshay Kumar, Anjana Umapathy, Teruko Mita- mura, Yuta Nakashima, Noa Garcia, and Chenhui Chu. Understanding the role of scene graphs in visual question answering.arXiv preprint arXiv:2101.05479, 2021

  54. [54]

    G-retriever: Retrieval-augmented generation for textual graph understanding and question answering.Advances in Neural Information Processing Systems, 37, 2024

    Xiaoxin He, Yijun Tian, Yifei Sun, Nitesh V Chawla, Thomas Laurent, Yann LeCun, Xavier Bresson, and Bryan Hooi. G-retriever: Retrieval-augmented generation for textual graph understanding and question answering.Advances in Neural Information Processing Systems, 37, 2024

  55. [55]

    Graphvis: Boosting llms with visual knowledge graph integration

    Yihe Deng, Chenchen Ye, Zijie Huang, Mingyu Derek Ma, Yiwen Kou, and Wei Wang. Graphvis: Boosting llms with visual knowledge graph integration. In A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang, editors,Advances in Neural Information Processing Systems, volume 37, pages 67511–67534. Curran Associates, Inc., 2024. 13

  56. [56]

    Localization is all you evaluate: Data leakage in online mapping datasets and how to fix it

    Adam Lilja, Junsheng Fu, Erik Stenborg, and Lars Hammarstrand. Localization is all you evaluate: Data leakage in online mapping datasets and how to fix it. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 22150–22159, 2024

  57. [57]

    GPT-4o mini: Advancing Cost-Efficient Intelligence.https://openai.com/index/ gpt-4o-mini-advancing-cost-efficient-intelligence/, July 2024

    OpenAI. GPT-4o mini: Advancing Cost-Efficient Intelligence.https://openai.com/index/ gpt-4o-mini-advancing-cost-efficient-intelligence/, July 2024. OpenAI Blog

  58. [58]

    Kimi K2.6

    Moonshot AI. Kimi K2.6. https://www.kimi.com/blog/kimi-k2-6.html, April 2026. Moonshot AI Blog

  59. [59]

    Qwen3-VL Technical Report

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

  60. [60]

    What is the direction of d∗(l)?

    Google DeepMind. Gemma 4: Our Most Intelligent Open Models. https://deepmind. google/models/gemma/gemma-4/, April 2026. Google DeepMind. 14 A Supplementary Material A.1 Query instantiation Let Gt = (V t, Et) denote the CRS scene graph at the queried frame and let Gt−w:t denote the temporal window of size w= 4 used by temporal queries. τ(n) is the node typ...