pith. machine review for the scientific record. sign in

arxiv: 2505.17685 · v3 · submitted 2025-05-23 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

Authors on Pith no claims yet

Pith reviewed 2026-05-15 19:15 UTC · model grok-4.3

classification 💻 cs.CV
keywords autonomous drivingvision-language-actionchain of thoughtfuture frame predictiontrajectory planningworld modelspatio-temporal reasoning
0
0 comments X

The pith

Generating one future visual frame lets driving models plan trajectories by preserving spatial and temporal details that text chains of thought discard.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that vision-language-action models for autonomous driving lose critical information when they compress scenes into text-based chains of thought. It introduces a method that first generates a single future frame containing predicted background, lane dividers, and 3D object boxes. This frame functions as a visual chain of thought that the same model then uses to output driving trajectories. Training jointly optimizes future-frame prediction and scene understanding so the model learns both to imagine and to act. Experiments on standard driving benchmarks report higher trajectory accuracy and fewer collisions while maintaining competitive image quality.

Core claim

FSDrive operates first as a world model to produce a unified future frame that merges a predicted background with explicit future lane dividers and 3D object boxes; this single imagined scene serves as the visual spatio-temporal CoT. The identical model then switches to an inverse-dynamics role and outputs trajectories conditioned on the current observation plus this visual CoT. A unified pre-training stage adds visual tokens to the vocabulary and jointly trains semantic VQA with future-frame generation, using a progressive curriculum that first enforces structural priors before full scene rendering.

What carries the argument

The visual spatio-temporal CoT: one generated future frame that encodes both spatial layout and temporal change in a single image for subsequent planning.

If this is right

  • Trajectory prediction accuracy rises on nuScenes and NAVSIM benchmarks.
  • Collision rates drop when planning uses the generated visual frame instead of text-only reasoning.
  • The same lightweight autoregressive model reaches competitive FID scores for future-frame generation.
  • Scene-understanding performance improves on DriveLM question-answering tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same visual-CoT pattern could be tested in other embodied tasks such as robotic manipulation where text reasoning discards spatial layout.
  • If generation errors are the main failure mode, adding an explicit physics-loss term during pre-training might reduce inherited planning mistakes.
  • Replacing the single-frame CoT with a short sequence of frames might further improve long-horizon anticipation without returning to pure text.

Load-bearing premise

The predicted future frame must be physically accurate enough to supply the precise lane and object positions that the planning step actually needs.

What would settle it

Train the model with deliberately inaccurate future frames (shifted lanes or wrong box positions) and check whether trajectory quality and collision rates remain unchanged from the accurate-frame case.

read the original abstract

Vision-Language-Action (VLA) models offer significant potential for end-to-end driving, yet their reasoning is often constrained by textual Chains-of-Thought (CoT). This symbolic compression of visual information creates a modality gap between perception and planning by blurring spatio-temporal relations and discarding fine-grained cues. We introduce FSDrive, a framework that empowers VLAs to "think visually" using a novel visual spatio-temporal CoT. FSDrive first operates as a world model, generating a unified future frame that combines a predicted background with explicit, physically-plausible priors like future lane dividers and 3D object boxes. This imagined scene serves as the visual spatio-temporal CoT, capturing both spatial structure and temporal evolution in a single representation. The same VLA then functions as an inverse-dynamics model to plan trajectories conditioned on current observations and this visual CoT. We enable this with a unified pre-training paradigm that expands the model's vocabulary with visual tokens and jointly optimizes for semantic understanding (VQA) and future-frame prediction. A progressive curriculum first generates structural priors to enforce physical laws before rendering the full scene. Evaluations on nuScenes and NAVSIM show FSDrive improves trajectory accuracy and reduces collisions, while also achieving competitive FID for video generation with a lightweight autoregressive model and advancing scene understanding on DriveLM. These results confirm that our visual spatio-temporal CoT bridges the perception-planning gap, enabling safer, more anticipatory autonomous driving. Code is available at https://github.com/MIV-XJTU/FSDrive.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces FSDrive, a Vision-Language-Action (VLA) framework for autonomous driving that replaces textual Chains-of-Thought with a visual spatio-temporal CoT. The model first acts as a world model to generate a unified future frame containing a predicted background plus explicit priors (future lane dividers and 3D object boxes); this frame is then used to condition the same VLA acting as an inverse-dynamics planner. A unified pre-training regime expands the vocabulary with visual tokens and jointly optimizes VQA and future-frame prediction via a progressive curriculum. Experiments on nuScenes and NAVSIM report gains in trajectory accuracy and collision reduction together with competitive FID scores for the generated frames.

Significance. If the performance gains can be isolated to the visual CoT representation rather than the joint pre-training or autoregressive architecture, the approach would offer a concrete mechanism for preserving fine-grained spatio-temporal information that textual CoT discards, potentially improving safety and anticipation in end-to-end driving systems. The public code release and multi-benchmark evaluation are positive factors.

major comments (3)
  1. [Abstract] Abstract and Experiments: the reported improvements in trajectory accuracy and collision reduction are presented without error bars, statistical tests, or ablation studies that isolate the contribution of the visual spatio-temporal CoT from the joint pre-training regime and expanded visual-token vocabulary.
  2. [Method] Method: the claim that the generated future frame functions as an effective visual CoT for inverse-dynamics planning rests on the unverified assumption that the planner actually conditions on and utilizes the generated frame (with its lane and box priors) rather than ignoring it; no control experiments comparing planning with versus without the generated frame are described.
  3. [Experiments] Experiments: the potential confound that gains on nuScenes and NAVSIM arise from the autoregressive pre-training or curriculum rather than the specific visual CoT representation is not addressed, undermining the central claim that the visual CoT bridges the perception-planning gap.
minor comments (2)
  1. [Abstract] Abstract: the acronym FSDrive is introduced without expansion on first use.
  2. [Abstract] Abstract: FID scores are described as competitive but no numerical values or direct baseline comparisons are supplied.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important aspects of empirical validation. We address each major comment below and will revise the manuscript to incorporate additional analyses and experiments as outlined.

read point-by-point responses
  1. Referee: [Abstract] Abstract and Experiments: the reported improvements in trajectory accuracy and collision reduction are presented without error bars, statistical tests, or ablation studies that isolate the contribution of the visual spatio-temporal CoT from the joint pre-training regime and expanded visual-token vocabulary.

    Authors: We agree that error bars, statistical tests, and isolating ablations would strengthen the presentation. In the revised version we will report mean and standard deviation over at least three random seeds for all nuScenes and NAVSIM metrics, include paired statistical significance tests (Wilcoxon signed-rank), and add an ablation that holds the joint pre-training and visual vocabulary fixed while replacing the visual CoT with a textual CoT baseline. This directly isolates the contribution of the visual representation. revision: yes

  2. Referee: [Method] Method: the claim that the generated future frame functions as an effective visual CoT for inverse-dynamics planning rests on the unverified assumption that the planner actually conditions on and utilizes the generated frame (with its lane and box priors) rather than ignoring it; no control experiments comparing planning with versus without the generated frame are described.

    Authors: We accept this criticism. The revised manuscript will include explicit control experiments: the inverse-dynamics planner will be evaluated once with the full generated future frame (background + lane dividers + 3D boxes) and once with the future-frame input replaced by a zeroed or randomly noised tensor while keeping all other inputs identical. Performance degradation in the control condition will be reported to confirm that the planner conditions on the visual priors. revision: yes

  3. Referee: [Experiments] Experiments: the potential confound that gains on nuScenes and NAVSIM arise from the autoregressive pre-training or curriculum rather than the specific visual CoT representation is not addressed, undermining the central claim that the visual CoT bridges the perception-planning gap.

    Authors: We agree that ruling out this confound is necessary. We will add two new baselines in the revision: (1) the identical autoregressive architecture and curriculum but with textual CoT instead of the visual future frame, and (2) the same joint pre-training without the progressive structural-prior curriculum. These comparisons will be presented alongside the original results to show that the reported gains are attributable to the visual spatio-temporal CoT. revision: yes

Circularity Check

0 steps flagged

No circularity; visual CoT is an additive architectural proposal validated empirically

full rationale

The paper introduces FSDrive as a new VLA framework that generates a visual spatio-temporal CoT (future frame with lane and box priors) via world-model pre-training and then conditions inverse-dynamics planning on it. No equations, fitted parameters, or self-citations are shown that reduce the claimed trajectory gains or perception-planning bridge to the inputs by construction. The progressive curriculum and joint VQA/future-frame optimization are presented as enabling steps, but performance is reported via external benchmarks (nuScenes, NAVSIM, DriveLM) rather than tautological re-derivation. This is a standard empirical systems paper with no load-bearing self-referential loops.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on the domain assumption that a single predicted frame can encode the necessary spatio-temporal relations without loss; no free parameters or invented physical entities are named in the abstract.

axioms (1)
  • domain assumption The predicted future frame with lane dividers and 3D boxes is sufficiently accurate and physically plausible to serve as effective reasoning input for trajectory planning.
    Invoked when the same VLA is said to function as an inverse-dynamics model conditioned on this visual CoT.
invented entities (1)
  • visual spatio-temporal CoT no independent evidence
    purpose: Unified future frame that replaces textual CoT to preserve spatial and temporal information.
    New representation introduced by the paper; no independent falsifiable handle outside model performance is provided in the abstract.

pith-pipeline@v0.9.0 · 5602 in / 1380 out tokens · 36163 ms · 2026-05-15T19:15:54.700395+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

    cs.RO 2026-05 unverdicted novelty 7.0

    MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.

  2. ConeSep: Cone-based Robust Noise-Unlearning Compositional Network for Composed Image Retrieval

    cs.CV 2026-04 unverdicted novelty 7.0

    ConeSep tackles noisy triplet correspondences in composed image retrieval by introducing geometric fidelity quantization to locate noise, negative boundary learning for semantic opposites, and targeted unlearning via ...

  3. Learning Vision-Language-Action World Models for Autonomous Driving

    cs.CV 2026-04 unverdicted novelty 7.0

    VLA-World improves autonomous driving by using action-guided future image generation followed by reflective reasoning over the imagined scene to refine trajectories.

  4. The Blind Spot of Adaptation: Quantifying and Mitigating Forgetting in Fine-tuned Driving Models

    cs.CV 2026-04 unverdicted novelty 7.0

    Fine-tuning VLMs for driving erodes pre-trained world knowledge, but shifting adaptation to prompt space via the Drive Expert Adapter preserves generalization while improving task performance.

  5. MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving

    cs.RO 2026-05 unverdicted novelty 6.0

    MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.

  6. CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 6.0

    CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and t...

  7. CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 6.0

    CoWorld-VLA encodes world information into four expert tokens that condition a diffusion-based planner, yielding competitive collision avoidance and trajectory accuracy on the NAVSIM benchmark.

  8. ProDrive: Proactive Planning for Autonomous Driving via Ego-Environment Co-Evolution

    cs.RO 2026-04 unverdicted novelty 6.0

    ProDrive couples a query-centric planner with a BEV world model for end-to-end ego-environment co-evolution, enabling future-outcome assessment that improves safety and efficiency over reactive baselines on NAVSIM v1.

  9. Air-Know: Arbiter-Calibrated Knowledge-Internalizing Robust Network for Composed Image Retrieval

    cs.CV 2026-04 unverdicted novelty 6.0

    Air-Know decouples MLLM-based external arbitration from proxy learning via knowledge internalization and dual-stream training to overcome noisy triplet correspondence in composed image retrieval.

  10. Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

    cs.CV 2026-04 unverdicted novelty 6.0

    OneVL is the first latent CoT method to exceed explicit CoT accuracy on four driving benchmarks while running at answer-only speed, by supervising latent tokens with a visual world model decoder.

  11. Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation

    cs.CV 2026-04 unverdicted novelty 6.0

    OneVL achieves superior accuracy to explicit chain-of-thought reasoning at answer-only latency by supervising latent tokens with a visual world model decoder that predicts future frames.

  12. Think in Latent Thoughts: A New Paradigm for Gloss-Free Sign Language Translation

    cs.CV 2026-04 unverdicted novelty 6.0

    A new SLT framework uses latent thoughts as a middle reasoning layer and plan-then-ground decoding to improve coherence and faithfulness in gloss-free sign language translation.

  13. Evaluation as Evolution: Transforming Adversarial Diffusion into Closed-Loop Curricula for Autonomous Vehicles

    cs.RO 2026-04 unverdicted novelty 6.0

    E² uses transport-regularized sparse control on learned reverse-time SDEs with topology-driven selection and Topological Anchoring to generate realistic adversarial scenarios, improving collision discovery by 9.01% on...

  14. EponaV2: Driving World Model with Comprehensive Future Reasoning

    cs.CV 2026-05 unverdicted novelty 5.0

    EponaV2 advances perception-free driving world models by forecasting comprehensive future 3D geometry and semantic representations, achieving SOTA planning performance on NAVSIM benchmarks.

  15. Latency Analysis and Optimization of Alpamayo 1 via Efficient Trajectory Generation

    cs.AI 2026-05 unverdicted novelty 5.0

    Redesigning Alpamayo 1 to single-reasoning and optimizing diffusion action generation cuts inference latency by 69.23% while preserving trajectory diversity and prediction quality.

  16. SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model

    cs.CV 2026-04 unverdicted novelty 5.0

    SpanVLA reduces action generation latency via flow-matching conditioned on history and improves robustness by training on negative-recovery samples with GRPO and a dedicated reasoning dataset.

  17. ActorMind: Emulating Human Actor Reasoning for Speech Role-Playing

    cs.SD 2026-04 unverdicted novelty 5.0

    ActorMind is a four-agent chain-of-thought framework that emulates human actors to produce spontaneous, emotion-infused speech responses for role-playing scenarios.

  18. Zero-Shot Vulnerability Detection in Low-Resource Smart Contracts Through Solidity-Only Training

    cs.CR 2026-03 unverdicted novelty 5.0

    Sol2Vy transfers vulnerability detection from Solidity to Vyper in zero-shot fashion, outperforming prior methods on reentrancy, weak randomness, and unchecked transfers.

  19. EvoDriveVLA: Evolving Driving VLA Models via Collaborative Perception-Planning Distillation

    cs.CV 2026-03 unverdicted novelty 5.0

    EvoDriveVLA uses collaborative perception-planning distillation with self-anchor and future-aware teachers to fix perception degradation and long-term instability in driving VLA models, reaching SOTA on nuScenes and NAVSIM.

  20. FAST: A Synergistic Framework of Attention and State-space Models for Spatiotemporal Traffic Prediction

    cs.LG 2026-04 unverdicted novelty 4.0

    FAST uses a Temporal-Spatial-Temporal structure with attention and Mamba modules plus learnable embeddings to achieve better accuracy on traffic prediction tasks than previous models.

  21. FedNSAM:Consistency of Local and Global Flatness for Federated Learning

    cs.LG 2026-02 unverdicted novelty 4.0

    FedNSAM uses global Nesterov momentum to make local flatness consistent with global flatness in federated learning, yielding tighter convergence than FedSAM and better empirical performance.

  22. Reinforcement Learning for Scalable and Trustworthy Intelligent Systems

    cs.LG 2026-05 unverdicted novelty 3.0

    Reinforcement learning is advanced for communication-efficient federated optimization and for preference-aligned, contextually safe policies in large language models.

Reference graph

Works this paper leans on

98 extracted references · 98 canonical work pages · cited by 19 Pith papers · 4 internal anchors

  1. [1]

    Caesar, V

    H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom. nuscenes: A multimodal dataset for autonomous driving.CVPR, 2020

  2. [2]

    Chang, M

    X. Chang, M. Xue, X. Liu, Z. Pan, and X. Wei. Driving by the rules: A benchmark for integrating traffic sign regulations into vectorized hd map.CVPR, 2025

  3. [3]

    S. Chen, B. Jiang, H. Gao, B. Liao, Q. Xu, Q. Zhang, C. Huang, W. Liu, and X. Wang. Vadv2: End-to-end vectorized autonomous driving via probabilistic planning.arXiv preprint arXiv:2402.13243, 2024

  4. [4]

    Chen and R

    Y . Chen and R. Greer. Technical report for argoverse2 scenario mining challenges on iterative error correction and spatially-aware prompting.arXiv preprint arXiv:2506.11124, 2025

  5. [5]

    Chen, Y .-Q

    Y . Chen, Y .-Q. Wang, and Z. Zhang. Drivinggpt: Unifying driving world modeling and planning with multi-modal autoregressive transformers.ICCV, 2025

  6. [6]

    Chitta, A

    K. Chitta, A. Prakash, B. Jaeger, Z. Yu, K. Renz, and A. Geiger. Transfuser: Imitation with transformer- based sensor fusion for autonomous driving.TPAMI, 2023

  7. [7]

    J. H. Cho, B. Ivanovic, Y . Cao, E. Schmerling, Y . Wang, X. Weng, B. Li, Y . You, P. Krähenbühl, Y . Wang, et al. Language-image models with 3d understanding.ICLR, 2025. 10

  8. [8]

    M. Dai, S. Liu, Z. Zhao, J. Gao, H. Sun, and X. Li. Secure tug-of-war (sectow): Iterative defense-attack training with reinforcement learning for multimodal model security.arXiv preprint arXiv:2507.22037, 2025

  9. [9]

    M. Dai, J. Sun, Z. Zhao, S. Liu, R. Li, J. Gao, and X. Li. From captions to rewards (carevl): Leveraging large language model experts for enhanced reward modeling in large vision-language models.arXiv preprint arXiv:2503.06260, 2025

  10. [10]

    Dauner, M

    D. Dauner, M. Hallgarten, T. Li, X. Weng, Z. Huang, Z. Yang, H. Li, I. Gilitschenski, B. Ivanovic, M. Pavone, A. Geiger, and K. Chitta. Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking. InNeurIPS, 2024

  11. [11]

    R. Dong, C. Han, Y . Peng, Z. Qi, Z. Ge, J. Yang, L. Zhao, J. Sun, H. Zhou, H. Wei, X. Kong, X. Zhang, K. Ma, and L. Yi. DreamLLM: Synergistic multimodal comprehension and creation. InICLR, 2024

  12. [12]

    Dosovitskiy, L

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Min- derer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale.ICLR, 2021

  13. [13]

    S. Gao, J. Yang, L. Chen, K. Chitta, Y . Qiu, A. Geiger, J. Zhang, and H. Li. Vista: A generalizable driving world model with high fidelity and versatile controllability. InNeurIPS, 2024

  14. [14]

    Y . Gao, C. Li, Z. You, J. Liu, Z. Li, P. Chen, Q. Chen, Z. Tang, L. Wang, P. Yang, et al. Openfly: A comprehensive platform for aerial vision-language navigation.CoRR, 2025

  15. [15]

    J. Guo, Z. Li, J. Wu, Q. Wang, Y . Li, L. Zhang, hai zhao, and Y . Yang. Tom: Leveraging tree-oriented mapreduce for long-context reasoning in large language models. InEMNLP, 2025

  16. [16]

    Hassan, S

    M. Hassan, S. Stapf, A. Rahimi, P. M. B. Rezende, Y . Haghighi, D. Brüggemann, I. Katircioglu, L. Zhang, X. Chen, S. Saha, M. Cannici, E. Aljalbout, B. Ye, X. Wang, A. Davtyan, M. Salzmann, D. Scaramuzza, M. Pollefeys, P. Favaro, and A. Alahi. Gem: A generalizable ego-vision multimodal world model for fine-grained ego-motion, object dynamics, and scene co...

  17. [17]

    Heusel, H

    M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.NeurIPS, 2017

  18. [18]

    A. Hu, L. Russell, H. Yeo, Z. Murez, G. Fedoseev, A. Kendall, J. Shotton, and G. Corrado. Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080, 2023

  19. [19]

    S. Hu, L. Chen, P. Wu, H. Li, J. Yan, and D. Tao. St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning. InECCV, 2022

  20. [20]

    Y . Hu, Q. Li, D. Zhang, J. Yan, and Y . Chen. Context-alignment: Activating and enhancing LLMs capabilities in time series. InICLR, 2025

  21. [21]

    Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wang, L. Lu, X. Jia, Q. Liu, J. Dai, Y . Qiao, and H. Li. Planning-oriented autonomous driving. InCVPR, 2023

  22. [22]

    Huang, M

    J. Huang, M. Yan, S. Chen, Y . Huang, and S. Chen. Magicfight: Personalized martial arts combat video generation. InProceedings of the 32nd ACM International Conference on Multimedia, 2024

  23. [23]

    Huang, G

    J. Huang, G. Zhang, Z. Jie, S. Jiao, Y . Qian, L. Chen, Y . Wei, and L. Ma. M4v: Multi-modal mamba for text-to-video generation.arXiv preprint arXiv:2506.10915, 2025

  24. [24]

    Huang, H

    Z. Huang, H. Qian, Z. Cai, A. Wang, J. Wang, and F. Xiong. Intelligent recognition method for urban road grid patterns by fusing mesh and road features.International Journal of Digital Earth, 2024

  25. [25]

    Huang, H

    Z. Huang, H. Qian, Z. Cai, X. Wang, L. Xie, and X. Niu. An intelligent multilane roadway recognition method based on pseudo-tagging.Cartography and Geographic Information Science, 2025

  26. [26]

    Huang, T

    Z. Huang, T. Tang, S. Chen, S. Lin, and Z. e. a. Jie. Making large language models better planners with reasoning-decision alignment.ECCV, 2024

  27. [27]

    Hwang, R

    J.-J. Hwang, R. Xu, H. Lin, W.-C. Hung, J. Ji, K. Choi, D. Huang, T. He, P. Covington, B. Sapp, J. Guo, D. Anguelov, and M. Tan. Emma: End-to-end multimodal model for autonomous driving.TMLR, 2025

  28. [28]

    Ishaq, J

    A. Ishaq, J. Lahoud, F. S. Khan, S. Khan, H. Cholakkal, and R. M. Anwer. Tracking meets large multimodal models for driving scenario understanding.ArXiv preprint arXiv:2503.14498, 2025

  29. [29]

    Jiang, S

    B. Jiang, S. Chen, Q. Xu, B. Liao, J. Chen, H. Zhou, Q. Zhang, W. Liu, C. Huang, and X. Wang. Vad: Vectorized scene representation for efficient autonomous driving.ICCV, 2023

  30. [30]

    S. W. Kim, J. Philion, A. Torralba, and S. Fidler. Drivegan: Towards a controllable high-quality neural simulation. InCVPR, 2021

  31. [31]

    B. Li, Y . Wang, J. Mao, B. Ivanovic, S. Veer, K. Leung, and M. Pavone. Driving everywhere with large language model policy adaptation. InCVPR, 2024

  32. [32]

    X. Li, P. Li, Y . Zheng, W. Sun, Y . Wang, and Y . Chen. Semi-supervised vision-centric 3d occupancy world model for autonomous driving.ICLR, 2025

  33. [33]

    Y . Li, L. Fan, J. He, Y . Wang, Y . Chen, Z. Zhang, and T. Tan. Enhancing end-to-end autonomous driving with latent world model.ICLR, 2025

  34. [34]

    Z. Li, Z. Yu, S. Lan, J. Li, J. Kautz, T. Lu, and J. M. Álvarez. Is ego status all you need for open-loop end-to-end autonomous driving?CVPR, 2024

  35. [35]

    Liang, X

    S. Liang, X. Chang, C. Wu, H. Yan, Y . Bai, X. Liu, H. Zhang, Y . Yuan, S. Zeng, M. Xu, et al. Persistent autoregressive mapping with traffic rules for autonomous driving.arXiv preprint arXiv:2509.22756, 2025. 11

  36. [36]

    B. Liao, S. Chen, H. Yin, B. Jiang, C. Wang, S. Yan, X. Zhang, X. Li, Y . Zhang, Q. Zhang, and X. Wang. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving.CVPR, 2025

  37. [37]

    D. Liu, S. Zhao, L. Zhuo, W. Lin, Y . Qiao, H. Li, and P. Gao. Lumina-mgpt: Illuminate flexible photoreal- istic text-to-image generation with multimodal generative pretraining.arXiv preprint arXiv:2408.02657, 2024

  38. [38]

    H. Liu, W. Yan, M. Zaharia, and P. Abbeel. World model on million-length video and language with ringattention.ICLR, 2025

  39. [39]

    J. Liu, F. Shang, Y . Liu, H. Liu, Y . Li, and Y . Gong. Fedbcgd: Communication-efficient accelerated block coordinate gradient descent for federated learning. InACM MM, 2024

  40. [40]

    W. Liu, J. Chen, K. Ji, L. Zhou, W. Chen, and B. Wang. Rag-instruct: Boosting llms with diverse retrieval-augmented instructions.EMNLP, 2025

  41. [41]

    W. Liu, J. Xu, F. Yu, Y . Lin, K. Ji, W. Chen, Y . Xu, Y . Wang, L. Shang, and B. Wang. Qfft, question-free fine-tuning for adaptive reasoning.NeurIPS, 2025

  42. [42]

    W. Lu, Y . Tong, and Z. Ye. Dammfnd: Domain-aware multimodal multi-view fake news detection. In Proceedings of the AAAI Conference on Artificial Intelligence, 2025

  43. [43]

    Y . Ma, Y . Cao, J. Sun, M. Pavone, and C. Xiao. Dolphins: Multimodal language model for driving.ECCV, 2024

  44. [44]

    J. Mao, J. Ye, Y . Qian, M. Pavone, and Y . Wang. A language agent for autonomous driving.COLM, 2024

  45. [45]

    C. Min, D. Zhao, L. Xiao, J. Zhao, X. Xu, Z. Zhu, L. Jin, J. Li, Y . Guo, J. Xing, L. Jing, Y . Nie, and B. Dai. Driveworld: 4d pre-trained scene understanding via world models for autonomous driving.CVPR, 2024

  46. [46]

    Motamed, L

    S. Motamed, L. Culp, K. Swersky, P. Jaini, and R. Geirhos. Do generative video models understand physical principles?arXiv preprint arXiv:2501.09038, 2025

  47. [47]

    J. Ni, Y . Guo, Y . Liu, R. Chen, L. Lu, and Z. Wu. Maskgwm: A generalizable driving world model with video mask reconstruction.CVPR, 2025

  48. [48]

    Y . Qian, X. Li, J. Zhang, X. Meng, Y . Li, H. Ding, and M. Wang. A diffusion-tgan framework for spatio-temporal speed imputation and trajectory reconstruction.IEEE T-ITS, 2025

  49. [49]

    K. Qiu, Z. Gao, Z. Zhou, M. Sun, and Y . Guo. Noise-consistent siamese-diffusion for medical image synthesis and segmentation. InCVPR, 2025

  50. [50]

    D. Qu, H. Song, Q. Chen, Y . Yao, X. Ye, Y . Ding, Z. Wang, J. Gu, B. Zhao, D. Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model.RSS, 2025

  51. [51]

    Radford, J

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision. InICML, 2021

  52. [52]

    Sarkar, M

    A. Sarkar, M. Y . I. Idris, and Z. Yu. Reasoning in computer vision: Taxonomy, models, tasks, and methodologies.arXiv preprint arXiv:2508.10523, 2025

  53. [53]

    Shtedritski, C

    A. Shtedritski, C. Rupprecht, and A. Vedaldi. What does clip know about a red circle? visual prompt engineering for vlms.ICCV, 2023

  54. [54]

    C. Sima, K. Renz, K. Chitta, L. Chen, H. Zhang, C. Xie, P. Luo, A. Geiger, and H. Li. Drivelm: Driving with graph visual question answering.ECCV, 2024

  55. [55]

    H. Song, D. Qu, Y . Yao, Q. Chen, Q. Lv, Y . Tang, M. Shi, G. Ren, M. Yao, B. Zhao, et al. Hume: Introducing system-2 thinking in visual-language-action model.arXiv preprint arXiv:2505.21432, 2025

  56. [56]

    P. Sun, Y . Jiang, S. Chen, S. Zhang, B. Peng, P. Luo, and Z. Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024

  57. [57]

    Q. Sun, Q. Yu, Y . Cui, F. Zhang, X. Zhang, Y . Wang, H. Gao, J. Liu, T. Huang, and X. Wang. Generative pretraining in multimodality.ICLR, 2024

  58. [58]

    S. Sun, W. Yu, Y . Ren, W. Du, L. Liu, X. Zhang, Y . Hu, and C. Ma. Gdiffretro: Retrosynthesis prediction with dual graph enhanced molecular representation and diffusion generation.AAAI, 2025

  59. [59]

    W. Tan, D. Chen, J. Xue, Z. Wang, and T. Chen. Teaching-inspired integrated prompting framework: A novel approach for enhancing reasoning in large language models. InCOLING: Industry Track, 2025

  60. [60]

    X. Tian, J. Gu, B. Li, Y . Liu, Z. Zhao, Y . Wang, K. Zhan, P. Jia, X. Lang, and H. Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models.CoRL, 2024

  61. [61]

    van den Oord, O

    A. van den Oord, O. Vinyals, and K. Kavukcuoglu. Neural discrete representation learning. InNeurIPS, 2017

  62. [62]

    Vaswani, N

    A. Vaswani, N. M. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. InNeurIPS, 2017

  63. [63]

    P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge, Y . Fan, K. Dang, M. Du, X. Ren, R. Men, D. Liu, C. Zhou, J. Zhou, and J. Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

  64. [64]

    S. Wang, Z. Yu, X. Jiang, S. Lan, M. Shi, N. Chang, J. Kautz, Y . Li, and J. M. Álvarez. Omnidrive: A holistic llm-agent framework for autonomous driving with 3d perception, reasoning and planning.CVPR, 2025. 12

  65. [65]

    X. Wang, Z. Zhu, G. Huang, X. Chen, J. Zhu, and J. Lu. Drivedreamer: Towards real-world-driven world models for autonomous driving.ECCV, 2024

  66. [66]

    Y . Wang, J. He, L. Fan, H. Li, Y . Chen, and Z. Zhang. Driving into the future: Multiview visual forecasting and planning with world model for autonomous driving.CVPR, 2024

  67. [67]

    J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. H. Chi, F. Xia, Q. Le, and D. Zhou. Chain of thought prompting elicits reasoning in large language models.NeurIPS, 2022

  68. [68]

    Q. Wei, P. Dai, W. Li, B. Liu, and X. Wu. Copeft: Fast adaptation framework for multi-agent collaborative perception with parameter-efficient fine-tuning. InAAAI, 2025

  69. [69]

    X. Weng, B. Ivanovic, Y . Wang, Y . Wang, and M. Pavone. Para-drive: Parallelized architecture for real-time autonomous driving. InCVPR, 2024

  70. [70]

    C. Wu, X. Chen, Z. Wu, Y . Ma, X. Liu, Z. Pan, W. Liu, Z. Xie, X. Yu, C. Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation.CVPR, 2024

  71. [71]

    C. Wu, H. Huang, L. Zhang, J. Chen, Y . Tong, and M. Zhou. Towards automated 3d evaluation of water leakage on a tunnel face via improved gan and self-attention dl model.Tunn Undergr Space Technol, 2023

  72. [72]

    J. Wu, H. Li, X. Zhang, X. Liu, Y . Huang, J. Luo, Y . Zhang, Z. Li, R. Chu, Y . Yang, and S. Li. Teaching your models to understand code via focal preference alignment. InEMNLP, 2025

  73. [73]

    Y . Wu, Z. Zhang, J. Chen, H. Tang, D. Li, Y . Fang, L. Zhu, E. Xie, H. Yin, L. Yi, et al. Vila-u: a unified foundation model integrating visual understanding and generation.ICLR, 2025

  74. [74]

    J. Xie, W. Mao, Z. Bai, D. J. Zhang, W. Wang, K. Q. Lin, Y . Gu, Z. Chen, Z. Yang, and M. Z. Shou. Show-o: One single transformer to unify multimodal understanding and generation.ICLR, 2025

  75. [75]

    M. Xie, S. Zeng, X. Chang, X. Liu, Z. Pan, M. Xu, and X. Wei. Seqgrowgraph: Learning lane topology as a chain of graph expansions. InICCV, 2025

  76. [76]

    Z. Xu, Y . Zhang, E. Xie, Z. Zhao, Y . Guo, K.-Y . K. Wong, Z. Li, and H. Zhao. Drivegpt4: Interpretable end-to-end autonomous driving via large language model.IEEE Robotics and Automation Letters, 2024

  77. [77]

    J. Yang, S. Gao, Y . Qiu, L. Chen, T. Li, B. Dai, K. Chitta, P. Wu, J. Zeng, P. Luo, J. Zhang, A. Geiger, Y . Qiao, and H. Li. Generalized predictive model for autonomous driving. InCVPR, 2024

  78. [78]

    X. Yang, B. Li, Y . Zhang, Z. Yin, L. Bai, L. Ma, Z. Wang, J. Cai, T.-T. Wong, H. Lu, and X. Jia. Vlipp: Towards physically plausible video generation with vision and language informed physical prior.ICCV, 2025

  79. [79]

    Z. Yang, L. Chen, Y . Sun, and H. Li. Visual point cloud forecasting enables scalable autonomous driving. InCVPR, 2024

  80. [80]

    Z. Yu, M. Y . I. Idris, H. Wang, P. Wang, J. Chen, and K. Wang. From physics to foundation models: A review of ai-driven quantitative remote sensing inversion.arXiv preprint arXiv:2507.09081, 2025

Showing first 80 references.