arxiv: 2505.17685 · v3 · submitted 2025-05-23 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

FutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving

Shuang Zeng , Xinyuan Chang , Mengwei Xie , Xinran Liu , Yifan Bai , Zheng Pan , Mu Xu , Xing Wei

show 1 more author

Ning Guo

Authors on Pith no claims yet

Pith reviewed 2026-05-15 19:15 UTC · model grok-4.3

classification 💻 cs.CV

keywords autonomous drivingvision-language-actionchain of thoughtfuture frame predictiontrajectory planningworld modelspatio-temporal reasoning

0 comments

The pith

Generating one future visual frame lets driving models plan trajectories by preserving spatial and temporal details that text chains of thought discard.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that vision-language-action models for autonomous driving lose critical information when they compress scenes into text-based chains of thought. It introduces a method that first generates a single future frame containing predicted background, lane dividers, and 3D object boxes. This frame functions as a visual chain of thought that the same model then uses to output driving trajectories. Training jointly optimizes future-frame prediction and scene understanding so the model learns both to imagine and to act. Experiments on standard driving benchmarks report higher trajectory accuracy and fewer collisions while maintaining competitive image quality.

Core claim

FSDrive operates first as a world model to produce a unified future frame that merges a predicted background with explicit future lane dividers and 3D object boxes; this single imagined scene serves as the visual spatio-temporal CoT. The identical model then switches to an inverse-dynamics role and outputs trajectories conditioned on the current observation plus this visual CoT. A unified pre-training stage adds visual tokens to the vocabulary and jointly trains semantic VQA with future-frame generation, using a progressive curriculum that first enforces structural priors before full scene rendering.

What carries the argument

The visual spatio-temporal CoT: one generated future frame that encodes both spatial layout and temporal change in a single image for subsequent planning.

If this is right

Trajectory prediction accuracy rises on nuScenes and NAVSIM benchmarks.
Collision rates drop when planning uses the generated visual frame instead of text-only reasoning.
The same lightweight autoregressive model reaches competitive FID scores for future-frame generation.
Scene-understanding performance improves on DriveLM question-answering tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same visual-CoT pattern could be tested in other embodied tasks such as robotic manipulation where text reasoning discards spatial layout.
If generation errors are the main failure mode, adding an explicit physics-loss term during pre-training might reduce inherited planning mistakes.
Replacing the single-frame CoT with a short sequence of frames might further improve long-horizon anticipation without returning to pure text.

Load-bearing premise

The predicted future frame must be physically accurate enough to supply the precise lane and object positions that the planning step actually needs.

What would settle it

Train the model with deliberately inaccurate future frames (shifted lanes or wrong box positions) and check whether trajectory quality and collision rates remain unchanged from the accurate-frame case.

read the original abstract

Vision-Language-Action (VLA) models offer significant potential for end-to-end driving, yet their reasoning is often constrained by textual Chains-of-Thought (CoT). This symbolic compression of visual information creates a modality gap between perception and planning by blurring spatio-temporal relations and discarding fine-grained cues. We introduce FSDrive, a framework that empowers VLAs to "think visually" using a novel visual spatio-temporal CoT. FSDrive first operates as a world model, generating a unified future frame that combines a predicted background with explicit, physically-plausible priors like future lane dividers and 3D object boxes. This imagined scene serves as the visual spatio-temporal CoT, capturing both spatial structure and temporal evolution in a single representation. The same VLA then functions as an inverse-dynamics model to plan trajectories conditioned on current observations and this visual CoT. We enable this with a unified pre-training paradigm that expands the model's vocabulary with visual tokens and jointly optimizes for semantic understanding (VQA) and future-frame prediction. A progressive curriculum first generates structural priors to enforce physical laws before rendering the full scene. Evaluations on nuScenes and NAVSIM show FSDrive improves trajectory accuracy and reduces collisions, while also achieving competitive FID for video generation with a lightweight autoregressive model and advancing scene understanding on DriveLM. These results confirm that our visual spatio-temporal CoT bridges the perception-planning gap, enabling safer, more anticipatory autonomous driving. Code is available at https://github.com/MIV-XJTU/FSDrive.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The visual future-frame CoT with lane and box priors is a straightforward addition to VLA driving models, but the gains are not isolated from the joint pre-training.

read the letter

The main move here is to let the VLA predict one future frame that already contains future lane dividers and 3D object boxes, then use that single frame as the visual spatio-temporal chain of thought when the same model plans the trajectory. This keeps spatial layout and short-term motion in the visual domain instead of forcing it through text. They train with an expanded vocabulary of visual tokens, jointly optimize VQA and future-frame prediction, and use a curriculum that first forces the structural priors before filling in the scene. On nuScenes and NAVSIM they report better trajectory accuracy and fewer collisions, plus competitive FID scores from a lightweight autoregressive generator, and they release the code. The DriveLM scene-understanding numbers also move in the right direction. Those are concrete, usable pieces of work. The curriculum and the unified token expansion are sensible engineering choices that make the training stable. The code release lets others check whether the generated frames actually look plausible on the road. The soft spot is the missing isolation. The model is trained end-to-end on both understanding and future prediction, so the reported improvements could come from the richer pre-training regime or the autoregressive architecture rather than from conditioning the planner on the specific generated frame at test time. No ablation turns the visual CoT off or measures how much the planner actually attends to it. If the predicted lanes or boxes are off, the planning step will inherit the error, but that error propagation is not quantified. The abstract gives no error bars or run-to-run variance either. This is aimed at people building multimodal driving agents who want a practical way to add visual anticipation without extra modules. It is concrete enough and tied to standard benchmarks that it deserves a serious referee, mainly so the authors can add the controls that would pin down whether the visual CoT is doing the claimed work.

Referee Report

3 major / 2 minor

Summary. The paper introduces FSDrive, a Vision-Language-Action (VLA) framework for autonomous driving that replaces textual Chains-of-Thought with a visual spatio-temporal CoT. The model first acts as a world model to generate a unified future frame containing a predicted background plus explicit priors (future lane dividers and 3D object boxes); this frame is then used to condition the same VLA acting as an inverse-dynamics planner. A unified pre-training regime expands the vocabulary with visual tokens and jointly optimizes VQA and future-frame prediction via a progressive curriculum. Experiments on nuScenes and NAVSIM report gains in trajectory accuracy and collision reduction together with competitive FID scores for the generated frames.

Significance. If the performance gains can be isolated to the visual CoT representation rather than the joint pre-training or autoregressive architecture, the approach would offer a concrete mechanism for preserving fine-grained spatio-temporal information that textual CoT discards, potentially improving safety and anticipation in end-to-end driving systems. The public code release and multi-benchmark evaluation are positive factors.

major comments (3)

[Abstract] Abstract and Experiments: the reported improvements in trajectory accuracy and collision reduction are presented without error bars, statistical tests, or ablation studies that isolate the contribution of the visual spatio-temporal CoT from the joint pre-training regime and expanded visual-token vocabulary.
[Method] Method: the claim that the generated future frame functions as an effective visual CoT for inverse-dynamics planning rests on the unverified assumption that the planner actually conditions on and utilizes the generated frame (with its lane and box priors) rather than ignoring it; no control experiments comparing planning with versus without the generated frame are described.
[Experiments] Experiments: the potential confound that gains on nuScenes and NAVSIM arise from the autoregressive pre-training or curriculum rather than the specific visual CoT representation is not addressed, undermining the central claim that the visual CoT bridges the perception-planning gap.

minor comments (2)

[Abstract] Abstract: the acronym FSDrive is introduced without expansion on first use.
[Abstract] Abstract: FID scores are described as competitive but no numerical values or direct baseline comparisons are supplied.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights important aspects of empirical validation. We address each major comment below and will revise the manuscript to incorporate additional analyses and experiments as outlined.

read point-by-point responses

Referee: [Abstract] Abstract and Experiments: the reported improvements in trajectory accuracy and collision reduction are presented without error bars, statistical tests, or ablation studies that isolate the contribution of the visual spatio-temporal CoT from the joint pre-training regime and expanded visual-token vocabulary.

Authors: We agree that error bars, statistical tests, and isolating ablations would strengthen the presentation. In the revised version we will report mean and standard deviation over at least three random seeds for all nuScenes and NAVSIM metrics, include paired statistical significance tests (Wilcoxon signed-rank), and add an ablation that holds the joint pre-training and visual vocabulary fixed while replacing the visual CoT with a textual CoT baseline. This directly isolates the contribution of the visual representation. revision: yes
Referee: [Method] Method: the claim that the generated future frame functions as an effective visual CoT for inverse-dynamics planning rests on the unverified assumption that the planner actually conditions on and utilizes the generated frame (with its lane and box priors) rather than ignoring it; no control experiments comparing planning with versus without the generated frame are described.

Authors: We accept this criticism. The revised manuscript will include explicit control experiments: the inverse-dynamics planner will be evaluated once with the full generated future frame (background + lane dividers + 3D boxes) and once with the future-frame input replaced by a zeroed or randomly noised tensor while keeping all other inputs identical. Performance degradation in the control condition will be reported to confirm that the planner conditions on the visual priors. revision: yes
Referee: [Experiments] Experiments: the potential confound that gains on nuScenes and NAVSIM arise from the autoregressive pre-training or curriculum rather than the specific visual CoT representation is not addressed, undermining the central claim that the visual CoT bridges the perception-planning gap.

Authors: We agree that ruling out this confound is necessary. We will add two new baselines in the revision: (1) the identical autoregressive architecture and curriculum but with textual CoT instead of the visual future frame, and (2) the same joint pre-training without the progressive structural-prior curriculum. These comparisons will be presented alongside the original results to show that the reported gains are attributable to the visual spatio-temporal CoT. revision: yes

Circularity Check

0 steps flagged

No circularity; visual CoT is an additive architectural proposal validated empirically

full rationale

The paper introduces FSDrive as a new VLA framework that generates a visual spatio-temporal CoT (future frame with lane and box priors) via world-model pre-training and then conditions inverse-dynamics planning on it. No equations, fitted parameters, or self-citations are shown that reduce the claimed trajectory gains or perception-planning bridge to the inputs by construction. The progressive curriculum and joint VQA/future-frame optimization are presented as enabling steps, but performance is reported via external benchmarks (nuScenes, NAVSIM, DriveLM) rather than tautological re-derivation. This is a standard empirical systems paper with no load-bearing self-referential loops.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on the domain assumption that a single predicted frame can encode the necessary spatio-temporal relations without loss; no free parameters or invented physical entities are named in the abstract.

axioms (1)

domain assumption The predicted future frame with lane dividers and 3D boxes is sufficiently accurate and physically plausible to serve as effective reasoning input for trajectory planning.
Invoked when the same VLA is said to function as an inverse-dynamics model conditioned on this visual CoT.

invented entities (1)

visual spatio-temporal CoT no independent evidence
purpose: Unified future frame that replaces textual CoT to preserve spatial and temporal information.
New representation introduced by the paper; no independent falsifiable handle outside model performance is provided in the abstract.

pith-pipeline@v0.9.0 · 5602 in / 1380 out tokens · 36163 ms · 2026-05-15T19:15:54.700395+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

generating a unified future frame that combines a predicted background with explicit, physically-plausible priors like future lane dividers and 3D object boxes. This imagined scene serves as the visual spatio-temporal CoT
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

progressive, easy-to-hard generation method... first infer drivable regions and key object positions... generating coarse-grained future perception images (e.g., lane dividers and 3D detection)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
cs.RO 2026-05 unverdicted novelty 7.0

MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.
ConeSep: Cone-based Robust Noise-Unlearning Compositional Network for Composed Image Retrieval
cs.CV 2026-04 unverdicted novelty 7.0

ConeSep tackles noisy triplet correspondences in composed image retrieval by introducing geometric fidelity quantization to locate noise, negative boundary learning for semantic opposites, and targeted unlearning via ...
Learning Vision-Language-Action World Models for Autonomous Driving
cs.CV 2026-04 unverdicted novelty 7.0

VLA-World improves autonomous driving by using action-guided future image generation followed by reflective reasoning over the imagined scene to refine trajectories.
The Blind Spot of Adaptation: Quantifying and Mitigating Forgetting in Fine-tuned Driving Models
cs.CV 2026-04 unverdicted novelty 7.0

Fine-tuning VLMs for driving erodes pre-trained world knowledge, but shifting adaptation to prompt space via the Drive Expert Adapter preserves generalization while improving task performance.
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
cs.RO 2026-05 unverdicted novelty 6.0

MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
cs.CV 2026-05 unverdicted novelty 6.0

CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and t...
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
cs.CV 2026-05 unverdicted novelty 6.0

CoWorld-VLA encodes world information into four expert tokens that condition a diffusion-based planner, yielding competitive collision avoidance and trajectory accuracy on the NAVSIM benchmark.
ProDrive: Proactive Planning for Autonomous Driving via Ego-Environment Co-Evolution
cs.RO 2026-04 unverdicted novelty 6.0

ProDrive couples a query-centric planner with a BEV world model for end-to-end ego-environment co-evolution, enabling future-outcome assessment that improves safety and efficiency over reactive baselines on NAVSIM v1.
Air-Know: Arbiter-Calibrated Knowledge-Internalizing Robust Network for Composed Image Retrieval
cs.CV 2026-04 unverdicted novelty 6.0

Air-Know decouples MLLM-based external arbitration from proxy learning via knowledge internalization and dual-stream training to overcome noisy triplet correspondence in composed image retrieval.
Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
cs.CV 2026-04 unverdicted novelty 6.0

OneVL is the first latent CoT method to exceed explicit CoT accuracy on four driving benchmarks while running at answer-only speed, by supervising latent tokens with a visual world model decoder.
Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
cs.CV 2026-04 unverdicted novelty 6.0

OneVL achieves superior accuracy to explicit chain-of-thought reasoning at answer-only latency by supervising latent tokens with a visual world model decoder that predicts future frames.
Think in Latent Thoughts: A New Paradigm for Gloss-Free Sign Language Translation
cs.CV 2026-04 unverdicted novelty 6.0

A new SLT framework uses latent thoughts as a middle reasoning layer and plan-then-ground decoding to improve coherence and faithfulness in gloss-free sign language translation.
Evaluation as Evolution: Transforming Adversarial Diffusion into Closed-Loop Curricula for Autonomous Vehicles
cs.RO 2026-04 unverdicted novelty 6.0

E² uses transport-regularized sparse control on learned reverse-time SDEs with topology-driven selection and Topological Anchoring to generate realistic adversarial scenarios, improving collision discovery by 9.01% on...
EponaV2: Driving World Model with Comprehensive Future Reasoning
cs.CV 2026-05 unverdicted novelty 5.0

EponaV2 advances perception-free driving world models by forecasting comprehensive future 3D geometry and semantic representations, achieving SOTA planning performance on NAVSIM benchmarks.
Latency Analysis and Optimization of Alpamayo 1 via Efficient Trajectory Generation
cs.AI 2026-05 unverdicted novelty 5.0

Redesigning Alpamayo 1 to single-reasoning and optimizing diffusion action generation cuts inference latency by 69.23% while preserving trajectory diversity and prediction quality.
SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model
cs.CV 2026-04 unverdicted novelty 5.0

SpanVLA reduces action generation latency via flow-matching conditioned on history and improves robustness by training on negative-recovery samples with GRPO and a dedicated reasoning dataset.
ActorMind: Emulating Human Actor Reasoning for Speech Role-Playing
cs.SD 2026-04 unverdicted novelty 5.0

ActorMind is a four-agent chain-of-thought framework that emulates human actors to produce spontaneous, emotion-infused speech responses for role-playing scenarios.
Zero-Shot Vulnerability Detection in Low-Resource Smart Contracts Through Solidity-Only Training
cs.CR 2026-03 unverdicted novelty 5.0

Sol2Vy transfers vulnerability detection from Solidity to Vyper in zero-shot fashion, outperforming prior methods on reentrancy, weak randomness, and unchecked transfers.
EvoDriveVLA: Evolving Driving VLA Models via Collaborative Perception-Planning Distillation
cs.CV 2026-03 unverdicted novelty 5.0

EvoDriveVLA uses collaborative perception-planning distillation with self-anchor and future-aware teachers to fix perception degradation and long-term instability in driving VLA models, reaching SOTA on nuScenes and NAVSIM.
FAST: A Synergistic Framework of Attention and State-space Models for Spatiotemporal Traffic Prediction
cs.LG 2026-04 unverdicted novelty 4.0

FAST uses a Temporal-Spatial-Temporal structure with attention and Mamba modules plus learnable embeddings to achieve better accuracy on traffic prediction tasks than previous models.
FedNSAM:Consistency of Local and Global Flatness for Federated Learning
cs.LG 2026-02 unverdicted novelty 4.0

FedNSAM uses global Nesterov momentum to make local flatness consistent with global flatness in federated learning, yielding tighter convergence than FedSAM and better empirical performance.
Reinforcement Learning for Scalable and Trustworthy Intelligent Systems
cs.LG 2026-05 unverdicted novelty 3.0

Reinforcement learning is advanced for communication-efficient federated optimization and for preference-aligned, contextually safe policies in large language models.

Reference graph

Works this paper leans on

98 extracted references · 98 canonical work pages · cited by 19 Pith papers · 4 internal anchors

[1]

Caesar, V

H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom. nuscenes: A multimodal dataset for autonomous driving.CVPR, 2020

work page 2020
[2]

Chang, M

X. Chang, M. Xue, X. Liu, Z. Pan, and X. Wei. Driving by the rules: A benchmark for integrating traffic sign regulations into vectorized hd map.CVPR, 2025

work page 2025
[3]

S. Chen, B. Jiang, H. Gao, B. Liao, Q. Xu, Q. Zhang, C. Huang, W. Liu, and X. Wang. Vadv2: End-to-end vectorized autonomous driving via probabilistic planning.arXiv preprint arXiv:2402.13243, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[4]

Chen and R

Y . Chen and R. Greer. Technical report for argoverse2 scenario mining challenges on iterative error correction and spatially-aware prompting.arXiv preprint arXiv:2506.11124, 2025

work page arXiv 2025
[5]

Chen, Y .-Q

Y . Chen, Y .-Q. Wang, and Z. Zhang. Drivinggpt: Unifying driving world modeling and planning with multi-modal autoregressive transformers.ICCV, 2025

work page 2025
[6]

Chitta, A

K. Chitta, A. Prakash, B. Jaeger, Z. Yu, K. Renz, and A. Geiger. Transfuser: Imitation with transformer- based sensor fusion for autonomous driving.TPAMI, 2023

work page 2023
[7]

J. H. Cho, B. Ivanovic, Y . Cao, E. Schmerling, Y . Wang, X. Weng, B. Li, Y . You, P. Krähenbühl, Y . Wang, et al. Language-image models with 3d understanding.ICLR, 2025. 10

work page 2025
[8]

M. Dai, S. Liu, Z. Zhao, J. Gao, H. Sun, and X. Li. Secure tug-of-war (sectow): Iterative defense-attack training with reinforcement learning for multimodal model security.arXiv preprint arXiv:2507.22037, 2025

work page arXiv 2025
[9]

M. Dai, J. Sun, Z. Zhao, S. Liu, R. Li, J. Gao, and X. Li. From captions to rewards (carevl): Leveraging large language model experts for enhanced reward modeling in large vision-language models.arXiv preprint arXiv:2503.06260, 2025

work page arXiv 2025
[10]

Dauner, M

D. Dauner, M. Hallgarten, T. Li, X. Weng, Z. Huang, Z. Yang, H. Li, I. Gilitschenski, B. Ivanovic, M. Pavone, A. Geiger, and K. Chitta. Navsim: Data-driven non-reactive autonomous vehicle simulation and benchmarking. InNeurIPS, 2024

work page 2024
[11]

R. Dong, C. Han, Y . Peng, Z. Qi, Z. Ge, J. Yang, L. Zhao, J. Sun, H. Zhou, H. Wei, X. Kong, X. Zhang, K. Ma, and L. Yi. DreamLLM: Synergistic multimodal comprehension and creation. InICLR, 2024

work page 2024
[12]

Dosovitskiy, L

A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Min- derer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale.ICLR, 2021

work page 2021
[13]

S. Gao, J. Yang, L. Chen, K. Chitta, Y . Qiu, A. Geiger, J. Zhang, and H. Li. Vista: A generalizable driving world model with high fidelity and versatile controllability. InNeurIPS, 2024

work page 2024
[14]

Y . Gao, C. Li, Z. You, J. Liu, Z. Li, P. Chen, Q. Chen, Z. Tang, L. Wang, P. Yang, et al. Openfly: A comprehensive platform for aerial vision-language navigation.CoRR, 2025

work page 2025
[15]

J. Guo, Z. Li, J. Wu, Q. Wang, Y . Li, L. Zhang, hai zhao, and Y . Yang. Tom: Leveraging tree-oriented mapreduce for long-context reasoning in large language models. InEMNLP, 2025

work page 2025
[16]

Hassan, S

M. Hassan, S. Stapf, A. Rahimi, P. M. B. Rezende, Y . Haghighi, D. Brüggemann, I. Katircioglu, L. Zhang, X. Chen, S. Saha, M. Cannici, E. Aljalbout, B. Ye, X. Wang, A. Davtyan, M. Salzmann, D. Scaramuzza, M. Pollefeys, P. Favaro, and A. Alahi. Gem: A generalizable ego-vision multimodal world model for fine-grained ego-motion, object dynamics, and scene co...

work page 2025
[17]

Heusel, H

M. Heusel, H. Ramsauer, T. Unterthiner, B. Nessler, and S. Hochreiter. Gans trained by a two time-scale update rule converge to a local nash equilibrium.NeurIPS, 2017

work page 2017
[18]

A. Hu, L. Russell, H. Yeo, Z. Murez, G. Fedoseev, A. Kendall, J. Shotton, and G. Corrado. Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

S. Hu, L. Chen, P. Wu, H. Li, J. Yan, and D. Tao. St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning. InECCV, 2022

work page 2022
[20]

Y . Hu, Q. Li, D. Zhang, J. Yan, and Y . Chen. Context-alignment: Activating and enhancing LLMs capabilities in time series. InICLR, 2025

work page 2025
[21]

Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wang, L. Lu, X. Jia, Q. Liu, J. Dai, Y . Qiao, and H. Li. Planning-oriented autonomous driving. InCVPR, 2023

work page 2023
[22]

Huang, M

J. Huang, M. Yan, S. Chen, Y . Huang, and S. Chen. Magicfight: Personalized martial arts combat video generation. InProceedings of the 32nd ACM International Conference on Multimedia, 2024

work page 2024
[23]

Huang, G

J. Huang, G. Zhang, Z. Jie, S. Jiao, Y . Qian, L. Chen, Y . Wei, and L. Ma. M4v: Multi-modal mamba for text-to-video generation.arXiv preprint arXiv:2506.10915, 2025

work page arXiv 2025
[24]

Huang, H

Z. Huang, H. Qian, Z. Cai, A. Wang, J. Wang, and F. Xiong. Intelligent recognition method for urban road grid patterns by fusing mesh and road features.International Journal of Digital Earth, 2024

work page 2024
[25]

Huang, H

Z. Huang, H. Qian, Z. Cai, X. Wang, L. Xie, and X. Niu. An intelligent multilane roadway recognition method based on pseudo-tagging.Cartography and Geographic Information Science, 2025

work page 2025
[26]

Huang, T

Z. Huang, T. Tang, S. Chen, S. Lin, and Z. e. a. Jie. Making large language models better planners with reasoning-decision alignment.ECCV, 2024

work page 2024
[27]

Hwang, R

J.-J. Hwang, R. Xu, H. Lin, W.-C. Hung, J. Ji, K. Choi, D. Huang, T. He, P. Covington, B. Sapp, J. Guo, D. Anguelov, and M. Tan. Emma: End-to-end multimodal model for autonomous driving.TMLR, 2025

work page 2025
[28]

Ishaq, J

A. Ishaq, J. Lahoud, F. S. Khan, S. Khan, H. Cholakkal, and R. M. Anwer. Tracking meets large multimodal models for driving scenario understanding.ArXiv preprint arXiv:2503.14498, 2025

work page arXiv 2025
[29]

Jiang, S

B. Jiang, S. Chen, Q. Xu, B. Liao, J. Chen, H. Zhou, Q. Zhang, W. Liu, C. Huang, and X. Wang. Vad: Vectorized scene representation for efficient autonomous driving.ICCV, 2023

work page 2023
[30]

S. W. Kim, J. Philion, A. Torralba, and S. Fidler. Drivegan: Towards a controllable high-quality neural simulation. InCVPR, 2021

work page 2021
[31]

B. Li, Y . Wang, J. Mao, B. Ivanovic, S. Veer, K. Leung, and M. Pavone. Driving everywhere with large language model policy adaptation. InCVPR, 2024

work page 2024
[32]

X. Li, P. Li, Y . Zheng, W. Sun, Y . Wang, and Y . Chen. Semi-supervised vision-centric 3d occupancy world model for autonomous driving.ICLR, 2025

work page 2025
[33]

Y . Li, L. Fan, J. He, Y . Wang, Y . Chen, Z. Zhang, and T. Tan. Enhancing end-to-end autonomous driving with latent world model.ICLR, 2025

work page 2025
[34]

Z. Li, Z. Yu, S. Lan, J. Li, J. Kautz, T. Lu, and J. M. Álvarez. Is ego status all you need for open-loop end-to-end autonomous driving?CVPR, 2024

work page 2024
[35]

Liang, X

S. Liang, X. Chang, C. Wu, H. Yan, Y . Bai, X. Liu, H. Zhang, Y . Yuan, S. Zeng, M. Xu, et al. Persistent autoregressive mapping with traffic rules for autonomous driving.arXiv preprint arXiv:2509.22756, 2025. 11

work page arXiv 2025
[36]

B. Liao, S. Chen, H. Yin, B. Jiang, C. Wang, S. Yan, X. Zhang, X. Li, Y . Zhang, Q. Zhang, and X. Wang. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving.CVPR, 2025

work page 2025
[37]

D. Liu, S. Zhao, L. Zhuo, W. Lin, Y . Qiao, H. Li, and P. Gao. Lumina-mgpt: Illuminate flexible photoreal- istic text-to-image generation with multimodal generative pretraining.arXiv preprint arXiv:2408.02657, 2024

work page arXiv 2024
[38]

H. Liu, W. Yan, M. Zaharia, and P. Abbeel. World model on million-length video and language with ringattention.ICLR, 2025

work page 2025
[39]

J. Liu, F. Shang, Y . Liu, H. Liu, Y . Li, and Y . Gong. Fedbcgd: Communication-efficient accelerated block coordinate gradient descent for federated learning. InACM MM, 2024

work page 2024
[40]

W. Liu, J. Chen, K. Ji, L. Zhou, W. Chen, and B. Wang. Rag-instruct: Boosting llms with diverse retrieval-augmented instructions.EMNLP, 2025

work page 2025
[41]

W. Liu, J. Xu, F. Yu, Y . Lin, K. Ji, W. Chen, Y . Xu, Y . Wang, L. Shang, and B. Wang. Qfft, question-free fine-tuning for adaptive reasoning.NeurIPS, 2025

work page 2025
[42]

W. Lu, Y . Tong, and Z. Ye. Dammfnd: Domain-aware multimodal multi-view fake news detection. In Proceedings of the AAAI Conference on Artificial Intelligence, 2025

work page 2025
[43]

Y . Ma, Y . Cao, J. Sun, M. Pavone, and C. Xiao. Dolphins: Multimodal language model for driving.ECCV, 2024

work page 2024
[44]

J. Mao, J. Ye, Y . Qian, M. Pavone, and Y . Wang. A language agent for autonomous driving.COLM, 2024

work page 2024
[45]

C. Min, D. Zhao, L. Xiao, J. Zhao, X. Xu, Z. Zhu, L. Jin, J. Li, Y . Guo, J. Xing, L. Jing, Y . Nie, and B. Dai. Driveworld: 4d pre-trained scene understanding via world models for autonomous driving.CVPR, 2024

work page 2024
[46]

Motamed, L

S. Motamed, L. Culp, K. Swersky, P. Jaini, and R. Geirhos. Do generative video models understand physical principles?arXiv preprint arXiv:2501.09038, 2025

work page arXiv 2025
[47]

J. Ni, Y . Guo, Y . Liu, R. Chen, L. Lu, and Z. Wu. Maskgwm: A generalizable driving world model with video mask reconstruction.CVPR, 2025

work page 2025
[48]

Y . Qian, X. Li, J. Zhang, X. Meng, Y . Li, H. Ding, and M. Wang. A diffusion-tgan framework for spatio-temporal speed imputation and trajectory reconstruction.IEEE T-ITS, 2025

work page 2025
[49]

K. Qiu, Z. Gao, Z. Zhou, M. Sun, and Y . Guo. Noise-consistent siamese-diffusion for medical image synthesis and segmentation. InCVPR, 2025

work page 2025
[50]

D. Qu, H. Song, Q. Chen, Y . Yao, X. Ye, Y . Ding, Z. Wang, J. Gu, B. Zhao, D. Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model.RSS, 2025

work page 2025
[51]

Radford, J

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision. InICML, 2021

work page 2021
[52]

Sarkar, M

A. Sarkar, M. Y . I. Idris, and Z. Yu. Reasoning in computer vision: Taxonomy, models, tasks, and methodologies.arXiv preprint arXiv:2508.10523, 2025

work page arXiv 2025
[53]

Shtedritski, C

A. Shtedritski, C. Rupprecht, and A. Vedaldi. What does clip know about a red circle? visual prompt engineering for vlms.ICCV, 2023

work page 2023
[54]

C. Sima, K. Renz, K. Chitta, L. Chen, H. Zhang, C. Xie, P. Luo, A. Geiger, and H. Li. Drivelm: Driving with graph visual question answering.ECCV, 2024

work page 2024
[55]

H. Song, D. Qu, Y . Yao, Q. Chen, Q. Lv, Y . Tang, M. Shi, G. Ren, M. Yao, B. Zhao, et al. Hume: Introducing system-2 thinking in visual-language-action model.arXiv preprint arXiv:2505.21432, 2025

work page arXiv 2025
[56]

P. Sun, Y . Jiang, S. Chen, S. Zhang, B. Peng, P. Luo, and Z. Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[57]

Q. Sun, Q. Yu, Y . Cui, F. Zhang, X. Zhang, Y . Wang, H. Gao, J. Liu, T. Huang, and X. Wang. Generative pretraining in multimodality.ICLR, 2024

work page 2024
[58]

S. Sun, W. Yu, Y . Ren, W. Du, L. Liu, X. Zhang, Y . Hu, and C. Ma. Gdiffretro: Retrosynthesis prediction with dual graph enhanced molecular representation and diffusion generation.AAAI, 2025

work page 2025
[59]

W. Tan, D. Chen, J. Xue, Z. Wang, and T. Chen. Teaching-inspired integrated prompting framework: A novel approach for enhancing reasoning in large language models. InCOLING: Industry Track, 2025

work page 2025
[60]

X. Tian, J. Gu, B. Li, Y . Liu, Z. Zhao, Y . Wang, K. Zhan, P. Jia, X. Lang, and H. Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models.CoRL, 2024

work page 2024
[61]

van den Oord, O

A. van den Oord, O. Vinyals, and K. Kavukcuoglu. Neural discrete representation learning. InNeurIPS, 2017

work page 2017
[62]

Vaswani, N

A. Vaswani, N. M. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. InNeurIPS, 2017

work page 2017
[63]

P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge, Y . Fan, K. Dang, M. Du, X. Ren, R. Men, D. Liu, C. Zhou, J. Zhou, and J. Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[64]

S. Wang, Z. Yu, X. Jiang, S. Lan, M. Shi, N. Chang, J. Kautz, Y . Li, and J. M. Álvarez. Omnidrive: A holistic llm-agent framework for autonomous driving with 3d perception, reasoning and planning.CVPR, 2025. 12

work page 2025
[65]

X. Wang, Z. Zhu, G. Huang, X. Chen, J. Zhu, and J. Lu. Drivedreamer: Towards real-world-driven world models for autonomous driving.ECCV, 2024

work page 2024
[66]

Y . Wang, J. He, L. Fan, H. Li, Y . Chen, and Z. Zhang. Driving into the future: Multiview visual forecasting and planning with world model for autonomous driving.CVPR, 2024

work page 2024
[67]

J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. H. Chi, F. Xia, Q. Le, and D. Zhou. Chain of thought prompting elicits reasoning in large language models.NeurIPS, 2022

work page 2022
[68]

Q. Wei, P. Dai, W. Li, B. Liu, and X. Wu. Copeft: Fast adaptation framework for multi-agent collaborative perception with parameter-efficient fine-tuning. InAAAI, 2025

work page 2025
[69]

X. Weng, B. Ivanovic, Y . Wang, Y . Wang, and M. Pavone. Para-drive: Parallelized architecture for real-time autonomous driving. InCVPR, 2024

work page 2024
[70]

C. Wu, X. Chen, Z. Wu, Y . Ma, X. Liu, Z. Pan, W. Liu, Z. Xie, X. Yu, C. Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation.CVPR, 2024

work page 2024
[71]

C. Wu, H. Huang, L. Zhang, J. Chen, Y . Tong, and M. Zhou. Towards automated 3d evaluation of water leakage on a tunnel face via improved gan and self-attention dl model.Tunn Undergr Space Technol, 2023

work page 2023
[72]

J. Wu, H. Li, X. Zhang, X. Liu, Y . Huang, J. Luo, Y . Zhang, Z. Li, R. Chu, Y . Yang, and S. Li. Teaching your models to understand code via focal preference alignment. InEMNLP, 2025

work page 2025
[73]

Y . Wu, Z. Zhang, J. Chen, H. Tang, D. Li, Y . Fang, L. Zhu, E. Xie, H. Yin, L. Yi, et al. Vila-u: a unified foundation model integrating visual understanding and generation.ICLR, 2025

work page 2025
[74]

J. Xie, W. Mao, Z. Bai, D. J. Zhang, W. Wang, K. Q. Lin, Y . Gu, Z. Chen, Z. Yang, and M. Z. Shou. Show-o: One single transformer to unify multimodal understanding and generation.ICLR, 2025

work page 2025
[75]

M. Xie, S. Zeng, X. Chang, X. Liu, Z. Pan, M. Xu, and X. Wei. Seqgrowgraph: Learning lane topology as a chain of graph expansions. InICCV, 2025

work page 2025
[76]

Z. Xu, Y . Zhang, E. Xie, Z. Zhao, Y . Guo, K.-Y . K. Wong, Z. Li, and H. Zhao. Drivegpt4: Interpretable end-to-end autonomous driving via large language model.IEEE Robotics and Automation Letters, 2024

work page 2024
[77]

J. Yang, S. Gao, Y . Qiu, L. Chen, T. Li, B. Dai, K. Chitta, P. Wu, J. Zeng, P. Luo, J. Zhang, A. Geiger, Y . Qiao, and H. Li. Generalized predictive model for autonomous driving. InCVPR, 2024

work page 2024
[78]

X. Yang, B. Li, Y . Zhang, Z. Yin, L. Bai, L. Ma, Z. Wang, J. Cai, T.-T. Wong, H. Lu, and X. Jia. Vlipp: Towards physically plausible video generation with vision and language informed physical prior.ICCV, 2025

work page 2025
[79]

Z. Yang, L. Chen, Y . Sun, and H. Li. Visual point cloud forecasting enables scalable autonomous driving. InCVPR, 2024

work page 2024
[80]

Z. Yu, M. Y . I. Idris, H. Wang, P. Wang, J. Chen, and K. Wang. From physics to foundation models: A review of ai-driven quantitative remote sensing inversion.arXiv preprint arXiv:2507.09081, 2025

work page arXiv 2025

Showing first 80 references.