Recognition: 2 theorem links
· Lean TheoremFutureSightDrive: Thinking Visually with Spatio-Temporal CoT for Autonomous Driving
Pith reviewed 2026-05-15 19:15 UTC · model grok-4.3
The pith
Generating one future visual frame lets driving models plan trajectories by preserving spatial and temporal details that text chains of thought discard.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
FSDrive operates first as a world model to produce a unified future frame that merges a predicted background with explicit future lane dividers and 3D object boxes; this single imagined scene serves as the visual spatio-temporal CoT. The identical model then switches to an inverse-dynamics role and outputs trajectories conditioned on the current observation plus this visual CoT. A unified pre-training stage adds visual tokens to the vocabulary and jointly trains semantic VQA with future-frame generation, using a progressive curriculum that first enforces structural priors before full scene rendering.
What carries the argument
The visual spatio-temporal CoT: one generated future frame that encodes both spatial layout and temporal change in a single image for subsequent planning.
If this is right
- Trajectory prediction accuracy rises on nuScenes and NAVSIM benchmarks.
- Collision rates drop when planning uses the generated visual frame instead of text-only reasoning.
- The same lightweight autoregressive model reaches competitive FID scores for future-frame generation.
- Scene-understanding performance improves on DriveLM question-answering tasks.
Where Pith is reading between the lines
- The same visual-CoT pattern could be tested in other embodied tasks such as robotic manipulation where text reasoning discards spatial layout.
- If generation errors are the main failure mode, adding an explicit physics-loss term during pre-training might reduce inherited planning mistakes.
- Replacing the single-frame CoT with a short sequence of frames might further improve long-horizon anticipation without returning to pure text.
Load-bearing premise
The predicted future frame must be physically accurate enough to supply the precise lane and object positions that the planning step actually needs.
What would settle it
Train the model with deliberately inaccurate future frames (shifted lanes or wrong box positions) and check whether trajectory quality and collision rates remain unchanged from the accurate-frame case.
read the original abstract
Vision-Language-Action (VLA) models offer significant potential for end-to-end driving, yet their reasoning is often constrained by textual Chains-of-Thought (CoT). This symbolic compression of visual information creates a modality gap between perception and planning by blurring spatio-temporal relations and discarding fine-grained cues. We introduce FSDrive, a framework that empowers VLAs to "think visually" using a novel visual spatio-temporal CoT. FSDrive first operates as a world model, generating a unified future frame that combines a predicted background with explicit, physically-plausible priors like future lane dividers and 3D object boxes. This imagined scene serves as the visual spatio-temporal CoT, capturing both spatial structure and temporal evolution in a single representation. The same VLA then functions as an inverse-dynamics model to plan trajectories conditioned on current observations and this visual CoT. We enable this with a unified pre-training paradigm that expands the model's vocabulary with visual tokens and jointly optimizes for semantic understanding (VQA) and future-frame prediction. A progressive curriculum first generates structural priors to enforce physical laws before rendering the full scene. Evaluations on nuScenes and NAVSIM show FSDrive improves trajectory accuracy and reduces collisions, while also achieving competitive FID for video generation with a lightweight autoregressive model and advancing scene understanding on DriveLM. These results confirm that our visual spatio-temporal CoT bridges the perception-planning gap, enabling safer, more anticipatory autonomous driving. Code is available at https://github.com/MIV-XJTU/FSDrive.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces FSDrive, a Vision-Language-Action (VLA) framework for autonomous driving that replaces textual Chains-of-Thought with a visual spatio-temporal CoT. The model first acts as a world model to generate a unified future frame containing a predicted background plus explicit priors (future lane dividers and 3D object boxes); this frame is then used to condition the same VLA acting as an inverse-dynamics planner. A unified pre-training regime expands the vocabulary with visual tokens and jointly optimizes VQA and future-frame prediction via a progressive curriculum. Experiments on nuScenes and NAVSIM report gains in trajectory accuracy and collision reduction together with competitive FID scores for the generated frames.
Significance. If the performance gains can be isolated to the visual CoT representation rather than the joint pre-training or autoregressive architecture, the approach would offer a concrete mechanism for preserving fine-grained spatio-temporal information that textual CoT discards, potentially improving safety and anticipation in end-to-end driving systems. The public code release and multi-benchmark evaluation are positive factors.
major comments (3)
- [Abstract] Abstract and Experiments: the reported improvements in trajectory accuracy and collision reduction are presented without error bars, statistical tests, or ablation studies that isolate the contribution of the visual spatio-temporal CoT from the joint pre-training regime and expanded visual-token vocabulary.
- [Method] Method: the claim that the generated future frame functions as an effective visual CoT for inverse-dynamics planning rests on the unverified assumption that the planner actually conditions on and utilizes the generated frame (with its lane and box priors) rather than ignoring it; no control experiments comparing planning with versus without the generated frame are described.
- [Experiments] Experiments: the potential confound that gains on nuScenes and NAVSIM arise from the autoregressive pre-training or curriculum rather than the specific visual CoT representation is not addressed, undermining the central claim that the visual CoT bridges the perception-planning gap.
minor comments (2)
- [Abstract] Abstract: the acronym FSDrive is introduced without expansion on first use.
- [Abstract] Abstract: FID scores are described as competitive but no numerical values or direct baseline comparisons are supplied.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which highlights important aspects of empirical validation. We address each major comment below and will revise the manuscript to incorporate additional analyses and experiments as outlined.
read point-by-point responses
-
Referee: [Abstract] Abstract and Experiments: the reported improvements in trajectory accuracy and collision reduction are presented without error bars, statistical tests, or ablation studies that isolate the contribution of the visual spatio-temporal CoT from the joint pre-training regime and expanded visual-token vocabulary.
Authors: We agree that error bars, statistical tests, and isolating ablations would strengthen the presentation. In the revised version we will report mean and standard deviation over at least three random seeds for all nuScenes and NAVSIM metrics, include paired statistical significance tests (Wilcoxon signed-rank), and add an ablation that holds the joint pre-training and visual vocabulary fixed while replacing the visual CoT with a textual CoT baseline. This directly isolates the contribution of the visual representation. revision: yes
-
Referee: [Method] Method: the claim that the generated future frame functions as an effective visual CoT for inverse-dynamics planning rests on the unverified assumption that the planner actually conditions on and utilizes the generated frame (with its lane and box priors) rather than ignoring it; no control experiments comparing planning with versus without the generated frame are described.
Authors: We accept this criticism. The revised manuscript will include explicit control experiments: the inverse-dynamics planner will be evaluated once with the full generated future frame (background + lane dividers + 3D boxes) and once with the future-frame input replaced by a zeroed or randomly noised tensor while keeping all other inputs identical. Performance degradation in the control condition will be reported to confirm that the planner conditions on the visual priors. revision: yes
-
Referee: [Experiments] Experiments: the potential confound that gains on nuScenes and NAVSIM arise from the autoregressive pre-training or curriculum rather than the specific visual CoT representation is not addressed, undermining the central claim that the visual CoT bridges the perception-planning gap.
Authors: We agree that ruling out this confound is necessary. We will add two new baselines in the revision: (1) the identical autoregressive architecture and curriculum but with textual CoT instead of the visual future frame, and (2) the same joint pre-training without the progressive structural-prior curriculum. These comparisons will be presented alongside the original results to show that the reported gains are attributable to the visual spatio-temporal CoT. revision: yes
Circularity Check
No circularity; visual CoT is an additive architectural proposal validated empirically
full rationale
The paper introduces FSDrive as a new VLA framework that generates a visual spatio-temporal CoT (future frame with lane and box priors) via world-model pre-training and then conditions inverse-dynamics planning on it. No equations, fitted parameters, or self-citations are shown that reduce the claimed trajectory gains or perception-planning bridge to the inputs by construction. The progressive curriculum and joint VQA/future-frame optimization are presented as enabling steps, but performance is reported via external benchmarks (nuScenes, NAVSIM, DriveLM) rather than tautological re-derivation. This is a standard empirical systems paper with no load-bearing self-referential loops.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The predicted future frame with lane dividers and 3D boxes is sufficiently accurate and physically plausible to serve as effective reasoning input for trajectory planning.
invented entities (1)
-
visual spatio-temporal CoT
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
generating a unified future frame that combines a predicted background with explicit, physically-plausible priors like future lane dividers and 3D object boxes. This imagined scene serves as the visual spatio-temporal CoT
-
IndisputableMonolith/Foundation/AlexanderDuality.leanalexander_duality_circle_linking unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
progressive, easy-to-hard generation method... first infer drivable regions and key object positions... generating coarse-grained future perception images (e.g., lane dividers and 3D detection)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 22 Pith papers
-
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
MindVLA-U1 introduces a unified streaming VLA with shared backbone, framewise memory, and language-guided action diffusion that surpasses human drivers on WOD-E2E planning metrics.
-
ConeSep: Cone-based Robust Noise-Unlearning Compositional Network for Composed Image Retrieval
ConeSep tackles noisy triplet correspondences in composed image retrieval by introducing geometric fidelity quantization to locate noise, negative boundary learning for semantic opposites, and targeted unlearning via ...
-
Learning Vision-Language-Action World Models for Autonomous Driving
VLA-World improves autonomous driving by using action-guided future image generation followed by reflective reasoning over the imagined scene to refine trajectories.
-
The Blind Spot of Adaptation: Quantifying and Mitigating Forgetting in Fine-tuned Driving Models
Fine-tuning VLMs for driving erodes pre-trained world knowledge, but shifting adaptation to prompt space via the Drive Expert Adapter preserves generalization while improving task performance.
-
MindVLA-U1: VLA Beats VA with Unified Streaming Architecture for Autonomous Driving
MindVLA-U1 is the first unified streaming VLA architecture that surpasses human drivers on WOD-E2E planning metrics while matching VA latency and preserving language interfaces.
-
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
CoWorld-VLA extracts semantic, geometric, dynamic, and trajectory expert tokens from multi-source supervision and feeds them into a diffusion-based hierarchical planner, achieving competitive collision avoidance and t...
-
CoWorld-VLA: Thinking in a Multi-Expert World Model for Autonomous Driving
CoWorld-VLA encodes world information into four expert tokens that condition a diffusion-based planner, yielding competitive collision avoidance and trajectory accuracy on the NAVSIM benchmark.
-
ProDrive: Proactive Planning for Autonomous Driving via Ego-Environment Co-Evolution
ProDrive couples a query-centric planner with a BEV world model for end-to-end ego-environment co-evolution, enabling future-outcome assessment that improves safety and efficiency over reactive baselines on NAVSIM v1.
-
Air-Know: Arbiter-Calibrated Knowledge-Internalizing Robust Network for Composed Image Retrieval
Air-Know decouples MLLM-based external arbitration from proxy learning via knowledge internalization and dual-stream training to overcome noisy triplet correspondence in composed image retrieval.
-
Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
OneVL is the first latent CoT method to exceed explicit CoT accuracy on four driving benchmarks while running at answer-only speed, by supervising latent tokens with a visual world model decoder.
-
Xiaomi OneVL: One-Step Latent Reasoning and Planning with Vision-Language Explanation
OneVL achieves superior accuracy to explicit chain-of-thought reasoning at answer-only latency by supervising latent tokens with a visual world model decoder that predicts future frames.
-
Think in Latent Thoughts: A New Paradigm for Gloss-Free Sign Language Translation
A new SLT framework uses latent thoughts as a middle reasoning layer and plan-then-ground decoding to improve coherence and faithfulness in gloss-free sign language translation.
-
Evaluation as Evolution: Transforming Adversarial Diffusion into Closed-Loop Curricula for Autonomous Vehicles
E² uses transport-regularized sparse control on learned reverse-time SDEs with topology-driven selection and Topological Anchoring to generate realistic adversarial scenarios, improving collision discovery by 9.01% on...
-
EponaV2: Driving World Model with Comprehensive Future Reasoning
EponaV2 advances perception-free driving world models by forecasting comprehensive future 3D geometry and semantic representations, achieving SOTA planning performance on NAVSIM benchmarks.
-
Latency Analysis and Optimization of Alpamayo 1 via Efficient Trajectory Generation
Redesigning Alpamayo 1 to single-reasoning and optimizing diffusion action generation cuts inference latency by 69.23% while preserving trajectory diversity and prediction quality.
-
SpanVLA: Efficient Action Bridging and Learning from Negative-Recovery Samples for Vision-Language-Action Model
SpanVLA reduces action generation latency via flow-matching conditioned on history and improves robustness by training on negative-recovery samples with GRPO and a dedicated reasoning dataset.
-
ActorMind: Emulating Human Actor Reasoning for Speech Role-Playing
ActorMind is a four-agent chain-of-thought framework that emulates human actors to produce spontaneous, emotion-infused speech responses for role-playing scenarios.
-
Zero-Shot Vulnerability Detection in Low-Resource Smart Contracts Through Solidity-Only Training
Sol2Vy transfers vulnerability detection from Solidity to Vyper in zero-shot fashion, outperforming prior methods on reentrancy, weak randomness, and unchecked transfers.
-
EvoDriveVLA: Evolving Driving VLA Models via Collaborative Perception-Planning Distillation
EvoDriveVLA uses collaborative perception-planning distillation with self-anchor and future-aware teachers to fix perception degradation and long-term instability in driving VLA models, reaching SOTA on nuScenes and NAVSIM.
-
FAST: A Synergistic Framework of Attention and State-space Models for Spatiotemporal Traffic Prediction
FAST uses a Temporal-Spatial-Temporal structure with attention and Mamba modules plus learnable embeddings to achieve better accuracy on traffic prediction tasks than previous models.
-
FedNSAM:Consistency of Local and Global Flatness for Federated Learning
FedNSAM uses global Nesterov momentum to make local flatness consistent with global flatness in federated learning, yielding tighter convergence than FedSAM and better empirical performance.
-
Reinforcement Learning for Scalable and Trustworthy Intelligent Systems
Reinforcement learning is advanced for communication-efficient federated optimization and for preference-aligned, contextually safe policies in large language models.
Reference graph
Works this paper leans on
- [1]
- [2]
-
[3]
S. Chen, B. Jiang, H. Gao, B. Liao, Q. Xu, Q. Zhang, C. Huang, W. Liu, and X. Wang. Vadv2: End-to-end vectorized autonomous driving via probabilistic planning.arXiv preprint arXiv:2402.13243, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Y . Chen and R. Greer. Technical report for argoverse2 scenario mining challenges on iterative error correction and spatially-aware prompting.arXiv preprint arXiv:2506.11124, 2025
-
[5]
Y . Chen, Y .-Q. Wang, and Z. Zhang. Drivinggpt: Unifying driving world modeling and planning with multi-modal autoregressive transformers.ICCV, 2025
work page 2025
- [6]
-
[7]
J. H. Cho, B. Ivanovic, Y . Cao, E. Schmerling, Y . Wang, X. Weng, B. Li, Y . You, P. Krähenbühl, Y . Wang, et al. Language-image models with 3d understanding.ICLR, 2025. 10
work page 2025
- [8]
- [9]
- [10]
-
[11]
R. Dong, C. Han, Y . Peng, Z. Qi, Z. Ge, J. Yang, L. Zhao, J. Sun, H. Zhou, H. Wei, X. Kong, X. Zhang, K. Ma, and L. Yi. DreamLLM: Synergistic multimodal comprehension and creation. InICLR, 2024
work page 2024
-
[12]
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Min- derer, G. Heigold, S. Gelly, J. Uszkoreit, and N. Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale.ICLR, 2021
work page 2021
-
[13]
S. Gao, J. Yang, L. Chen, K. Chitta, Y . Qiu, A. Geiger, J. Zhang, and H. Li. Vista: A generalizable driving world model with high fidelity and versatile controllability. InNeurIPS, 2024
work page 2024
-
[14]
Y . Gao, C. Li, Z. You, J. Liu, Z. Li, P. Chen, Q. Chen, Z. Tang, L. Wang, P. Yang, et al. Openfly: A comprehensive platform for aerial vision-language navigation.CoRR, 2025
work page 2025
-
[15]
J. Guo, Z. Li, J. Wu, Q. Wang, Y . Li, L. Zhang, hai zhao, and Y . Yang. Tom: Leveraging tree-oriented mapreduce for long-context reasoning in large language models. InEMNLP, 2025
work page 2025
-
[16]
M. Hassan, S. Stapf, A. Rahimi, P. M. B. Rezende, Y . Haghighi, D. Brüggemann, I. Katircioglu, L. Zhang, X. Chen, S. Saha, M. Cannici, E. Aljalbout, B. Ye, X. Wang, A. Davtyan, M. Salzmann, D. Scaramuzza, M. Pollefeys, P. Favaro, and A. Alahi. Gem: A generalizable ego-vision multimodal world model for fine-grained ego-motion, object dynamics, and scene co...
work page 2025
- [17]
-
[18]
A. Hu, L. Russell, H. Yeo, Z. Murez, G. Fedoseev, A. Kendall, J. Shotton, and G. Corrado. Gaia-1: A generative world model for autonomous driving.arXiv preprint arXiv:2309.17080, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[19]
S. Hu, L. Chen, P. Wu, H. Li, J. Yan, and D. Tao. St-p3: End-to-end vision-based autonomous driving via spatial-temporal feature learning. InECCV, 2022
work page 2022
-
[20]
Y . Hu, Q. Li, D. Zhang, J. Yan, and Y . Chen. Context-alignment: Activating and enhancing LLMs capabilities in time series. InICLR, 2025
work page 2025
-
[21]
Y . Hu, J. Yang, L. Chen, K. Li, C. Sima, X. Zhu, S. Chai, S. Du, T. Lin, W. Wang, L. Lu, X. Jia, Q. Liu, J. Dai, Y . Qiao, and H. Li. Planning-oriented autonomous driving. InCVPR, 2023
work page 2023
- [22]
- [23]
- [24]
- [25]
- [26]
- [27]
- [28]
- [29]
-
[30]
S. W. Kim, J. Philion, A. Torralba, and S. Fidler. Drivegan: Towards a controllable high-quality neural simulation. InCVPR, 2021
work page 2021
-
[31]
B. Li, Y . Wang, J. Mao, B. Ivanovic, S. Veer, K. Leung, and M. Pavone. Driving everywhere with large language model policy adaptation. InCVPR, 2024
work page 2024
-
[32]
X. Li, P. Li, Y . Zheng, W. Sun, Y . Wang, and Y . Chen. Semi-supervised vision-centric 3d occupancy world model for autonomous driving.ICLR, 2025
work page 2025
-
[33]
Y . Li, L. Fan, J. He, Y . Wang, Y . Chen, Z. Zhang, and T. Tan. Enhancing end-to-end autonomous driving with latent world model.ICLR, 2025
work page 2025
-
[34]
Z. Li, Z. Yu, S. Lan, J. Li, J. Kautz, T. Lu, and J. M. Álvarez. Is ego status all you need for open-loop end-to-end autonomous driving?CVPR, 2024
work page 2024
- [35]
-
[36]
B. Liao, S. Chen, H. Yin, B. Jiang, C. Wang, S. Yan, X. Zhang, X. Li, Y . Zhang, Q. Zhang, and X. Wang. Diffusiondrive: Truncated diffusion model for end-to-end autonomous driving.CVPR, 2025
work page 2025
- [37]
-
[38]
H. Liu, W. Yan, M. Zaharia, and P. Abbeel. World model on million-length video and language with ringattention.ICLR, 2025
work page 2025
-
[39]
J. Liu, F. Shang, Y . Liu, H. Liu, Y . Li, and Y . Gong. Fedbcgd: Communication-efficient accelerated block coordinate gradient descent for federated learning. InACM MM, 2024
work page 2024
-
[40]
W. Liu, J. Chen, K. Ji, L. Zhou, W. Chen, and B. Wang. Rag-instruct: Boosting llms with diverse retrieval-augmented instructions.EMNLP, 2025
work page 2025
-
[41]
W. Liu, J. Xu, F. Yu, Y . Lin, K. Ji, W. Chen, Y . Xu, Y . Wang, L. Shang, and B. Wang. Qfft, question-free fine-tuning for adaptive reasoning.NeurIPS, 2025
work page 2025
-
[42]
W. Lu, Y . Tong, and Z. Ye. Dammfnd: Domain-aware multimodal multi-view fake news detection. In Proceedings of the AAAI Conference on Artificial Intelligence, 2025
work page 2025
-
[43]
Y . Ma, Y . Cao, J. Sun, M. Pavone, and C. Xiao. Dolphins: Multimodal language model for driving.ECCV, 2024
work page 2024
-
[44]
J. Mao, J. Ye, Y . Qian, M. Pavone, and Y . Wang. A language agent for autonomous driving.COLM, 2024
work page 2024
-
[45]
C. Min, D. Zhao, L. Xiao, J. Zhao, X. Xu, Z. Zhu, L. Jin, J. Li, Y . Guo, J. Xing, L. Jing, Y . Nie, and B. Dai. Driveworld: 4d pre-trained scene understanding via world models for autonomous driving.CVPR, 2024
work page 2024
-
[46]
S. Motamed, L. Culp, K. Swersky, P. Jaini, and R. Geirhos. Do generative video models understand physical principles?arXiv preprint arXiv:2501.09038, 2025
-
[47]
J. Ni, Y . Guo, Y . Liu, R. Chen, L. Lu, and Z. Wu. Maskgwm: A generalizable driving world model with video mask reconstruction.CVPR, 2025
work page 2025
-
[48]
Y . Qian, X. Li, J. Zhang, X. Meng, Y . Li, H. Ding, and M. Wang. A diffusion-tgan framework for spatio-temporal speed imputation and trajectory reconstruction.IEEE T-ITS, 2025
work page 2025
-
[49]
K. Qiu, Z. Gao, Z. Zhou, M. Sun, and Y . Guo. Noise-consistent siamese-diffusion for medical image synthesis and segmentation. InCVPR, 2025
work page 2025
-
[50]
D. Qu, H. Song, Q. Chen, Y . Yao, X. Ye, Y . Ding, Z. Wang, J. Gu, B. Zhao, D. Wang, et al. Spatialvla: Exploring spatial representations for visual-language-action model.RSS, 2025
work page 2025
-
[51]
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clark, G. Krueger, and I. Sutskever. Learning transferable visual models from natural language supervision. InICML, 2021
work page 2021
- [52]
-
[53]
A. Shtedritski, C. Rupprecht, and A. Vedaldi. What does clip know about a red circle? visual prompt engineering for vlms.ICCV, 2023
work page 2023
-
[54]
C. Sima, K. Renz, K. Chitta, L. Chen, H. Zhang, C. Xie, P. Luo, A. Geiger, and H. Li. Drivelm: Driving with graph visual question answering.ECCV, 2024
work page 2024
- [55]
-
[56]
P. Sun, Y . Jiang, S. Chen, S. Zhang, B. Peng, P. Luo, and Z. Yuan. Autoregressive model beats diffusion: Llama for scalable image generation.arXiv preprint arXiv:2406.06525, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[57]
Q. Sun, Q. Yu, Y . Cui, F. Zhang, X. Zhang, Y . Wang, H. Gao, J. Liu, T. Huang, and X. Wang. Generative pretraining in multimodality.ICLR, 2024
work page 2024
-
[58]
S. Sun, W. Yu, Y . Ren, W. Du, L. Liu, X. Zhang, Y . Hu, and C. Ma. Gdiffretro: Retrosynthesis prediction with dual graph enhanced molecular representation and diffusion generation.AAAI, 2025
work page 2025
-
[59]
W. Tan, D. Chen, J. Xue, Z. Wang, and T. Chen. Teaching-inspired integrated prompting framework: A novel approach for enhancing reasoning in large language models. InCOLING: Industry Track, 2025
work page 2025
-
[60]
X. Tian, J. Gu, B. Li, Y . Liu, Z. Zhao, Y . Wang, K. Zhan, P. Jia, X. Lang, and H. Zhao. Drivevlm: The convergence of autonomous driving and large vision-language models.CoRL, 2024
work page 2024
-
[61]
A. van den Oord, O. Vinyals, and K. Kavukcuoglu. Neural discrete representation learning. InNeurIPS, 2017
work page 2017
-
[62]
A. Vaswani, N. M. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, L. Kaiser, and I. Polosukhin. Attention is all you need. InNeurIPS, 2017
work page 2017
-
[63]
P. Wang, S. Bai, S. Tan, S. Wang, Z. Fan, J. Bai, K. Chen, X. Liu, J. Wang, W. Ge, Y . Fan, K. Dang, M. Du, X. Ren, R. Men, D. Liu, C. Zhou, J. Zhou, and J. Lin. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[64]
S. Wang, Z. Yu, X. Jiang, S. Lan, M. Shi, N. Chang, J. Kautz, Y . Li, and J. M. Álvarez. Omnidrive: A holistic llm-agent framework for autonomous driving with 3d perception, reasoning and planning.CVPR, 2025. 12
work page 2025
-
[65]
X. Wang, Z. Zhu, G. Huang, X. Chen, J. Zhu, and J. Lu. Drivedreamer: Towards real-world-driven world models for autonomous driving.ECCV, 2024
work page 2024
-
[66]
Y . Wang, J. He, L. Fan, H. Li, Y . Chen, and Z. Zhang. Driving into the future: Multiview visual forecasting and planning with world model for autonomous driving.CVPR, 2024
work page 2024
-
[67]
J. Wei, X. Wang, D. Schuurmans, M. Bosma, E. H. Chi, F. Xia, Q. Le, and D. Zhou. Chain of thought prompting elicits reasoning in large language models.NeurIPS, 2022
work page 2022
-
[68]
Q. Wei, P. Dai, W. Li, B. Liu, and X. Wu. Copeft: Fast adaptation framework for multi-agent collaborative perception with parameter-efficient fine-tuning. InAAAI, 2025
work page 2025
-
[69]
X. Weng, B. Ivanovic, Y . Wang, Y . Wang, and M. Pavone. Para-drive: Parallelized architecture for real-time autonomous driving. InCVPR, 2024
work page 2024
-
[70]
C. Wu, X. Chen, Z. Wu, Y . Ma, X. Liu, Z. Pan, W. Liu, Z. Xie, X. Yu, C. Ruan, et al. Janus: Decoupling visual encoding for unified multimodal understanding and generation.CVPR, 2024
work page 2024
-
[71]
C. Wu, H. Huang, L. Zhang, J. Chen, Y . Tong, and M. Zhou. Towards automated 3d evaluation of water leakage on a tunnel face via improved gan and self-attention dl model.Tunn Undergr Space Technol, 2023
work page 2023
-
[72]
J. Wu, H. Li, X. Zhang, X. Liu, Y . Huang, J. Luo, Y . Zhang, Z. Li, R. Chu, Y . Yang, and S. Li. Teaching your models to understand code via focal preference alignment. InEMNLP, 2025
work page 2025
-
[73]
Y . Wu, Z. Zhang, J. Chen, H. Tang, D. Li, Y . Fang, L. Zhu, E. Xie, H. Yin, L. Yi, et al. Vila-u: a unified foundation model integrating visual understanding and generation.ICLR, 2025
work page 2025
-
[74]
J. Xie, W. Mao, Z. Bai, D. J. Zhang, W. Wang, K. Q. Lin, Y . Gu, Z. Chen, Z. Yang, and M. Z. Shou. Show-o: One single transformer to unify multimodal understanding and generation.ICLR, 2025
work page 2025
-
[75]
M. Xie, S. Zeng, X. Chang, X. Liu, Z. Pan, M. Xu, and X. Wei. Seqgrowgraph: Learning lane topology as a chain of graph expansions. InICCV, 2025
work page 2025
-
[76]
Z. Xu, Y . Zhang, E. Xie, Z. Zhao, Y . Guo, K.-Y . K. Wong, Z. Li, and H. Zhao. Drivegpt4: Interpretable end-to-end autonomous driving via large language model.IEEE Robotics and Automation Letters, 2024
work page 2024
-
[77]
J. Yang, S. Gao, Y . Qiu, L. Chen, T. Li, B. Dai, K. Chitta, P. Wu, J. Zeng, P. Luo, J. Zhang, A. Geiger, Y . Qiao, and H. Li. Generalized predictive model for autonomous driving. InCVPR, 2024
work page 2024
-
[78]
X. Yang, B. Li, Y . Zhang, Z. Yin, L. Bai, L. Ma, Z. Wang, J. Cai, T.-T. Wong, H. Lu, and X. Jia. Vlipp: Towards physically plausible video generation with vision and language informed physical prior.ICCV, 2025
work page 2025
-
[79]
Z. Yang, L. Chen, Y . Sun, and H. Li. Visual point cloud forecasting enables scalable autonomous driving. InCVPR, 2024
work page 2024
- [80]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.