Recognition: unknown
HERMES++: Toward a Unified Driving World Model for 3D Scene Understanding and Generation
Pith reviewed 2026-05-07 05:27 UTC · model grok-4.3
The pith
HERMES++ unifies 3D scene understanding and future geometry prediction for driving environments in one model.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
HERMES++ is a unified driving world model that integrates 3D scene understanding and future geometry prediction within a single framework. It uses a BEV representation to consolidate multi-view spatial information into a structure compatible with LLMs, introduces LLM-enhanced world queries to transfer knowledge from the understanding branch, designs a Current-to-Future Link to condition geometric evolution on semantic context, and employs Joint Geometric Optimization that integrates explicit geometric constraints with implicit latent regularization to align internal representations with geometry-aware priors. Extensive evaluations on multiple benchmarks show the model achieves strong results
What carries the argument
Four synergistic designs: BEV representation consolidation for multi-view spatial data, LLM-enhanced world queries for semantic knowledge transfer, Current-to-Future Link for temporal conditioning, and Joint Geometric Optimization combining explicit constraints with latent regularization to maintain structural integrity.
If this is right
- The unified model outperforms specialist approaches in future point cloud prediction.
- The model also outperforms specialists in 3D scene understanding tasks.
- The approach enables integrated simulation of environmental dynamics that incorporates both semantic interpretation and geometric forecasting.
- The designs bridge the gap between LLM-based reasoning and physical geometry evolution in driving scenes.
Where Pith is reading between the lines
- A single consistent model could let autonomous driving planners generate future scenarios that respect both understood scene semantics and physical geometry rules.
- Reducing the need for separate understanding and generation modules might lower overall system complexity in deployed vehicles.
- The integration pattern could extend to other robotics tasks that require aligned scene comprehension and forward simulation, such as manipulation planning.
Load-bearing premise
The four designs successfully transfer semantic knowledge to geometric prediction and enforce structural integrity without introducing new errors or losing critical information.
What would settle it
Evaluating HERMES++ on standard driving benchmarks such as nuScenes against separate specialist models for 3D understanding and future point cloud prediction; if the unified model underperforms specialists in either task, the synergy of the designs would be shown not to hold.
Figures
read the original abstract
Driving world models serve as a pivotal technology for autonomous driving by simulating environmental dynamics. However, existing approaches predominantly focus on future scene generation, often overlooking comprehensive 3D scene understanding. Conversely, while Large Language Models (LLMs) demonstrate impressive reasoning capabilities, they lack the capacity to predict future geometric evolution, creating a significant disparity between semantic interpretation and physical simulation. To bridge this gap, we propose HERMES++, a unified driving world model that integrates 3D scene understanding and future geometry prediction within a single framework. Our approach addresses the distinct requirements of these tasks through synergistic designs. First, a BEV representation consolidates multi-view spatial information into a structure compatible with LLMs. Second, we introduce LLM-enhanced world queries to facilitate knowledge transfer from the understanding branch. Third, a Current-to-Future Link is designed to bridge the temporal gap, conditioning geometric evolution on semantic context. Finally, to enforce structural integrity, we employ a Joint Geometric Optimization strategy that integrates explicit geometric constraints with implicit latent regularization to align internal representations with geometry-aware priors. Extensive evaluations on multiple benchmarks validate the effectiveness of our method. HERMES++ achieves strong performance, outperforming specialist approaches in both future point cloud prediction and 3D scene understanding tasks. The model and code will be publicly released at https://github.com/H-EmbodVis/HERMESV2.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes HERMES++, a unified driving world model integrating 3D scene understanding and future geometry prediction. It introduces four synergistic designs: BEV consolidation to aggregate multi-view information, LLM-enhanced world queries to transfer semantic knowledge, a Current-to-Future Link to condition geometric evolution on semantic context, and a Joint Geometric Optimization strategy combining explicit constraints with latent regularization. The authors present ablation studies isolating each component and report results on nuScenes and Waymo, claiming outperformance over specialist baselines in both 3D detection/segmentation metrics and future point-cloud metrics (CD, EMD). The code and model are to be released publicly.
Significance. If the reported results hold, the work is significant for autonomous driving and 3D vision by addressing the gap between LLM semantic reasoning and geometric simulation in one framework. Strengths include the ablation tables that quantify each design's contribution, direct comparisons against specialist methods on standard benchmarks, and the commitment to public code release, which supports reproducibility.
minor comments (4)
- Abstract: the claim of 'strong performance' and 'outperforming specialist approaches' is not supported by any quantitative metrics or specific baseline names. Adding one or two key numbers (e.g., CD reduction on nuScenes) would make the summary self-contained.
- Section 3 (Method): the exact injection mechanism for LLM-enhanced world queries into the geometric branch is described at a high level; a short equation or diagram annotation showing how semantic features are fused would improve clarity.
- Section 4 (Experiments): the main results tables compare against baselines, but the paper should explicitly state whether baselines were re-implemented with identical training protocols or taken from original reports, to allow readers to judge fairness.
- Figure captions: several qualitative figures lack dataset name, task (detection vs. prediction), and camera/viewpoint information, making it harder to interpret the visualizations without cross-referencing the text.
Simulated Author's Rebuttal
We thank the referee for the positive summary of HERMES++ and the recommendation for minor revision. The referee accurately captures the paper's core contribution: a unified framework that integrates 3D scene understanding and future geometry prediction through BEV consolidation, LLM-enhanced queries, the Current-to-Future Link, and Joint Geometric Optimization. We are pleased that the ablation studies, benchmark comparisons on nuScenes and Waymo, and commitment to public code release were noted as strengths. As no specific major comments were provided in the report, we have no points requiring rebuttal or revision at this time.
Circularity Check
No significant circularity detected
full rationale
The paper proposes HERMES++ as an engineering integration of existing components (BEV consolidation, LLM queries, temporal linking, and joint optimization) for unifying scene understanding and future point cloud prediction. No equations, derivations, or first-principles results appear that reduce any claimed prediction or performance metric to quantities defined by the model's own fitted parameters or self-referential definitions. Claims rest on empirical benchmark comparisons (nuScenes, Waymo) and ablation tables that isolate additive contributions, with no load-bearing self-citations, uniqueness theorems, or ansatz smuggling from prior author work. The derivation chain is therefore self-contained as a set of architectural choices validated externally rather than tautological.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Multi-view camera images can be consolidated into a BEV representation that remains compatible with LLM processing
invented entities (3)
-
LLM-enhanced world queries
no independent evidence
-
Current-to-Future Link
no independent evidence
-
Joint Geometric Optimization strategy
no independent evidence
Reference graph
Works this paper leans on
-
[1]
End-to-end autonomous driving: Challenges and frontiers,
L. Chen, P. Wu, K. Chitta, B. Jaeger, A. Geiger, and H. Li, “End-to-end autonomous driving: Challenges and frontiers,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 12, pp. 10 164–10 183, 2024
2024
-
[2]
Vista: A generalizable driving world model with high fidelity and versatile controllability,
S. Gao, J. Yang, L. Chen, K. Chitta, Y . Qiu, A. Geiger, J. Zhang, and H. Li, “Vista: A generalizable driving world model with high fidelity and versatile controllability,” inProc. Adv. Neural Inf. Process. Syst., 2024, pp. 91 560–91 596
2024
-
[3]
Drivedreamer-2: Llm-enhanced world models for diverse driving video generation,
G. Zhao, X. Wang, Z. Zhu, X. Chen, G. Huang, X. Bao, and X. Wang, “Drivedreamer-2: Llm-enhanced world models for diverse driving video generation,” inProc. AAAI Conf. Artif. Intell., vol. 39, no. 10, 2025, pp. 10 412–10 420
2025
-
[4]
Driving into the future: Multiview visual forecasting and planning with world model for autonomous driving,
Y . Wang, J. He, L. Fan, H. Li, Y . Chen, and Z. Zhang, “Driving into the future: Multiview visual forecasting and planning with world model for autonomous driving,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 14 749–14 759. 15
2024
-
[5]
GAIA-1: A Generative World Model for Autonomous Driving
A. Hu, L. Russell, H. Yeo, Z. Murez, G. Fedoseev, A. Kendall, J. Shotton, and G. Corrado, “Gaia-1: A generative world model for autonomous driving,”arXiv preprint arXiv:2309.17080, 2023
work page internal anchor Pith review arXiv 2023
-
[6]
Occworld: Learning a 3d occupancy world model for autonomous driving,
W. Zheng, W. Chen, Y . Huang, B. Zhang, Y . Duan, and J. Lu, “Occworld: Learning a 3d occupancy world model for autonomous driving,” inProc. Eur. Conf. Comput. Vis., 2024, pp. 55–72
2024
-
[7]
Visual point cloud forecasting enables scalable autonomous driving,
Z. Yang, L. Chen, Y . Sun, and H. Li, “Visual point cloud forecasting enables scalable autonomous driving,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 14 673–14 684
2024
-
[8]
Query-based temporal fusion with explicit motion for 3d object detection,
J. Hou, Z. Liu, Z. Zou, X. Ye, X. Baiet al., “Query-based temporal fusion with explicit motion for 3d object detection,” inProc. Adv. Neural Inf. Process. Syst., vol. 36, 2023, pp. 75 782–75 797
2023
-
[9]
Dds3d: Dense pseudo-labels with dynamic threshold for semi-supervised 3d object detection,
J. Li, Z. Liu, J. Hou, and D. Liang, “Dds3d: Dense pseudo-labels with dynamic threshold for semi-supervised 3d object detection,” inProc. IEEE Int. Conf. Robotics Automation, 2023, pp. 9245–9252
2023
-
[10]
Parameter- efficient fine-tuning in spectral domain for point cloud learning,
D. Liang, T. Feng, X. Zhou, Y . Zhang, Z. Zou, and X. Bai, “Parameter- efficient fine-tuning in spectral domain for point cloud learning,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 47, no. 12, pp. 10 949–10 966, 2025
2025
-
[11]
Pointmamba: A simple state space model for point cloud analysis,
D. Liang, X. Zhou, W. Xu, X. Zhu, Z. Zou, X. Ye, X. Tan, and X. Bai, “Pointmamba: A simple state space model for point cloud analysis,” in Proc. Adv. Neural Inf. Process. Syst., vol. 37, 2024, pp. 32 653–32 677
2024
-
[12]
Visual instruction tuning,
H. Liu, C. Li, Q. Wu, and Y . J. Lee, “Visual instruction tuning,” in Proc. Adv. Neural Inf. Process. Syst., vol. 36, 2023, pp. 34 892–34 916
2023
-
[13]
Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,
Z. Chen, J. Wu, W. Wang, W. Su, G. Chen, S. Xing, M. Zhong, Q. Zhang, X. Zhu, L. Luet al., “Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 24 185–24 198
2024
-
[14]
Monkey: Image resolution and text label are important things for large multi-modal models,
Z. Li, B. Yang, Q. Liu, Z. Ma, S. Zhang, J. Yang, Y . Sun, Y . Liu, and X. Bai, “Monkey: Image resolution and text label are important things for large multi-modal models,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 26 763–26 773
2024
-
[15]
Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning,
S. Wang, Z. Yu, X. Jiang, S. Lan, M. Shi, N. Chang, J. Kautz, Y . Li, and J. M. Alvarez, “Omnidrive: A holistic vision-language dataset for autonomous driving with counterfactual reasoning,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2025, pp. 22 442–22 452
2025
-
[16]
Drivelm: Driving with graph visual question answering,
C. Sima, K. Renz, K. Chitta, L. Chen, H. Zhang, C. Xie, J. Beißwenger, P. Luo, A. Geiger, and H. Li, “Drivelm: Driving with graph visual question answering,” inProc. Eur. Conf. Comput. Vis., 2024, pp. 256– 274
2024
-
[17]
Drivegpt4: Interpretable end-to-end autonomous driving via large language model,
Z. Xu, Y . Zhang, E. Xie, Z. Zhao, Y . Guo, K.-Y . K. Wong, Z. Li, and H. Zhao, “Drivegpt4: Interpretable end-to-end autonomous driving via large language model,”IEEE Robotics and Automation Letters, pp. 8186–8193, 2024
2024
-
[18]
Drivex: Omni scene modeling for learning generalizable world knowledge in autonomous driving,
C. Shi, S. Shi, K. Sheng, B. Zhang, and L. Jiang, “Drivex: Omni scene modeling for learning generalizable world knowledge in autonomous driving,” inProc. IEEE Int. Conf. Comput. Vis., 2025, pp. 28 599– 28 609
2025
-
[19]
Hermes: A unified self-driving world model for simultaneous 3d scene understanding and generation,
X. Zhou, D. Liang, S. Tu, X. Chen, Y . Ding, D. Zhang, F. Tan, H. Zhao, and X. Bai, “Hermes: A unified self-driving world model for simultaneous 3d scene understanding and generation,” inProc. IEEE Int. Conf. Comput. Vis., 2025, pp. 27 817–27 827
2025
-
[20]
World models,
D. Ha and J. Schmidhuber, “World models,” inProc. Adv. Neural Inf. Process. Syst., 2018
2018
-
[21]
Drivedreamer: Towards real-world-drive world models for autonomous driving,
X. Wang, Z. Zhu, G. Huang, X. Chen, J. Zhu, and J. Lu, “Drivedreamer: Towards real-world-drive world models for autonomous driving,” in Proc. Eur. Conf. Comput. Vis., 2024, pp. 55–72
2024
-
[22]
Unleashing generalization of end- to-end autonomous driving with controllable long video generation,
E. Ma, L. Zhou, T. Tang, Z. Zhang, D. Han, J. Jiang, K. Zhan, P. Jia, X. Lang, H. Sunet al., “Unleashing generalization of end- to-end autonomous driving with controllable long video generation,” arXiv preprint arXiv:2406.01349, 2024
-
[23]
arXiv preprint arXiv:2412.09627 (2024)
W. Zheng, Z. Xia, Y . Huang, S. Zuo, J. Zhou, and J. Lu, “Doe- 1: Closed-loop autonomous driving with large world model,”arXiv preprint arXiv:2412.09627, 2024
-
[24]
Driveworld: 4d pre-trained scene understanding via world models for autonomous driving,
C. Min, D. Zhao, L. Xiao, J. Zhao, X. Xu, Z. Zhu, L. Jin, J. Li, Y . Guo, J. Xinget al., “Driveworld: 4d pre-trained scene understanding via world models for autonomous driving,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 15 522–15 533
2024
-
[25]
Cam4docc: Benchmark for camera-only 4d occupancy forecasting in autonomous driving applications,
J. Ma, X. Chen, J. Huang, J. Xu, Z. Luo, J. Xu, W. Gu, R. Ai, and H. Wang, “Cam4docc: Benchmark for camera-only 4d occupancy forecasting in autonomous driving applications,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 21 486–21 495
2024
-
[26]
Generalized predictive model for autonomous driving,
J. Yang, S. Gao, Y . Qiu, L. Chen, T. Li, B. Dai, K. Chitta, P. Wu, J. Zeng, P. Luoet al., “Generalized predictive model for autonomous driving,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 14 662–14 672
2024
-
[27]
Available: https://arxiv.org/abs/2311.13549
F. Jia, W. Mao, Y . Liu, Y . Zhao, Y . Wen, C. Zhang, X. Zhang, and T. Wang, “Adriver-i: A general world model for autonomous driving,” arXiv preprint arXiv:2311.13549, 2023
-
[28]
Bevworld: A multimodal world model for autonomous driving via unified bev latent space,
Y . Zhang, S. Gong, K. Xiong, X. Ye, X. Tan, F. Wang, J. Huang, H. Wu, and H. Wang, “Bevworld: A multimodal world model for autonomous driving via unified bev latent space,”arXiv preprint arXiv:2407.05679, 2024
-
[29]
Magicdrive: Street view generation with diverse 3d geometry control,
R. Gao, K. Chen, E. Xie, L. Hong, Z. Li, D.-Y . Yeung, and Q. Xu, “Magicdrive: Street view generation with diverse 3d geometry control,” inProc. Int. Conf. Learn. Representations, 2024
2024
-
[30]
Panacea: Panoramic and controllable video generation for autonomous driving,
Y . Wen, Y . Zhao, Y . Liu, F. Jia, Y . Wang, C. Luo, C. Zhang, T. Wang, X. Sun, and X. Zhang, “Panacea: Panoramic and controllable video generation for autonomous driving,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 6902–6912
2024
-
[31]
Drivingdiffusion: layout-guided multi- view driving scenarios video generation with latent diffusion model,
X. Li, Y . Zhang, and X. Ye, “Drivingdiffusion: layout-guided multi- view driving scenarios video generation with latent diffusion model,” inProc. Eur. Conf. Comput. Vis., 2024, pp. 469–485
2024
-
[32]
Uniscene: Unified occupancy-centric driving scene generation,
B. Li, J. Guo, H. Liu, Y . Zou, Y . Ding, X. Chen, H. Zhu, F. Tan, C. Zhang, T. Wanget al., “Uniscene: Unified occupancy-centric driving scene generation,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2025, pp. 11 971–11 981
2025
-
[33]
Maskgwm: A generalizable driving world model with video mask reconstruction,
J. Ni, Y . Guo, Y . Liu, R. Chen, L. Lu, and Z. Wu, “Maskgwm: A generalizable driving world model with video mask reconstruction,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2025, pp. 22 381– 22 391
2025
-
[34]
Wovogen: World volume-aware diffusion for controllable multi-camera driving scene generation,
J. Lu, Z. Huang, Z. Yang, J. Zhang, and L. Zhang, “Wovogen: World volume-aware diffusion for controllable multi-camera driving scene generation,” inProc. Eur. Conf. Comput. Vis., 2024, pp. 329–345
2024
-
[35]
Drivinggpt: Unifying driving world modeling and planning with multi-modal autoregressive transformers,
Y . Chen, Y . Wang, and Z. Zhang, “Drivinggpt: Unifying driving world modeling and planning with multi-modal autoregressive transformers,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2025, pp. 26 890– 26 900
2025
-
[36]
Drivingworld: Constructing world model for autonomous driving via video gpt
X. Hu, W. Yin, M. Jia, J. Deng, X. Guo, Q. Zhang, X. Long, and P. Tan, “Drivingworld: Constructingworld model for autonomous driving via video gpt,”arXiv preprint arXiv:2412.19505, 2024
-
[37]
Epona: Autoregressive diffusion world model for autonomous driving,
K. Zhang, Z. Tang, X. Hu, X. Pan, X. Guo, Y . Liu, J. Huang, L. Yuan, Q. Zhang, X.-X. Longet al., “Epona: Autoregressive diffusion world model for autonomous driving,” inProc. IEEE Int. Conf. Comput. Vis., 2025, pp. 27 220–27 230
2025
-
[38]
Drivedreamer4d: World models are effective data machines for 4d driving scene representation,
G. Zhao, C. Ni, X. Wang, Z. Zhu, X. Zhang, Y . Wang, G. Huang, X. Chen, B. Wang, Y . Zhanget al., “Drivedreamer4d: World models are effective data machines for 4d driving scene representation,” in Proc. IEEE Conf. Comput. Vis. Pattern Recognit., 2025, pp. 12 015– 12 026
2025
-
[39]
Occsora: 4d occupancy generation models as world simulators for autonomous driving,
L. Wang, W. Zheng, Y . Ren, H. Jiang, Z. Cui, H. Yu, and J. Lu, “Occsora: 4d occupancy generation models as world simulators for autonomous driving,”arXiv preprint arXiv:2405.20337, 2024
-
[40]
Dome: Taming diffusion model into high-fidelity controllable occu- pancy world model,
S. Gu, W. Yin, B. Jin, X. Guo, J. Wang, H. Li, Q. Zhang, and X. Long, “Dome: Taming diffusion model into high-fidelity controllable occu- pancy world model,”arXiv preprint arXiv:2410.10429, 2024
-
[41]
Uno: Unsupervised occupancy fields for perception and forecasting,
B. Agro, Q. Sykora, S. Casas, T. Gilles, and R. Urtasun, “Uno: Unsupervised occupancy fields for perception and forecasting,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 14 487–14 496
2024
-
[42]
Neural volumetric world models for autonomous driving,
Z. Huang, J. Zhang, and E. Ohn-Bar, “Neural volumetric world models for autonomous driving,” inProc. Eur. Conf. Comput. Vis., 2024, pp. 195–213
2024
-
[43]
Renderworld: World model with self- supervised 3d label,
Z. Yan, W. Dong, Y . Shao, Y . Lu, H. Liu, J. Liu, H. Wang, Z. Wang, Y . Wang, F. Remondinoet al., “Renderworld: World model with self- supervised 3d label,” inProc. IEEE Int. Conf. Robotics Automation, 2025, pp. 6063–6070
2025
-
[44]
Occllama: An occupancy- language-action generative world model for autonomous driving,
J. Wei, S. Yuan, P. Li, Q. Hu, Z. Gan, and W. Ding, “Occllama: An occupancy-language-action generative world model for autonomous driving,”arXiv preprint arXiv:2409.03272, 2024
-
[45]
Lidardm: Generative lidar simulation in a generated world,
V . Zyrianov, H. Che, Z. Liu, and S. Wang, “Lidardm: Generative lidar simulation in a generated world,” inProc. IEEE Int. Conf. Robotics Automation, 2025, pp. 6055–6062
2025
-
[46]
S2net: Stochastic sequential pointcloud forecasting,
X. Weng, J. Nan, K.-H. Lee, R. McAllister, A. Gaidon, N. Rhinehart, and K. M. Kitani, “S2net: Stochastic sequential pointcloud forecasting,” inProc. Eur. Conf. Comput. Vis., 2022, pp. 549–564
2022
-
[47]
Point cloud forecasting as a proxy for 4d occupancy forecasting,
T. Khurana, P. Hu, D. Held, and D. Ramanan, “Point cloud forecasting as a proxy for 4d occupancy forecasting,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 1116–1124
2023
-
[48]
Learning unsupervised world models for autonomous driving via discrete diffusion,
L. Zhang, Y . Xiong, Z. Yang, S. Casas, R. Hu, and R. Urtasun, “Learning unsupervised world models for autonomous driving via discrete diffusion,” inProc. Int. Conf. Learn. Representations, 2023
2023
-
[49]
Gem: A generalizable ego-vision multimodal world model for fine- grained ego-motion, object dynamics, and scene composition control,
M. Hassan, S. Stapf, A. Rahimi, P. Rezende, Y . Haghighi, D. Br ¨uggemann, I. Katircioglu, L. Zhang, X. Chen, S. Sahaet al., 16 “Gem: A generalizable ego-vision multimodal world model for fine- grained ego-motion, object dynamics, and scene composition control,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2025, pp. 22 404– 22 415
2025
-
[50]
Dist-4d: Disentangled spatiotemporal diffusion with metric depth for 4d driving scene generation,
J. Guo, Y . Ding, X. Chen, S. Chen, B. Li, Y . Zou, X. Lyu, F. Tan, X. Qi, Z. Liet al., “Dist-4d: Disentangled spatiotemporal diffusion with metric depth for 4d driving scene generation,” inProc. IEEE Int. Conf. Comput. Vis., 2025, pp. 27 231–27 241
2025
-
[51]
Seeing the future, perceiving the future: A unified driving world model for future generation and perception,
D. Liang, D. Zhang, X. Zhou, S. Tu, T. Feng, X. Li, Y . Zhang, M. Du, X. Tan, and X. Bai, “Seeing the future, perceiving the future: A unified driving world model for future generation and perception,” inProc. IEEE Int. Conf. Robotics Automation, 2026
2026
-
[52]
Occ-llm: Enhancing autonomous driving with occupancy-based large language models,
T. Xu, H. Lu, X. Yan, Y . Cai, B. Liu, and Y . Chen, “Occ-llm: Enhancing autonomous driving with occupancy-based large language models,” arXiv preprint arXiv:2502.06419, 2025
-
[53]
arXiv preprint arXiv:2512.09864 (2025)
H. Lu, Z. Liu, G. Jiang, Y . Luo, S. Chen, Y . Zhang, and Y .-C. Chen, “Uniugp: Unifying understanding, generation, and planing for end-to- end autonomous driving,”arXiv preprint arXiv:2512.09864, 2025
-
[54]
Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving,
S. Zeng, X. Chang, M. Xie, X. Liu, Y . Bai, Z. Pan, M. Xu, X. Wei, and N. Guo, “Futuresightdrive: Thinking visually with spatio-temporal cot for autonomous driving,” inProc. Adv. Neural Inf. Process. Syst., 2025
2025
-
[55]
Visionllm v2: An end-to-end generalist multimodal large language model for hundreds of vision-language tasks,
J. Wu, M. Zhong, S. Xing, Z. Lai, Z. Liu, W. Wang, Z. Chen, X. Zhu, L. Lu, T. Luet al., “Visionllm v2: An end-to-end generalist multimodal large language model for hundreds of vision-language tasks,” inProc. Adv. Neural Inf. Process. Syst., 2024, pp. 69 925–69 975
2024
-
[56]
Psalm: Pixelwise segmentation with large multi-modal model,
Z. Zhang, Y . Ma, E. Zhang, and X. Bai, “Psalm: Pixelwise segmentation with large multi-modal model,” inProc. Eur. Conf. Comput. Vis., 2024, pp. 74–91
2024
-
[57]
Dreamllm: Synergistic multimodal compre- hension and creation,
R. Dong, C. Han, Y . Peng, Z. Qi, Z. Ge, J. Yang, L. Zhao, J. Sun, H. Zhou, H. Weiet al., “Dreamllm: Synergistic multimodal compre- hension and creation,” inProc. Int. Conf. Learn. Representations, 2024
2024
-
[58]
Language-image mod- els with 3d understanding,
J. H. Cho, B. Ivanovic, Y . Cao, E. Schmerling, Y . Wang, X. Weng, B. Li, Y . You, P. Kraehenbuehl, Y . Wanget al., “Language-image mod- els with 3d understanding,” inProc. Int. Conf. Learn. Representations, 2025
2025
-
[59]
Impromptu vla: Open weights and open data for driving vision-language-action models,
H. Chi, H.-a. Gao, Z. Liu, J. Liu, C. Liu, J. Li, K. Yang, Y . Yu, Z. Wang, W. Li, L. Wang, X. Hu, H. Sun, H. Zhao, and H. Zhao, “Impromptu vla: Open weights and open data for driving vision-language-action models,” inProc. Adv. Neural Inf. Process. Syst., 2025
2025
-
[60]
Embodied understanding of driving scenarios,
Y . Zhou, L. Huang, Q. Bu, J. Zeng, T. Li, H. Qiu, H. Zhu, M. Guo, Y . Qiao, and H. Li, “Embodied understanding of driving scenarios,” in Proc. Eur. Conf. Comput. Vis., 2024, pp. 129–148
2024
-
[61]
Drivemlm: aligning multi-modal large language models with behavioral planning states for autonomous driving,
E. Cui, W. Wang, Z. Li, J. Xie, H. Zou, H. Deng, G. Luo, L. Lu, X. Zhu, and J. Dai, “Drivemlm: aligning multi-modal large language models with behavioral planning states for autonomous driving,”Visual Intelligence, vol. 3, no. 1, p. 22, 2025
2025
-
[62]
Emma: End-to-end multimodal model for autonomous driving,
J.-J. Hwang, R. Xu, H. Lin, W.-C. Hung, J. Ji, K. Choi, D. Huang, T. He, P. Covington, B. Sappet al., “Emma: End-to-end multimodal model for autonomous driving,”Trans. on Mach. Learn. Research, 2024
2024
-
[63]
Drivevlm: The convergence of autonomous driving and large vision-language models,
X. Tian, J. Gu, B. Li, Y . Liu, Y . Wang, Z. Zhao, K. Zhan, P. Jia, X. Lang, and H. Zhao, “Drivevlm: The convergence of autonomous driving and large vision-language models,” inConf. on robot learn., 2025, pp. 4698–4726
2025
-
[64]
Lmdrive: Closed-loop end-to-end driving with large language models,
H. Shao, Y . Hu, L. Wang, G. Song, S. L. Waslander, Y . Liu, and H. Li, “Lmdrive: Closed-loop end-to-end driving with large language models,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 15 120–15 130
2024
-
[65]
Minddrive: A vision-language-action model for autonomous driving via online reinforcement learning,
H. Fu, D. Zhang, Z. Zhao, J. Cui, H. Xie, B. Wang, G. Chen, D. Liang, and X. Bai, “Minddrive: A vision-language-action model for autonomous driving via online reinforcement learning,”arXiv preprint arXiv:2512.13636, 2025
-
[66]
Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation,
H. Fu, D. Zhang, Z. Zhao, J. Cui, D. Liang, C. Zhang, D. Zhang, H. Xie, B. Wang, and X. Bai, “Orion: A holistic end-to-end autonomous driving framework by vision-language instructed action generation,” in Proc. IEEE Int. Conf. Comput. Vis., 2025, pp. 24 823–24 834
2025
-
[67]
Bevdepth: Acquisition of reliable depth for multi-view 3d object detection,
Y . Li, Z. Ge, G. Yu, J. Yang, Z. Wang, Y . Shi, J. Sun, and Z. Li, “Bevdepth: Acquisition of reliable depth for multi-view 3d object detection,” inProc. AAAI Conf. Artif. Intell., vol. 37, no. 2, 2023, pp. 1477–1485
2023
-
[68]
Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation,
Z. Liu, H. Tang, A. Amini, X. Yang, H. Mao, D. L. Rus, and S. Han, “Bevfusion: Multi-task multi-sensor fusion with unified bird’s-eye view representation,” inProc. IEEE Int. Conf. Robotics Automation, 2023, pp. 2774–2781
2023
-
[69]
Delving into the devils of bird’s-eye-view perception: A review, evaluation and recipe,
H. Li, C. Sima, J. Dai, W. Wang, L. Lu, H. Wang, J. Zeng, Z. Li, J. Yang, H. Denget al., “Delving into the devils of bird’s-eye-view perception: A review, evaluation and recipe,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 46, no. 4, pp. 2151–2170, 2023
2023
-
[70]
Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers,
Z. Li, W. Wang, H. Li, E. Xie, C. Sima, T. Lu, Y . Qiao, and J. Dai, “Bevformer: Learning bird’s-eye-view representation from multi-camera images via spatiotemporal transformers,” inProc. Eur. Conf. Comput. Vis., 2022, pp. 1–18
2022
-
[71]
Bevformer v2: Adapting modern image backbones to bird’s-eye-view recognition via perspective supervision,
C. Yang, Y . Chen, H. Tian, C. Tao, X. Zhu, Z. Zhang, G. Huang, H. Li, Y . Qiao, L. Luet al., “Bevformer v2: Adapting modern image backbones to bird’s-eye-view recognition via perspective supervision,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 17 830– 17 839
2023
-
[72]
Reproducible scaling laws for contrastive language-image learning,
M. Cherti, R. Beaumont, R. Wightman, M. Wortsman, G. Ilharco, C. Gordon, C. Schuhmann, L. Schmidt, and J. Jitsev, “Reproducible scaling laws for contrastive language-image learning,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2023, pp. 2818–2829
2023
-
[73]
A convnet for the 2020s,
Z. Liu, H. Mao, C.-Y . Wu, C. Feichtenhofer, T. Darrell, and S. Xie, “A convnet for the 2020s,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2022, pp. 11 976–11 986
2022
-
[74]
Make your vit-based multi-view 3d detectors faster via token com- pression,
D. Zhang, D. Liang, Z. Tan, X. Ye, C. Zhang, J. Wang, and X. Bai, “Make your vit-based multi-view 3d detectors faster via token com- pression,” inProc. Eur. Conf. Comput. Vis., 2024, pp. 56–72
2024
-
[75]
How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites,
Z. Chen, W. Wang, H. Tian, S. Ye, Z. Gao, E. Cui, W. Tong, K. Hu, J. Luo, Z. Maet al., “How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites,”Science China Information Sciences, vol. 67, no. 12, p. 220101, 2024
2024
-
[76]
Unipad: A universal pre-training paradigm for autonomous driving,
H. Yang, S. Zhang, D. Huang, X. Wu, H. Zhu, T. He, S. Tang, H. Zhao, Q. Qiu, B. Linet al., “Unipad: A universal pre-training paradigm for autonomous driving,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2024, pp. 15 238–15 250
2024
-
[77]
Ponderv2: Improved 3d representation with a universal pre-training paradigm,
H. Zhu, H. Yang, X. Wu, D. Huang, S. Zhang, X. He, H. Zhao, C. Shen, Y . Qiao, T. Heet al., “Ponderv2: Improved 3d representation with a universal pre-training paradigm,”IEEE Trans. Pattern Anal. Mach. Intell., vol. 47, no. 8, pp. 6550–6565, 2025
2025
-
[78]
Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction,
P. Wang, L. Liu, Y . Liu, C. Theobalt, T. Komura, and W. Wang, “Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction,” inProc. Adv. Neural Inf. Process. Syst., 2021, pp. 27 171–27 183
2021
-
[79]
nuscenes: A multimodal dataset for autonomous driving,
H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2020, pp. 11 621–11 631
2020
-
[80]
Cider: Consensus- based image description evaluation,
R. Vedantam, C. Lawrence Zitnick, and D. Parikh, “Cider: Consensus- based image description evaluation,” inProc. IEEE Conf. Comput. Vis. Pattern Recognit., 2015, pp. 4566–4575
2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.