arxiv: 2605.05092 · v1 · submitted 2026-05-06 · 💻 cs.RO · cs.AI· cs.CV

Recognition: unknown

Driver-WM: A Driver-Centric Traffic-Conditioned Latent World Model for In-Cabin Dynamics Rollout

Haozhuang Chi , Daosheng Qiu , Hao Su , Haochen Liu , Zirui Li , Haoruo Zhang , Chen Lv

Authors on Pith no claims yet

Pith reviewed 2026-05-08 16:47 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CV

keywords driver-centric world modellatent world modelin-cabin dynamics rollouttraffic-conditioned forecastinggated causal injectionvision-language featuresshared-control drivingautonomous vehicle safety

0 comments

The pith

Driver-WM forecasts in-cabin driver dynamics by causally conditioning on out-cabin traffic in a compact latent space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Driver-WM as a driver-centric latent world model that rolls out future in-cabin dynamics while conditioning them on surrounding traffic states. Existing systems either forecast only the external road scene or perform single-step recognition of driver behavior, but this work unifies multi-step physical kinematics prediction with semantic recognition of actions and emotions. It builds a shared compact latent space from frozen vision-language features, processes traffic and driver streams separately, and links them through a gated mechanism that injects external context into internal state predictions while preserving time order. A sympathetic reader would care because safe shared-control automation depends on anticipating how a human driver will react when control passes between person and machine.

Core claim

Driver-WM is a driver-centric latent world model that rolls out in-cabin dynamics causally conditioned on out-cabin traffic context. This formulation unifies physical kinematics forecasting with auxiliary behavioral and emotional semantic recognition. Operating in a compact latent space constructed from frozen vision-language features, Driver-WM adopts a dual-stream architecture to separately encode external traffic and internal driver states. These streams are directionally coupled via a gated causal injection mechanism, which uses a learned vector gate to modulate external contextual perturbations while strictly enforcing temporal causality.

What carries the argument

Dual-stream architecture with gated causal injection, which separately encodes traffic and driver states then directionally couples them through a learned vector gate that modulates external context into internal predictions while enforcing temporal causality.

If this is right

Enables robust long-horizon geometric forecasting for reactive high-motion maneuvers.
Improves semantic alignment for both driver and traffic states.
Supports controlled test-time interventions that systematically analyze how external context affects internal state predictions.
Unifies physical kinematics rollout with behavioral and emotional semantic recognition in one model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The explicit external-to-internal conditioning could support simulation of driver responses under hypothetical traffic changes not seen in training.
Because the architecture separates the two streams before injection, it may allow independent updates to the traffic encoder without retraining the entire driver dynamics component.
The approach suggests that similar causal injection patterns could be tested in other in-cabin monitoring domains where external scene context influences human state.

Load-bearing premise

A compact latent space built from frozen vision-language features plus the dual-stream gated injection is sufficient to capture and causally link external traffic context to internal driver dynamics without critical information loss.

What would settle it

Ablation experiments on the multi-task assistive driving benchmark in which removing the gated causal injection or unfreezing the vision-language features produces measurable drops in long-horizon geometric accuracy or semantic alignment scores.

Figures

Figures reproduced from arXiv: 2605.05092 by Chen Lv, Daosheng Qiu, Haochen Liu, Haoruo Zhang, Hao Su, Haozhuang Chi, Zirui Li.

**Figure 1.** Figure 1: The comparison of three paradigms: (a) Regular driver monitoring systems (DMS) for driver-state recognition. (b) Standard world models for future environment forecasting. (c) Driver-WM (ours) that performs multi-step rollout of internal driver dynamics explicitly conditioned on synchronized external traffic observations. human supervision in a shared-control (mixed-autonomy) setting [31]. Although recent a… view at source ↗

**Figure 2.** Figure 2: Overall Architecture of Driver-WM. From synchronized in/out-cabin videos, a frozen Qwen3-VL extracts dual-stream latent features. Pooled external history Z ˆ¯ ext ≤t perturbs the internal transition via a directed Gated Causal Injection with a vector gate gt, yielding an updated internal latent zˆ int t+1. Internal latents are autoregressively rolled out to forecast future states, decoded into skeleton tra… view at source ↗

**Figure 3.** Figure 3: Mechanism and dynamics. (a) Controlled interventions: On the same clip, swapping the out-cabin context or disabling injection (λCA=0) alters reactive hand motion; frames are aligned to the maximal injection step. (b) High-Motion tail: Horizon-wise MPJPE shows the zero-velocity baseline degrades with horizon, while Driver-WM substantially reduces long-horizon error compared to motion-only baselines. Pathway… view at source ↗

**Figure 4.** Figure 4: Qualitative results of driver dynamics rollout and causal interventions. view at source ↗

**Figure 5.** Figure 5: Additional post-hoc visualizations with the optional frozen renderer. view at source ↗

read the original abstract

Safe L2/L3 driving automation requires anticipating human-in-the-loop reactions during shared-control transitions. While most driving world models forecast the external environment, in-cabin intelligence remains strictly recognition-oriented and lacks multi-step rollout capabilities for driver dynamics. We introduce Driver-WM, a driver-centric latent world model that rolls out in-cabin dynamics causally conditioned on out-cabin traffic context. This formulation unifies physical kinematics forecasting with auxiliary behavioral and emotional semantic recognition. Operating in a compact latent space constructed from frozen vision-language features, Driver-WM adopts a dual-stream architecture to separately encode external traffic and internal driver states. These streams are directionally coupled via a gated causal injection mechanism, which uses a learned vector gate to modulate external contextual perturbations while strictly enforcing temporal causality. Evaluations on a multi-task assistive driving benchmark demonstrate that Driver-WM yields robust long-horizon geometric forecasting for reactive high-motion maneuvers and improves semantic alignment for both driver and traffic states. Finally, the explicit external-to-internal conditioning allows for controlled test-time interventions to systematically analyze mechanism responses.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Driver-WM adds a dual-stream gated mechanism to condition in-cabin latent rollouts on traffic context, but the abstract gives no numbers to back the performance claims.

read the letter

The main new piece is a driver-centric latent world model that forecasts in-cabin kinematics while conditioning on external traffic. It encodes the two streams separately and links them with a learned vector gate that keeps the flow causal. This setup also folds in semantic tasks like behavior and emotion recognition, and it supports test-time interventions to probe how traffic affects driver states. That directional conditioning and the unified kinematics-plus-semantics goal are the concrete extensions beyond standard latent world models for driving scenes.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Driver-WM, a driver-centric latent world model for rolling out in-cabin dynamics causally conditioned on external traffic context. It uses a dual-stream architecture operating in a compact latent space derived from frozen vision-language model features, with the streams coupled directionally via a gated causal injection mechanism employing a learned vector gate to modulate external perturbations while enforcing temporal causality. The approach unifies physical kinematics forecasting with auxiliary behavioral and emotional semantic recognition tasks. Evaluations on a multi-task assistive driving benchmark are claimed to demonstrate robust long-horizon geometric forecasting for reactive high-motion maneuvers, improved semantic alignment for driver and traffic states, and support for controlled test-time interventions to analyze mechanism responses.

Significance. If the central claims hold with supporting evidence, the work would advance in-cabin intelligence for shared-control L2/L3 automation by extending world models beyond external forecasting to include multi-step driver dynamics rollout. The explicit external-to-internal conditioning and test-time intervention capability offer interpretability benefits, while the frozen VLM features and causal gating promote efficiency. This addresses a recognized gap in driver-centric modeling, though the significance hinges on demonstrating that the architecture preserves necessary dynamic information without critical loss.

major comments (2)

Abstract: The central claims that Driver-WM 'yields robust long-horizon geometric forecasting for reactive high-motion maneuvers' and 'improves semantic alignment for both driver and traffic states' are presented without any quantitative metrics, error bars, baseline comparisons, ablation results, or details on the multi-task benchmark. This absence prevents verification of whether the dual-stream architecture and gated causal injection deliver the stated performance gains.
Abstract and architecture description: The load-bearing assumption that a compact latent space constructed from frozen vision-language features, combined with gated causal injection, preserves sufficient fine-grained kinematic and geometric information for accurate long-horizon rollout of high-motion driver maneuvers is not justified. VLMs are pretrained on semantic alignment rather than physics-consistent motion prediction, and no explicit recovery mechanism for high-frequency dynamic cues discarded at encoding is described, raising a correctness risk for the robustness claims even if the gating functions as intended.

minor comments (2)

The 'multi-task assistive driving benchmark' is referenced without naming the dataset, specifying the constituent tasks, metrics, data splits, or exclusion rules. These details are required in the experiments section for reproducibility and to allow assessment of the evaluation protocol.
The abstract would be strengthened by including at least one key quantitative result or comparison to ground the performance claims, rather than relying solely on qualitative descriptors.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and valuable feedback on our work. We have carefully considered the major comments and made revisions to the manuscript to improve clarity and provide additional supporting evidence for our claims. Our point-by-point responses are as follows.

read point-by-point responses

Referee: Abstract: The central claims that Driver-WM 'yields robust long-horizon geometric forecasting for reactive high-motion maneuvers' and 'improves semantic alignment for both driver and traffic states' are presented without any quantitative metrics, error bars, baseline comparisons, ablation results, or details on the multi-task benchmark. This absence prevents verification of whether the dual-stream architecture and gated causal injection deliver the stated performance gains.

Authors: We agree that the abstract would be strengthened by including quantitative support for the claims. In the revised manuscript, we have updated the abstract to incorporate key quantitative metrics from our evaluations on the multi-task assistive driving benchmark, including specific improvements in forecasting error for long-horizon rollouts and semantic alignment scores, along with comparisons to relevant baselines. This revision ensures that the performance gains are verifiable from the abstract itself. revision: yes
Referee: Abstract and architecture description: The load-bearing assumption that a compact latent space constructed from frozen vision-language features, combined with gated causal injection, preserves sufficient fine-grained kinematic and geometric information for accurate long-horizon rollout of high-motion driver maneuvers is not justified. VLMs are pretrained on semantic alignment rather than physics-consistent motion prediction, and no explicit recovery mechanism for high-frequency dynamic cues discarded at encoding is described, raising a correctness risk for the robustness claims even if the gating functions as intended.

Authors: This is a valid concern regarding the information preservation in the latent space. Although VLMs are pretrained primarily for semantic tasks, our empirical results on the benchmark demonstrate that the encoded features, when processed through the dual-stream architecture and gated causal injection, enable accurate long-horizon geometric forecasting even for high-motion maneuvers. We have revised the architecture description section to provide a more detailed justification, including references to how VLM features capture motion-related information and how the causal gating mechanism helps maintain temporal dynamics. Additionally, we have included ablation studies on the latent space dimensionality to show that the compact representation retains necessary kinematic details without significant loss. While we do not introduce an explicit recovery mechanism for discarded cues, the design choices and experimental validation support the robustness claims. revision: partial

Circularity Check

0 steps flagged

No significant circularity; architecture presented as novel design

full rationale

The paper describes Driver-WM as a new driver-centric latent world model using frozen vision-language features, a dual-stream architecture, and gated causal injection for external-to-internal conditioning. No equations, derivations, or fitted parameters are shown that reduce any claimed prediction or rollout to its own inputs by construction. The central claims rest on the proposed architecture and benchmark evaluations rather than self-definitional loops, self-citation chains, or renamed known results. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the sufficiency of frozen vision-language features for driver and traffic semantics and on the effectiveness of the introduced gated causal injection for enforcing causality; these are introduced without independent prior validation in the abstract.

axioms (1)

domain assumption Frozen vision-language features capture sufficient semantic information for both external traffic and internal driver states
Model operates in a compact latent space constructed from frozen vision-language features as stated in the abstract.

invented entities (1)

Gated causal injection mechanism with learned vector gate no independent evidence
purpose: To directionally couple external traffic stream to internal driver stream while modulating perturbations and strictly enforcing temporal causality
Described as the key coupling component in the dual-stream architecture

pith-pipeline@v0.9.0 · 5511 in / 1305 out tokens · 47580 ms · 2026-05-08T16:47:19.544373+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 28 canonical work pages · 1 internal anchor

[1]

In: Proceedings of the IEEE/CVF ICCV, pp

Adeli, V., Ehsanpour, M., Reid, I., Niebles, J.C., Savarese, S., Adeli, E., Rezatofighi, H.: TRiPOD: Human Trajectory and Pose Dynamics Forecasting in the Wild . In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 13370– 13380. IEEE Computer Society, Los Alamitos, CA, USA (Oct 2021).https://doi. org/10.1109/ICCV48922.2021.01314, http...

work page doi:10.1109/iccv48922.2021.01314 2021
[2]

Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond (2024),https://openreview.net/forum?id=qrGjFJVl3m

2024
[3]

Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

2025
[4]

IEEE Trans

Chen, L., Wu, P., Chitta, K., Jaeger, B., Geiger, A., Li, H.: End-to-end autonomous driving: Challenges and frontiers. IEEE Trans. Pattern Anal. Mach. Intell.46(12), 10164–10183 (Dec 2024).https://doi.org/10.1109/TPAMI.2024.3435937, https: //doi.org/10.1109/TPAMI.2024.3435937

work page doi:10.1109/tpami.2024.3435937 2024
[5]

Ai safety assurance for automated vehicles: A survey on research, standardization, regulation,

Chen, L., Li, Y., Huang, C., Li, B., Xing, Y., Tian, D., Li, L., Hu, Z., Na, X., Li, Z., Teng, S., Lv, C., Wang, J., Cao, D., Zheng, N., Wang, F.Y.: Milestones in autonomous driving and intelligent vehicles: Survey of surveys. IEEE Transactions on Intelligent Vehicles8(2), 1046–1056 (2023).https://doi.org/10.1109/TIV. 2022.3223131

work page doi:10.1109/tiv 2023
[6]

In: 2025 IEEE Intelligent Vehicles Sym- posium (IV)

Chi, H., Yang, H., Yang, L., Lv, C.: Vlm-dm: Visual language models for multitask domain adaptation in driver monitoring. In: 2025 IEEE Intelligent Vehicles Sym- posium (IV). pp. 1280–1285 (2025).https://doi.org/10.1109/IV64158.2025. 11097620

work page doi:10.1109/iv64158.2025 2025
[7]

In: Proceedings of the 40th International Conference on Machine Learning

Driess, D., Xia, F., Sajjadi, M.S.M., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., Huang, W., Chebotar, Y., Sermanet, P., Duckworth, D., Levine, S., Vanhoucke, V., Hausman, K., Toussaint, M., Greff, K., Zeng, A., Mordatch, I., Florence, P.: Palm-e: an embodied multimodal language model. In: Proceedings of the 40th Inter...

2023
[8]

In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C

Gao, S., Yang, J., Chen, L., Chitta, K., Qiu, Y., Geiger, A., Zhang, J., Li, H.: Vista: A generalizable driving world model with high fidelity and versatile con- trollability. In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C. (eds.) Advances in Neural Information Processing Sys- tems. vol. 37, pp. 91560–91596. Curran...

work page doi:10.52202/079017-2906 2024
[9]

In: 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

Guo, W., Du, Y., Shen, X., Lepetit, V., Alameda-Pineda, X., Moreno-Noguer, F.: Back to mlp: A simple baseline for human motion prediction. In: 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). pp. 4798–4808 (2023).https://doi.org/10.1109/WACV56688.2023.00479

work page doi:10.1109/wacv56688.2023.00479 2023
[10]

arXiv e-prints arXiv:2309.17080 (Sep 2023).https://doi.org/10.48550/arXiv.2309

Hu, A., Russell, L., Yeo, H., Murez, Z., Fedoseev, G., Kendall, A., Shotton, J., Corrado, G.: GAIA-1: A Generative World Model for Autonomous Driving. arXiv e-prints arXiv:2309.17080 (Sep 2023).https://doi.org/10.48550/arXiv.2309. 17080

work page doi:10.48550/arxiv.2309 2023
[11]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Hu, Y., Yang, J., Chen, L., Li, K., Sima, C., Zhu, X., Chai, S., Du, S., Lin, T., Wang, W., Lu, L., Jia, X., Liu, Q., Dai, J., Qiao, Y., Li, H.: Planning-oriented autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 17853–17862 (June 2023)

2023
[12]

Jain, A., Koppula, H.S., Soh, S., Raghavan, B., Singh, A., Saxena, A.: Brain4cars: Car that knows before you do via sensory-fusion deep learning architecture (2016), https://arxiv.org/abs/1601.00740

work page arXiv 2016
[13]

In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV)

Jiang, B., Chen, S., Xu, Q., Liao, B., Chen, J., Zhou, H., Zhang, Q., Liu, W., Huang, C., Wang, X.: Vad: Vectorized scene representation for efficient autonomous driving. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 8306–8316 (2023).https://doi.org/10.1109/ICCV51070.2023.00766

work page doi:10.1109/iccv51070.2023.00766 2023
[14]

arXiv preprint arXiv:2509.07996 (2025)

Kong, L., Yang, W., Mei, J., Liu, Y., Liang, A., Zhu, D., Lu, D., Yin, W., Hu, X., Jia, M., Deng, J., Zhang, K., Wu, Y., Yan, T., Gao, S., Wang, S., Li, L., Pan, L., Liu, Y., Zhu, J., Tsang Ooi, W., Hoi, S.C.H., Liu, Z.: 3D and 4D World Modeling: A Survey. arXiv e-prints arXiv:2509.07996 (Sep 2025).https://doi.org/10.48550/ arXiv.2509.07996

work page arXiv 2025
[15]

In: Proceedings of the 40th International Conference on Machine Learning

Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: Proceedings of the 40th International Conference on Machine Learning. ICML’23, JMLR.org (2023)

2023
[16]

Enhancing end-to-end autonomous driving with latent world model.arXiv preprint arXiv:2406.08481, 2024

Li, Y., Fan, L., He, J., Wang, Y., Chen, Y., Zhang, Z., Tan, T.: Enhancing End-to- End Autonomous Driving with Latent World Model. arXiv e-prints arXiv:2406.08481 (Jun 2024).https://doi.org/10.48550/arXiv.2406.08481

work page doi:10.48550/arxiv.2406.08481 2024
[17]

IEEE Transactions on Pattern Analysis and Machine Intelligence47(4), 2597–2614 (2025).https://doi.org/10.1109/ TPAMI.2025.3526936

Liu, H., Huang, Z., Huang, W., Yang, H., Mo, X., Lv, C.: Hybrid-prediction integrated planning for autonomous driving. IEEE Transactions on Pattern Analysis and Machine Intelligence47(4), 2597–2614 (2025).https://doi.org/10.1109/ TPAMI.2025.3526936

work page arXiv 2025
[18]

In: Proceedings of the 37th International Conference on Neural Information Processing Systems

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. NIPS ’23, Curran Associates Inc., Red Hook, NY, USA (2023)

2023
[19]

Liu, W., Guo, Q., Wang, Z., Wang, W., Yang, L., Qiao, Y., Wang, L., Li, Z., Lv, C., Zhang, S., Xi, J., Liu, H.: Uv-m3tl: A unified and versatile multimodal multi-task learning framework for assistive driving perception (2026),https://arxiv.org/ abs/2602.01594

work page arXiv 2026
[20]

Liu, W., Qiao, Y., Wang, Z., Guo, Q., Chen, Z., Zhou, M., Li, X., Wang, L., Li, Z., Liu, H., Wang, W.: Tem3-learning: Time-efficient multimodal multi-task Driver-WM 17 learning for advanced assistive driving. In: Laugier, C., Renzaglia, A., Atanasov, N., Birchfield, S., Cielniak, G., {De Mattos}, L., Fiorini, L., Giguere, P., Hashimoto, K., Ibanez-Guzman,...

work page arXiv 2025
[21]

What’s in the image? a deep-dive into the vision of vision language models

Liu, W., Wang, W., Qiao, Y., Guo, Q., Zhu, J., Li, P., Chen, Z., Yang, H., Li, Z., Wang, L., Tan, T., Liu, H.: MMTL-UniAD: A Unified Framework for Multimodal and Multi-Task Learning in Assistive Driving Perception . In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6864–6874. IEEE Computer Society, Los Alamitos, CA, USA (...

work page arXiv 2025
[22]

In: 2019 IEEE/CVF International Conference on Com- puter Vision (ICCV)

Martin, M., Roitberg, A., Haurilet, M., Horne, M., Reiß, S., Voit, M., Stiefelhagen, R.: Drive&act: A multi-modal dataset for fine-grained driver behavior recognition in autonomous vehicles. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 2801–2810 (2019).https://doi.org/10.1109/ICCV.2019. 00289

work page doi:10.1109/iccv.2019 2019
[23]

and Romero, Javier , title =

Martinez, J., Black, M.J., Romero, J.: On Human Motion Prediction Using Recurrent Neural Networks . In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4674–4683. IEEE Computer Society, Los Alamitos, CA, USA (Jul 2017).https://doi.org/10.1109/CVPR.2017.497, https: //doi.ieeecomputersociety.org/10.1109/CVPR.2017.497

work page doi:10.1109/cvpr.2017.497 2017
[24]

Rahimi, A., Gerard, V., Zablocki, E., Cord, M., Alahi, A.: Mad: Motion appearance decoupling for efficient driving world models (2026),https://arxiv.org/abs/2601. 09452

2026
[25]

SAE On-Road Automated Vehicle Standards Committee: Taxonomy and definitions fortermsrelated toon-roadmotorvehicle automated drivingsystems.SAE Standard J3016 (2014)

2014
[26]

In: Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LII

Sima, C., Renz, K., Chitta, K., Chen, L., Zhang, H., Xie, C., Beißwenger, J., Luo, P., Geiger, A., Li, H.: Drivelm: Driving with graph visual question answering. In: Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LII. p. 256–274. Springer-Verlag, Berlin, Heidelberg (2024). https://doi.o...

work page doi:10.1007/978-3-031-72943-0_15 2024
[27]

Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., Fan, Y., Dang, K., Du, M., Ren, X., Men, R., Liu, D., Zhou, C., Zhou, J., Lin, J.: Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution (2024),https://arxiv.org/abs/2409.12191

work page internal anchor Pith review arXiv 2024
[28]

Wang, W., Wang, L., Zhang, C., Liu, C., Sun, L.: Social interactions for au- tonomous driving: A review and perspectives. Found. Trends Robot10(3–4), 198–376 (Nov 2022).https://doi.org/10.1561/2300000078, https://doi.org/ 10.1561/2300000078 18 H. Chi et al

work page doi:10.1561/2300000078 2022
[29]

In: Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XLVIII

Wang, X., Zhu, Z., Huang, G., Chen, X., Zhu, J., Lu, J.: Drivedreamer: Towards real- world-drive world models for autonomous driving. In: Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XLVIII. p. 55–72. Springer-Verlag, Berlin, Heidelberg (2024). https://doi.org/10.1007/978-3-031-73195-...

work page doi:10.1007/978-3-031-73195-2_4 2024
[30]

Human Factors64(7), 1227– 1260 (2022)

Weaver, B.W., DeLucia, P.R.: A systematic review and meta-analysis of takeover performance during conditionally automated driving. Human Factors64(7), 1227– 1260 (2022)

2022
[31]

Transportation Research Part C: Emerging Technologies128, 103199 (2021).https://doi.org/ https://doi.org/10.1016/j.trc.2021.103199

Xing,Y.,Lv,C.,Cao,D.,Hang,P.:Towardhuman-vehiclecollaboration:Reviewand perspectives on human-centered collaborative automated driving. Transportation Research Part C: Emerging Technologies128, 103199 (2021).https://doi.org/ https://doi.org/10.1016/j.trc.2021.103199

work page doi:10.1016/j.trc.2021.103199 2021
[32]

IEEE Transactions on Vehicular Technology68(6), 5379–5390 (2019).https://doi.org/10.1109/TVT

Xing, Y., Lv, C., Wang, H., Cao, D., Velenis, E., Wang, F.Y.: Driver activity recognition for intelligent vehicles: A deep learning approach. IEEE Transactions on Vehicular Technology68(6), 5379–5390 (2019).https://doi.org/10.1109/TVT. 2019.2908425

work page doi:10.1109/tvt 2019
[33]

Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence. AAAI’18/I...

2018
[34]

In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV)

Yang, D., Huang, S., Xu, Z., Li, Z., Wang, S., Li, M., Wang, Y., Liu, Y., Yang, K., Chen, Z., Wang, Y., Liu, J., Zhang, P., Zhai, P., Zhang, L.: Aide: A vision-driven multi-view, multi-modal, multi-tasking dataset for assistive driving perception. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 20402–20413 (2023).https://doi.org/...

work page doi:10.1109/iccv51070.2023.01871 2023
[35]

IEEE Transactions on Intelligent Transportation Systems25(2), 2034–2045 (2024)

Yang, H., Liu, H., Hu, Z., Nguyen, A.T., Guerra, T.M., Lv, C.: Quantitative identi- fication of driver distraction: A weakly supervised contrastive learning approach. IEEE Transactions on Intelligent Transportation Systems25(2), 2034–2045 (2024). https://doi.org/10.1109/TITS.2023.3316203

work page doi:10.1109/tits.2023.3316203 2034
[36]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Zhao, G., Ni, C., Wang, X., Zhu, Z., Zhang, X., Wang, Y., Huang, G., Chen, X., Wang, B., Zhang, Y., Mei, W., Wang, X.: Drivedreamer4d: World models are effective data machines for 4d driving scene representation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 12015–12026 (June 2025)

2025
[37]

Zhao, G., Wang, X., Zhu, Z., Chen, X., Huang, G., Bao, X., Wang, X.: Drivedreamer- 2: Llm-enhanced world models for diverse driving video generation (2024),https: //arxiv.org/abs/2403.06845

work page arXiv 2024
[38]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

Zheng, Y., Yang, P., Xing, Z., Zhang, Q., Zheng, Y., Gao, Y., Li, P., Zhang, T., Xia, Z., Jia, P., Lang, X., Zhao, D.: World4drive: End-to-end autonomous driving via intention-aware physical latent world model. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 28632–28642 (October 2025)

2025
[39]

In: Proceedings of the 33rd ACM International Conference on Multimedia

Zhu, G., Fan, S., Dai, H., Ho, E.S.L.: Waymo-3dskelmo: A multi-agent 3d skele- tal motion dataset for pedestrian interaction modeling in autonomous driving. In: Proceedings of the 33rd ACM International Conference on Multimedia. p. 13184–13190. MM ’25, Association for Computing Machinery, New York, NY, USA (2025). https://doi.org/10.1145/3746027.3758273, ...

work page doi:10.1145/3746027.3758273 2025
[40]

2023 , url =

Zhu, W., Ma, X., Liu, Z., Liu, L., Wu, W., Wang, Y.: MotionBERT: A Unified Perspective on Learning Human Motion Representations . In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 15039–15053. IEEE Computer Society, Los Alamitos, CA, USA (Oct 2023).https://doi.org/10. 1109/ICCV51070.2023.01385, https://doi.ieeecomputersociety.org/1...

work page arXiv 2023
[41]

maximal injection step

Zitkovich, B., Yu, T., Xu, S., Xu, P., Xiao, T., Xia, F., Wu, J., Wohlhart, P., Welker, S., Wahid, A., Vuong, Q., Vanhoucke, V., Tran, H., Soricut, R., Singh, A., Singh, J., Sermanet, P., Sanketi, P.R., Salazar, G., Ryoo, M.S., Reymann, K., Rao, K., Pertsch, K., Mordatch, I., Michalewski, H., Lu, Y., Levine, S., Lee, L., Lee, T.W.E., Leal, I., Kuang, Y., ...

2023