pith. machine review for the scientific record. sign in

arxiv: 2605.05092 · v1 · submitted 2026-05-06 · 💻 cs.RO · cs.AI· cs.CV

Recognition: unknown

Driver-WM: A Driver-Centric Traffic-Conditioned Latent World Model for In-Cabin Dynamics Rollout

Authors on Pith no claims yet

Pith reviewed 2026-05-08 16:47 UTC · model grok-4.3

classification 💻 cs.RO cs.AIcs.CV
keywords driver-centric world modellatent world modelin-cabin dynamics rollouttraffic-conditioned forecastinggated causal injectionvision-language featuresshared-control drivingautonomous vehicle safety
0
0 comments X

The pith

Driver-WM forecasts in-cabin driver dynamics by causally conditioning on out-cabin traffic in a compact latent space.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Driver-WM as a driver-centric latent world model that rolls out future in-cabin dynamics while conditioning them on surrounding traffic states. Existing systems either forecast only the external road scene or perform single-step recognition of driver behavior, but this work unifies multi-step physical kinematics prediction with semantic recognition of actions and emotions. It builds a shared compact latent space from frozen vision-language features, processes traffic and driver streams separately, and links them through a gated mechanism that injects external context into internal state predictions while preserving time order. A sympathetic reader would care because safe shared-control automation depends on anticipating how a human driver will react when control passes between person and machine.

Core claim

Driver-WM is a driver-centric latent world model that rolls out in-cabin dynamics causally conditioned on out-cabin traffic context. This formulation unifies physical kinematics forecasting with auxiliary behavioral and emotional semantic recognition. Operating in a compact latent space constructed from frozen vision-language features, Driver-WM adopts a dual-stream architecture to separately encode external traffic and internal driver states. These streams are directionally coupled via a gated causal injection mechanism, which uses a learned vector gate to modulate external contextual perturbations while strictly enforcing temporal causality.

What carries the argument

Dual-stream architecture with gated causal injection, which separately encodes traffic and driver states then directionally couples them through a learned vector gate that modulates external context into internal predictions while enforcing temporal causality.

If this is right

  • Enables robust long-horizon geometric forecasting for reactive high-motion maneuvers.
  • Improves semantic alignment for both driver and traffic states.
  • Supports controlled test-time interventions that systematically analyze how external context affects internal state predictions.
  • Unifies physical kinematics rollout with behavioral and emotional semantic recognition in one model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The explicit external-to-internal conditioning could support simulation of driver responses under hypothetical traffic changes not seen in training.
  • Because the architecture separates the two streams before injection, it may allow independent updates to the traffic encoder without retraining the entire driver dynamics component.
  • The approach suggests that similar causal injection patterns could be tested in other in-cabin monitoring domains where external scene context influences human state.

Load-bearing premise

A compact latent space built from frozen vision-language features plus the dual-stream gated injection is sufficient to capture and causally link external traffic context to internal driver dynamics without critical information loss.

What would settle it

Ablation experiments on the multi-task assistive driving benchmark in which removing the gated causal injection or unfreezing the vision-language features produces measurable drops in long-horizon geometric accuracy or semantic alignment scores.

Figures

Figures reproduced from arXiv: 2605.05092 by Chen Lv, Daosheng Qiu, Haochen Liu, Haoruo Zhang, Hao Su, Haozhuang Chi, Zirui Li.

Figure 1
Figure 1. Figure 1: The comparison of three paradigms: (a) Regular driver monitoring systems (DMS) for driver-state recognition. (b) Standard world models for future environment forecasting. (c) Driver-WM (ours) that performs multi-step rollout of internal driver dynamics explicitly conditioned on synchronized external traffic observations. human supervision in a shared-control (mixed-autonomy) setting [31]. Although recent a… view at source ↗
Figure 2
Figure 2. Figure 2: Overall Architecture of Driver-WM. From synchronized in/out-cabin videos, a frozen Qwen3-VL extracts dual-stream latent features. Pooled external history Z ˆ¯ ext ≤t perturbs the internal transition via a directed Gated Causal Injection with a vector gate gt, yielding an updated internal latent zˆ int t+1. Internal latents are autoregressively rolled out to forecast future states, decoded into skeleton tra… view at source ↗
Figure 3
Figure 3. Figure 3: Mechanism and dynamics. (a) Controlled interventions: On the same clip, swapping the out-cabin context or disabling injection (λCA=0) alters reactive hand motion; frames are aligned to the maximal injection step. (b) High-Motion tail: Horizon-wise MPJPE shows the zero-velocity baseline degrades with horizon, while Driver-WM substantially reduces long-horizon error compared to motion-only baselines. Pathway… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results of driver dynamics rollout and causal interventions. view at source ↗
Figure 5
Figure 5. Figure 5: Additional post-hoc visualizations with the optional frozen renderer. view at source ↗
read the original abstract

Safe L2/L3 driving automation requires anticipating human-in-the-loop reactions during shared-control transitions. While most driving world models forecast the external environment, in-cabin intelligence remains strictly recognition-oriented and lacks multi-step rollout capabilities for driver dynamics. We introduce Driver-WM, a driver-centric latent world model that rolls out in-cabin dynamics causally conditioned on out-cabin traffic context. This formulation unifies physical kinematics forecasting with auxiliary behavioral and emotional semantic recognition. Operating in a compact latent space constructed from frozen vision-language features, Driver-WM adopts a dual-stream architecture to separately encode external traffic and internal driver states. These streams are directionally coupled via a gated causal injection mechanism, which uses a learned vector gate to modulate external contextual perturbations while strictly enforcing temporal causality. Evaluations on a multi-task assistive driving benchmark demonstrate that Driver-WM yields robust long-horizon geometric forecasting for reactive high-motion maneuvers and improves semantic alignment for both driver and traffic states. Finally, the explicit external-to-internal conditioning allows for controlled test-time interventions to systematically analyze mechanism responses.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Driver-WM, a driver-centric latent world model for rolling out in-cabin dynamics causally conditioned on external traffic context. It uses a dual-stream architecture operating in a compact latent space derived from frozen vision-language model features, with the streams coupled directionally via a gated causal injection mechanism employing a learned vector gate to modulate external perturbations while enforcing temporal causality. The approach unifies physical kinematics forecasting with auxiliary behavioral and emotional semantic recognition tasks. Evaluations on a multi-task assistive driving benchmark are claimed to demonstrate robust long-horizon geometric forecasting for reactive high-motion maneuvers, improved semantic alignment for driver and traffic states, and support for controlled test-time interventions to analyze mechanism responses.

Significance. If the central claims hold with supporting evidence, the work would advance in-cabin intelligence for shared-control L2/L3 automation by extending world models beyond external forecasting to include multi-step driver dynamics rollout. The explicit external-to-internal conditioning and test-time intervention capability offer interpretability benefits, while the frozen VLM features and causal gating promote efficiency. This addresses a recognized gap in driver-centric modeling, though the significance hinges on demonstrating that the architecture preserves necessary dynamic information without critical loss.

major comments (2)
  1. Abstract: The central claims that Driver-WM 'yields robust long-horizon geometric forecasting for reactive high-motion maneuvers' and 'improves semantic alignment for both driver and traffic states' are presented without any quantitative metrics, error bars, baseline comparisons, ablation results, or details on the multi-task benchmark. This absence prevents verification of whether the dual-stream architecture and gated causal injection deliver the stated performance gains.
  2. Abstract and architecture description: The load-bearing assumption that a compact latent space constructed from frozen vision-language features, combined with gated causal injection, preserves sufficient fine-grained kinematic and geometric information for accurate long-horizon rollout of high-motion driver maneuvers is not justified. VLMs are pretrained on semantic alignment rather than physics-consistent motion prediction, and no explicit recovery mechanism for high-frequency dynamic cues discarded at encoding is described, raising a correctness risk for the robustness claims even if the gating functions as intended.
minor comments (2)
  1. The 'multi-task assistive driving benchmark' is referenced without naming the dataset, specifying the constituent tasks, metrics, data splits, or exclusion rules. These details are required in the experiments section for reproducibility and to allow assessment of the evaluation protocol.
  2. The abstract would be strengthened by including at least one key quantitative result or comparison to ground the performance claims, rather than relying solely on qualitative descriptors.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thorough review and valuable feedback on our work. We have carefully considered the major comments and made revisions to the manuscript to improve clarity and provide additional supporting evidence for our claims. Our point-by-point responses are as follows.

read point-by-point responses
  1. Referee: Abstract: The central claims that Driver-WM 'yields robust long-horizon geometric forecasting for reactive high-motion maneuvers' and 'improves semantic alignment for both driver and traffic states' are presented without any quantitative metrics, error bars, baseline comparisons, ablation results, or details on the multi-task benchmark. This absence prevents verification of whether the dual-stream architecture and gated causal injection deliver the stated performance gains.

    Authors: We agree that the abstract would be strengthened by including quantitative support for the claims. In the revised manuscript, we have updated the abstract to incorporate key quantitative metrics from our evaluations on the multi-task assistive driving benchmark, including specific improvements in forecasting error for long-horizon rollouts and semantic alignment scores, along with comparisons to relevant baselines. This revision ensures that the performance gains are verifiable from the abstract itself. revision: yes

  2. Referee: Abstract and architecture description: The load-bearing assumption that a compact latent space constructed from frozen vision-language features, combined with gated causal injection, preserves sufficient fine-grained kinematic and geometric information for accurate long-horizon rollout of high-motion driver maneuvers is not justified. VLMs are pretrained on semantic alignment rather than physics-consistent motion prediction, and no explicit recovery mechanism for high-frequency dynamic cues discarded at encoding is described, raising a correctness risk for the robustness claims even if the gating functions as intended.

    Authors: This is a valid concern regarding the information preservation in the latent space. Although VLMs are pretrained primarily for semantic tasks, our empirical results on the benchmark demonstrate that the encoded features, when processed through the dual-stream architecture and gated causal injection, enable accurate long-horizon geometric forecasting even for high-motion maneuvers. We have revised the architecture description section to provide a more detailed justification, including references to how VLM features capture motion-related information and how the causal gating mechanism helps maintain temporal dynamics. Additionally, we have included ablation studies on the latent space dimensionality to show that the compact representation retains necessary kinematic details without significant loss. While we do not introduce an explicit recovery mechanism for discarded cues, the design choices and experimental validation support the robustness claims. revision: partial

Circularity Check

0 steps flagged

No significant circularity; architecture presented as novel design

full rationale

The paper describes Driver-WM as a new driver-centric latent world model using frozen vision-language features, a dual-stream architecture, and gated causal injection for external-to-internal conditioning. No equations, derivations, or fitted parameters are shown that reduce any claimed prediction or rollout to its own inputs by construction. The central claims rest on the proposed architecture and benchmark evaluations rather than self-definitional loops, self-citation chains, or renamed known results. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the sufficiency of frozen vision-language features for driver and traffic semantics and on the effectiveness of the introduced gated causal injection for enforcing causality; these are introduced without independent prior validation in the abstract.

axioms (1)
  • domain assumption Frozen vision-language features capture sufficient semantic information for both external traffic and internal driver states
    Model operates in a compact latent space constructed from frozen vision-language features as stated in the abstract.
invented entities (1)
  • Gated causal injection mechanism with learned vector gate no independent evidence
    purpose: To directionally couple external traffic stream to internal driver stream while modulating perturbations and strictly enforcing temporal causality
    Described as the key coupling component in the dual-stream architecture

pith-pipeline@v0.9.0 · 5511 in / 1305 out tokens · 47580 ms · 2026-05-08T16:47:19.544373+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 28 canonical work pages · 1 internal anchor

  1. [1]

    In: Proceedings of the IEEE/CVF ICCV, pp

    Adeli, V., Ehsanpour, M., Reid, I., Niebles, J.C., Savarese, S., Adeli, E., Rezatofighi, H.: TRiPOD: Human Trajectory and Pose Dynamics Forecasting in the Wild . In: 2021 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 13370– 13380. IEEE Computer Society, Los Alamitos, CA, USA (Oct 2021).https://doi. org/10.1109/ICCV48922.2021.01314, http...

  2. [2]

    Bai, J., Bai, S., Yang, S., Wang, S., Tan, S., Wang, P., Lin, J., Zhou, C., Zhou, J.: Qwen-VL: A versatile vision-language model for understanding, localization, text reading, and beyond (2024),https://openreview.net/forum?id=qrGjFJVl3m

  3. [3]

    Bai, S., Cai, Y., Chen, R., Chen, K., Chen, X., Cheng, Z., Deng, L., Ding, W., Gao, C., Ge, C., Ge, W., Guo, Z., Huang, Q., Huang, J., Huang, F., Hui, B., Jiang, S., Li, Z., Li, M., Li, M., Li, K., Lin, Z., Lin, J., Liu, X., Liu, J., Liu, C., Liu, Y., Liu, D., Liu, S., Lu, D., Luo, R., Lv, C., Men, R., Meng, L., Ren, X., Ren, X., Song, S., Sun, Y., Tang, ...

  4. [4]

    IEEE Trans

    Chen, L., Wu, P., Chitta, K., Jaeger, B., Geiger, A., Li, H.: End-to-end autonomous driving: Challenges and frontiers. IEEE Trans. Pattern Anal. Mach. Intell.46(12), 10164–10183 (Dec 2024).https://doi.org/10.1109/TPAMI.2024.3435937, https: //doi.org/10.1109/TPAMI.2024.3435937

  5. [5]

    Ai safety assurance for automated vehicles: A survey on research, standardization, regulation,

    Chen, L., Li, Y., Huang, C., Li, B., Xing, Y., Tian, D., Li, L., Hu, Z., Na, X., Li, Z., Teng, S., Lv, C., Wang, J., Cao, D., Zheng, N., Wang, F.Y.: Milestones in autonomous driving and intelligent vehicles: Survey of surveys. IEEE Transactions on Intelligent Vehicles8(2), 1046–1056 (2023).https://doi.org/10.1109/TIV. 2022.3223131

  6. [6]

    In: 2025 IEEE Intelligent Vehicles Sym- posium (IV)

    Chi, H., Yang, H., Yang, L., Lv, C.: Vlm-dm: Visual language models for multitask domain adaptation in driver monitoring. In: 2025 IEEE Intelligent Vehicles Sym- posium (IV). pp. 1280–1285 (2025).https://doi.org/10.1109/IV64158.2025. 11097620

  7. [7]

    In: Proceedings of the 40th International Conference on Machine Learning

    Driess, D., Xia, F., Sajjadi, M.S.M., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., Huang, W., Chebotar, Y., Sermanet, P., Duckworth, D., Levine, S., Vanhoucke, V., Hausman, K., Toussaint, M., Greff, K., Zeng, A., Mordatch, I., Florence, P.: Palm-e: an embodied multimodal language model. In: Proceedings of the 40th Inter...

  8. [8]

    In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C

    Gao, S., Yang, J., Chen, L., Chitta, K., Qiu, Y., Geiger, A., Zhang, J., Li, H.: Vista: A generalizable driving world model with high fidelity and versatile con- trollability. In: Globerson, A., Mackey, L., Belgrave, D., Fan, A., Paquet, U., Tomczak, J., Zhang, C. (eds.) Advances in Neural Information Processing Sys- tems. vol. 37, pp. 91560–91596. Curran...

  9. [9]

    In: 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV)

    Guo, W., Du, Y., Shen, X., Lepetit, V., Alameda-Pineda, X., Moreno-Noguer, F.: Back to mlp: A simple baseline for human motion prediction. In: 2023 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). pp. 4798–4808 (2023).https://doi.org/10.1109/WACV56688.2023.00479

  10. [10]

    arXiv e-prints arXiv:2309.17080 (Sep 2023).https://doi.org/10.48550/arXiv.2309

    Hu, A., Russell, L., Yeo, H., Murez, Z., Fedoseev, G., Kendall, A., Shotton, J., Corrado, G.: GAIA-1: A Generative World Model for Autonomous Driving. arXiv e-prints arXiv:2309.17080 (Sep 2023).https://doi.org/10.48550/arXiv.2309. 17080

  11. [11]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Hu, Y., Yang, J., Chen, L., Li, K., Sima, C., Zhu, X., Chai, S., Du, S., Lin, T., Wang, W., Lu, L., Jia, X., Liu, Q., Dai, J., Qiao, Y., Li, H.: Planning-oriented autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 17853–17862 (June 2023)

  12. [12]

    Jain, A., Koppula, H.S., Soh, S., Raghavan, B., Singh, A., Saxena, A.: Brain4cars: Car that knows before you do via sensory-fusion deep learning architecture (2016), https://arxiv.org/abs/1601.00740

  13. [13]

    In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV)

    Jiang, B., Chen, S., Xu, Q., Liao, B., Chen, J., Zhou, H., Zhang, Q., Liu, W., Huang, C., Wang, X.: Vad: Vectorized scene representation for efficient autonomous driving. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 8306–8316 (2023).https://doi.org/10.1109/ICCV51070.2023.00766

  14. [14]

    arXiv preprint arXiv:2509.07996 (2025)

    Kong, L., Yang, W., Mei, J., Liu, Y., Liang, A., Zhu, D., Lu, D., Yin, W., Hu, X., Jia, M., Deng, J., Zhang, K., Wu, Y., Yan, T., Gao, S., Wang, S., Li, L., Pan, L., Liu, Y., Zhu, J., Tsang Ooi, W., Hoi, S.C.H., Liu, Z.: 3D and 4D World Modeling: A Survey. arXiv e-prints arXiv:2509.07996 (Sep 2025).https://doi.org/10.48550/ arXiv.2509.07996

  15. [15]

    In: Proceedings of the 40th International Conference on Machine Learning

    Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: bootstrapping language-image pre-training with frozen image encoders and large language models. In: Proceedings of the 40th International Conference on Machine Learning. ICML’23, JMLR.org (2023)

  16. [16]

    Enhancing end-to-end autonomous driving with latent world model.arXiv preprint arXiv:2406.08481, 2024

    Li, Y., Fan, L., He, J., Wang, Y., Chen, Y., Zhang, Z., Tan, T.: Enhancing End-to- End Autonomous Driving with Latent World Model. arXiv e-prints arXiv:2406.08481 (Jun 2024).https://doi.org/10.48550/arXiv.2406.08481

  17. [17]

    IEEE Transactions on Pattern Analysis and Machine Intelligence47(4), 2597–2614 (2025).https://doi.org/10.1109/ TPAMI.2025.3526936

    Liu, H., Huang, Z., Huang, W., Yang, H., Mo, X., Lv, C.: Hybrid-prediction integrated planning for autonomous driving. IEEE Transactions on Pattern Analysis and Machine Intelligence47(4), 2597–2614 (2025).https://doi.org/10.1109/ TPAMI.2025.3526936

  18. [18]

    In: Proceedings of the 37th International Conference on Neural Information Processing Systems

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. NIPS ’23, Curran Associates Inc., Red Hook, NY, USA (2023)

  19. [19]

    Liu, W., Guo, Q., Wang, Z., Wang, W., Yang, L., Qiao, Y., Wang, L., Li, Z., Lv, C., Zhang, S., Xi, J., Liu, H.: Uv-m3tl: A unified and versatile multimodal multi-task learning framework for assistive driving perception (2026),https://arxiv.org/ abs/2602.01594

  20. [20]

    Liu, W., Qiao, Y., Wang, Z., Guo, Q., Chen, Z., Zhou, M., Li, X., Wang, L., Li, Z., Liu, H., Wang, W.: Tem3-learning: Time-efficient multimodal multi-task Driver-WM 17 learning for advanced assistive driving. In: Laugier, C., Renzaglia, A., Atanasov, N., Birchfield, S., Cielniak, G., {De Mattos}, L., Fiorini, L., Giguere, P., Hashimoto, K., Ibanez-Guzman,...

  21. [21]

    What’s in the image? a deep-dive into the vision of vision language models

    Liu, W., Wang, W., Qiao, Y., Guo, Q., Zhu, J., Li, P., Chen, Z., Yang, H., Li, Z., Wang, L., Tan, T., Liu, H.: MMTL-UniAD: A Unified Framework for Multimodal and Multi-Task Learning in Assistive Driving Perception . In: 2025 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6864–6874. IEEE Computer Society, Los Alamitos, CA, USA (...

  22. [22]

    In: 2019 IEEE/CVF International Conference on Com- puter Vision (ICCV)

    Martin, M., Roitberg, A., Haurilet, M., Horne, M., Reiß, S., Voit, M., Stiefelhagen, R.: Drive&act: A multi-modal dataset for fine-grained driver behavior recognition in autonomous vehicles. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 2801–2810 (2019).https://doi.org/10.1109/ICCV.2019. 00289

  23. [23]

    and Romero, Javier , title =

    Martinez, J., Black, M.J., Romero, J.: On Human Motion Prediction Using Recurrent Neural Networks . In: 2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR). pp. 4674–4683. IEEE Computer Society, Los Alamitos, CA, USA (Jul 2017).https://doi.org/10.1109/CVPR.2017.497, https: //doi.ieeecomputersociety.org/10.1109/CVPR.2017.497

  24. [24]

    Rahimi, A., Gerard, V., Zablocki, E., Cord, M., Alahi, A.: Mad: Motion appearance decoupling for efficient driving world models (2026),https://arxiv.org/abs/2601. 09452

  25. [25]

    SAE On-Road Automated Vehicle Standards Committee: Taxonomy and definitions fortermsrelated toon-roadmotorvehicle automated drivingsystems.SAE Standard J3016 (2014)

  26. [26]

    In: Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LII

    Sima, C., Renz, K., Chitta, K., Chen, L., Zhang, H., Xie, C., Beißwenger, J., Luo, P., Geiger, A., Li, H.: Drivelm: Driving with graph visual question answering. In: Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part LII. p. 256–274. Springer-Verlag, Berlin, Heidelberg (2024). https://doi.o...

  27. [27]

    Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., Fan, Y., Dang, K., Du, M., Ren, X., Men, R., Liu, D., Zhou, C., Zhou, J., Lin, J.: Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution (2024),https://arxiv.org/abs/2409.12191

  28. [28]

    Wang, W., Wang, L., Zhang, C., Liu, C., Sun, L.: Social interactions for au- tonomous driving: A review and perspectives. Found. Trends Robot10(3–4), 198–376 (Nov 2022).https://doi.org/10.1561/2300000078, https://doi.org/ 10.1561/2300000078 18 H. Chi et al

  29. [29]

    In: Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XLVIII

    Wang, X., Zhu, Z., Huang, G., Chen, X., Zhu, J., Lu, J.: Drivedreamer: Towards real- world-drive world models for autonomous driving. In: Computer Vision – ECCV 2024: 18th European Conference, Milan, Italy, September 29–October 4, 2024, Proceedings, Part XLVIII. p. 55–72. Springer-Verlag, Berlin, Heidelberg (2024). https://doi.org/10.1007/978-3-031-73195-...

  30. [30]

    Human Factors64(7), 1227– 1260 (2022)

    Weaver, B.W., DeLucia, P.R.: A systematic review and meta-analysis of takeover performance during conditionally automated driving. Human Factors64(7), 1227– 1260 (2022)

  31. [31]

    Transportation Research Part C: Emerging Technologies128, 103199 (2021).https://doi.org/ https://doi.org/10.1016/j.trc.2021.103199

    Xing,Y.,Lv,C.,Cao,D.,Hang,P.:Towardhuman-vehiclecollaboration:Reviewand perspectives on human-centered collaborative automated driving. Transportation Research Part C: Emerging Technologies128, 103199 (2021).https://doi.org/ https://doi.org/10.1016/j.trc.2021.103199

  32. [32]

    IEEE Transactions on Vehicular Technology68(6), 5379–5390 (2019).https://doi.org/10.1109/TVT

    Xing, Y., Lv, C., Wang, H., Cao, D., Velenis, E., Wang, F.Y.: Driver activity recognition for intelligent vehicles: A deep learning approach. IEEE Transactions on Vehicular Technology68(6), 5379–5390 (2019).https://doi.org/10.1109/TVT. 2019.2908425

  33. [33]

    Yan, S., Xiong, Y., Lin, D.: Spatial temporal graph convolutional networks for skeleton-based action recognition. In: Proceedings of the Thirty-Second AAAI Conference on Artificial Intelligence and Thirtieth Innovative Applications of Artificial Intelligence Conference and Eighth AAAI Symposium on Educational Advances in Artificial Intelligence. AAAI’18/I...

  34. [34]

    In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV)

    Yang, D., Huang, S., Xu, Z., Li, Z., Wang, S., Li, M., Wang, Y., Liu, Y., Yang, K., Chen, Z., Wang, Y., Liu, J., Zhang, P., Zhai, P., Zhang, L.: Aide: A vision-driven multi-view, multi-modal, multi-tasking dataset for assistive driving perception. In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 20402–20413 (2023).https://doi.org/...

  35. [35]

    IEEE Transactions on Intelligent Transportation Systems25(2), 2034–2045 (2024)

    Yang, H., Liu, H., Hu, Z., Nguyen, A.T., Guerra, T.M., Lv, C.: Quantitative identi- fication of driver distraction: A weakly supervised contrastive learning approach. IEEE Transactions on Intelligent Transportation Systems25(2), 2034–2045 (2024). https://doi.org/10.1109/TITS.2023.3316203

  36. [36]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Zhao, G., Ni, C., Wang, X., Zhu, Z., Zhang, X., Wang, Y., Huang, G., Chen, X., Wang, B., Zhang, Y., Mei, W., Wang, X.: Drivedreamer4d: World models are effective data machines for 4d driving scene representation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 12015–12026 (June 2025)

  37. [37]

    Zhao, G., Wang, X., Zhu, Z., Chen, X., Huang, G., Bao, X., Wang, X.: Drivedreamer- 2: Llm-enhanced world models for diverse driving video generation (2024),https: //arxiv.org/abs/2403.06845

  38. [38]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

    Zheng, Y., Yang, P., Xing, Z., Zhang, Q., Zheng, Y., Gao, Y., Li, P., Zhang, T., Xia, Z., Jia, P., Lang, X., Zhao, D.: World4drive: End-to-end autonomous driving via intention-aware physical latent world model. In: Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV). pp. 28632–28642 (October 2025)

  39. [39]

    In: Proceedings of the 33rd ACM International Conference on Multimedia

    Zhu, G., Fan, S., Dai, H., Ho, E.S.L.: Waymo-3dskelmo: A multi-agent 3d skele- tal motion dataset for pedestrian interaction modeling in autonomous driving. In: Proceedings of the 33rd ACM International Conference on Multimedia. p. 13184–13190. MM ’25, Association for Computing Machinery, New York, NY, USA (2025). https://doi.org/10.1145/3746027.3758273, ...

  40. [40]

    2023 , url =

    Zhu, W., Ma, X., Liu, Z., Liu, L., Wu, W., Wang, Y.: MotionBERT: A Unified Perspective on Learning Human Motion Representations . In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 15039–15053. IEEE Computer Society, Los Alamitos, CA, USA (Oct 2023).https://doi.org/10. 1109/ICCV51070.2023.01385, https://doi.ieeecomputersociety.org/1...

  41. [41]

    maximal injection step

    Zitkovich, B., Yu, T., Xu, S., Xu, P., Xiao, T., Xia, F., Wu, J., Wohlhart, P., Welker, S., Wahid, A., Vuong, Q., Vanhoucke, V., Tran, H., Soricut, R., Singh, A., Singh, J., Sermanet, P., Sanketi, P.R., Salazar, G., Ryoo, M.S., Reymann, K., Rao, K., Pertsch, K., Mordatch, I., Michalewski, H., Lu, Y., Levine, S., Lee, L., Lee, T.W.E., Leal, I., Kuang, Y., ...