pith. sign in

arxiv: 2606.26922 · v1 · pith:KSWWEQNTnew · submitted 2026-06-25 · 💻 cs.RO · cs.AI

Risk-Aware Selective Multimodal Driver Monitoring with Driver-State World Modeling

Pith reviewed 2026-06-26 04:37 UTC · model grok-4.3

classification 💻 cs.RO cs.AI
keywords driver monitoringselective inferencemultimodal fusionrisk-aware controlautomated vehiclesphysiological signalsworld modeling
0
0 comments X

The pith

A cost-aware gate lets a fast RGB-physiological model abstain on uncertain driver states, cutting unsafe false negatives from 17.37% to about 5% at deployment latency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper builds a selective inference system for in-cabin driver monitoring that pairs a lightweight multimodal student with a learned gate. The student fuses cabin video with heart-rate and electrodermal signals to classify driver demand. The gate uses per-sample scores to accept the fast output or trigger safety intervention instead of always running a slower large model. Experiments show the combination lowers missed unsafe states while preserving low latency. A separate world-modeling module is added to forecast future errors and action costs, though it reveals remaining calibration problems across driver groups.

Core claim

Cost-aware selective inference with an RGB-physiological student and learned gate reduces unsafe false negatives from 17.37% under always-fast inference to approximately 5% across seeds while keeping 3 ms inference latency; the student itself reaches 0.7440 Macro-F1 on scenario-induced driver-demand recognition.

What carries the argument

The learned gate that decides per-sample whether to accept the fast RGB-physiological prediction or abstain for safety intervention, using scores that contain information beyond scenario priors.

If this is right

  • The RGB-physiological student improves over single-modality baselines to 0.7440 Macro-F1 and 0.9099 balanced accuracy with 11.39 M parameters.
  • Cost-aware selection keeps overall system latency at deployment levels while lowering the unsafe error rate.
  • Driver-state world modeling supplies predictive signals for future model errors and counterfactual costs.
  • Worst-group evaluations still show operating-point calibration drift even with the added predictive module.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same gate-plus-student pattern could be tested in other latency-critical safety settings such as pedestrian detection or medical alarm systems.
  • Improving physiological signal alignment across sensors would directly raise the upper bound on student accuracy.
  • Group-robust calibration techniques would need to be added before the system can be deployed across varied driver populations.

Load-bearing premise

The gate can reliably read sample-level signals to choose abstention without creating new safety risks, and the physiological signals stay synchronized enough for the student model to work.

What would settle it

A controlled test on the same driver-demand scenarios in which turning on the learned gate either raises the overall unsafe false-negative rate above the always-fast baseline or produces new false positives that trigger unnecessary interventions at higher total cost.

Figures

Figures reproduced from arXiv: 2606.26922 by Daosheng Qiu, Hao Su, Haozhuang Chi, Shu Long, Wei Zhang, Xinyue Miao, Yongle Dong.

Figure 1
Figure 1. Figure 1: Overview of the proposed framework. The system continuously processes RGB frames and window-level HR/EDA signals through a lightweight fast student. Instead of forcing a mandatory classification, a learned cost-aware gate evaluates instantaneous reliability and predictive evidence from a compact driver-state world modeling module. The gate then decides whether to accept_fast, abstain_warn, slow_replace, or… view at source ↗
Figure 2
Figure 2. Figure 2: Selective confusion matrix. Comparison between always-fast inference and the learned cost-aware gate. By explicitly optimizing for asymmetric risk, the learned gate successfully redistributes safety-critical errors (Unsafe FNs, red) into conservative or positive abstentions (orange), while simultaneously increasing the number of correctly accepted high-demand states [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative comparison of modality contributions. Case A shows visual cues are sufficient. Case B demonstrates the necessity of mul￾timodal fusion: when visual features are ambiguous (head down), physio￾logical dynamics (EDA/HR drop) suc￾cessfully recover the true High demand state. Case C shows a failure case where both modalities fail to capture the state change. RGB Case A Inputs 1 -0 -1 0 32 64 96 127 … view at source ↗
Figure 5
Figure 5. Figure 5: Calibration, deployment frontier, and matched-coverage safety be [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
read the original abstract

Continuous driver monitoring in automated vehicles requires low-latency inference while avoiding unsafe decisions under uncertain driver states. Large vision-language models provide broad multimodal priors, but their latency and limited reliability in this setting make them unsuitable as always-on in-cabin monitors. We propose a cost-aware selective inference framework for deployable multimodal driver monitoring. The core system is a lightweight RGB-physiological student that combines in-cabin visual observations with window-level HR/EDA signals, and a learned gate that decides when to accept the fast prediction or abstain for safety intervention. Additional controls show that the learned scores contain sample-level information beyond scenario priors, while exact physiological synchronization remains a limitation. To incorporate predictive evidence, we further study a compact driver-state world modeling module that rolls out latent driver-state features and estimates future fast-model errors and counterfactual system-level action costs. On scenario-induced driver-demand recognition, the RGB-physiological student improves over RGB-only and physiology-only baselines, reaching 0.7440 Macro-F1 and 0.9099 balanced accuracy with 11.39M parameters and 3.08ms inference latency. Cost-aware selective inference reduces unsafe false negatives from 17.37% under always-fast inference to approximately 5% across seeds, while maintaining deployment-level latency. While driver-state world modeling offers valuable predictive signals, worst-group evaluations highlight persistent operating-point calibration drift. Ultimately, reliable edge driver monitoring requires advancing not only perception backbones, but also risk-aware selective control and group-robust calibration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes a cost-aware selective inference framework for low-latency multimodal driver monitoring in automated vehicles. A lightweight RGB-physiological student model fuses in-cabin RGB observations with window-level HR/EDA signals to reach 0.7440 Macro-F1 and 0.9099 balanced accuracy (11.39M parameters, 3.08ms latency). A learned gate decides between accepting the fast prediction or abstaining for safety intervention, reducing unsafe false negatives from 17.37% (always-fast inference) to approximately 5% across seeds while preserving deployment latency. A compact driver-state world modeling module is studied for rolling out latent features and estimating future errors and counterfactual costs. The work explicitly notes that exact physiological synchronization remains a limitation and that worst-group evaluations show persistent calibration drift.

Significance. If the reported safety gains hold under realistic synchronization conditions, the selective-inference approach could meaningfully improve the reliability of edge-deployed driver monitoring by trading off latency against risk without always invoking heavy models. The explicit discussion of synchronization as a limitation and the inclusion of world-modeling for predictive cost estimation are constructive contributions. The controls demonstrating sample-level information in the gate scores beyond scenario priors strengthen the case for learned abstention.

major comments (2)
  1. [Abstract] Abstract: The central safety claim—that cost-aware selective inference reduces unsafe false negatives from 17.37% to ~5%—rests on the RGB-physiological student achieving 0.7440 Macro-F1. The same paragraph states that "exact physiological synchronization remains a limitation," which directly undermines confidence that the reported fusion performance (and therefore the gate's effectiveness) would be realized in deployment where HR/EDA signals may exhibit temporal offsets relative to RGB frames.
  2. [Abstract] Abstract: The manuscript reports that "additional controls show that the learned scores contain sample-level information beyond scenario priors," yet provides no quantitative details on the control experiments, ablation results, or statistical tests supporting this claim. This information is load-bearing for validating that the gate is not merely learning scenario-level priors.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the potential significance of the selective-inference approach. We address each major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The central safety claim—that cost-aware selective inference reduces unsafe false negatives from 17.37% to ~5%—rests on the RGB-physiological student achieving 0.7440 Macro-F1. The same paragraph states that "exact physiological synchronization remains a limitation," which directly undermines confidence that the reported fusion performance (and therefore the gate's effectiveness) would be realized in deployment where HR/EDA signals may exhibit temporal offsets relative to RGB frames.

    Authors: The reported performance and safety gains are measured under the experimental condition of window-level aligned physiological signals with RGB frames. The explicit statement that exact synchronization remains a limitation accurately flags that temporal offsets in real deployments could reduce fusion effectiveness and thereby weaken the gate. We will revise the abstract to qualify the results as holding for synchronized inputs and to note the implications for deployment. revision: yes

  2. Referee: [Abstract] Abstract: The manuscript reports that "additional controls show that the learned scores contain sample-level information beyond scenario priors," yet provides no quantitative details on the control experiments, ablation results, or statistical tests supporting this claim. This information is load-bearing for validating that the gate is not merely learning scenario-level priors.

    Authors: Quantitative results from the control experiments (ablation of scenario conditioning and statistical tests) appear in Section 4.3. To make the supporting evidence immediately visible in the summary, we will insert a concise statement of the key quantitative findings into the abstract. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical ML results with no self-referential derivations

full rationale

The paper presents an empirical framework for cost-aware selective multimodal driver monitoring, reporting experimental metrics such as 0.7440 Macro-F1 for the RGB-physiological student, reduction of unsafe false negatives from 17.37% to ~5%, and evaluations of a driver-state world modeling module. No equations, first-principles derivations, or predictions are described that reduce by construction to fitted inputs, self-definitions, or self-citation chains. Claims rest on benchmark comparisons and ablation studies rather than renaming known results or smuggling ansatzes via prior self-work. This matches the provided reader's assessment that no abstract-level derivations reduce to inputs by construction, confirming the derivation chain is self-contained and externally falsifiable via reported performance numbers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the gate and world model are described at high level without detailing their internal assumptions or fitted values.

pith-pipeline@v0.9.1-grok · 5847 in / 1052 out tokens · 32351 ms · 2026-06-26T04:37:28.800284+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

35 extracted references · 12 canonical work pages · 3 internal anchors

  1. [1]

    In: Proceedings of the 36th International Conference on Neural Information Processing Systems

    Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millicah, K., Reynolds, M., Ring, R., Rutherford, E., Cabi, S., Han, T., Gong, Z., Samangooei, S., Monteiro, M., Menick, J., Borgeaud, S., Brock, A., Nematzadeh, A., Sharifzadeh, S., Binkowski, M., Barreira, R., Vinyals, O., Zisserman, A., Simonyan, K.: Flamingo: a...

  2. [2]

    In: The Twelfth International Conference on Learning Representations (2024),https://openreview.net/forum?id=33XGfHLtZg

    Angelopoulos, A.N., Bates, S., Fisch, A., Lei, L., Schuster, T.: Conformal risk control. In: The Twelfth International Conference on Learning Representations (2024),https://openreview.net/forum?id=33XGfHLtZg

  3. [3]

    arXiv e-prints arXiv:2404.08471 (Feb 2024).https://doi.org/10.48550/ arXiv.2404.08471

    Bardes, A., Garrido, Q., Ponce, J., Chen, X., Rabbat, M., LeCun, Y., Assran, M., Ballas, N.: Revisiting Feature Prediction for Learning Visual Representations from Video. arXiv e-prints arXiv:2404.08471 (Feb 2024).https://doi.org/10.48550/ arXiv.2404.08471

  4. [4]

    Chi, H., Qiu, D., Su, H., Liu, H., Li, Z., Zhang, H., Lv, C.: Driver-wm: A driver- centric traffic-conditioned latent world model for in-cabin dynamics rollout (2026), https://arxiv.org/abs/2605.05092

  5. [5]

    In: 2025 IEEE Intelligent Vehicles Sym- posium (IV)

    Chi, H., Yang, H., Yang, L., Lv, C.: Vlm-dm: Visual language models for multitask domain adaptation in driver monitoring. In: 2025 IEEE Intelligent Vehicles Sym- posium (IV). pp. 1280–1285 (2025).https://doi.org/10.1109/IV64158.2025. 11097620

  6. [6]

    IEEE Transactions on Information Theory16(1), 41–46 (1970).https://doi.org/10.1109/TIT.1970

    Chow, C.: On optimum recognition error and reject tradeoff. IEEE Transactions on Information Theory16(1), 41–46 (1970).https://doi.org/10.1109/TIT.1970. 1054406

  7. [7]

    In: Proceedings of the 37th International Conference on Neural Information Processing Systems

    Dai, W., Li, J., Li, D., Tiong, A.M.H., Zhao, J., Wang, W., Li, B., Fung, P., Hoi, S.: Instructblip: towards general-purpose vision-language models with instruction tuning. In: Proceedings of the 37th International Conference on Neural Information Processing Systems. NIPS ’23, Curran Associates Inc., Red Hook, NY, USA (2023)

  8. [8]

    Scientific Data11(1), 327 (Mar 2024).https://doi.org/10.1038/s41597-024-03137-y , https://doi

    Dargahi Nobari, K., Bertram, T.: A multimodal driver monitoring benchmark dataset for driver modeling in assisted driving automation. Scientific Data11(1), 327 (Mar 2024).https://doi.org/10.1038/s41597-024-03137-y , https://doi. org/10.1038/s41597-024-03137-y

  9. [9]

    In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R

    Geifman, Y., El-Yaniv, R.: Selective classification for deep neural networks. In: Guyon, I., Luxburg, U.V., Bengio, S., Wallach, H., Fergus, R., Vishwanathan, S., Garnett, R. (eds.) Advances in Neural Information Processing Systems. vol. 30. Curran Associates, Inc. (2017),https://proceedings.neurips.cc/paper_files/ paper/2017/file/4a8423d5e91fda00bb7e4654...

  10. [10]

    In: Chaudhuri, K., Salakhutdinov, R

    Geifman, Y., El-Yaniv, R.: SelectiveNet: A deep neural network with an integrated reject option. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 97, pp. 2151–2159. PMLR (09–15 Jun 2019),https://proceedings. mlr.press/v97/geifman19a.html

  11. [11]

    World Models

    Ha, D., Schmidhuber, J.: World Models. arXiv e-prints arXiv:1803.10122 (Mar 2018).https://doi.org/10.48550/arXiv.1803.10122

  12. [12]

    In: International Conference on Machine Learning

    Hafner, D., Lillicrap, T., Fischer, I., Villegas, R., Ha, D., Lee, H., Davidson, J.: Learning latent dynamics for planning from pixels. In: International Conference on Machine Learning. pp. 2555–2565 (2019)

  13. [13]

    arXiv e-prints arXiv:2301.04104 (Jan 2023).https://doi.org/10

    Hafner, D., Pasukonis, J., Ba, J., Lillicrap, T.: Mastering Diverse Domains through World Models. arXiv e-prints arXiv:2301.04104 (Jan 2023).https://doi.org/10. 48550/arXiv.2301.04104

  14. [14]

    In: Kim, B., Yue, Y., Chaudhuri, S., Fragkiadaki, K., Khan, M., Sun, Y

    Hansen,N.,Su,H.,Wang,X.:Td-mpc2:Scalable,robustworldmodelsforcontinuous control. In: Kim, B., Yue, Y., Chaudhuri, S., Fragkiadaki, K., Khan, M., Sun, Y. (eds.) International Conference on Learning Representations. vol. 2024, pp. 47376– 47405 (2024), https://proceedings.iclr.cc/paper_files/paper/2024/file/ cf73d57b6dcda32b293df7c2d5341f49-Paper-Conference.pdf

  15. [15]

    IEEE Transactions on Intelligent Transportation Systems6(2), 156–166 (2005).https://doi.org/10.1109/TITS.2005.848368

    Healey, J., Picard, R.: Detecting stress during real-world driving tasks using phys- iological sensors. IEEE Transactions on Intelligent Transportation Systems6(2), 156–166 (2005).https://doi.org/10.1109/TITS.2005.848368

  16. [16]

    Distilling the Knowledge in a Neural Network

    Hinton, G., Vinyals, O., Dean, J.: Distilling the Knowledge in a Neural Network. arXiv e-prints arXiv:1503.02531 (Mar 2015).https://doi.org/10.48550/arXiv. 1503.02531

  17. [17]

    Hu, A., Russell, L., Yeo, H., Murez, Z., Fedoseev, G., Kendall, A., Shotton, J., Corrado, G.: Gaia-1: A generative world model for autonomous driving (2023), https://arxiv.org/abs/2309.17080

  18. [18]

    Biomedical Signal Processing and Control93, 106204 (2024).https://doi.org/https://doi

    Huang, J., Huang, X., Peng, Y., Hu, L.: Driver state recognition with physiological signals: Based on deep feature fusion and feature selection techniques. Biomedical Signal Processing and Control93, 106204 (2024).https://doi.org/https://doi. org/10.1016/j.bspc.2024.106204, https://www.sciencedirect.com/science/ article/pii/S1746809424002623

  19. [19]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Jang, J., Ma, C., Lee, B.: Vl2lite: Task-specific knowledge distillation from large vision-language models to lightweight networks. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 30073–30083 (June 2025)

  20. [20]

    In: International Conference on Machine Learning (2020),https: //api.semanticscholar.org/CorpusID:229156320

    Koh, P.W., Sagawa, S., Marklund, H., Xie, S.M., Zhang, M., Balsubramani, A., Hu, W., Yasunaga, M., Phillips, R.L., Gao, I., Lee, T., David, E., Stavness, I., Guo, W., Earnshaw, B.A., Haque, I.S., Beery, S., Leskovec, J., Kundaje, A.B., Pierson, E., Levine, S., Finn, C., Liang, P.: Wilds: A benchmark of in-the-wild distribution shifts. In: International Co...

  21. [21]

    In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J

    Li, J., Li, D., Savarese, S., Hoi, S.: BLIP-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: Krause, A., Brunskill, E., Cho, K., Engelhardt, B., Sabato, S., Scarlett, J. (eds.) Proceedings of the 40th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 202, pp....

  22. [22]

    In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: Oh, A., Naumann, T., Globerson, A., Saenko, K., Hardt, M., Levine, S. (eds.) Advances in Neural Risk-Aware Selective Multimodal Driver Monitoring 17 Information Processing Systems. vol. 36, pp. 34892–34916. Curran Associates, Inc. (2023), https://proceedings.neurips.cc/paper_files/paper/20...

  23. [23]

    In: Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., Garnett, R

    Liu, Z., Wang, Z., Liang, P.P., Salakhutdinov, R.R., Morency, L.P., Ueda, M.: Deep gamblers: Learning to abstain with portfolio theory. In: Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., Garnett, R. (eds.) Ad- vances in Neural Information Processing Systems. vol. 32. Curran Associates, Inc. (2019), https://proceedings.neurips.cc/p...

  24. [24]

    In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R

    Madras, D., Pitassi, T., Zemel, R.: Predict responsibly: Improving fairness and accuracy by learning to defer. In: Bengio, S., Wallach, H., Larochelle, H., Grauman, K., Cesa-Bianchi, N., Garnett, R. (eds.) Advances in Neural Information Processing Systems. vol. 31. Curran Associates, Inc. (2018),https://proceedings.neurips. cc/paper_files/paper/2018/file/...

  25. [25]

    In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV)

    Martin, M., Roitberg, A., Haurilet, M., Horne, M., Reiß, S., Voit, M., Stiefelhagen, R.: Drive&act: A multi-modal dataset for fine-grained driver behavior recognition in autonomous vehicles. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 2801–2810 (2019).https://doi.org/10.1109/ICCV.2019. 00289

  26. [26]

    Frontiers in PsychologyV olume 12 - 2021(2021)

    Meteier, Q., Capallera, M., Ruffieux, S., Angelini, L., Abou Khaled, O., Mugellini, E., Widmer, M., Sonderegger, A.: Classification of drivers’ workload using phys- iological signals in conditional automation. Frontiers in PsychologyV olume 12 - 2021(2021). https://doi.org/10.3389/fpsyg.2021.596038 , https: //www.frontiersin.org/journals/psychology/articl...

  27. [27]

    In: III, H.D., Singh, A

    Mozannar, H., Sontag, D.: Consistent estimators for learning to defer to an expert. In: III, H.D., Singh, A. (eds.) Proceedings of the 37th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 119, pp. 7076–7087. PMLR (13–18 Jul 2020),https://proceedings.mlr.press/v119/ mozannar20b.html

  28. [28]

    In: Computer Vision – ECCV 2020 Workshops: Glasgow, UK, August 23–28, 2020, Proceedings, Part IV

    Ortega, J.D., Kose, N., Cañas, P., Chao, M.A., Unnervik, A., Nieto, M., Otaegui, O., Salgado, L.: Dmd: A large-scale multi-modal driver monitoring dataset for attention and alertness analysis. In: Computer Vision – ECCV 2020 Workshops: Glasgow, UK, August 23–28, 2020, Proceedings, Part IV. p. 387–405. Springer-Verlag, Berlin, Heidelberg (2020). https://do...

  29. [29]

    In: Meila, M., Zhang, T

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: Meila, M., Zhang, T. (eds.) Proceedings of the 38th International Conference on Machine Learning. Proceedings of Machine Learning Resea...

  30. [30]

    In: International Conference on Learning Representations (2020), https://openreview.net/forum?id=ryxGuJrFvS

    Sagawa*, S., Koh*, P.W., Hashimoto, T.B., Liang, P.: Distributionally robust neural networks. In: International Conference on Learning Representations (2020), https://openreview.net/forum?id=ryxGuJrFvS

  31. [31]

    Sensors12(12), 16937–16953 (2012).https://doi.org/10

    Sahayadhas, A., Sundaraj, K., Murugappan, M.: Detecting driver drowsiness based on sensors: A review. Sensors12(12), 16937–16953 (2012).https://doi.org/10. 3390/s121216937,https://www.mdpi.com/1424-8220/12/12/16937 18 D. Qiu et al

  32. [32]

    Sensors 23(4) (2023)

    Sriranga, A.K., Lu, Q., Birrell, S.: A systematic review of in-vehicle physiological indices and sensor technology for driver mental workload monitoring. Sensors 23(4) (2023). https://doi.org/10.3390/s23042214 , https://www.mdpi.com/ 1424-8220/23/4/2214

  33. [33]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., Fan, Y., Dang, K., Du, M., Ren, X., Men, R., Liu, D., Zhou, C., Zhou, J., Lin, J.: Qwen2-VL: Enhancing Vision-Language Model’s Perception of the World at Any Resolution. arXiv e-prints arXiv:2409.12191 (Sep 2024). https://doi.org/10.48550/arXiv.2409.12191

  34. [34]

    In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G

    Wang, X., Zhu, Z., Huang, G., Chen, X., Zhu, J., Lu, J.: Drivedreamer: Towards real-world-drive world models for autonomous driving. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV

  35. [35]

    pp. 55–72. Springer Nature Switzerland, Cham (2025) Risk-Aware Selective Multimodal Driver Monitoring 19 A Additional Discussion This section clarifies several design choices that are central to interpreting the proposed framework. Q1:Why formulate driver monitoring as selective inference rather than simply maximizing classification accuracy? Continuous d...