pith. machine review for the scientific record. sign in

arxiv: 2604.02798 · v1 · submitted 2026-04-03 · 💻 cs.MM

Recognition: unknown

Differential Mental Disorder Detection with Psychology-Inspired Multimodal Stimuli

Jingjing Wu, Junyu Guo, Qiqi Zhao, Qi Wang, Richang Hong, Shijie Hao, Xiaowei Zhang, Yanrong Guo, Yuqi Chu, Zhibo Lei, Zhiyuan Zhou, Zhongcheng Yu

Authors on Pith no claims yet

Pith reviewed 2026-05-13 19:02 UTC · model grok-4.3

classification 💻 cs.MM
keywords differential mental disorder detectionmultimodal datasetpsychology-inspired stimulidepressionanxietyschizophreniaprompt-guided learning
0
0 comments X

The pith

Psychology-inspired multimodal stimuli enable more accurate differential detection of overlapping mental disorders.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the challenge of distinguishing mental disorders such as depression, anxiety, and schizophrenia when symptoms overlap in clinical practice. It proposes using stimuli drawn from experimental psychology to provoke emotional, cognitive, and behavioral responses that reveal disorder-specific patterns. From this approach the authors build a large multimodal dataset with labels verified by psychiatrists and introduce a framework that feeds inter-disorder prior knowledge into the model as prompt descriptions. Experiments indicate that the resulting representations outperform standard baselines, suggesting the stimulus design helps isolate the differences that matter for differential diagnosis.

Core claim

Psychology-inspired multimodal stimuli grounded in experimental psychology findings elicit heterogeneous signals across emotional, cognitive, and behavioral dimensions; a paradigm-aware multimodal framework then uses inter-disorder prior knowledge expressed as prompt-guided semantic descriptions to capture task-specific affective and interaction contexts, producing improved representations for differential detection across depression, anxiety, and schizophrenia in a clinically verified dataset.

What carries the argument

The paradigm-aware multimodal framework that incorporates inter-disorder differences prior knowledge as prompt-guided semantic descriptions to model task-specific affective and interaction contexts.

If this is right

  • The MMH dataset supplies a publicly usable resource of multimodal recordings with psychiatrist-confirmed labels across three disorders.
  • The prompt-guided incorporation of inter-disorder knowledge improves modeling of heterogeneous signals from varied elicitation tasks.
  • Consistent gains over baselines indicate that stimulus design choices affect the separability of disorder representations.
  • The same paradigm can be applied to additional elicitation tasks or modalities while retaining the prompt structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar stimulus sets could be adapted for clinical intake interviews to reduce misdiagnosis rates in comorbid cases.
  • The framework structure may transfer to other domains that require distinguishing conditions with shared surface features.
  • Longitudinal versions of the stimuli could support monitoring of treatment response rather than one-time diagnosis.

Load-bearing premise

Psychology-inspired multimodal stimuli reliably produce disorder-specific response patterns that remain distinguishable despite symptom overlap, and the prompt-guided prior knowledge successfully encodes those distinctions in the learned representations.

What would settle it

Re-running the experiments with non-psychology-inspired or neutral stimuli and obtaining comparable performance gains would show that the stimulus design itself is not required for the reported improvements.

Figures

Figures reproduced from arXiv: 2604.02798 by Jingjing Wu, Junyu Guo, Qiqi Zhao, Qi Wang, Richang Hong, Shijie Hao, Xiaowei Zhang, Yanrong Guo, Yuqi Chu, Zhibo Lei, Zhiyuan Zhou, Zhongcheng Yu.

Figure 1
Figure 1. Figure 1: Comparison between conventional single-disorder [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the psychology-inspired multimodal stimulus paradigm. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The overview of the proposed paradigm-level prompt-guided learning framework, taking the four-class downstream task as [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 5
Figure 5. Figure 5: Module-level ablation study on the paradigm-aware [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 4
Figure 4. Figure 4: Modality-level ablation study. 5.3.3 Contribution of Different Modalities. To evaluate the con￾tribution of different modalities, we perform a modality-level abla￾tion study with different modality combinations, as shown in [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: t-SNE visualization for different diagnosis tasks. Left [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
read the original abstract

Differential diagnosis of mental disorders remains a fundamental challenge in real-world clinical practice, where multiple conditions often exhibit overlapping symptoms. However, most existing public datasets are developed under single-disorder settings and rely on limited data elicitation paradigms, restricting their ability to capture disorder-specific patterns. In this work, we investigate differential mental disorder detection through psychology-inspired multimodal stimuli, designed to elicit diverse emotional, cognitive, and behavioral responses grounded in findings from experimental psychology. Based on this paradigm, we collect a large-scale multimodal mental health dataset (MMH) covering depression, anxiety, and schizophrenia, with all diagnostic labels clinically verified by licensed psychiatrists. To effectively model the heterogeneous signals induced by diverse elicitation tasks, we further propose a paradigm-aware multimodal framework that leverages inter-disorder differences prior knowledge as prompt-guided semantic descriptions to capture task-specific affective and interaction contexts for multimodal representation learning in the new differential mental disorder detection task. Extensive experiments show that our framework consistently outperforms existing baselines, underscoring the value of psychology-inspired stimulus design for differential mental disorder detection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes psychology-inspired multimodal stimuli grounded in experimental psychology to elicit emotional, cognitive, and behavioral responses for differential detection of depression, anxiety, and schizophrenia. It introduces the MMH dataset with psychiatrist-verified labels and a paradigm-aware multimodal framework that incorporates inter-disorder prior knowledge via prompt-guided semantic descriptions for task-specific representation learning. Extensive experiments are reported to show consistent outperformance over baselines.

Significance. If the central claims hold after validation, the work could advance multimodal mental health AI by addressing symptom overlap through targeted elicitation paradigms and prior-informed prompting, while releasing a new clinically verified dataset that enables research on differential diagnosis rather than single-disorder settings.

major comments (2)
  1. [Abstract and §4 (Experiments)] Abstract and §4 (Experiments): the claim of consistent outperformance attributable to psychology-inspired stimuli and prompt-guided inter-disorder priors is not supported by intermediate evidence such as statistical tests on raw multimodal response distributions, ablations isolating stimulus design from fusion or scale effects, or comparisons against non-psychology baselines; without these, gains could stem from dataset construction or prompt engineering alone.
  2. [§3.1 (Stimuli Design)] §3.1 (Stimuli Design): the assertion that stimuli reliably elicit disorder-specific patterns separable from overlapping symptoms lacks direct validation (e.g., no reported metrics on response separability or psychiatrist verification of elicited differences beyond labels); this is load-bearing for the differential-detection contribution.
minor comments (2)
  1. [§3.2 (Dataset)] Provide participant demographics, exact stimulus protocols, and inter-rater reliability statistics for the MMH dataset to support reproducibility.
  2. [Tables in §4] Include error bars, statistical significance tests, and full hyperparameter details in all result tables.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and describe the revisions we will make to strengthen the evidence for our claims.

read point-by-point responses
  1. Referee: [Abstract and §4 (Experiments)] Abstract and §4 (Experiments): the claim of consistent outperformance attributable to psychology-inspired stimuli and prompt-guided inter-disorder priors is not supported by intermediate evidence such as statistical tests on raw multimodal response distributions, ablations isolating stimulus design from fusion or scale effects, or comparisons against non-psychology baselines; without these, gains could stem from dataset construction or prompt engineering alone.

    Authors: We agree that intermediate evidence is needed to attribute gains specifically to the psychology-inspired stimuli and prompt-guided priors. In the revised manuscript we will add (i) statistical tests (e.g., ANOVA with post-hoc comparisons) on the raw multimodal response distributions across disorders, (ii) systematic ablations that isolate stimulus design from fusion architecture and model scale, and (iii) additional baselines that use generic (non-psychology) elicitation paradigms. These additions will clarify whether performance improvements arise from the proposed paradigm or from dataset construction and prompt engineering alone. revision: yes

  2. Referee: [§3.1 (Stimuli Design)] §3.1 (Stimuli Design): the assertion that stimuli reliably elicit disorder-specific patterns separable from overlapping symptoms lacks direct validation (e.g., no reported metrics on response separability or psychiatrist verification of elicited differences beyond labels); this is load-bearing for the differential-detection contribution.

    Authors: The stimuli were selected from established experimental-psychology paradigms known to target distinct emotional, cognitive, and behavioral dimensions. All diagnostic labels were independently verified by licensed psychiatrists. We acknowledge, however, that explicit quantitative validation of elicited separability is currently absent. In the revision we will report response-separability metrics (e.g., silhouette scores and linear-discriminant separability on multimodal features) and will include any available psychiatrist commentary on the observed differences in elicited responses. If additional clinical verification is required, we will obtain it from the same psychiatrist panel. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on independent benchmarking

full rationale

The paper advances an empirical contribution: a psychology-grounded stimulus protocol, a new multimodal dataset with psychiatrist-verified labels, and a prompt-guided fusion framework. Performance superiority is asserted solely via comparative experiments against external baselines on held-out data. No equations, parameter-fitting steps, or derivations appear that would reduce any reported gain to a quantity defined inside the paper itself. References to experimental psychology are external and non-self-citing; no uniqueness theorem or ansatz is imported from prior author work to force the architecture. The central claim therefore remains falsifiable by replication on the released dataset and does not collapse into self-definition or fitted-input renaming.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that psychology-inspired stimuli produce distinguishable disorder-specific signals and that prior inter-disorder knowledge can be effectively encoded as prompts without introducing circularity or overfitting.

axioms (1)
  • domain assumption Psychology-inspired multimodal stimuli elicit diverse emotional, cognitive and behavioral responses grounded in experimental psychology findings that differ across disorders.
    Invoked to justify the data collection paradigm and the value of the new stimuli over standard elicitation methods.

pith-pipeline@v0.9.0 · 5515 in / 1249 out tokens · 49395 ms · 2026-05-13T19:02:23.537741+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 4 internal anchors

  1. [1]

    Sharifa Alghowinem, Roland Goecke, Michael Wagner, Julien Epps, Matthew Hyett, Gordon Parker, and Michael Breakspear. 2018. Multimodal depression detection: Fusion analysis of paralinguistic, head pose and eye gaze behaviors. IEEE Transactions on Affective Computing9, 4 (2018), 478–490. doi:10.1109/TA FFC.2016.2634527

  2. [2]

    Minghui An, Jingjing Wang, Shoushan Li, and Guodong Zhou. 2020. Multimodal topic-Enriched auxiliary learning for depression detection. InProceedings of the 28th International Conference on Computational Linguistics, Donia Scott, Nuria Bel, and Chengqing Zong (Eds.). International Committee on Computational Linguistics, Barcelona, Spain (Online), 1078–1089

  3. [3]

    Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. 33 (2020), 12449–12460

  4. [4]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yi- heng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. 2025. Qwen2.5-VL Technical R...

  5. [5]

    Tadas Baltrusaitis, Amir Zadeh, Yao Chong Lim, and Louis-Philippe Morency

  6. [6]

    Recovering Architectural Design Decisions,

    OpenFace 2.0: Facial Behavior Analysis Toolkit. InProceedings of the 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018). 59–66. doi:10.1109/FG.2018.00019

  7. [7]

    Rainer Banse and Klaus R Scherer. 1996. Acoustic profiles in vocal emotion expression.Journal of Personality and Social Psychology70, 3 (1996), 614–636

  8. [8]

    Hanshu Cai, Zhenqin Yuan, Yiwen Gao, Shuting Sun, Na Li, Fuze Tian, Han Xiao, Jianxiu Li, Zhengwu Yang, Xiaowei Li, et al. 2022. A multi-modal open dataset for mental-disorder analysis.Scientific Data9, 1 (2022), 178

  9. [9]

    Jian Chen, Yuzhu Hu, Qifeng Lai, Wei Wang, Junxin Chen, Han Liu, Gautam Srivastava, Ali Kashif Bashir, and Xiping Hu. 2024. IIFDD: Intra and inter-modal fusion for depression detection with multi-modal information from Internet of Medical Things.Information Fusion102 (2024), 102017

  10. [10]

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. InProceedings of the International Conference on Machine Learning. PMLR, 1597–1607

  11. [11]

    Zhuang Chen, Jiawen Deng, Jinfeng Zhou, Jincenzi Wu, Tieyun Qian, and Minlie Huang. 2024. Depression detection in clinical interviews with LLM-empowered structural element graph. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 8181–8194

  12. [12]

    Ziqiang Chen, Dandan Wang, Liangliang Lou, Shiqing Zhang, Xiaoming Zhao, Shuqiang Jiang, Jun Yu, and Jun Xiao. 2025. Text-guided multimodal depression detection via cross-modal feature reconstruction and decomposition.Information Fusion117 (2025), 102861

  13. [13]

    Jiaee Cheong, Sinan Kalkan, and Hatice Gunes. 2024. Fairrefuse: Referee-guided fusion for multi-modal causal fairness in depression detection. InProceedings of International Joint Conference on Artificial Intelligence (IJCAI). 7224–7232

  14. [14]

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multi- modality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261(2025)

  15. [15]

    Steven Davis and Paul Mermelstein. 1980. Comparison of parametric representa- tions for monosyllabic word recognition in continuously spoken sentences.IEEE Transactions on Acoustics, Speech, and Signal Processing28, 4 (1980), 357–366

  16. [16]

    David DeVault, Ron Artstein, Grace Benn, Teresa Dey, Ed Fast, Alesia Gainer, Kallirroi Georgila, Jon Gratch, Arno Hartholt, Margaux Lhommet, et al . 2014. SimSensei Kiosk: A virtual human interviewer for healthcare decision support. InProceedings of the 2014 International Conference on Autonomous Agents and Multi-agent Systems. 1061–1068

  17. [17]

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Associ- ation for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Comp...

  18. [18]

    Hamdi Dibeklio˘glu, Zakia Hammal, and Jeffrey F Cohn. 2018. Dynamic mul- timodal measurement of depression severity using deep autoencoding.IEEE Journal of Biomedical and Health Informatics22, 2 (2018), 525–536

  19. [19]

    Hamdi Dibeklio˘glu, Zakia Hammal, Ying Yang, and Jeffrey F Cohn. 2015. Multi- modal detection of depression in clinical interviews. InProceedings of the 2015 ACM on International Conference on Multimodal Interaction. 307–310

  20. [20]

    Brian Diep, Marija Stanojevic, and Jekaterina Novikova. 2022. Multi-modal deep learning system for depression and anxiety detection.arXiv preprint arXiv:2212.14490(2022)

  21. [21]

    Huiting Fan, Xingnan Zhang, Yingying Xu, Jiangxiong Fang, Shiqing Zhang, Xi- aoming Zhao, and Jun Yu. 2024. Transformer-based multimodal feature enhance- ment networks for multimodal depression detection integrating video, audio and remote photoplethysmograph signals.Information Fusion104 (2024), 102161

  22. [22]

    Weiquan Fan, Zhiwei He, Xiaofen Xing, Bolun Cai, and Weirui Lu. 2019. Multi- modality depression detection via multi-scale temporal dilated cnns. InProceed- ings of the 9th International on Audio/Visual Emotion Challenge and Workshop. 73–80

  23. [23]

    Ricardo Flores, ML Tlachac, Avantika Shrestha, and Elke Rundensteiner. 2023. Temporal facial features for depression screening. InProceedings of the 2022 ACM International Joint Conference on Pervasive and Ubiquitous Computing and the 2022 ACM International Symposium on Wearable Computers. 488–493

  24. [24]

    Jay C Fournier, Matthew T Keener, Jorge Almeida, Dina M Kronhaus, and Mary L Phillips. 2013. Amygdala and whole-brain activity to emotional faces distinguishes major depressive disorder and bipolar disorder.Bipolar Disorders15, 7 (2013), 741–752. doi:10.1111/bdi.12106

  25. [25]

    Yuan Gong and Christian Poellabauer. 2017. Topic modeling based multi-modal depression detection. InProceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge. 69–76. Conference’17, July 2017, Washington, DC, USA T rovato et al

  26. [26]

    Jonathan Gratch, Ron Artstein, Gale M Lucas, Giota Stratou, Stefan Scherer, Angela Nazarian, Rachel Wood, Jill Boberg, David DeVault, Stacy Marsella, et al

  27. [27]

    InProceedings of the International Conference on Language Resources and Evaluation, V ol

    The distress analysis interview corpus of human and computer interviews. InProceedings of the International Conference on Language Resources and Evaluation, V ol. 14. Reykjavik, 3123–3128

  28. [28]

    Tao Gui, Liang Zhu, Qi Zhang, Minlong Peng, Xu Zhou, Keyu Ding, and Zhigang Chen. 2019. Cooperative multimodal approach to depression detection in twitter. InProceedings of the Association for the Advancement of Artificial Intelligence (AAAI), V ol. 33. 110–117

  29. [29]

    Ruben C Gur, Roland J Erwin, Raquel E Gur, Alexander S Zwil, Carolyn Heim- berg, and Helena C Kraemer. 1992. Facial emotion discrimination: II. Behavioral findings in depression.Psychiatry Research42, 3 (1992), 241–251

  30. [30]

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Ramesh, et al. 2024. GPT-4o System Card. arXiv:2410.21276 [cs.CL] https://arxiv.org/abs/2410.21276

  31. [31]

    Zhongyi Jiang, Ke Xu, Xing Gao, Yin Cao, Yihan Zhang, Guanzhong Dong, Yun Chen, Xuanyan Zhu, Qiaoyang Zhang, Ran Bi, et al. 2025. DNet: A depression recognition network combining residual network and vision transformer.BMC Psychiatry25, 1 (2025), 880

  32. [32]

    Juho Jung, Chaewon Kang, Jeewoo Yoon, Seungbae Kim, and Jinyoung Han. 2024. Hique: Hierarchical question embedding network for multimodal depression de- tection. InProceedings of the 33rd ACM International Conference on Information and Knowledge Management. 1049–1059

  33. [33]

    Puneet Kumar, Shreshtha Misra, Zhuhong Shao, Bin Zhu, Balasubramanian Ra- man, and Xiaobai Li. 2025. Multimodal Interpretable Depression Analysis Us- ing Visual, Physiological, Audio and Textual Data. InProceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). 5305–

  34. [34]

    doi:10.1109/W ACV61041.2025.00518

  35. [35]

    Qinghe Li, Fanghui Dong, Qun Gai, Kaili Che, Heng Ma, Feng Zhao, Tongpeng Chu, Ning Mao, and Peiyuan Wang. 2023. Diagnosis of major depressive disorder using machine learning based on multisequence MRI neuroimaging features. Journal of Magnetic Resonance Imaging58, 5 (2023), 1420–1430

  36. [36]

    Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. InProceedings of the International Conference on Learning Representations

  37. [37]

    Michelle Morales, Stefan Scherer, and Rivka Levitan. 2018. A linguistically- informed fusion approach for multimodal depression detection. InProceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: from Keyboard to Clinic. 13–24

  38. [38]

    Luntian Mou, Siqi Zhen, Shasha Mao, and Nan Ma. 2025. Disentangled rep- resentation learning via transformer with graph attention fusion for depression detection. InProceedings of the 1st International Workshop on Cognition-oriented Multimodal Affective and Empathetic Computing. 20–29

  39. [39]

    Mingyue Niu, Jianhua Tao, Bin Liu, Jian Huang, and Zheng Lian. 2023. Mul- timodal spatiotemporal representation for automatic depression level detection. IEEE Transactions on Affective Computing14, 1 (2023), 294–307

  40. [40]

    Yuchen Pan, Junjun Jiang, Kui Jiang, and Xianming Liu. 2024. Disentangled- multimodal privileged knowledge distillation for depression recognition with incomplete multimodal data. InProceedings of the 32nd ACM International Conference on Multimedia. 5712–5721

  41. [41]

    Gowtham Premananth, Yashish M Siriwarden, Philip Resnik, and Carol Espy- Wilson. 2024. A multi-modal approach for identifying schizophrenia using cross- modal attention. InProceedings of the 46th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). IEEE, 1–5

  42. [42]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning transferable visual models from natural language supervision. InProceedings of the International Conference on Machine Learning. PMLR, 8748–8763

  43. [43]

    Anupama Ray, Siddharth Kumar, Rutvik Reddy, Prerana Mukherjee, and Ritu Garg

  44. [44]

    InProceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop

    Multi-level attention network using text, audio and video for depression prediction. InProceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop. 81–88

  45. [45]

    Mariana Rodrigues Makiuchi, Tifani Warnita, Kuniaki Uto, and Koichi Shinoda

  46. [46]

    InProceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop

    Multimodal fusion of bert-cnn and gated cnn representations for depres- sion detection. InProceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop. 55–63

  47. [47]

    Lars A Ross, Dave Saint-Amour, Victoria M Leavitt, Sophie Molholm, Daniel C Javitt, and John J Foxe. 2007. Impaired multisensory processing in schizophre- nia: deficits in the visual enhancement of speech comprehension under noisy environmental conditions.Schizophrenia Research97, 1-3 (2007), 173–183. doi:10.1016/j.schres.2007.08.008

  48. [48]

    Annett Schirmer and Sonja A Kotz. 2006. Beyond the right hemisphere: brain mechanisms mediating vocal emotional processing.Trends in Cognitive Sciences 10, 1 (2006), 24–30

  49. [49]

    Hanlei Shi, Yu Liu, Haoxun Li, Yuxuan Ding, Jiaxi Hu, Leyuan Qu, and Taihao Li

  50. [50]

    InProceedings of the 33rd ACM International Conference on Multimedia

    HOPE: Hierarchical fusion for optimized and personality-aware estimation of depression. InProceedings of the 33rd ACM International Conference on Multimedia. 13937–13943

  51. [51]

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al

  52. [52]

    Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805(2023)

  53. [53]

    ML Tlachac, Ricardo Flores, Miranda Reisch, Rimsha Kayastha, Nina Taurich, Veronica Melican, Connor Bruneau, Hunter Caouette, Joshua Lovering, Ermal Toto, et al. 2022. StudentSADD: rapid mobile depression and suicidal ideation screening of college students during the coronavirus pandemic.ACM on Interac- tive, Mobile, Wearable and Ubiquitous Technologies6,...

  54. [54]

    Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal transformer for unaligned multimodal language sequences. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 6558–6569

  55. [55]

    Md Azher Uddin, Joolekha Bibi Joolee, and Kyung-Ah Sohn. 2023. Deep multi- modal network based automated depression severity estimation.IEEE Transac- tions on Affective Computing14, 3 (2023), 2153–2167

  56. [56]

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. 30 (2017)

  57. [57]

    Qianqian Wang, Long Li, Lishan Qiao, and Mingxia Liu. 2022. Adaptive multi- modal neuroimage integration for major depression disorder detection.Frontiers in Neuroinformatics16 (2022), 856175

  58. [58]

    Xiaojie Wang, Xin Wan, Zhaolong Ning, Zihan Qie, Jiameng Li, and Yulong Xiao. 2023. A multimodal fusion depression recognition assisted decision-making system based on EEG and speech signals. InProceedings of the 2023 Interna- tional Conference on Communications, Computing, Cybersecurity, and Informat- ics (CCCI). IEEE, 1–8

  59. [59]

    Ping-Cheng Wei, Kunyu Peng, Alina Roitberg, Kailun Yang, Jiaming Zhang, and Rainer Stiefelhagen. 2023. Multi-modal depression estimation based on sub-attentional fusion. InProceedings of the European Conference on Computer Vision. Springer, 623–639

  60. [60]

    World Health Organization. 2022. World mental health report: transforming mental health for all. https://www.who.int/publications/i/item/9789240049338 Accessed: 2025-07-25

  61. [61]

    Jiaxin Ye, Junping Zhang, and Hongming Shan. 2025. Depmamba: Progressive fusion mamba for multimodal depression detection. InProceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5

  62. [62]

    Wei Zhang, En Zhu, Juan Chen, and YunPeng Li. 2024. MDDR: Multi-modal dual- Attention aggregation for depression recognition. InProceedings of the 32nd ACM International Conference on Multimedia (MM ’24). Association for Computing Machinery, New York, NY , USA, 321–329. doi:10.1145/3664647.3681491

  63. [63]

    Li Zhou, Zhenyu Liu, Yutong Li, Yuchi Duan, Huimin Yu, and Bin Hu. 2024. Multi fine-grained fusion network for depression detection.ACM Transactions on Multimedia Computing, Communications and Applications20, 8 (2024), 1–23

  64. [64]

    Zhiyuan Zhou, Yanrong Guo, Shijie Hao, and Richang Hong. 2025. Multi-modal depression detection in interview via exploring emotional distribution information. IEEE Transactions on Multimedia27 (2025), 6872–6883. doi:10.1109/TMM.20 25.3590939

  65. [65]

    Bochao Zou, Jiali Han, Yingxue Wang, Rui Liu, Shenghui Zhao, Lei Feng, Xi- angwen Lyu, and Huimin Ma. 2023. Semi-structural interview-based Chinese multimodal depression corpus towards automatic preliminary screening of de- pressive disorders.IEEE Transactions on Affective Computing14, 4 (2023), 2823–2838