arxiv: 2604.02798 · v1 · submitted 2026-04-03 · 💻 cs.MM

Recognition: unknown

Differential Mental Disorder Detection with Psychology-Inspired Multimodal Stimuli

Jingjing Wu, Junyu Guo, Qiqi Zhao, Qi Wang, Richang Hong, Shijie Hao, Xiaowei Zhang, Yanrong Guo, Yuqi Chu, Zhibo Lei, Zhiyuan Zhou, Zhongcheng Yu

Authors on Pith no claims yet

Pith reviewed 2026-05-13 19:02 UTC · model grok-4.3

classification 💻 cs.MM

keywords differential mental disorder detectionmultimodal datasetpsychology-inspired stimulidepressionanxietyschizophreniaprompt-guided learning

0 comments

The pith

Psychology-inspired multimodal stimuli enable more accurate differential detection of overlapping mental disorders.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the challenge of distinguishing mental disorders such as depression, anxiety, and schizophrenia when symptoms overlap in clinical practice. It proposes using stimuli drawn from experimental psychology to provoke emotional, cognitive, and behavioral responses that reveal disorder-specific patterns. From this approach the authors build a large multimodal dataset with labels verified by psychiatrists and introduce a framework that feeds inter-disorder prior knowledge into the model as prompt descriptions. Experiments indicate that the resulting representations outperform standard baselines, suggesting the stimulus design helps isolate the differences that matter for differential diagnosis.

Core claim

Psychology-inspired multimodal stimuli grounded in experimental psychology findings elicit heterogeneous signals across emotional, cognitive, and behavioral dimensions; a paradigm-aware multimodal framework then uses inter-disorder prior knowledge expressed as prompt-guided semantic descriptions to capture task-specific affective and interaction contexts, producing improved representations for differential detection across depression, anxiety, and schizophrenia in a clinically verified dataset.

What carries the argument

The paradigm-aware multimodal framework that incorporates inter-disorder differences prior knowledge as prompt-guided semantic descriptions to model task-specific affective and interaction contexts.

If this is right

The MMH dataset supplies a publicly usable resource of multimodal recordings with psychiatrist-confirmed labels across three disorders.
The prompt-guided incorporation of inter-disorder knowledge improves modeling of heterogeneous signals from varied elicitation tasks.
Consistent gains over baselines indicate that stimulus design choices affect the separability of disorder representations.
The same paradigm can be applied to additional elicitation tasks or modalities while retaining the prompt structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar stimulus sets could be adapted for clinical intake interviews to reduce misdiagnosis rates in comorbid cases.
The framework structure may transfer to other domains that require distinguishing conditions with shared surface features.
Longitudinal versions of the stimuli could support monitoring of treatment response rather than one-time diagnosis.

Load-bearing premise

Psychology-inspired multimodal stimuli reliably produce disorder-specific response patterns that remain distinguishable despite symptom overlap, and the prompt-guided prior knowledge successfully encodes those distinctions in the learned representations.

What would settle it

Re-running the experiments with non-psychology-inspired or neutral stimuli and obtaining comparable performance gains would show that the stimulus design itself is not required for the reported improvements.

Figures

Figures reproduced from arXiv: 2604.02798 by Jingjing Wu, Junyu Guo, Qiqi Zhao, Qi Wang, Richang Hong, Shijie Hao, Xiaowei Zhang, Yanrong Guo, Yuqi Chu, Zhibo Lei, Zhiyuan Zhou, Zhongcheng Yu.

**Figure 2.** Figure 2: Overview of the psychology-inspired multimodal stimulus paradigm. [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The overview of the proposed paradigm-level prompt-guided learning framework, taking the four-class downstream task as [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 5.** Figure 5: Module-level ablation study on the paradigm-aware [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 4.** Figure 4: Modality-level ablation study. 5.3.3 Contribution of Different Modalities. To evaluate the contribution of different modalities, we perform a modality-level ablation study with different modality combinations, as shown in [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 6.** Figure 6: t-SNE visualization for different diagnosis tasks. Left [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

Differential diagnosis of mental disorders remains a fundamental challenge in real-world clinical practice, where multiple conditions often exhibit overlapping symptoms. However, most existing public datasets are developed under single-disorder settings and rely on limited data elicitation paradigms, restricting their ability to capture disorder-specific patterns. In this work, we investigate differential mental disorder detection through psychology-inspired multimodal stimuli, designed to elicit diverse emotional, cognitive, and behavioral responses grounded in findings from experimental psychology. Based on this paradigm, we collect a large-scale multimodal mental health dataset (MMH) covering depression, anxiety, and schizophrenia, with all diagnostic labels clinically verified by licensed psychiatrists. To effectively model the heterogeneous signals induced by diverse elicitation tasks, we further propose a paradigm-aware multimodal framework that leverages inter-disorder differences prior knowledge as prompt-guided semantic descriptions to capture task-specific affective and interaction contexts for multimodal representation learning in the new differential mental disorder detection task. Extensive experiments show that our framework consistently outperforms existing baselines, underscoring the value of psychology-inspired stimulus design for differential mental disorder detection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes psychology-inspired multimodal stimuli grounded in experimental psychology to elicit emotional, cognitive, and behavioral responses for differential detection of depression, anxiety, and schizophrenia. It introduces the MMH dataset with psychiatrist-verified labels and a paradigm-aware multimodal framework that incorporates inter-disorder prior knowledge via prompt-guided semantic descriptions for task-specific representation learning. Extensive experiments are reported to show consistent outperformance over baselines.

Significance. If the central claims hold after validation, the work could advance multimodal mental health AI by addressing symptom overlap through targeted elicitation paradigms and prior-informed prompting, while releasing a new clinically verified dataset that enables research on differential diagnosis rather than single-disorder settings.

major comments (2)

[Abstract and §4 (Experiments)] Abstract and §4 (Experiments): the claim of consistent outperformance attributable to psychology-inspired stimuli and prompt-guided inter-disorder priors is not supported by intermediate evidence such as statistical tests on raw multimodal response distributions, ablations isolating stimulus design from fusion or scale effects, or comparisons against non-psychology baselines; without these, gains could stem from dataset construction or prompt engineering alone.
[§3.1 (Stimuli Design)] §3.1 (Stimuli Design): the assertion that stimuli reliably elicit disorder-specific patterns separable from overlapping symptoms lacks direct validation (e.g., no reported metrics on response separability or psychiatrist verification of elicited differences beyond labels); this is load-bearing for the differential-detection contribution.

minor comments (2)

[§3.2 (Dataset)] Provide participant demographics, exact stimulus protocols, and inter-rater reliability statistics for the MMH dataset to support reproducibility.
[Tables in §4] Include error bars, statistical significance tests, and full hyperparameter details in all result tables.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and describe the revisions we will make to strengthen the evidence for our claims.

read point-by-point responses

Referee: [Abstract and §4 (Experiments)] Abstract and §4 (Experiments): the claim of consistent outperformance attributable to psychology-inspired stimuli and prompt-guided inter-disorder priors is not supported by intermediate evidence such as statistical tests on raw multimodal response distributions, ablations isolating stimulus design from fusion or scale effects, or comparisons against non-psychology baselines; without these, gains could stem from dataset construction or prompt engineering alone.

Authors: We agree that intermediate evidence is needed to attribute gains specifically to the psychology-inspired stimuli and prompt-guided priors. In the revised manuscript we will add (i) statistical tests (e.g., ANOVA with post-hoc comparisons) on the raw multimodal response distributions across disorders, (ii) systematic ablations that isolate stimulus design from fusion architecture and model scale, and (iii) additional baselines that use generic (non-psychology) elicitation paradigms. These additions will clarify whether performance improvements arise from the proposed paradigm or from dataset construction and prompt engineering alone. revision: yes
Referee: [§3.1 (Stimuli Design)] §3.1 (Stimuli Design): the assertion that stimuli reliably elicit disorder-specific patterns separable from overlapping symptoms lacks direct validation (e.g., no reported metrics on response separability or psychiatrist verification of elicited differences beyond labels); this is load-bearing for the differential-detection contribution.

Authors: The stimuli were selected from established experimental-psychology paradigms known to target distinct emotional, cognitive, and behavioral dimensions. All diagnostic labels were independently verified by licensed psychiatrists. We acknowledge, however, that explicit quantitative validation of elicited separability is currently absent. In the revision we will report response-separability metrics (e.g., silhouette scores and linear-discriminant separability on multimodal features) and will include any available psychiatrist commentary on the observed differences in elicited responses. If additional clinical verification is required, we will obtain it from the same psychiatrist panel. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on independent benchmarking

full rationale

The paper advances an empirical contribution: a psychology-grounded stimulus protocol, a new multimodal dataset with psychiatrist-verified labels, and a prompt-guided fusion framework. Performance superiority is asserted solely via comparative experiments against external baselines on held-out data. No equations, parameter-fitting steps, or derivations appear that would reduce any reported gain to a quantity defined inside the paper itself. References to experimental psychology are external and non-self-citing; no uniqueness theorem or ansatz is imported from prior author work to force the architecture. The central claim therefore remains falsifiable by replication on the released dataset and does not collapse into self-definition or fitted-input renaming.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that psychology-inspired stimuli produce distinguishable disorder-specific signals and that prior inter-disorder knowledge can be effectively encoded as prompts without introducing circularity or overfitting.

axioms (1)

domain assumption Psychology-inspired multimodal stimuli elicit diverse emotional, cognitive and behavioral responses grounded in experimental psychology findings that differ across disorders.
Invoked to justify the data collection paradigm and the value of the new stimuli over standard elicitation methods.

pith-pipeline@v0.9.0 · 5515 in / 1249 out tokens · 49395 ms · 2026-05-13T19:02:23.537741+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages · 4 internal anchors

[1]

Sharifa Alghowinem, Roland Goecke, Michael Wagner, Julien Epps, Matthew Hyett, Gordon Parker, and Michael Breakspear. 2018. Multimodal depression detection: Fusion analysis of paralinguistic, head pose and eye gaze behaviors. IEEE Transactions on Affective Computing9, 4 (2018), 478–490. doi:10.1109/TA FFC.2016.2634527

work page doi:10.1109/ta 2018
[2]

Minghui An, Jingjing Wang, Shoushan Li, and Guodong Zhou. 2020. Multimodal topic-Enriched auxiliary learning for depression detection. InProceedings of the 28th International Conference on Computational Linguistics, Donia Scott, Nuria Bel, and Chengqing Zong (Eds.). International Committee on Computational Linguistics, Barcelona, Spain (Online), 1078–1089

work page 2020
[3]

Alexei Baevski, Yuhao Zhou, Abdelrahman Mohamed, and Michael Auli. 2020. wav2vec 2.0: A framework for self-supervised learning of speech representations. 33 (2020), 12449–12460

work page 2020
[4]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Humen Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zheren Fu, Yi- heng Xu, Jiabo Ye, Xi Zhang, Tianbao Xie, Zesen Cheng, Hang Zhang, Zhibo Yang, Haiyang Xu, and Junyang Lin. 2025. Qwen2.5-VL Technical R...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Tadas Baltrusaitis, Amir Zadeh, Yao Chong Lim, and Louis-Philippe Morency

work page
[6]

Recovering Architectural Design Decisions,

OpenFace 2.0: Facial Behavior Analysis Toolkit. InProceedings of the 13th IEEE International Conference on Automatic Face & Gesture Recognition (FG 2018). 59–66. doi:10.1109/FG.2018.00019

work page doi:10.1109/fg.2018.00019 2018
[7]

Rainer Banse and Klaus R Scherer. 1996. Acoustic profiles in vocal emotion expression.Journal of Personality and Social Psychology70, 3 (1996), 614–636

work page 1996
[8]

Hanshu Cai, Zhenqin Yuan, Yiwen Gao, Shuting Sun, Na Li, Fuze Tian, Han Xiao, Jianxiu Li, Zhengwu Yang, Xiaowei Li, et al. 2022. A multi-modal open dataset for mental-disorder analysis.Scientific Data9, 1 (2022), 178

work page 2022
[9]

Jian Chen, Yuzhu Hu, Qifeng Lai, Wei Wang, Junxin Chen, Han Liu, Gautam Srivastava, Ali Kashif Bashir, and Xiping Hu. 2024. IIFDD: Intra and inter-modal fusion for depression detection with multi-modal information from Internet of Medical Things.Information Fusion102 (2024), 102017

work page 2024
[10]

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. 2020. A simple framework for contrastive learning of visual representations. InProceedings of the International Conference on Machine Learning. PMLR, 1597–1607

work page 2020
[11]

Zhuang Chen, Jiawen Deng, Jinfeng Zhou, Jincenzi Wu, Tieyun Qian, and Minlie Huang. 2024. Depression detection in clinical interviews with LLM-empowered structural element graph. InProceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 8181–8194

work page 2024
[12]

Ziqiang Chen, Dandan Wang, Liangliang Lou, Shiqing Zhang, Xiaoming Zhao, Shuqiang Jiang, Jun Yu, and Jun Xiao. 2025. Text-guided multimodal depression detection via cross-modal feature reconstruction and decomposition.Information Fusion117 (2025), 102861

work page 2025
[13]

Jiaee Cheong, Sinan Kalkan, and Hatice Gunes. 2024. Fairrefuse: Referee-guided fusion for multi-modal causal fairness in depression detection. InProceedings of International Joint Conference on Artificial Intelligence (IJCAI). 7224–7232

work page 2024
[14]

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. 2025. Gemini 2.5: Pushing the frontier with advanced reasoning, multi- modality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[15]

Steven Davis and Paul Mermelstein. 1980. Comparison of parametric representa- tions for monosyllabic word recognition in continuously spoken sentences.IEEE Transactions on Acoustics, Speech, and Signal Processing28, 4 (1980), 357–366

work page 1980
[16]

David DeVault, Ron Artstein, Grace Benn, Teresa Dey, Ed Fast, Alesia Gainer, Kallirroi Georgila, Jon Gratch, Arno Hartholt, Margaux Lhommet, et al . 2014. SimSensei Kiosk: A virtual human interviewer for healthcare decision support. InProceedings of the 2014 International Conference on Autonomous Agents and Multi-agent Systems. 1061–1068

work page 2014
[17]

Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. 2019. BERT: Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North American Chapter of the Associ- ation for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). Association for Comp...

work page 2019
[18]

Hamdi Dibeklio˘glu, Zakia Hammal, and Jeffrey F Cohn. 2018. Dynamic mul- timodal measurement of depression severity using deep autoencoding.IEEE Journal of Biomedical and Health Informatics22, 2 (2018), 525–536

work page 2018
[19]

Hamdi Dibeklio˘glu, Zakia Hammal, Ying Yang, and Jeffrey F Cohn. 2015. Multi- modal detection of depression in clinical interviews. InProceedings of the 2015 ACM on International Conference on Multimodal Interaction. 307–310

work page 2015
[20]

Brian Diep, Marija Stanojevic, and Jekaterina Novikova. 2022. Multi-modal deep learning system for depression and anxiety detection.arXiv preprint arXiv:2212.14490(2022)

work page arXiv 2022
[21]

Huiting Fan, Xingnan Zhang, Yingying Xu, Jiangxiong Fang, Shiqing Zhang, Xi- aoming Zhao, and Jun Yu. 2024. Transformer-based multimodal feature enhance- ment networks for multimodal depression detection integrating video, audio and remote photoplethysmograph signals.Information Fusion104 (2024), 102161

work page 2024
[22]

Weiquan Fan, Zhiwei He, Xiaofen Xing, Bolun Cai, and Weirui Lu. 2019. Multi- modality depression detection via multi-scale temporal dilated cnns. InProceed- ings of the 9th International on Audio/Visual Emotion Challenge and Workshop. 73–80

work page 2019
[23]

Ricardo Flores, ML Tlachac, Avantika Shrestha, and Elke Rundensteiner. 2023. Temporal facial features for depression screening. InProceedings of the 2022 ACM International Joint Conference on Pervasive and Ubiquitous Computing and the 2022 ACM International Symposium on Wearable Computers. 488–493

work page 2023
[24]

Jay C Fournier, Matthew T Keener, Jorge Almeida, Dina M Kronhaus, and Mary L Phillips. 2013. Amygdala and whole-brain activity to emotional faces distinguishes major depressive disorder and bipolar disorder.Bipolar Disorders15, 7 (2013), 741–752. doi:10.1111/bdi.12106

work page doi:10.1111/bdi.12106 2013
[25]

Yuan Gong and Christian Poellabauer. 2017. Topic modeling based multi-modal depression detection. InProceedings of the 7th Annual Workshop on Audio/Visual Emotion Challenge. 69–76. Conference’17, July 2017, Washington, DC, USA T rovato et al

work page 2017
[26]

Jonathan Gratch, Ron Artstein, Gale M Lucas, Giota Stratou, Stefan Scherer, Angela Nazarian, Rachel Wood, Jill Boberg, David DeVault, Stacy Marsella, et al

work page
[27]

InProceedings of the International Conference on Language Resources and Evaluation, V ol

The distress analysis interview corpus of human and computer interviews. InProceedings of the International Conference on Language Resources and Evaluation, V ol. 14. Reykjavik, 3123–3128

work page
[28]

Tao Gui, Liang Zhu, Qi Zhang, Minlong Peng, Xu Zhou, Keyu Ding, and Zhigang Chen. 2019. Cooperative multimodal approach to depression detection in twitter. InProceedings of the Association for the Advancement of Artificial Intelligence (AAAI), V ol. 33. 110–117

work page 2019
[29]

Ruben C Gur, Roland J Erwin, Raquel E Gur, Alexander S Zwil, Carolyn Heim- berg, and Helena C Kraemer. 1992. Facial emotion discrimination: II. Behavioral findings in depression.Psychiatry Research42, 3 (1992), 241–251

work page 1992
[30]

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Ramesh, et al. 2024. GPT-4o System Card. arXiv:2410.21276 [cs.CL] https://arxiv.org/abs/2410.21276

work page internal anchor Pith review Pith/arXiv arXiv 2024
[31]

Zhongyi Jiang, Ke Xu, Xing Gao, Yin Cao, Yihan Zhang, Guanzhong Dong, Yun Chen, Xuanyan Zhu, Qiaoyang Zhang, Ran Bi, et al. 2025. DNet: A depression recognition network combining residual network and vision transformer.BMC Psychiatry25, 1 (2025), 880

work page 2025
[32]

Juho Jung, Chaewon Kang, Jeewoo Yoon, Seungbae Kim, and Jinyoung Han. 2024. Hique: Hierarchical question embedding network for multimodal depression de- tection. InProceedings of the 33rd ACM International Conference on Information and Knowledge Management. 1049–1059

work page 2024
[33]

Puneet Kumar, Shreshtha Misra, Zhuhong Shao, Bin Zhu, Balasubramanian Ra- man, and Xiaobai Li. 2025. Multimodal Interpretable Depression Analysis Us- ing Visual, Physiological, Audio and Textual Data. InProceedings of the 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV). 5305–

work page 2025
[34]

doi:10.1109/W ACV61041.2025.00518

work page doi:10.1109/w 2025
[35]

Qinghe Li, Fanghui Dong, Qun Gai, Kaili Che, Heng Ma, Feng Zhao, Tongpeng Chu, Ning Mao, and Peiyuan Wang. 2023. Diagnosis of major depressive disorder using machine learning based on multisequence MRI neuroimaging features. Journal of Magnetic Resonance Imaging58, 5 (2023), 1420–1430

work page 2023
[36]

Ilya Loshchilov and Frank Hutter. 2019. Decoupled weight decay regularization. InProceedings of the International Conference on Learning Representations

work page 2019
[37]

Michelle Morales, Stefan Scherer, and Rivka Levitan. 2018. A linguistically- informed fusion approach for multimodal depression detection. InProceedings of the Fifth Workshop on Computational Linguistics and Clinical Psychology: from Keyboard to Clinic. 13–24

work page 2018
[38]

Luntian Mou, Siqi Zhen, Shasha Mao, and Nan Ma. 2025. Disentangled rep- resentation learning via transformer with graph attention fusion for depression detection. InProceedings of the 1st International Workshop on Cognition-oriented Multimodal Affective and Empathetic Computing. 20–29

work page 2025
[39]

Mingyue Niu, Jianhua Tao, Bin Liu, Jian Huang, and Zheng Lian. 2023. Mul- timodal spatiotemporal representation for automatic depression level detection. IEEE Transactions on Affective Computing14, 1 (2023), 294–307

work page 2023
[40]

Yuchen Pan, Junjun Jiang, Kui Jiang, and Xianming Liu. 2024. Disentangled- multimodal privileged knowledge distillation for depression recognition with incomplete multimodal data. InProceedings of the 32nd ACM International Conference on Multimedia. 5712–5721

work page 2024
[41]

Gowtham Premananth, Yashish M Siriwarden, Philip Resnik, and Carol Espy- Wilson. 2024. A multi-modal approach for identifying schizophrenia using cross- modal attention. InProceedings of the 46th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC). IEEE, 1–5

work page 2024
[42]

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. 2021. Learning transferable visual models from natural language supervision. InProceedings of the International Conference on Machine Learning. PMLR, 8748–8763

work page 2021
[43]

Anupama Ray, Siddharth Kumar, Rutvik Reddy, Prerana Mukherjee, and Ritu Garg

work page
[44]

InProceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop

Multi-level attention network using text, audio and video for depression prediction. InProceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop. 81–88

work page
[45]

Mariana Rodrigues Makiuchi, Tifani Warnita, Kuniaki Uto, and Koichi Shinoda

work page
[46]

InProceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop

Multimodal fusion of bert-cnn and gated cnn representations for depres- sion detection. InProceedings of the 9th International on Audio/Visual Emotion Challenge and Workshop. 55–63

work page
[47]

Lars A Ross, Dave Saint-Amour, Victoria M Leavitt, Sophie Molholm, Daniel C Javitt, and John J Foxe. 2007. Impaired multisensory processing in schizophre- nia: deficits in the visual enhancement of speech comprehension under noisy environmental conditions.Schizophrenia Research97, 1-3 (2007), 173–183. doi:10.1016/j.schres.2007.08.008

work page doi:10.1016/j.schres.2007.08.008 2007
[48]

Annett Schirmer and Sonja A Kotz. 2006. Beyond the right hemisphere: brain mechanisms mediating vocal emotional processing.Trends in Cognitive Sciences 10, 1 (2006), 24–30

work page 2006
[49]

Hanlei Shi, Yu Liu, Haoxun Li, Yuxuan Ding, Jiaxi Hu, Leyuan Qu, and Taihao Li

work page
[50]

InProceedings of the 33rd ACM International Conference on Multimedia

HOPE: Hierarchical fusion for optimized and personality-aware estimation of depression. InProceedings of the 33rd ACM International Conference on Multimedia. 13937–13943

work page
[51]

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al

work page
[52]

Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805(2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[53]

ML Tlachac, Ricardo Flores, Miranda Reisch, Rimsha Kayastha, Nina Taurich, Veronica Melican, Connor Bruneau, Hunter Caouette, Joshua Lovering, Ermal Toto, et al. 2022. StudentSADD: rapid mobile depression and suicidal ideation screening of college students during the coronavirus pandemic.ACM on Interac- tive, Mobile, Wearable and Ubiquitous Technologies6,...

work page 2022
[54]

Yao-Hung Hubert Tsai, Shaojie Bai, Paul Pu Liang, J Zico Kolter, Louis-Philippe Morency, and Ruslan Salakhutdinov. 2019. Multimodal transformer for unaligned multimodal language sequences. InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics. 6558–6569

work page 2019
[55]

Md Azher Uddin, Joolekha Bibi Joolee, and Kyung-Ah Sohn. 2023. Deep multi- modal network based automated depression severity estimation.IEEE Transac- tions on Affective Computing14, 3 (2023), 2153–2167

work page 2023
[56]

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. 2017. Attention is all you need. 30 (2017)

work page 2017
[57]

Qianqian Wang, Long Li, Lishan Qiao, and Mingxia Liu. 2022. Adaptive multi- modal neuroimage integration for major depression disorder detection.Frontiers in Neuroinformatics16 (2022), 856175

work page 2022
[58]

Xiaojie Wang, Xin Wan, Zhaolong Ning, Zihan Qie, Jiameng Li, and Yulong Xiao. 2023. A multimodal fusion depression recognition assisted decision-making system based on EEG and speech signals. InProceedings of the 2023 Interna- tional Conference on Communications, Computing, Cybersecurity, and Informat- ics (CCCI). IEEE, 1–8

work page 2023
[59]

Ping-Cheng Wei, Kunyu Peng, Alina Roitberg, Kailun Yang, Jiaming Zhang, and Rainer Stiefelhagen. 2023. Multi-modal depression estimation based on sub-attentional fusion. InProceedings of the European Conference on Computer Vision. Springer, 623–639

work page 2023
[60]

World Health Organization. 2022. World mental health report: transforming mental health for all. https://www.who.int/publications/i/item/9789240049338 Accessed: 2025-07-25

work page arXiv 2022
[61]

Jiaxin Ye, Junping Zhang, and Hongming Shan. 2025. Depmamba: Progressive fusion mamba for multimodal depression detection. InProceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 1–5

work page 2025
[62]

Wei Zhang, En Zhu, Juan Chen, and YunPeng Li. 2024. MDDR: Multi-modal dual- Attention aggregation for depression recognition. InProceedings of the 32nd ACM International Conference on Multimedia (MM ’24). Association for Computing Machinery, New York, NY , USA, 321–329. doi:10.1145/3664647.3681491

work page doi:10.1145/3664647.3681491 2024
[63]

Li Zhou, Zhenyu Liu, Yutong Li, Yuchi Duan, Huimin Yu, and Bin Hu. 2024. Multi fine-grained fusion network for depression detection.ACM Transactions on Multimedia Computing, Communications and Applications20, 8 (2024), 1–23

work page 2024
[64]

Zhiyuan Zhou, Yanrong Guo, Shijie Hao, and Richang Hong. 2025. Multi-modal depression detection in interview via exploring emotional distribution information. IEEE Transactions on Multimedia27 (2025), 6872–6883. doi:10.1109/TMM.20 25.3590939

work page doi:10.1109/tmm.20 2025
[65]

Bochao Zou, Jiali Han, Yingxue Wang, Rui Liu, Shenghui Zhao, Lei Feng, Xi- angwen Lyu, and Huimin Ma. 2023. Semi-structural interview-based Chinese multimodal depression corpus towards automatic preliminary screening of de- pressive disorders.IEEE Transactions on Affective Computing14, 4 (2023), 2823–2838

work page 2023