Recognition: unknown
FAIR_XAI: Improving Multimodal Foundation Model Fairness via Explainability for Wellbeing Assessment
Pith reviewed 2026-05-08 06:13 UTC · model grok-4.3
The pith
XAI interventions on vision-language models for depression prediction achieve procedural fairness but fail to ensure equitable outcomes across demographics.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Applying explainability-based interventions to multimodal foundation models for wellbeing assessment produces mixed results: fairness prompting can achieve perfect equal opportunity on certain models and datasets but at a substantial accuracy cost, while other interventions improve procedural consistency without guaranteeing outcome fairness and can sometimes amplify racial bias.
What carries the argument
The XAI intervention framework that combines fairness prompting with explainability techniques applied to VLMs such as Qwen2-VL and Phi-3.5-Vision.
If this is right
- Fairness prompting can remove gender-based equal-opportunity violations in depression prediction for models like Qwen2-VL.
- Explainability interventions raise procedural consistency on lab datasets but do not ensure equitable outcome distributions.
- Future methods must jointly target predictive accuracy, demographic parity, and generalization across controlled and naturalistic settings.
- Racial bias can increase rather than decrease after some explainability steps on certain architectures and data.
Where Pith is reading between the lines
- Bias mitigation for clinical multimodal models may require embedding fairness constraints during training instead of applying them after the fact.
- The gap between transparency and equity suggests that evaluation protocols should include explicit accuracy-fairness trade-off curves rather than separate metrics.
- Extending the framework to additional mental-health tasks could test whether the observed patterns are specific to depression labels.
Load-bearing premise
The chosen fairness metrics such as equal opportunity and the observed demographic biases accurately capture genuine real-world disparities rather than being driven by dataset artifacts, label noise, or unmeasured interactions.
What would settle it
A follow-up experiment on an independent wellbeing dataset in which the same interventions simultaneously raise accuracy, eliminate both gender and racial disparities, and maintain cross-domain performance would show the claimed persistent gap does not hold.
Figures
read the original abstract
In recent years, the integration of multimodal machine learning in wellbeing assessment has offered transformative potential for monitoring mental health. However, with the rapid advancement of Vision-Language Models (VLMs), their deployment in clinical settings has raised concerns due to their lack of transparency and potential for bias. While previous research has explored the intersection of fairness and Explainable AI (XAI), its application to VLMs for wellbeing assessment and depression prediction remains under-explored. This work investigates VLM performance across laboratory (AFAR-BSFT) and naturalistic (E-DAIC) datasets, focusing on diagnostic reliability and demographic fairness. Performance varied substantially across environments and architectures; Phi3.5-Vision achieved 80.4% accuracy on E-DAIC, while Qwen2-VL struggled at 33.9%. Additionally, both models demonstrated a tendency to over-predict depression on AFAR-BSFT. Although bias existed across both architectures, Qwen2-VL showed higher gender disparities, while Phi-3.5-Vision exhibited more racial bias. Our XAI intervention framework yielded mixed results; fairness prompting achieved perfect equal opportunity for Qwen2-VL at a severe accuracy cost on E-DAIC. On AFAR-BSFT, explainability-based interventions improved procedural consistency but did not guarantee outcome fairness, sometimes amplifying racial bias. These results highlight a persistent gap between procedural transparency and equitable outcomes. We analyse these findings and consolidate concrete recommendations for addressing them, emphasising that future fairness interventions must jointly optimise predictive accuracy, demographic parity, and cross-domain generalisation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper evaluates two vision-language models (Phi-3.5-Vision and Qwen2-VL) on depression prediction using the laboratory AFAR-BSFT and naturalistic E-DAIC datasets. It reports substantial performance variation (e.g., 80.4% accuracy for Phi-3.5-Vision vs. 33.9% for Qwen2-VL on E-DAIC), tendencies to over-predict depression, demographic biases (gender for Qwen2-VL, racial for Phi-3.5-Vision), and mixed outcomes from an XAI intervention framework: fairness prompting yields perfect equal opportunity for Qwen2-VL at high accuracy cost on E-DAIC, while explainability interventions on AFAR-BSFT improve procedural consistency but can amplify racial bias. The central claim is that these results demonstrate a persistent gap between procedural transparency and equitable outcomes, leading to recommendations for jointly optimizing accuracy, demographic parity, and cross-domain generalization.
Significance. If the empirical findings hold after proper validation, the work would usefully document the limitations of current XAI techniques for mitigating bias in multimodal foundation models applied to mental-health assessment. It draws attention to the difficulty of translating procedural improvements into outcome fairness and supplies concrete recommendations that could inform future deployment guidelines in clinical AI.
major comments (2)
- [Abstract] Abstract and evaluation sections: the reported point estimates (80.4% vs. 33.9% accuracy, perfect equal opportunity after prompting) are presented without statistical significance tests, confidence intervals, per-subgroup sample sizes, or controls for label noise and dataset shift between AFAR-BSFT and E-DAIC. These omissions make it impossible to assess whether the observed bias amplification or accuracy-fairness trade-offs are robust or could be artifacts of the data.
- [Results] Results and discussion: the claim that explainability-based interventions sometimes amplify racial bias rests on the unverified assumption that the chosen fairness metrics (equal opportunity, demographic parity) faithfully capture real disparities rather than confounding factors such as label noise correlated with race or gender. No sensitivity analyses or noise-modeling checks are described.
minor comments (2)
- [Abstract] The abstract and introduction would benefit from explicit definitions or citations for the fairness metrics (equal opportunity, demographic parity) used throughout.
- [Figures/Tables] Figure and table captions should include the exact demographic subgroup sizes and any preprocessing steps applied to AFAR-BSFT and E-DAIC to improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for their detailed and constructive feedback, which underscores the need for greater statistical rigor and validation in our empirical analysis of VLM fairness and XAI interventions. We address each major comment point by point below, indicating the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract and evaluation sections: the reported point estimates (80.4% vs. 33.9% accuracy, perfect equal opportunity after prompting) are presented without statistical significance tests, confidence intervals, per-subgroup sample sizes, or controls for label noise and dataset shift between AFAR-BSFT and E-DAIC. These omissions make it impossible to assess whether the observed bias amplification or accuracy-fairness trade-offs are robust or could be artifacts of the data.
Authors: We agree that the absence of statistical tests, confidence intervals, and subgroup sample sizes limits the ability to evaluate robustness. In the revised manuscript, we will add bootstrap confidence intervals (e.g., 95% CI via 1000 resamples) for all reported accuracy and fairness metrics, include appropriate significance tests such as McNemar's test for accuracy differences between models and chi-squared or Fisher's exact tests for fairness metric disparities, and explicitly tabulate per-subgroup sample sizes (by gender and race) for both AFAR-BSFT and E-DAIC. For label noise and dataset shift, we will expand the limitations section to discuss these factors based on available dataset documentation and, where feasible, perform a basic sensitivity check by stratifying results by recording condition or annotator agreement metadata. These changes will be incorporated into the evaluation sections and abstract summary. revision: yes
-
Referee: [Results] Results and discussion: the claim that explainability-based interventions sometimes amplify racial bias rests on the unverified assumption that the chosen fairness metrics (equal opportunity, demographic parity) faithfully capture real disparities rather than confounding factors such as label noise correlated with race or gender. No sensitivity analyses or noise-modeling checks are described.
Authors: We concur that sensitivity analyses are necessary to support claims about bias amplification. In the revision, we will introduce a new subsection under Results that conducts sensitivity analyses: we will simulate varying levels of label noise (e.g., 5-20% flip rates) correlated with race and gender using the available demographic annotations, recompute equal opportunity and demographic parity, and report how the observed amplification effects change. We will also explicitly state in the discussion that these metrics serve as standard proxies and may be influenced by unmeasured confounders, while emphasizing that our core finding—the gap between procedural XAI improvements and outcome fairness—persists across the tested conditions. This addition will qualify our interpretation without altering the central conclusions. revision: yes
Circularity Check
No circularity; purely empirical evaluation on external datasets
full rationale
The paper conducts an empirical study of VLM performance and XAI interventions for fairness in wellbeing assessment, reporting accuracy, bias metrics, and intervention outcomes on the public AFAR-BSFT and E-DAIC datasets. No mathematical derivations, equations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. All claims rest on direct experimental results and standard fairness metrics applied to external data, with no load-bearing step that reduces by construction to the paper's own inputs. This is self-contained empirical work with no derivation chain to inspect for circularity.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
[n. d.]. Depressive disorder (depression). https://www.who.int/news-room/fact- sheets/detail/depression. Accessed: 2026-04-22
2026
- [2]
-
[3]
Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauff- mann, et al. 2024. Phi-4 technical report.arXiv preprint arXiv:2412.08905(2024)
work page internal anchor Pith review arXiv 2024
-
[4]
Sharifa Alghowinem, Roland Goecke, Michael Wagner, Julien Epps, Matthew Hyett, Gordon Parker, and Michael Breakspear. 2016. Multimodal depression detection: fusion analysis of paralinguistic, head pose and eye gaze behaviors. IEEE Transactions on Affective Computing9, 4 (2016), 478–490. 8
2016
-
[5]
Fatemeh Amiri, MohammadMahdi Rezaei Yousefi, Caro Lucas, Azadeh Shakery, and Nasser Yazdani. 2011. Mutual information-based feature selection for intru- sion detection systems.Journal of network and computer applications34, 4 (2011), 1184–1199
2011
-
[6]
Anna Arias-Duart, Ferran Parés, Dario Garcia-Gasulla, and Victor Giménez- Ábalos. 2022. Focus! rating xai methods and finding biases. In2022 IEEE Interna- tional Conference on Fuzzy Systems (FUZZ-IEEE). IEEE, 1–8
2022
-
[7]
Alejandro Barredo Arrieta, Natalia Díaz-Rodríguez, Javier Del Ser, Adrien Ben- netot, Siham Tabik, Alberto Barbado, Salvador García, Sergio Gil-López, Daniel Molina, Richard Benjamins, et al. 2020. Explainable Artificial Intelligence (XAI): Concepts, taxonomies, opportunities and challenges toward responsible AI.In- formation fusion58 (2020), 82–115
2020
-
[8]
Min SH Aung, Sebastian Kaltwang, Bernardino Romera-Paredes, Brais Martinez, Aneesha Singh, Matteo Cella, Michel Valstar, Hongying Meng, Andrew Kemp, Moshen Shafizadeh, et al. 2015. The automatic detection of chronic pain-related expression: requirements, challenges and the multimodal EmoPain dataset.IEEE transactions on affective computing7, 4 (2015), 435–451
2015
-
[9]
Su Lin Blodgett, Lisa Green, and Brendan O’Connor. 2016. Demographic Di- alectal Variation in Social Media: A Case Study of African-American English. InProceedings of the 2016 Conference on Empirical Methods in Natural Language Processing. ACL, 1119–1130
2016
-
[10]
Joseph Cameron, Jiaee Cheong, Micol Spitale, and Hatice Gunes. 2024. Multi- modal gender fairness in depression prediction: Insights on data from the usa & china. In2024 12th International Conference on Affective Computing and Intelligent Interaction Workshops and Demos (ACIIW). IEEE, 265–273
2024
-
[11]
Diogo V Carvalho, Eduardo M Pereira, and Jaime S Cardoso. 2019. Machine learning interpretability: A survey on methods and metrics.Electronics8, 8 (2019), 832
2019
-
[12]
Jiaee Cheong, Aditya Bangar, Sinan Kalkan, and Hatice Gunes. 2025. U-Fair: Uncertainty-based Multimodal Multitask Learning for Fairer Depression Detec- tion. InMachine Learning for Health (ML4H). PMLR, 203–218
2025
-
[13]
Jiaee Cheong, Sinan Kalkan, and Hatice Gunes. 2021. The hitchhiker’s guide to bias and fairness in facial affective signal processing: Overview and techniques. IEEE Signal Processing Magazine38, 6 (2021), 39–49
2021
-
[14]
Jiaee Cheong, Sinan Kalkan, and Hatice Gunes. 2022. Counterfactual Fairness for Facial Expression Recognition.ECCV Workshop on Challenge on People Analysis (WCPA)(2022)
2022
-
[15]
Jiaee Cheong, Sinan Kalkan, and Hatice Gunes. 2023. Causal Structure Learning of Bias for Fair Affect Recognition. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision
2023
-
[16]
Jiaee Cheong, Sinan Kalkan, and Hatice Gunes. 2024. FairReFuse: referee-guided fusion for multimodal causal fairness in depression detection. InProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence. 7224–7232
2024
-
[17]
Jiaee Cheong, Selim Kuzucu, Sinan Kalkan, and Hatice Gunes. 2023. Towards gender fairness for mental health prediction. InProceedings of the Thirty-Second International Joint Conference on Artificial Intelligence. 5932–5940
2023
-
[18]
Jiaee Cheong, Abtin Mogharabin, Paul Liang, Hatice Gunes, and Sinan Kalkan
-
[19]
arXiv preprint arXiv:2508.16748(2025)
Fairwell: Fair multimodal self-supervised learning for wellbeing prediction. arXiv preprint arXiv:2508.16748(2025)
-
[20]
It’s not Fair!
Jiaee Cheong, Micol Spitale, and Hatice Gunes. 2023. “It’s not Fair!”–Fairness for a Small Dataset of Multi-modal Dyadic Mental Well-being Coaching. In2023 11th International Conference on Affective Computing and Intelligent Interaction (ACII). IEEE, 1–8
2023
-
[21]
Jiaee Cheong, Micol Spitale, and Hatice Gunes. 2025. Small but fair! fairness for multimodal human-human and robot-human mental wellbeing coaching.IEEE Transactions on Affective Computing(2025)
2025
- [22]
-
[23]
Karina Cortiñas-Lorenzo and Gerard Lacey. 2023. Toward explainable affective computing: A review.IEEE Transactions on Neural Networks and Learning Systems 35, 10 (2023), 13101–13121
2023
-
[24]
Fethiye Irmak Dogan, Yuval Weiss, Kajal Patel, Jiaee Cheong, and Hatice Gunes
- [25]
- [26]
-
[27]
Xin Luna Dong, Seungwhan Moon, Yifan Ethan Xu, Kshitiz Malik, and Zhou Yu
-
[28]
InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining
Towards next-generation intelligent assistants leveraging llm techniques. InProceedings of the 29th ACM SIGKDD Conference on Knowledge Discovery and Data Mining. 5792–5793
-
[29]
Rudresh Dwivedi, Devam Dave, Het Naik, Smiti Singhal, Rana Omer, Pankesh Patel, Bin Qian, Zhenyu Wen, Tejal Shah, Graham Morgan, et al. 2023. Explainable AI (XAI): Core ideas, techniques, and solutions.ACM computing surveys55, 9 (2023), 1–33
2023
-
[30]
Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkatasubramanian
Michael Feldman, Sorelle A. Friedler, John Moeller, Carlos Scheidegger, and Suresh Venkatasubramanian. 2015. Certifying and Removing Disparate Impact. InProceedings of the 21st ACM SIGKDD International Conference on Knowledge Discovery and Data Mining. 259–268. https://arxiv.org/pdf/1412.3756
-
[31]
Gianni Fenu and Mirko Marras. 2022. Demographic Fairness in Multimodal Bio- metrics: A Comparative Analysis on Audio-Visual Speaker Recognition Systems. Procedia Computer Science198 (2022), 249–254
2022
- [32]
-
[33]
Tao Gui, Liang Zhu, Qi Zhang, Minlong Peng, Xu Zhou, Keyu Ding, and Zhigang Chen. 2019. Cooperative multimodal approach to depression detection in twitter. InProceedings of the AAAI conference on artificial intelligence, Vol. 33. 110–117
2019
-
[34]
Yufei Guo, Muzhe Guo, Juntao Su, Zhou Yang, Mengqiu Zhu, Hongfei Li, Mengyang Qiu, and Shuo Shuo Liu. 2024. Bias in large language models: Origin, evaluation, and mitigation.arXiv preprint arXiv:2411.10915(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
Moritz Hardt, Eric Price, and Nati Srebro. 2016. Equality of Opportunity in Su- pervised Learning. InAdvances in Neural Information Processing Systems, Vol. 29. https://arxiv.org/pdf/1610.02413
work page Pith review arXiv 2016
- [36]
- [37]
-
[38]
Angus Man Ho Kwok, Jiaee Cheong, Sinan Kalkan, and Hatice Gunes. 2025. Machine learning fairness for depression detection using eeg data. In2025 IEEE 22nd International Symposium on Biomedical Imaging (ISBI). IEEE, 1–5
2025
- [39]
-
[40]
Guy Laban, Tomer Laban, and Hatice Gunes. 2024. Lexi: Large language models experimentation interface. InProceedings of the 12th international conference on human-agent interaction. 250–259
2024
-
[41]
Min Kyung Lee and Katherine Rich. 2021. Who is included in human perceptions of AI?: Trust and perceived fairness around healthcare AI and cultural mistrust. InProceedings of the 2021 CHI conference on human factors in computing systems. 1–14
2021
-
[42]
Zheng Lian, Rui Liu, Kele Xu, Bin Liu, Xuefei Liu, Yazhou Zhang, Xin Liu, Yong Li, Zebang Cheng, Haolin Zuo, et al. 2025. Mer 2025: When affective comput- ing meets large language models. InProceedings of the 33rd ACM International Conference on Multimedia. 13837–13842
2025
-
[43]
Scott M Lundberg. 2020. Explaining quantitative measures of fairness. InFair & Responsible AI Workshop@ CHI2020
2020
-
[44]
L Daniel Maxim, Ron Niebo, and Mark J Utell. 2014. Screening tests: a review with examples.Inhalation toxicology26, 13 (2014), 811–828
2014
-
[45]
Ninareh Mehrabi, Fred Morstatter, Nripsuta Saxena, Kristina Lerman, and Aram Galstyan. 2021. A survey on bias and fairness in machine learning.ACM com- puting surveys (CSUR)54, 6 (2021), 1–35
2021
- [46]
-
[47]
Soujanya Poria, Erik Cambria, Rajiv Bajpai, and Amir Hussain. 2017. A review of affective computing: From unimodal analysis to multimodal fusion.Information fusion37 (2017), 98–125
2017
-
[48]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763
2021
-
[49]
Emanuele Ratti and Mark Graves. 2022. Explainable machine learning practices: opening another black box for reliable medical AI.AI and Ethics2, 4 (2022), 801–814
2022
-
[50]
Fabien Ringeval, Björn Schuller, Michel Valstar, Nicholas Cummins, Roddy Cowie, Leili Tavabi, Maximilian Schmitt, Sina Alisamir, Shahin Amiriparian, Eva-Maria Messner, et al. 2019. AVEC 2019 workshop and challenge: state-of-mind, detecting depression with AI, and cross-cultural affect recognition. InProceedings of the 9th International on Audio/visual Emo...
2019
-
[51]
Hamza Roubhi, Abdenour Hacine Gharbi, Khaled Rouabah, and Philippe Ravier
-
[52]
Mutual information-based feature selection strategy for speech emotion recognition using machine learning algorithms combined with the voting rules method.Engineering, Technology & Applied Science Research15, 1 (2025), 19207– 19213
2025
-
[53]
Claude Elwood Shannon. 1948. A mathematical theory of communication.The Bell system technical journal27, 3 (1948), 379–423
1948
-
[54]
Micol Spitale, Srikar Babu, Serhan Cakmak, Jiaee Cheong, and Hatice Gunes. 2025. Exploring Causality for HRI: A Case Study on Robotic Mental Well-being Coach- ing. In2025 34th IEEE International Conference on Robot and Human Interactive Communication (RO-MAN). IEEE, 708–713
2025
- [55]
-
[56]
Mukund Sundararajan, Ankur Taly, and Qiqi Yan. 2017. Axiomatic Attribution for Deep Networks. InProceedings of the 34th International Conference on Machine Learning. PMLR, 3319–3328. https://proceedings.mlr.press/v70/sundararajan17a/ sundararajan17a.pdf
2017
-
[57]
Amir Taubenfeld, Yaniv Dover, Roi Reichart, and Ariel Goldstein. 2024. Systematic biases in LLM simulations of debates. InProceedings of the 2024 conference on empirical methods in natural language processing. 251–267
2024
-
[58]
Nipuna Thalpage. 2023. Unlocking the black box: Explainable artificial intelli- gence (XAI) for trust and transparency in ai systems.J. Digit. Art Humanit4, 1 (2023), 31–36
2023
-
[59]
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. 2024. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191(2024)
work page internal anchor Pith review arXiv 2024
-
[60]
Robert Wolfe and Aylin Caliskan. 2022. Markedness in Visual Semantic AI. InPro- ceedings of the 2022 ACM Conference on Fairness, Accountability, and Transparency. ACM, 1266–1279
2022
-
[61]
Shen Yan, Di Huang, and Mohammad Soleymani. 2020. Mitigating biases in mul- timodal personality assessment. InProceedings of the 2020 international conference on multimodal interaction. 361–369
2020
-
[62]
Jingyi Zhang, Jiaxing Huang, Sheng Jin, and Shijian Lu. 2024. Vision-language models for vision tasks: A survey.IEEE transactions on pattern analysis and machine intelligence46, 8 (2024), 5625–5644
2024
-
[63]
Yuying Zhao, Yu Wang, and Tyler Derr. 2023. Fairness and explainability: Bridging the gap towards fair model explanations. InProceedings of the AAAI conference on artificial intelligence, Vol. 37. 11363–11371. 10
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.