pith. sign in

arxiv: 2606.30550 · v1 · pith:AL346ZBEnew · submitted 2026-06-29 · 💻 cs.SD

SIGMA: Saliency-Guided Sparse Mask Attacks for Speech Emotion Recognition

Pith reviewed 2026-06-30 04:41 UTC · model grok-4.3

classification 💻 cs.SD
keywords adversarial attacksspeech emotion recognitionsparse perturbationssaliency mapsexplainable AIself-supervised featuresIEMOCAPTESS
0
0 comments X

The pith

SIGMA computes a single saliency-guided mask from post-hoc XAI on self-supervised speech features to restrict sparse adversarial perturbations on emotion recognition models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes SIGMA as a method to perform sparse attacks on speech emotion recognition while incorporating explainability. It applies post-hoc XAI techniques to generate saliency maps that define the mask scope, then limits magnitude-bounded updates to that mask. The mask is generated once from self-supervised features and reused across different models and attack variants. A reader would care because it offers an interpretable way to probe vulnerabilities in SER systems deployed in sensitive settings, with competitive success rates under explicit sparsity and magnitude constraints.

Core claim

On self-supervised speech features, post-hoc XAI techniques produce saliency maps that identify the scope of the mask, to which magnitude-bounded updates are then restricted. The mask is computed once and can be reused across models and different sparsity attacks to amortise cost. Under matched budgets and across multiple sparse-attack settings on IEMOCAP and TESS, SIGMA maintains competitive attack success rates while navigating a trade-off between attack efficacy and explanation consistency.

What carries the argument

The saliency-guided sparse mask produced by post-hoc XAI techniques applied to self-supervised speech features, which scopes all subsequent magnitude-bounded perturbations.

If this is right

  • Attack success rates remain competitive with baseline sparse methods under the same budgets on IEMOCAP and TESS.
  • Explanation consistency improves because perturbations are confined to regions highlighted by saliency maps.
  • Computational cost is amortised since one mask can be reused across multiple target models and attack families.
  • The framework applies uniformly to multiple sparse-attack settings and supports analysis of transferability across models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The reusable mask property suggests efficiency advantages when analysing vulnerability across large ensembles of SER models.
  • Incorporating saliency guidance upfront may help identify input regions whose protection could increase overall model robustness.
  • The explicit trade-off documented here provides a concrete axis for comparing future sparse attack methods that also claim interpretability.

Load-bearing premise

Post-hoc XAI techniques on self-supervised speech features produce saliency maps that reliably identify the scope of the mask for effective and consistent sparse perturbations.

What would settle it

An experiment in which explanation consistency metrics show no improvement or decline when the saliency-guided mask is used versus random or non-XAI masks at equivalent attack success rates and sparsity levels.

Figures

Figures reproduced from arXiv: 2606.30550 by Bj\"orn W. Schuller, Qiyang Sun, Yi Chang, Zixing Zhang.

Figure 1
Figure 1. Figure 1: Workflow of SIGMA for SER. Raw waveforms are first encoded by a self-supervised speech encoder into frame-level features. An XAI saliency [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Perturbations on one test utterance (Emo2Vec+BaseModel) under the [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
read the original abstract

Speech conveys rich emotional information. As Speech Emotion Recognition (SER) is usually deployed in privacy-sensitive and reliability-critical environments, adversarial attacks on SER have attracted increasing attention. Existing sparse attacks control the number of perturbed elements, yet, they often lack explainability guidance and explicit measures of explanation consistency. A unified treatment of sparsity and magnitude constraints is also uncommon. In addition, transferability across attack families and target models remains limited. Hence, we propose a SalIency-Guided sparse Mask Attack (SIGMA). On self-supervised speech features, we use post-hoc explainable artificial intelligence (XAI) techniques to produce saliency maps and identify the scope of the mask, and then restrict magnitude-bounded updates to this mask. The mask is computed once and can be reused across models and different sparsity attacks to amortise cost. We evaluate on the IEMOCAP and TESS datasets. Under matched budgets and across multiple sparse-attack settings, SIGMA maintains competitive attack success rates, navigating a conscious trade-off between attack efficacy and explanation consistency. SIGMA therefore provides an efficient and interpretable framework for analysing the vulnerability and explanation behaviour of SER models under structured perturbations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

4 major / 2 minor

Summary. The manuscript proposes SIGMA, a saliency-guided sparse mask attack for Speech Emotion Recognition. It applies post-hoc XAI techniques to self-supervised speech features to generate saliency maps that define the scope of a sparse mask, then applies magnitude-bounded perturbations only within this mask. The mask is computed once and reused across models and attack variants. On IEMOCAP and TESS, the method is claimed to maintain competitive attack success rates while navigating a trade-off between efficacy and explanation consistency, providing an efficient, interpretable framework for SER vulnerability analysis.

Significance. If the empirical results hold with proper validation, the approach offers a reusable mask mechanism that amortizes XAI computation and explicitly links sparse attack design to model explanations, which is a constructive contribution to both adversarial robustness in audio and XAI applications in privacy-sensitive SER. The unified handling of sparsity and magnitude budgets is a methodological strength.

major comments (4)
  1. [Abstract] Abstract: the claim that SIGMA 'maintains competitive attack success rates' and navigates a 'conscious trade-off' between efficacy and explanation consistency is unsupported by any quantitative values, baselines, error bars, or metrics for consistency; without these, the central empirical claim cannot be evaluated.
  2. [Method] Method section (description of saliency map usage): the assumption that post-hoc XAI saliency maps on self-supervised features (e.g., wav2vec) reliably identify causally relevant regions for emotion classification is load-bearing for the consistency gain, yet no faithfulness metrics, insertion/deletion scores, or ablation against random masks are reported; if the maps are unfaithful, the method reduces to a standard sparse attack with added overhead.
  3. [Experiments] Experiments section: no sensitivity analysis to XAI technique choice or target model is provided, despite known method-dependence of saliency maps for audio; this directly affects the claim of improved explanation consistency across settings and reuse of the mask.
  4. [Experiments] Experiments section: the manuscript reports no direct comparison of SIGMA-guided masks versus non-XAI sparse attacks under identical sparsity/magnitude budgets, leaving the incremental benefit of saliency guidance on success rates versus consistency unquantified.
minor comments (2)
  1. [Abstract] Abstract: the expansion of SIGMA contains inconsistent capitalization ('SalIency-Guided'); standardize to 'Saliency-Guided'.
  2. [Abstract] Abstract: specify the concrete post-hoc XAI methods employed (e.g., Grad-CAM variant or integrated gradients) to improve reproducibility.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and commit to revisions that strengthen the empirical support and validation of the method.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that SIGMA 'maintains competitive attack success rates' and navigates a 'conscious trade-off' between efficacy and explanation consistency is unsupported by any quantitative values, baselines, error bars, or metrics for consistency; without these, the central empirical claim cannot be evaluated.

    Authors: We agree that the abstract should contain explicit quantitative support. The experiments section reports attack success rates, baseline comparisons, and consistency metrics under matched sparsity/magnitude budgets on IEMOCAP and TESS. We will revise the abstract to include these specific values, error bars, and a brief definition of the consistency metric. revision: yes

  2. Referee: [Method] Method section (description of saliency map usage): the assumption that post-hoc XAI saliency maps on self-supervised features (e.g., wav2vec) reliably identify causally relevant regions for emotion classification is load-bearing for the consistency gain, yet no faithfulness metrics, insertion/deletion scores, or ablation against random masks are reported; if the maps are unfaithful, the method reduces to a standard sparse attack with added overhead.

    Authors: This concern is valid. The manuscript does not report insertion/deletion scores or random-mask ablations. We will add an ablation comparing SIGMA-guided masks against random masks of identical sparsity and magnitude budgets, and include any post-hoc faithfulness indicators that can be computed on the existing saliency maps to quantify the benefit of the guidance. revision: yes

  3. Referee: [Experiments] Experiments section: no sensitivity analysis to XAI technique choice or target model is provided, despite known method-dependence of saliency maps for audio; this directly affects the claim of improved explanation consistency across settings and reuse of the mask.

    Authors: We acknowledge the known variability of saliency methods. While the current evaluation reuses masks across target models, it does not vary the underlying XAI technique. In revision we will add results using at least one additional XAI method to demonstrate that the mask reuse and consistency benefits are not tied to a single saliency extractor. revision: yes

  4. Referee: [Experiments] Experiments section: the manuscript reports no direct comparison of SIGMA-guided masks versus non-XAI sparse attacks under identical sparsity/magnitude budgets, leaving the incremental benefit of saliency guidance on success rates versus consistency unquantified.

    Authors: We agree that a direct comparison is required to isolate the contribution of saliency guidance. The paper compares SIGMA across attack families but does not benchmark against equivalent non-guided sparse attacks. We will add this head-to-head evaluation under identical budgets, reporting differences in both attack success rate and explanation consistency. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method proposal with no derivation chain

full rationale

The paper proposes SIGMA, a saliency-guided sparse mask attack for SER using post-hoc XAI on self-supervised features. No equations, predictions, fitted parameters, or self-citations are presented that reduce any claim to its inputs by construction. The work consists of a method description followed by empirical evaluation on IEMOCAP and TESS under sparsity/magnitude budgets; the central assumption about XAI map reliability is a methodological choice, not a derived result. This matches the default expectation for non-circular papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no equations, parameters, or technical derivations; ledger is empty by necessity.

pith-pipeline@v0.9.1-grok · 5746 in / 1055 out tokens · 37785 ms · 2026-06-30T04:41:01.684371+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

57 extracted references · 14 canonical work pages · 1 internal anchor

  1. [1]

    GatedxLSTM: A multimodal affective computing approach for emotion recognition in conversations,

    Y . Li, Q. Sun, S. M. K. Murthy, E. Alturki, and B. W. Schuller, “GatedxLSTM: A multimodal affective computing approach for emotion recognition in conversations,”arXiv preprint:2503.20919, pp. 1–9, 2025

  2. [2]

    Make patient consultation warmer: A clinical application for speech emotion recognition,

    H.-C. Li, T. Pan, M.-H. Lee, and H.-W. Chiu, “Make patient consultation warmer: A clinical application for speech emotion recognition,”Applied Sciences, vol. 11, no. 11, 2021

  3. [3]

    Speech emotion recognition and serious games: An entertaining approach for crowdsourcing annotated samples,

    L. Matsouliadis, E. Siamtanidou, N. Vryzas, and C. Dimoulas, “Speech emotion recognition and serious games: An entertaining approach for crowdsourcing annotated samples,”Information, vol. 16, no. 3, p. 238, 2025

  4. [4]

    The acoustically emotion-aware conversational agent with speech emotion recognition and empathetic responses,

    J. Hu, Y . Huang, X. Hu, and Y . Xu, “The acoustically emotion-aware conversational agent with speech emotion recognition and empathetic responses,”IEEE Transactions on Affective Computing, vol. 14, no. 1, pp. 17–30, 2022

  5. [5]

    Towards friendly AI: A comprehensive review and new perspectives on human-AI alignment,

    Q. Sun, Y . Li, E. Alturki, S. M. K. Murthy, and B. W. Schuller, “Towards friendly AI: A comprehensive review and new perspectives on human-AI alignment,”arXiv preprint:2412.15114, pp. 1–15, 2024

  6. [6]

    Adversarial attacks and defenses in deep learning,

    K. Ren, T. Zheng, Z. Qin, and X. Liu, “Adversarial attacks and defenses in deep learning,”Engineering, vol. 6, no. 3, pp. 346–360, 2020

  7. [7]

    Threat of adversarial attacks on deep learning in computer vision: A survey,

    N. Akhtar and A. Mian, “Threat of adversarial attacks on deep learning in computer vision: A survey,”IEEE Access, vol. 6, pp. 14 410–14 430, 2018. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 12

  8. [8]

    Adversarial attacks on ASR systems: An overview,

    X. Zhang, H. Tan, X. Huang, D. Zhang, K. Tang, and Z. Gu, “Adversarial attacks on ASR systems: An overview,”arXiv preprint:2208.02250, pp. 1–8, 2022

  9. [9]

    Adversarial attacks and defenses in speaker recognition systems: A survey,

    J. Lan, R. Zhang, Z. Yan, J. Wang, Y . Chen, and R. Hou, “Adversarial attacks and defenses in speaker recognition systems: A survey,”Journal of Systems Architecture, vol. 127, p. 102526, 2022

  10. [10]

    Generating and protecting against adversarial attacks for deep speech-based emotion recognition models,

    Z. Ren, A. Baird, J. Han, Z. Zhang, and B. W. Schuller, “Generating and protecting against adversarial attacks for deep speech-based emotion recognition models,” inProc. ICASSP, 2020, pp. 7184–7188

  11. [11]

    A black-box adversarial attack strategy with adjustable sparsity and generalizability for deep image classifiers,

    A. Ghosh, S. S. Mullick, S. Datta, S. Das, A. K. Das, and R. Mallipeddi, “A black-box adversarial attack strategy with adjustable sparsity and generalizability for deep image classifiers,”Pattern Recognition, vol. 122, p. 108279, 2022

  12. [12]

    Weighted-sampling audio adversarial example attack,

    X. Liu, K. Wan, Y . Ding, X. Zhang, and Q. Zhu, “Weighted-sampling audio adversarial example attack,” inProc. AAAI, 2020, pp. 4908–4915

  13. [13]

    Audio injection adversarial example attack,

    X. Liu, X. Chen, M. Yin, Y . Wang, T. Hu, and K. Ding, “Audio injection adversarial example attack,” inProc. ICML Workshop on Adversarial Machine Learning, 2021

  14. [14]

    STAA-net: A sparse and transferable adversarial attack for speech emotion recognition,

    Y . Chang, Z. Ren, Z. Zhang, X. Jing, K. Qian, X. Shao, B. Hu, T. Schultz, and B. W. Schuller, “STAA-net: A sparse and transferable adversarial attack for speech emotion recognition,”IEEE Transactions on Affective Computing, 2024

  15. [15]

    Boosting adversarial attacks with momentum,

    Y . Dong, F. Liao, T. Pang, H. Su, J. Zhu, X. Hu, and J. Li, “Boosting adversarial attacks with momentum,” inProc. CVPR, 2018, pp. 9185– 9193

  16. [16]

    Muting Whisper: A universal acoustic adversarial attack on speech foundation models,

    V . Raina, R. Ma, C. McGhee, K. Knill, and M. Gales, “Muting Whisper: A universal acoustic adversarial attack on speech foundation models,” arXiv preprint:2405.06134, pp. 1–18, 2024

  17. [17]

    Boosting the transferability of audio adversarial examples with acoustic representation optimization,

    W. Jin, J. Su, H. Wang, Y . Ye, and J. Hao, “Boosting the transferability of audio adversarial examples with acoustic representation optimization,” arXiv preprint:2503.19591, pp. 1–16, 2025

  18. [18]

    Focus-shifting attack: An adversarial attack that retains saliency map information and manipulates model explanations,

    Q.-X. Huang, L.-K. Chiang, M.-Y . Chiu, and H.-M. Sun, “Focus-shifting attack: An adversarial attack that retains saliency map information and manipulates model explanations,”IEEE Transactions on Reliability, vol. 73, no. 2, pp. 808–819, 2023

  19. [19]

    Maximal Jacobian-based Saliency Map Attack

    R. Wiyatno and A. Xu, “Maximal Jacobian-based saliency map attack,” arXiv preprint:1808.07945, pp. 1–5, 2018

  20. [20]

    Saliency attack: Towards imper- ceptible black-box adversarial attack,

    Z. Dai, S. Liu, Q. Li, and K. Tang, “Saliency attack: Towards imper- ceptible black-box adversarial attack,”ACM Transactions on Intelligent Systems and Technology., vol. 14, no. 3, pp. 1–20, 2023

  21. [21]

    Hidden Markov model-based speech emotion recognition,

    B. Schuller, G. Rigoll, and M. Lang, “Hidden Markov model-based speech emotion recognition,” inProc. ICASSP, vol. 2, 2003, pp. II–1

  22. [22]

    Audio explanation synthesis with generative foundation models,

    A. Akman, Q. Sun, and B. W. Schuller, “Audio explanation synthesis with generative foundation models,” inProc. ICASSP, 2025, pp. 1–5

  23. [23]

    wav2vec 2.0: A framework for self-supervised learning of speech representations,

    A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proc. KDDNeurIPS, vol. 33, 2020, pp. 12 449–12 460

  24. [24]

    Speech emotion diarization: Which emotion appears when?

    Y . Wang, M. Ravanelli, A. Nfissi, and A. Yacoubi, “Speech emotion diarization: Which emotion appears when?”arXiv preprint:2306.12991, pp. 1–7, 2023

  25. [25]

    SUPERB: Speech process- ing Universal PERformance benchmark,

    S. w. Yang, P.-H. Chi, Y .-S. Chuang, C.-I. J. Lai, K. Lakhotia, Y .-T. Lin, A. T. Liu, J. Shi, X. Chang, G.-T. Linet al., “SUPERB: Speech process- ing Universal PERformance benchmark,”arXiv preprint:2105.01051, pp. 1–6, 2021

  26. [26]

    emotion2vec: Self-supervised pre-training for speech emotion repre- sentation,

    Z. Ma, Z. Zheng, J. Ye, J. Li, Z. Gao, S. Zhang, and X. Chen, “emotion2vec: Self-supervised pre-training for speech emotion repre- sentation,” inProc. ACL, 2024, pp. 15 747–15 760

  27. [27]

    Ex- HuBERT: Enhancing HuBERT through block extension and fine-tuning on 37 emotion datasets,

    S. Amiriparian, F. Packa ´n, M. Gerczuk, and B. W. Schuller, “Ex- HuBERT: Enhancing HuBERT through block extension and fine-tuning on 37 emotion datasets,”arXiv preprint:2406.10275, pp. 1–5, 2024

  28. [28]

    A fine-tuned wav2vec 2.0/Hu- BERT benchmark for speech emotion recognition, speaker verification and spoken language understanding,

    Y . Wang, A. Boumadane, and A. Heba, “A fine-tuned wav2vec 2.0/Hu- BERT benchmark for speech emotion recognition, speaker verification and spoken language understanding,”arXiv preprint:2111.02735, pp. 1– 7, 2021

  29. [29]

    Emotion recognition from speech using wav2vec 2.0 embeddings,

    L. Pepino, P. Riera, and L. Ferrer, “Emotion recognition from speech using wav2vec 2.0 embeddings,”arXiv preprint:2104.03502, pp. 1–5, 2021

  30. [30]

    Explaining and harnessing adversarial examples,

    I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” inProc. ICLR, 2015

  31. [31]

    Towards deep learning models resistant to adversarial attacks,

    A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” inProc. ICLR, 2018

  32. [32]

    Simple black-box adversarial attacks,

    C. Guo, J. Gardner, Y . You, A. G. Wilson, and K. Weinberger, “Simple black-box adversarial attacks,” inProc. ICML, vol. 97, 2019, pp. 2484– 2493

  33. [33]

    One pixel attack for fooling deep neural networks,

    J. Su, D. V . Vargas, and K. Sakurai, “One pixel attack for fooling deep neural networks,”IEEE Transactions on Evolutionary Computation, vol. 23, no. 5, pp. 828–841, 2019

  34. [34]

    Universal adversarial perturbations for speech recogni- tion systems,

    P. Neekhara, S. Hussain, P. Pandey, S. Dubnov, J. J. McAuley, and F. Koushanfar, “Universal adversarial perturbations for speech recogni- tion systems,” inProc. Interspeech, 2019, pp. 481–485

  35. [35]

    Generating transferable adversarial examples for speech classification,

    H. Kim, J. Park, and J. Lee, “Generating transferable adversarial examples for speech classification,”Pattern Recognition, vol. 137, p. 109286, 2023

  36. [36]

    Attack on practical speaker verification system using universal adversarial perturbations,

    W. Zhang, S. Zhao, L. Liu, J. Li, X. Cheng, T. F. Zheng, and X. Hu, “Attack on practical speaker verification system using universal adversarial perturbations,” inProc. ICASSP, 2021, pp. 2575–2579

  37. [37]

    Generative adversarial networks,

    I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio, “Generative adversarial networks,” Communications of the ACM, vol. 63, no. 11, pp. 139–144, 2020

  38. [38]

    DiffAttack: Imperceptible and transfer- able audio adversarial attack via diffusion model,

    J. Chen, Y . Dai, and F. Huang, “DiffAttack: Imperceptible and transfer- able audio adversarial attack via diffusion model,” inProc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), 2025, pp. 1–5

  39. [39]

    Enabling fast and universal audio adversarial attack using generative model,

    Y . Xie, Z. Li, C. Shi, J. Liu, Y . Chen, and B. Yuan, “Enabling fast and universal audio adversarial attack using generative model,” inProc. AAAI, 2021, pp. 14 129–14 137

  40. [40]

    EvilHarmony: Stealthy adversarial attacks against black-box speech recognition systems,

    X. Yuan, J. Zhang, F. Guo, K. Chen, X. Wang, S. Zhang, Y . Chen, D. Liu, P. Li, Z. Wang, and R. Zhu, “EvilHarmony: Stealthy adversarial attacks against black-box speech recognition systems,” inProc. IEEE Symp. Secur. Privacy (SP), 2025, pp. 4569–4587

  41. [41]

    Explainable artificial in- telligence for medical applications: A review,

    Q. Sun, A. Akman, and B. W. Schuller, “Explainable artificial in- telligence for medical applications: A review,”ACM Transactions on Computing for Healthcare, vol. 6, no. 2, pp. 1–31, 2025

  42. [42]

    Learning important features through propagating activation differences,

    A. Shrikumar, P. Greenside, and A. Kundaje, “Learning important features through propagating activation differences,” inProc. ICML, 2017, pp. 3145–3153

  43. [43]

    Explainable ai for healthcare,

    Y . Li, Q. Sun, A. Akman, and B. W. Schuller, “Explainable ai for healthcare,” inHandbook on Smart Health. SAGE Publications 1 Oliver’s Yard, 55 City Road, London, EC1Y 1SP, 2025, pp. 632–652

  44. [44]

    Axiomatic attribution for deep networks,

    M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic attribution for deep networks,” inProc. ICML, 2017, pp. 3319–3328

  45. [45]

    Layer-wise relevance propagation: An overview,

    G. Montavon, A. Binder, S. Lapuschkin, W. Samek, and K.-R. M ¨uller, “Layer-wise relevance propagation: An overview,” inExplainable AI: Interpreting, Explaining and Visualizing Deep Learning, ser. Lecture Notes in Computer Science. Springer, 2019, vol. 11700, pp. 193–209

  46. [46]

    Grad-CAM: Visual explanations from deep networks via gradient-based localization,

    R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-CAM: Visual explanations from deep networks via gradient-based localization,” inProc. ICCV, 2017, pp. 618–626

  47. [47]

    “Why should I trust you?

    M. T. Ribeiro, S. Singh, and C. Guestrin, ““Why should I trust you?” explaining the predictions of any classifier,” inProc. KDD, 2016, pp. 1135–1144

  48. [48]

    A unified approach to interpreting model predictions,

    S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model predictions,” inProc. NeurIPS, vol. 30, 2017

  49. [49]

    Audio explainable artificial intelligence: A review,

    A. Akman and B. W. Schuller, “Audio explainable artificial intelligence: A review,”Intelligent Computing, vol. 2, p. 0074, 2024

  50. [50]

    audiolime: Listenable explanations using source separation,

    V . Haunschmid, E. Manilow, and G. Widmer, “audiolime: Listenable explanations using source separation,”arXiv preprint:2008.00582, pp. 1–5, 2020

  51. [51]

    Unveiling hidden factors: Explainable AI for feature boosting in speech emotion recognition,

    A. Nfissi, W. Bouachir, N. Bouguila, and B. Mishara, “Unveiling hidden factors: Explainable AI for feature boosting in speech emotion recognition,”arXiv preprint:2406.01624, pp. 1–36, 2024

  52. [52]

    Reliable evaluation of at- tribution maps in CNNs: A perturbation-based approach,

    L. Nieradzik, H. Stephani, and J. Keuper, “Reliable evaluation of at- tribution maps in CNNs: A perturbation-based approach,”International Journal of Computer Vision, vol. 133, no. 5, pp. 2392–2409, 2025

  53. [53]

    IEMOCAP: Interactive emotional dyadic motion capture database,

    C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “IEMOCAP: Interactive emotional dyadic motion capture database,”Language resources and evaluation, vol. 42, no. 4, pp. 335–359, 2008

  54. [54]

    Self-attention transfer networks for speech emotion recognition,

    Z. Zhao, K. Wang, Z. Bao, Z. Zhang, N. Cummins, S. Sun, H. Wang, J. Tao, and B. W. Schuller, “Self-attention transfer networks for speech emotion recognition,”Virtual Reality & Intelligent Hardware, vol. 3, no. 1, pp. 43–54, 2021

  55. [55]

    Toronto emotional speech set (TESS),

    M. K. Pichora-Fuller and K. Dupuis, “Toronto emotional speech set (TESS),” Scholars Portal Dataverse, 2020. [Online]. Available: https://doi.org/10.5683/SP2/E8H2MF

  56. [56]

    Speech emotion recognition using deep 1D & 2D CNN LSTM networks,

    J. Zhao, X. Mao, and L. Chen, “Speech emotion recognition using deep 1D & 2D CNN LSTM networks,”Biomedical Signal Processing and Control, vol. 47, pp. 312–323, 2019

  57. [57]

    End-to-end speech emotion recognition using deep neural networks,

    P. Tzirakis, J. Zhang, and B. W. Schuller, “End-to-end speech emotion recognition using deep neural networks,” inProc. ICASSP, 2018, pp. 5089–5093