SIGMA: Saliency-Guided Sparse Mask Attacks for Speech Emotion Recognition

Bj\"orn W. Schuller; Qiyang Sun; Yi Chang; Zixing Zhang

arxiv: 2606.30550 · v1 · pith:AL346ZBEnew · submitted 2026-06-29 · 💻 cs.SD

SIGMA: Saliency-Guided Sparse Mask Attacks for Speech Emotion Recognition

Qiyang Sun , Yi Chang , Zixing Zhang , Bj\"orn W. Schuller This is my paper

Pith reviewed 2026-06-30 04:41 UTC · model grok-4.3

classification 💻 cs.SD

keywords adversarial attacksspeech emotion recognitionsparse perturbationssaliency mapsexplainable AIself-supervised featuresIEMOCAPTESS

0 comments

The pith

SIGMA computes a single saliency-guided mask from post-hoc XAI on self-supervised speech features to restrict sparse adversarial perturbations on emotion recognition models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes SIGMA as a method to perform sparse attacks on speech emotion recognition while incorporating explainability. It applies post-hoc XAI techniques to generate saliency maps that define the mask scope, then limits magnitude-bounded updates to that mask. The mask is generated once from self-supervised features and reused across different models and attack variants. A reader would care because it offers an interpretable way to probe vulnerabilities in SER systems deployed in sensitive settings, with competitive success rates under explicit sparsity and magnitude constraints.

Core claim

On self-supervised speech features, post-hoc XAI techniques produce saliency maps that identify the scope of the mask, to which magnitude-bounded updates are then restricted. The mask is computed once and can be reused across models and different sparsity attacks to amortise cost. Under matched budgets and across multiple sparse-attack settings on IEMOCAP and TESS, SIGMA maintains competitive attack success rates while navigating a trade-off between attack efficacy and explanation consistency.

What carries the argument

The saliency-guided sparse mask produced by post-hoc XAI techniques applied to self-supervised speech features, which scopes all subsequent magnitude-bounded perturbations.

If this is right

Attack success rates remain competitive with baseline sparse methods under the same budgets on IEMOCAP and TESS.
Explanation consistency improves because perturbations are confined to regions highlighted by saliency maps.
Computational cost is amortised since one mask can be reused across multiple target models and attack families.
The framework applies uniformly to multiple sparse-attack settings and supports analysis of transferability across models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The reusable mask property suggests efficiency advantages when analysing vulnerability across large ensembles of SER models.
Incorporating saliency guidance upfront may help identify input regions whose protection could increase overall model robustness.
The explicit trade-off documented here provides a concrete axis for comparing future sparse attack methods that also claim interpretability.

Load-bearing premise

Post-hoc XAI techniques on self-supervised speech features produce saliency maps that reliably identify the scope of the mask for effective and consistent sparse perturbations.

What would settle it

An experiment in which explanation consistency metrics show no improvement or decline when the saliency-guided mask is used versus random or non-XAI masks at equivalent attack success rates and sparsity levels.

Figures

Figures reproduced from arXiv: 2606.30550 by Bj\"orn W. Schuller, Qiyang Sun, Yi Chang, Zixing Zhang.

**Figure 1.** Figure 1: Workflow of SIGMA for SER. Raw waveforms are first encoded by a self-supervised speech encoder into frame-level features. An XAI saliency [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: Perturbations on one test utterance (Emo2Vec+BaseModel) under the [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗

read the original abstract

Speech conveys rich emotional information. As Speech Emotion Recognition (SER) is usually deployed in privacy-sensitive and reliability-critical environments, adversarial attacks on SER have attracted increasing attention. Existing sparse attacks control the number of perturbed elements, yet, they often lack explainability guidance and explicit measures of explanation consistency. A unified treatment of sparsity and magnitude constraints is also uncommon. In addition, transferability across attack families and target models remains limited. Hence, we propose a SalIency-Guided sparse Mask Attack (SIGMA). On self-supervised speech features, we use post-hoc explainable artificial intelligence (XAI) techniques to produce saliency maps and identify the scope of the mask, and then restrict magnitude-bounded updates to this mask. The mask is computed once and can be reused across models and different sparsity attacks to amortise cost. We evaluate on the IEMOCAP and TESS datasets. Under matched budgets and across multiple sparse-attack settings, SIGMA maintains competitive attack success rates, navigating a conscious trade-off between attack efficacy and explanation consistency. SIGMA therefore provides an efficient and interpretable framework for analysing the vulnerability and explanation behaviour of SER models under structured perturbations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SIGMA ties post-hoc XAI saliency maps to sparse attacks on SER with a reusable mask, but the approach rests on unverified assumptions about map quality and supplies no numbers to back the claims.

read the letter

The main thing to know is that SIGMA uses saliency maps from post-hoc XAI on self-supervised speech features to define a mask, then restricts magnitude-bounded sparse perturbations to that mask. The mask is computed once and reused across models and attack variants.

What is new is the explicit combination of saliency guidance with both sparsity and magnitude constraints in one framework for audio, plus the reuse step to cut repeated computation. This is a step past plain sparse attacks that ignore explainability.

The paper does a clear job stating the motivation around privacy-sensitive SER deployments and the value of structured perturbations that also support consistency checks.

The soft spot is the assumption that the saliency maps reliably mark the right regions for effective attacks. Post-hoc XAI maps on audio features are often unfaithful or sensitive to the chosen method, and the text gives no faithfulness metrics, no ablation against random masks, and no sensitivity tests to different XAI techniques. Without those, the consistency gain is not demonstrated. The abstract also states competitive success rates and a trade-off but includes no quantitative results, baselines, or error bars, so the actual performance cannot be judged.

The evaluation is limited to IEMOCAP and TESS under matched budgets, but again the lack of reported figures keeps the strength of the findings unclear.

This paper is for researchers working on adversarial robustness in speech and audio ML who want to add an interpretability angle. A reader who needs concrete methods to test SER models with sparse, guided perturbations would get some value if the full experiments include proper controls.

It deserves a serious referee to examine the results and the XAI assumptions in detail.

I recommend sending it to peer review rather than desk rejection.

Referee Report

4 major / 2 minor

Summary. The manuscript proposes SIGMA, a saliency-guided sparse mask attack for Speech Emotion Recognition. It applies post-hoc XAI techniques to self-supervised speech features to generate saliency maps that define the scope of a sparse mask, then applies magnitude-bounded perturbations only within this mask. The mask is computed once and reused across models and attack variants. On IEMOCAP and TESS, the method is claimed to maintain competitive attack success rates while navigating a trade-off between efficacy and explanation consistency, providing an efficient, interpretable framework for SER vulnerability analysis.

Significance. If the empirical results hold with proper validation, the approach offers a reusable mask mechanism that amortizes XAI computation and explicitly links sparse attack design to model explanations, which is a constructive contribution to both adversarial robustness in audio and XAI applications in privacy-sensitive SER. The unified handling of sparsity and magnitude budgets is a methodological strength.

major comments (4)

[Abstract] Abstract: the claim that SIGMA 'maintains competitive attack success rates' and navigates a 'conscious trade-off' between efficacy and explanation consistency is unsupported by any quantitative values, baselines, error bars, or metrics for consistency; without these, the central empirical claim cannot be evaluated.
[Method] Method section (description of saliency map usage): the assumption that post-hoc XAI saliency maps on self-supervised features (e.g., wav2vec) reliably identify causally relevant regions for emotion classification is load-bearing for the consistency gain, yet no faithfulness metrics, insertion/deletion scores, or ablation against random masks are reported; if the maps are unfaithful, the method reduces to a standard sparse attack with added overhead.
[Experiments] Experiments section: no sensitivity analysis to XAI technique choice or target model is provided, despite known method-dependence of saliency maps for audio; this directly affects the claim of improved explanation consistency across settings and reuse of the mask.
[Experiments] Experiments section: the manuscript reports no direct comparison of SIGMA-guided masks versus non-XAI sparse attacks under identical sparsity/magnitude budgets, leaving the incremental benefit of saliency guidance on success rates versus consistency unquantified.

minor comments (2)

[Abstract] Abstract: the expansion of SIGMA contains inconsistent capitalization ('SalIency-Guided'); standardize to 'Saliency-Guided'.
[Abstract] Abstract: specify the concrete post-hoc XAI methods employed (e.g., Grad-CAM variant or integrated gradients) to improve reproducibility.

Simulated Author's Rebuttal

4 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and commit to revisions that strengthen the empirical support and validation of the method.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that SIGMA 'maintains competitive attack success rates' and navigates a 'conscious trade-off' between efficacy and explanation consistency is unsupported by any quantitative values, baselines, error bars, or metrics for consistency; without these, the central empirical claim cannot be evaluated.

Authors: We agree that the abstract should contain explicit quantitative support. The experiments section reports attack success rates, baseline comparisons, and consistency metrics under matched sparsity/magnitude budgets on IEMOCAP and TESS. We will revise the abstract to include these specific values, error bars, and a brief definition of the consistency metric. revision: yes
Referee: [Method] Method section (description of saliency map usage): the assumption that post-hoc XAI saliency maps on self-supervised features (e.g., wav2vec) reliably identify causally relevant regions for emotion classification is load-bearing for the consistency gain, yet no faithfulness metrics, insertion/deletion scores, or ablation against random masks are reported; if the maps are unfaithful, the method reduces to a standard sparse attack with added overhead.

Authors: This concern is valid. The manuscript does not report insertion/deletion scores or random-mask ablations. We will add an ablation comparing SIGMA-guided masks against random masks of identical sparsity and magnitude budgets, and include any post-hoc faithfulness indicators that can be computed on the existing saliency maps to quantify the benefit of the guidance. revision: yes
Referee: [Experiments] Experiments section: no sensitivity analysis to XAI technique choice or target model is provided, despite known method-dependence of saliency maps for audio; this directly affects the claim of improved explanation consistency across settings and reuse of the mask.

Authors: We acknowledge the known variability of saliency methods. While the current evaluation reuses masks across target models, it does not vary the underlying XAI technique. In revision we will add results using at least one additional XAI method to demonstrate that the mask reuse and consistency benefits are not tied to a single saliency extractor. revision: yes
Referee: [Experiments] Experiments section: the manuscript reports no direct comparison of SIGMA-guided masks versus non-XAI sparse attacks under identical sparsity/magnitude budgets, leaving the incremental benefit of saliency guidance on success rates versus consistency unquantified.

Authors: We agree that a direct comparison is required to isolate the contribution of saliency guidance. The paper compares SIGMA across attack families but does not benchmark against equivalent non-guided sparse attacks. We will add this head-to-head evaluation under identical budgets, reporting differences in both attack success rate and explanation consistency. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method proposal with no derivation chain

full rationale

The paper proposes SIGMA, a saliency-guided sparse mask attack for SER using post-hoc XAI on self-supervised features. No equations, predictions, fitted parameters, or self-citations are presented that reduce any claim to its inputs by construction. The work consists of a method description followed by empirical evaluation on IEMOCAP and TESS under sparsity/magnitude budgets; the central assumption about XAI map reliability is a methodological choice, not a derived result. This matches the default expectation for non-circular papers.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no equations, parameters, or technical derivations; ledger is empty by necessity.

pith-pipeline@v0.9.1-grok · 5746 in / 1055 out tokens · 37785 ms · 2026-06-30T04:41:01.684371+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 14 canonical work pages · 1 internal anchor

[1]

GatedxLSTM: A multimodal affective computing approach for emotion recognition in conversations,

Y . Li, Q. Sun, S. M. K. Murthy, E. Alturki, and B. W. Schuller, “GatedxLSTM: A multimodal affective computing approach for emotion recognition in conversations,”arXiv preprint:2503.20919, pp. 1–9, 2025

work page arXiv 2025
[2]

Make patient consultation warmer: A clinical application for speech emotion recognition,

H.-C. Li, T. Pan, M.-H. Lee, and H.-W. Chiu, “Make patient consultation warmer: A clinical application for speech emotion recognition,”Applied Sciences, vol. 11, no. 11, 2021

2021
[3]

Speech emotion recognition and serious games: An entertaining approach for crowdsourcing annotated samples,

L. Matsouliadis, E. Siamtanidou, N. Vryzas, and C. Dimoulas, “Speech emotion recognition and serious games: An entertaining approach for crowdsourcing annotated samples,”Information, vol. 16, no. 3, p. 238, 2025

2025
[4]

The acoustically emotion-aware conversational agent with speech emotion recognition and empathetic responses,

J. Hu, Y . Huang, X. Hu, and Y . Xu, “The acoustically emotion-aware conversational agent with speech emotion recognition and empathetic responses,”IEEE Transactions on Affective Computing, vol. 14, no. 1, pp. 17–30, 2022

2022
[5]

Towards friendly AI: A comprehensive review and new perspectives on human-AI alignment,

Q. Sun, Y . Li, E. Alturki, S. M. K. Murthy, and B. W. Schuller, “Towards friendly AI: A comprehensive review and new perspectives on human-AI alignment,”arXiv preprint:2412.15114, pp. 1–15, 2024

work page arXiv 2024
[6]

Adversarial attacks and defenses in deep learning,

K. Ren, T. Zheng, Z. Qin, and X. Liu, “Adversarial attacks and defenses in deep learning,”Engineering, vol. 6, no. 3, pp. 346–360, 2020

2020
[7]

Threat of adversarial attacks on deep learning in computer vision: A survey,

N. Akhtar and A. Mian, “Threat of adversarial attacks on deep learning in computer vision: A survey,”IEEE Access, vol. 6, pp. 14 410–14 430, 2018. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 12

2018
[8]

Adversarial attacks on ASR systems: An overview,

X. Zhang, H. Tan, X. Huang, D. Zhang, K. Tang, and Z. Gu, “Adversarial attacks on ASR systems: An overview,”arXiv preprint:2208.02250, pp. 1–8, 2022

work page arXiv 2022
[9]

Adversarial attacks and defenses in speaker recognition systems: A survey,

J. Lan, R. Zhang, Z. Yan, J. Wang, Y . Chen, and R. Hou, “Adversarial attacks and defenses in speaker recognition systems: A survey,”Journal of Systems Architecture, vol. 127, p. 102526, 2022

2022
[10]

Generating and protecting against adversarial attacks for deep speech-based emotion recognition models,

Z. Ren, A. Baird, J. Han, Z. Zhang, and B. W. Schuller, “Generating and protecting against adversarial attacks for deep speech-based emotion recognition models,” inProc. ICASSP, 2020, pp. 7184–7188

2020
[11]

A black-box adversarial attack strategy with adjustable sparsity and generalizability for deep image classifiers,

A. Ghosh, S. S. Mullick, S. Datta, S. Das, A. K. Das, and R. Mallipeddi, “A black-box adversarial attack strategy with adjustable sparsity and generalizability for deep image classifiers,”Pattern Recognition, vol. 122, p. 108279, 2022

2022
[12]

Weighted-sampling audio adversarial example attack,

X. Liu, K. Wan, Y . Ding, X. Zhang, and Q. Zhu, “Weighted-sampling audio adversarial example attack,” inProc. AAAI, 2020, pp. 4908–4915

2020
[13]

Audio injection adversarial example attack,

X. Liu, X. Chen, M. Yin, Y . Wang, T. Hu, and K. Ding, “Audio injection adversarial example attack,” inProc. ICML Workshop on Adversarial Machine Learning, 2021

2021
[14]

STAA-net: A sparse and transferable adversarial attack for speech emotion recognition,

Y . Chang, Z. Ren, Z. Zhang, X. Jing, K. Qian, X. Shao, B. Hu, T. Schultz, and B. W. Schuller, “STAA-net: A sparse and transferable adversarial attack for speech emotion recognition,”IEEE Transactions on Affective Computing, 2024

2024
[15]

Boosting adversarial attacks with momentum,

Y . Dong, F. Liao, T. Pang, H. Su, J. Zhu, X. Hu, and J. Li, “Boosting adversarial attacks with momentum,” inProc. CVPR, 2018, pp. 9185– 9193

2018
[16]

Muting Whisper: A universal acoustic adversarial attack on speech foundation models,

V . Raina, R. Ma, C. McGhee, K. Knill, and M. Gales, “Muting Whisper: A universal acoustic adversarial attack on speech foundation models,” arXiv preprint:2405.06134, pp. 1–18, 2024

work page arXiv 2024
[17]

Boosting the transferability of audio adversarial examples with acoustic representation optimization,

W. Jin, J. Su, H. Wang, Y . Ye, and J. Hao, “Boosting the transferability of audio adversarial examples with acoustic representation optimization,” arXiv preprint:2503.19591, pp. 1–16, 2025

work page arXiv 2025
[18]

Focus-shifting attack: An adversarial attack that retains saliency map information and manipulates model explanations,

Q.-X. Huang, L.-K. Chiang, M.-Y . Chiu, and H.-M. Sun, “Focus-shifting attack: An adversarial attack that retains saliency map information and manipulates model explanations,”IEEE Transactions on Reliability, vol. 73, no. 2, pp. 808–819, 2023

2023
[19]

Maximal Jacobian-based Saliency Map Attack

R. Wiyatno and A. Xu, “Maximal Jacobian-based saliency map attack,” arXiv preprint:1808.07945, pp. 1–5, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[20]

Saliency attack: Towards imper- ceptible black-box adversarial attack,

Z. Dai, S. Liu, Q. Li, and K. Tang, “Saliency attack: Towards imper- ceptible black-box adversarial attack,”ACM Transactions on Intelligent Systems and Technology., vol. 14, no. 3, pp. 1–20, 2023

2023
[21]

Hidden Markov model-based speech emotion recognition,

B. Schuller, G. Rigoll, and M. Lang, “Hidden Markov model-based speech emotion recognition,” inProc. ICASSP, vol. 2, 2003, pp. II–1

2003
[22]

Audio explanation synthesis with generative foundation models,

A. Akman, Q. Sun, and B. W. Schuller, “Audio explanation synthesis with generative foundation models,” inProc. ICASSP, 2025, pp. 1–5

2025
[23]

wav2vec 2.0: A framework for self-supervised learning of speech representations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proc. KDDNeurIPS, vol. 33, 2020, pp. 12 449–12 460

2020
[24]

Speech emotion diarization: Which emotion appears when?

Y . Wang, M. Ravanelli, A. Nfissi, and A. Yacoubi, “Speech emotion diarization: Which emotion appears when?”arXiv preprint:2306.12991, pp. 1–7, 2023

work page arXiv 2023
[25]

SUPERB: Speech process- ing Universal PERformance benchmark,

S. w. Yang, P.-H. Chi, Y .-S. Chuang, C.-I. J. Lai, K. Lakhotia, Y .-T. Lin, A. T. Liu, J. Shi, X. Chang, G.-T. Linet al., “SUPERB: Speech process- ing Universal PERformance benchmark,”arXiv preprint:2105.01051, pp. 1–6, 2021

work page arXiv 2021
[26]

emotion2vec: Self-supervised pre-training for speech emotion repre- sentation,

Z. Ma, Z. Zheng, J. Ye, J. Li, Z. Gao, S. Zhang, and X. Chen, “emotion2vec: Self-supervised pre-training for speech emotion repre- sentation,” inProc. ACL, 2024, pp. 15 747–15 760

2024
[27]

Ex- HuBERT: Enhancing HuBERT through block extension and fine-tuning on 37 emotion datasets,

S. Amiriparian, F. Packa ´n, M. Gerczuk, and B. W. Schuller, “Ex- HuBERT: Enhancing HuBERT through block extension and fine-tuning on 37 emotion datasets,”arXiv preprint:2406.10275, pp. 1–5, 2024

work page arXiv 2024
[28]

A fine-tuned wav2vec 2.0/Hu- BERT benchmark for speech emotion recognition, speaker verification and spoken language understanding,

Y . Wang, A. Boumadane, and A. Heba, “A fine-tuned wav2vec 2.0/Hu- BERT benchmark for speech emotion recognition, speaker verification and spoken language understanding,”arXiv preprint:2111.02735, pp. 1– 7, 2021

work page arXiv 2021
[29]

Emotion recognition from speech using wav2vec 2.0 embeddings,

L. Pepino, P. Riera, and L. Ferrer, “Emotion recognition from speech using wav2vec 2.0 embeddings,”arXiv preprint:2104.03502, pp. 1–5, 2021

work page arXiv 2021
[30]

Explaining and harnessing adversarial examples,

I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” inProc. ICLR, 2015

2015
[31]

Towards deep learning models resistant to adversarial attacks,

A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” inProc. ICLR, 2018

2018
[32]

Simple black-box adversarial attacks,

C. Guo, J. Gardner, Y . You, A. G. Wilson, and K. Weinberger, “Simple black-box adversarial attacks,” inProc. ICML, vol. 97, 2019, pp. 2484– 2493

2019
[33]

One pixel attack for fooling deep neural networks,

J. Su, D. V . Vargas, and K. Sakurai, “One pixel attack for fooling deep neural networks,”IEEE Transactions on Evolutionary Computation, vol. 23, no. 5, pp. 828–841, 2019

2019
[34]

Universal adversarial perturbations for speech recogni- tion systems,

P. Neekhara, S. Hussain, P. Pandey, S. Dubnov, J. J. McAuley, and F. Koushanfar, “Universal adversarial perturbations for speech recogni- tion systems,” inProc. Interspeech, 2019, pp. 481–485

2019
[35]

Generating transferable adversarial examples for speech classification,

H. Kim, J. Park, and J. Lee, “Generating transferable adversarial examples for speech classification,”Pattern Recognition, vol. 137, p. 109286, 2023

2023
[36]

Attack on practical speaker verification system using universal adversarial perturbations,

W. Zhang, S. Zhao, L. Liu, J. Li, X. Cheng, T. F. Zheng, and X. Hu, “Attack on practical speaker verification system using universal adversarial perturbations,” inProc. ICASSP, 2021, pp. 2575–2579

2021
[37]

Generative adversarial networks,

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio, “Generative adversarial networks,” Communications of the ACM, vol. 63, no. 11, pp. 139–144, 2020

2020
[38]

DiffAttack: Imperceptible and transfer- able audio adversarial attack via diffusion model,

J. Chen, Y . Dai, and F. Huang, “DiffAttack: Imperceptible and transfer- able audio adversarial attack via diffusion model,” inProc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), 2025, pp. 1–5

2025
[39]

Enabling fast and universal audio adversarial attack using generative model,

Y . Xie, Z. Li, C. Shi, J. Liu, Y . Chen, and B. Yuan, “Enabling fast and universal audio adversarial attack using generative model,” inProc. AAAI, 2021, pp. 14 129–14 137

2021
[40]

EvilHarmony: Stealthy adversarial attacks against black-box speech recognition systems,

X. Yuan, J. Zhang, F. Guo, K. Chen, X. Wang, S. Zhang, Y . Chen, D. Liu, P. Li, Z. Wang, and R. Zhu, “EvilHarmony: Stealthy adversarial attacks against black-box speech recognition systems,” inProc. IEEE Symp. Secur. Privacy (SP), 2025, pp. 4569–4587

2025
[41]

Explainable artificial in- telligence for medical applications: A review,

Q. Sun, A. Akman, and B. W. Schuller, “Explainable artificial in- telligence for medical applications: A review,”ACM Transactions on Computing for Healthcare, vol. 6, no. 2, pp. 1–31, 2025

2025
[42]

Learning important features through propagating activation differences,

A. Shrikumar, P. Greenside, and A. Kundaje, “Learning important features through propagating activation differences,” inProc. ICML, 2017, pp. 3145–3153

2017
[43]

Explainable ai for healthcare,

Y . Li, Q. Sun, A. Akman, and B. W. Schuller, “Explainable ai for healthcare,” inHandbook on Smart Health. SAGE Publications 1 Oliver’s Yard, 55 City Road, London, EC1Y 1SP, 2025, pp. 632–652

2025
[44]

Axiomatic attribution for deep networks,

M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic attribution for deep networks,” inProc. ICML, 2017, pp. 3319–3328

2017
[45]

Layer-wise relevance propagation: An overview,

G. Montavon, A. Binder, S. Lapuschkin, W. Samek, and K.-R. M ¨uller, “Layer-wise relevance propagation: An overview,” inExplainable AI: Interpreting, Explaining and Visualizing Deep Learning, ser. Lecture Notes in Computer Science. Springer, 2019, vol. 11700, pp. 193–209

2019
[46]

Grad-CAM: Visual explanations from deep networks via gradient-based localization,

R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-CAM: Visual explanations from deep networks via gradient-based localization,” inProc. ICCV, 2017, pp. 618–626

2017
[47]

“Why should I trust you?

M. T. Ribeiro, S. Singh, and C. Guestrin, ““Why should I trust you?” explaining the predictions of any classifier,” inProc. KDD, 2016, pp. 1135–1144

2016
[48]

A unified approach to interpreting model predictions,

S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model predictions,” inProc. NeurIPS, vol. 30, 2017

2017
[49]

Audio explainable artificial intelligence: A review,

A. Akman and B. W. Schuller, “Audio explainable artificial intelligence: A review,”Intelligent Computing, vol. 2, p. 0074, 2024

2024
[50]

audiolime: Listenable explanations using source separation,

V . Haunschmid, E. Manilow, and G. Widmer, “audiolime: Listenable explanations using source separation,”arXiv preprint:2008.00582, pp. 1–5, 2020

work page arXiv 2008
[51]

Unveiling hidden factors: Explainable AI for feature boosting in speech emotion recognition,

A. Nfissi, W. Bouachir, N. Bouguila, and B. Mishara, “Unveiling hidden factors: Explainable AI for feature boosting in speech emotion recognition,”arXiv preprint:2406.01624, pp. 1–36, 2024

work page arXiv 2024
[52]

Reliable evaluation of at- tribution maps in CNNs: A perturbation-based approach,

L. Nieradzik, H. Stephani, and J. Keuper, “Reliable evaluation of at- tribution maps in CNNs: A perturbation-based approach,”International Journal of Computer Vision, vol. 133, no. 5, pp. 2392–2409, 2025

2025
[53]

IEMOCAP: Interactive emotional dyadic motion capture database,

C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “IEMOCAP: Interactive emotional dyadic motion capture database,”Language resources and evaluation, vol. 42, no. 4, pp. 335–359, 2008

2008
[54]

Self-attention transfer networks for speech emotion recognition,

Z. Zhao, K. Wang, Z. Bao, Z. Zhang, N. Cummins, S. Sun, H. Wang, J. Tao, and B. W. Schuller, “Self-attention transfer networks for speech emotion recognition,”Virtual Reality & Intelligent Hardware, vol. 3, no. 1, pp. 43–54, 2021

2021
[55]

Toronto emotional speech set (TESS),

M. K. Pichora-Fuller and K. Dupuis, “Toronto emotional speech set (TESS),” Scholars Portal Dataverse, 2020. [Online]. Available: https://doi.org/10.5683/SP2/E8H2MF

work page doi:10.5683/sp2/e8h2mf 2020
[56]

Speech emotion recognition using deep 1D & 2D CNN LSTM networks,

J. Zhao, X. Mao, and L. Chen, “Speech emotion recognition using deep 1D & 2D CNN LSTM networks,”Biomedical Signal Processing and Control, vol. 47, pp. 312–323, 2019

2019
[57]

End-to-end speech emotion recognition using deep neural networks,

P. Tzirakis, J. Zhang, and B. W. Schuller, “End-to-end speech emotion recognition using deep neural networks,” inProc. ICASSP, 2018, pp. 5089–5093

2018

[1] [1]

GatedxLSTM: A multimodal affective computing approach for emotion recognition in conversations,

Y . Li, Q. Sun, S. M. K. Murthy, E. Alturki, and B. W. Schuller, “GatedxLSTM: A multimodal affective computing approach for emotion recognition in conversations,”arXiv preprint:2503.20919, pp. 1–9, 2025

work page arXiv 2025

[2] [2]

Make patient consultation warmer: A clinical application for speech emotion recognition,

H.-C. Li, T. Pan, M.-H. Lee, and H.-W. Chiu, “Make patient consultation warmer: A clinical application for speech emotion recognition,”Applied Sciences, vol. 11, no. 11, 2021

2021

[3] [3]

Speech emotion recognition and serious games: An entertaining approach for crowdsourcing annotated samples,

L. Matsouliadis, E. Siamtanidou, N. Vryzas, and C. Dimoulas, “Speech emotion recognition and serious games: An entertaining approach for crowdsourcing annotated samples,”Information, vol. 16, no. 3, p. 238, 2025

2025

[4] [4]

The acoustically emotion-aware conversational agent with speech emotion recognition and empathetic responses,

J. Hu, Y . Huang, X. Hu, and Y . Xu, “The acoustically emotion-aware conversational agent with speech emotion recognition and empathetic responses,”IEEE Transactions on Affective Computing, vol. 14, no. 1, pp. 17–30, 2022

2022

[5] [5]

Towards friendly AI: A comprehensive review and new perspectives on human-AI alignment,

Q. Sun, Y . Li, E. Alturki, S. M. K. Murthy, and B. W. Schuller, “Towards friendly AI: A comprehensive review and new perspectives on human-AI alignment,”arXiv preprint:2412.15114, pp. 1–15, 2024

work page arXiv 2024

[6] [6]

Adversarial attacks and defenses in deep learning,

K. Ren, T. Zheng, Z. Qin, and X. Liu, “Adversarial attacks and defenses in deep learning,”Engineering, vol. 6, no. 3, pp. 346–360, 2020

2020

[7] [7]

Threat of adversarial attacks on deep learning in computer vision: A survey,

N. Akhtar and A. Mian, “Threat of adversarial attacks on deep learning in computer vision: A survey,”IEEE Access, vol. 6, pp. 14 410–14 430, 2018. JOURNAL OF LATEX CLASS FILES, VOL. 14, NO. 8, AUGUST 2021 12

2018

[8] [8]

Adversarial attacks on ASR systems: An overview,

X. Zhang, H. Tan, X. Huang, D. Zhang, K. Tang, and Z. Gu, “Adversarial attacks on ASR systems: An overview,”arXiv preprint:2208.02250, pp. 1–8, 2022

work page arXiv 2022

[9] [9]

Adversarial attacks and defenses in speaker recognition systems: A survey,

J. Lan, R. Zhang, Z. Yan, J. Wang, Y . Chen, and R. Hou, “Adversarial attacks and defenses in speaker recognition systems: A survey,”Journal of Systems Architecture, vol. 127, p. 102526, 2022

2022

[10] [10]

Generating and protecting against adversarial attacks for deep speech-based emotion recognition models,

Z. Ren, A. Baird, J. Han, Z. Zhang, and B. W. Schuller, “Generating and protecting against adversarial attacks for deep speech-based emotion recognition models,” inProc. ICASSP, 2020, pp. 7184–7188

2020

[11] [11]

A black-box adversarial attack strategy with adjustable sparsity and generalizability for deep image classifiers,

A. Ghosh, S. S. Mullick, S. Datta, S. Das, A. K. Das, and R. Mallipeddi, “A black-box adversarial attack strategy with adjustable sparsity and generalizability for deep image classifiers,”Pattern Recognition, vol. 122, p. 108279, 2022

2022

[12] [12]

Weighted-sampling audio adversarial example attack,

X. Liu, K. Wan, Y . Ding, X. Zhang, and Q. Zhu, “Weighted-sampling audio adversarial example attack,” inProc. AAAI, 2020, pp. 4908–4915

2020

[13] [13]

Audio injection adversarial example attack,

X. Liu, X. Chen, M. Yin, Y . Wang, T. Hu, and K. Ding, “Audio injection adversarial example attack,” inProc. ICML Workshop on Adversarial Machine Learning, 2021

2021

[14] [14]

STAA-net: A sparse and transferable adversarial attack for speech emotion recognition,

Y . Chang, Z. Ren, Z. Zhang, X. Jing, K. Qian, X. Shao, B. Hu, T. Schultz, and B. W. Schuller, “STAA-net: A sparse and transferable adversarial attack for speech emotion recognition,”IEEE Transactions on Affective Computing, 2024

2024

[15] [15]

Boosting adversarial attacks with momentum,

Y . Dong, F. Liao, T. Pang, H. Su, J. Zhu, X. Hu, and J. Li, “Boosting adversarial attacks with momentum,” inProc. CVPR, 2018, pp. 9185– 9193

2018

[16] [16]

Muting Whisper: A universal acoustic adversarial attack on speech foundation models,

V . Raina, R. Ma, C. McGhee, K. Knill, and M. Gales, “Muting Whisper: A universal acoustic adversarial attack on speech foundation models,” arXiv preprint:2405.06134, pp. 1–18, 2024

work page arXiv 2024

[17] [17]

Boosting the transferability of audio adversarial examples with acoustic representation optimization,

W. Jin, J. Su, H. Wang, Y . Ye, and J. Hao, “Boosting the transferability of audio adversarial examples with acoustic representation optimization,” arXiv preprint:2503.19591, pp. 1–16, 2025

work page arXiv 2025

[18] [18]

Focus-shifting attack: An adversarial attack that retains saliency map information and manipulates model explanations,

Q.-X. Huang, L.-K. Chiang, M.-Y . Chiu, and H.-M. Sun, “Focus-shifting attack: An adversarial attack that retains saliency map information and manipulates model explanations,”IEEE Transactions on Reliability, vol. 73, no. 2, pp. 808–819, 2023

2023

[19] [19]

Maximal Jacobian-based Saliency Map Attack

R. Wiyatno and A. Xu, “Maximal Jacobian-based saliency map attack,” arXiv preprint:1808.07945, pp. 1–5, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[20] [20]

Saliency attack: Towards imper- ceptible black-box adversarial attack,

Z. Dai, S. Liu, Q. Li, and K. Tang, “Saliency attack: Towards imper- ceptible black-box adversarial attack,”ACM Transactions on Intelligent Systems and Technology., vol. 14, no. 3, pp. 1–20, 2023

2023

[21] [21]

Hidden Markov model-based speech emotion recognition,

B. Schuller, G. Rigoll, and M. Lang, “Hidden Markov model-based speech emotion recognition,” inProc. ICASSP, vol. 2, 2003, pp. II–1

2003

[22] [22]

Audio explanation synthesis with generative foundation models,

A. Akman, Q. Sun, and B. W. Schuller, “Audio explanation synthesis with generative foundation models,” inProc. ICASSP, 2025, pp. 1–5

2025

[23] [23]

wav2vec 2.0: A framework for self-supervised learning of speech representations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: A framework for self-supervised learning of speech representations,” in Proc. KDDNeurIPS, vol. 33, 2020, pp. 12 449–12 460

2020

[24] [24]

Speech emotion diarization: Which emotion appears when?

Y . Wang, M. Ravanelli, A. Nfissi, and A. Yacoubi, “Speech emotion diarization: Which emotion appears when?”arXiv preprint:2306.12991, pp. 1–7, 2023

work page arXiv 2023

[25] [25]

SUPERB: Speech process- ing Universal PERformance benchmark,

S. w. Yang, P.-H. Chi, Y .-S. Chuang, C.-I. J. Lai, K. Lakhotia, Y .-T. Lin, A. T. Liu, J. Shi, X. Chang, G.-T. Linet al., “SUPERB: Speech process- ing Universal PERformance benchmark,”arXiv preprint:2105.01051, pp. 1–6, 2021

work page arXiv 2021

[26] [26]

emotion2vec: Self-supervised pre-training for speech emotion repre- sentation,

Z. Ma, Z. Zheng, J. Ye, J. Li, Z. Gao, S. Zhang, and X. Chen, “emotion2vec: Self-supervised pre-training for speech emotion repre- sentation,” inProc. ACL, 2024, pp. 15 747–15 760

2024

[27] [27]

Ex- HuBERT: Enhancing HuBERT through block extension and fine-tuning on 37 emotion datasets,

S. Amiriparian, F. Packa ´n, M. Gerczuk, and B. W. Schuller, “Ex- HuBERT: Enhancing HuBERT through block extension and fine-tuning on 37 emotion datasets,”arXiv preprint:2406.10275, pp. 1–5, 2024

work page arXiv 2024

[28] [28]

A fine-tuned wav2vec 2.0/Hu- BERT benchmark for speech emotion recognition, speaker verification and spoken language understanding,

Y . Wang, A. Boumadane, and A. Heba, “A fine-tuned wav2vec 2.0/Hu- BERT benchmark for speech emotion recognition, speaker verification and spoken language understanding,”arXiv preprint:2111.02735, pp. 1– 7, 2021

work page arXiv 2021

[29] [29]

Emotion recognition from speech using wav2vec 2.0 embeddings,

L. Pepino, P. Riera, and L. Ferrer, “Emotion recognition from speech using wav2vec 2.0 embeddings,”arXiv preprint:2104.03502, pp. 1–5, 2021

work page arXiv 2021

[30] [30]

Explaining and harnessing adversarial examples,

I. J. Goodfellow, J. Shlens, and C. Szegedy, “Explaining and harnessing adversarial examples,” inProc. ICLR, 2015

2015

[31] [31]

Towards deep learning models resistant to adversarial attacks,

A. Madry, A. Makelov, L. Schmidt, D. Tsipras, and A. Vladu, “Towards deep learning models resistant to adversarial attacks,” inProc. ICLR, 2018

2018

[32] [32]

Simple black-box adversarial attacks,

C. Guo, J. Gardner, Y . You, A. G. Wilson, and K. Weinberger, “Simple black-box adversarial attacks,” inProc. ICML, vol. 97, 2019, pp. 2484– 2493

2019

[33] [33]

One pixel attack for fooling deep neural networks,

J. Su, D. V . Vargas, and K. Sakurai, “One pixel attack for fooling deep neural networks,”IEEE Transactions on Evolutionary Computation, vol. 23, no. 5, pp. 828–841, 2019

2019

[34] [34]

Universal adversarial perturbations for speech recogni- tion systems,

P. Neekhara, S. Hussain, P. Pandey, S. Dubnov, J. J. McAuley, and F. Koushanfar, “Universal adversarial perturbations for speech recogni- tion systems,” inProc. Interspeech, 2019, pp. 481–485

2019

[35] [35]

Generating transferable adversarial examples for speech classification,

H. Kim, J. Park, and J. Lee, “Generating transferable adversarial examples for speech classification,”Pattern Recognition, vol. 137, p. 109286, 2023

2023

[36] [36]

Attack on practical speaker verification system using universal adversarial perturbations,

W. Zhang, S. Zhao, L. Liu, J. Li, X. Cheng, T. F. Zheng, and X. Hu, “Attack on practical speaker verification system using universal adversarial perturbations,” inProc. ICASSP, 2021, pp. 2575–2579

2021

[37] [37]

Generative adversarial networks,

I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y . Bengio, “Generative adversarial networks,” Communications of the ACM, vol. 63, no. 11, pp. 139–144, 2020

2020

[38] [38]

DiffAttack: Imperceptible and transfer- able audio adversarial attack via diffusion model,

J. Chen, Y . Dai, and F. Huang, “DiffAttack: Imperceptible and transfer- able audio adversarial attack via diffusion model,” inProc. IEEE Int. Conf. Acoust., Speech Signal Process. (ICASSP), 2025, pp. 1–5

2025

[39] [39]

Enabling fast and universal audio adversarial attack using generative model,

Y . Xie, Z. Li, C. Shi, J. Liu, Y . Chen, and B. Yuan, “Enabling fast and universal audio adversarial attack using generative model,” inProc. AAAI, 2021, pp. 14 129–14 137

2021

[40] [40]

EvilHarmony: Stealthy adversarial attacks against black-box speech recognition systems,

X. Yuan, J. Zhang, F. Guo, K. Chen, X. Wang, S. Zhang, Y . Chen, D. Liu, P. Li, Z. Wang, and R. Zhu, “EvilHarmony: Stealthy adversarial attacks against black-box speech recognition systems,” inProc. IEEE Symp. Secur. Privacy (SP), 2025, pp. 4569–4587

2025

[41] [41]

Explainable artificial in- telligence for medical applications: A review,

Q. Sun, A. Akman, and B. W. Schuller, “Explainable artificial in- telligence for medical applications: A review,”ACM Transactions on Computing for Healthcare, vol. 6, no. 2, pp. 1–31, 2025

2025

[42] [42]

Learning important features through propagating activation differences,

A. Shrikumar, P. Greenside, and A. Kundaje, “Learning important features through propagating activation differences,” inProc. ICML, 2017, pp. 3145–3153

2017

[43] [43]

Explainable ai for healthcare,

Y . Li, Q. Sun, A. Akman, and B. W. Schuller, “Explainable ai for healthcare,” inHandbook on Smart Health. SAGE Publications 1 Oliver’s Yard, 55 City Road, London, EC1Y 1SP, 2025, pp. 632–652

2025

[44] [44]

Axiomatic attribution for deep networks,

M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic attribution for deep networks,” inProc. ICML, 2017, pp. 3319–3328

2017

[45] [45]

Layer-wise relevance propagation: An overview,

G. Montavon, A. Binder, S. Lapuschkin, W. Samek, and K.-R. M ¨uller, “Layer-wise relevance propagation: An overview,” inExplainable AI: Interpreting, Explaining and Visualizing Deep Learning, ser. Lecture Notes in Computer Science. Springer, 2019, vol. 11700, pp. 193–209

2019

[46] [46]

Grad-CAM: Visual explanations from deep networks via gradient-based localization,

R. R. Selvaraju, M. Cogswell, A. Das, R. Vedantam, D. Parikh, and D. Batra, “Grad-CAM: Visual explanations from deep networks via gradient-based localization,” inProc. ICCV, 2017, pp. 618–626

2017

[47] [47]

“Why should I trust you?

M. T. Ribeiro, S. Singh, and C. Guestrin, ““Why should I trust you?” explaining the predictions of any classifier,” inProc. KDD, 2016, pp. 1135–1144

2016

[48] [48]

A unified approach to interpreting model predictions,

S. M. Lundberg and S.-I. Lee, “A unified approach to interpreting model predictions,” inProc. NeurIPS, vol. 30, 2017

2017

[49] [49]

Audio explainable artificial intelligence: A review,

A. Akman and B. W. Schuller, “Audio explainable artificial intelligence: A review,”Intelligent Computing, vol. 2, p. 0074, 2024

2024

[50] [50]

audiolime: Listenable explanations using source separation,

V . Haunschmid, E. Manilow, and G. Widmer, “audiolime: Listenable explanations using source separation,”arXiv preprint:2008.00582, pp. 1–5, 2020

work page arXiv 2008

[51] [51]

Unveiling hidden factors: Explainable AI for feature boosting in speech emotion recognition,

A. Nfissi, W. Bouachir, N. Bouguila, and B. Mishara, “Unveiling hidden factors: Explainable AI for feature boosting in speech emotion recognition,”arXiv preprint:2406.01624, pp. 1–36, 2024

work page arXiv 2024

[52] [52]

Reliable evaluation of at- tribution maps in CNNs: A perturbation-based approach,

L. Nieradzik, H. Stephani, and J. Keuper, “Reliable evaluation of at- tribution maps in CNNs: A perturbation-based approach,”International Journal of Computer Vision, vol. 133, no. 5, pp. 2392–2409, 2025

2025

[53] [53]

IEMOCAP: Interactive emotional dyadic motion capture database,

C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “IEMOCAP: Interactive emotional dyadic motion capture database,”Language resources and evaluation, vol. 42, no. 4, pp. 335–359, 2008

2008

[54] [54]

Self-attention transfer networks for speech emotion recognition,

Z. Zhao, K. Wang, Z. Bao, Z. Zhang, N. Cummins, S. Sun, H. Wang, J. Tao, and B. W. Schuller, “Self-attention transfer networks for speech emotion recognition,”Virtual Reality & Intelligent Hardware, vol. 3, no. 1, pp. 43–54, 2021

2021

[55] [55]

Toronto emotional speech set (TESS),

M. K. Pichora-Fuller and K. Dupuis, “Toronto emotional speech set (TESS),” Scholars Portal Dataverse, 2020. [Online]. Available: https://doi.org/10.5683/SP2/E8H2MF

work page doi:10.5683/sp2/e8h2mf 2020

[56] [56]

Speech emotion recognition using deep 1D & 2D CNN LSTM networks,

J. Zhao, X. Mao, and L. Chen, “Speech emotion recognition using deep 1D & 2D CNN LSTM networks,”Biomedical Signal Processing and Control, vol. 47, pp. 312–323, 2019

2019

[57] [57]

End-to-end speech emotion recognition using deep neural networks,

P. Tzirakis, J. Zhang, and B. W. Schuller, “End-to-end speech emotion recognition using deep neural networks,” inProc. ICASSP, 2018, pp. 5089–5093

2018