RedVox: Safety and Fairness Gaps in Speech Models Across Languages

Beatrice Savoldi; Luisa Bentivogli; Matteo Negri; Sara Papi; Wafa Aissa

arxiv: 2606.26968 · v1 · pith:LUF52NINnew · submitted 2026-06-25 · 💻 cs.CL

RedVox: Safety and Fairness Gaps in Speech Models Across Languages

Beatrice Savoldi , Sara Papi , Wafa Aissa , Matteo Negri , Luisa Bentivogli This is my paper

Pith reviewed 2026-06-26 04:34 UTC · model grok-4.3

classification 💻 cs.CL

keywords speech modelssafety evaluationfairnessmultilingual benchmarkaudio inputnon-English languagesrefusal behaviorstereotypical content

0 comments

The pith

Speech models show larger safety and fairness failures in non-English languages and with spoken inputs than with text.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates RedVox, a benchmark of real-voice recordings that test speech models on unsafe and unfair stereotypical requests in five languages. It first checks how existing model papers report safety work and finds almost none cover languages other than English. When eight current models are run on the new benchmark, failures appear even without special attack prompts, rise sharply outside English, and rise further when the same request arrives as audio instead of text. The authors also record the extra privacy and consent hurdles that come with gathering spoken data from people. These patterns matter because speech models are moving into everyday use across many countries, where the same safety shortfalls could affect more users.

Core claim

RedVox is a multilingual safety and fairness benchmark for audio and speech built on real voices that covers unsafe and unfair stereotypical requests across English, French, Italian, Spanish, and German. Evaluation of eight state-of-the-art models shows vulnerabilities persist even under non-adversarial conditions, worsen in non-English languages, and are amplified when the request comes from a spoken input. A participant survey further documents the personal and privacy challenges of collecting speech data with human contributors.

What carries the argument

The RedVox benchmark, consisting of real-voice recordings of unsafe and unfair requests in five languages used to measure model refusal rates and bias under spoken versus text conditions.

If this is right

Safety testing for speech models must include non-English languages as a standard requirement rather than an optional add-on.
Spoken input should be treated as a higher-risk category in safety evaluations because it increases failure rates.
Model releases need to document multilingual safety performance, since current practice covers it in only 8 percent of cases.
Data collection methods for speech safety benchmarks must address participant privacy concerns to remain viable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training pipelines that rely mainly on English text may be leaving non-English speech safety unaddressed by default.
The same benchmark could be reused to track whether future model updates close the spoken-input gap over time.
Real-world deployment teams might need separate refusal policies for voice interfaces in languages where gaps are largest.

Load-bearing premise

The specific requests and recording conditions in RedVox match the unsafe and unfair content real users would actually send to these models in the five languages.

What would settle it

Running the same models on a fresh collection of unsafe requests drawn directly from real non-English users and finding no rise in failure rates relative to English or to text inputs would contradict the reported pattern.

Figures

Figures reproduced from arXiv: 2606.26968 by Beatrice Savoldi, Luisa Bentivogli, Matteo Negri, Sara Papi, Wafa Aissa.

**Figure 2.** Figure 2: REDVOX Framework. (Top) Benchmark properties and the two request types about safety and fairness. In Request Type I (Speech), harmful content is vocalized and accompanied by a textual follow-up request; in Request Type II (Audio), harmful content appears in text only, paired with a distracting audio signal. (Bottom) Evaluation workflow assessing model responses on an increasing severity scale. speech model… view at source ↗

**Figure 3.** Figure 3: Composition of REDVOX released data. English represents the largest portion (40%). Vulnerability types (middle ring) are distinguished across unsafe (69%) and stereotypical requests (31%). Each audio type and speech recordings amount to ∼25% each. Chi-squared tests on categorical evaluation labels (see evaluation in the upcoming Section 4.1) further confirm robustness, with Cramér’s V remaining negligibl… view at source ↗

**Figure 4.** Figure 4: Mutlilingual results. Ratios of responses by ■ safe-by-accident, ■ controversial, ■ unsafe. 0 5 10 15 20 Unsafe Controversial Safe by Accident 0 5 10 15 20 Unsafe Controversial Safe by Accident 0 5 10 15 20 Unsafe Controversial Safe by Accident % 0 5 10 15 20 Unsafe Controversial Safe by Accident [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Ratio of responses by fairness (•) and safety (■). Non-English statistics are solid symbols, while English are transparent symbols. GPT-realtime2 Gemini-Flash-Lite3.1 Gemini-Pro3.1 Qwen3-Omni Gemma4 Qwen2-Audio Phi4-Multi Voxtral text noise-a noise-b silence speech 3% 18% 16% 10% 10% 21% 11% 22% 2% 11% 16% 9% 17% 23% 28% 41% 2% 12% 14% 11% 14% 24% 29% 36% 3% 12% 16% 9% 19% 23% 30% 42% 10% 15% 19% 15% 21% 2… view at source ↗

**Figure 6.** Figure 6: Ratio of controversial and unsafe (C+U) responses across textual, audio and speech inputs. only one type of audio leads to an increase—we can conclude that the mere presence of audio input acts as a stressing factor for non-proprietary models, independent of semantic content. 6 Discussion Results on REDVOX confirm that safety and fairness vulnerabilities in speech models are a present concern—not only u… view at source ↗

**Figure 7.** Figure 7: Participant attitudes. (A) Mean comfort (1– 5) with creating and releasing generic or harmful content in text vs. voice. (B) Ratio of participants reporting feeling personally responsible in pronouncing harmful requests and concerned about voice identification. related research).15 Figure 7B sheds light on such rates, highlighting both psychological and ethical as well as privacy concerns. The majority of … view at source ↗

**Figure 8.** Figure 8: Prompt used for LLM-as-a-Judge evaluation with [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: Ratio of controversial and unsafe/unfair responses by audio type. Responses are provided by language [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: Ratio of participants’ responses to: Which vulnerability categories felt most uncomfortable to engage with? [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: Distribution of model responses for native vs. non-native English speakers across eight audiolanguage models (speech inputs only). Statistical testing was performed using one-sided Mann–Whitney U tests on the full ordinal outcome scale (safe related < safe unrelated < controversial < unsafe/unfair), testing the directional hypothesis that non-native speakers receive systematically worse outcomes than nat… view at source ↗

**Figure 12.** Figure 12: Distribution of model responses for men vs. women speakers across eight audio-language models (all langs, speech inputs only). Statistical testing was performed using two-sided Mann–Whitney U tests on the full ordinal outcome scale (safe related < safe unrelated < controversial < unsafe/unfair), testing whether any significant gendered difference exists. Effect size is reported as rank-biserial correlatio… view at source ↗

read the original abstract

Speech-capable models are increasingly deployed in real-world applications across languages. Yet their safety and fairness beyond English settings and under naturalistic conditions remain understudied. We survey safety reporting practices across state-of-the-art speech model releases, finding that only 8% document any multilingual analysis. To address this gap, we introduce RedVox, a multilingual safety and fairness benchmark for audio and speech built on real voices, covering unsafe and unfair stereotypical requests across five languages (English, French, Italian, Spanish, and German). Evaluating eight state-of-the-art models, we find that vulnerabilities persist even under non-adversarial conditions, worsen in non-English languages, and are amplified when the request comes from a spoken input. Finally, by surveying the participants who contributed to RedVox, we document the unique personal and privacy challenges of collecting speech data with human participants, pointing to broader sociotechnical challenges in naturalistic speech safety research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RedVox adds a benchmark for speech safety across languages but needs better validation for the cross-language comparison.

read the letter

The one thing to know is that this paper introduces RedVox, a benchmark for safety and fairness in multilingual speech models, and finds that problems are worse outside English and with spoken requests.

They survey model releases and show the lack of multilingual analysis is widespread. Then they build this benchmark on real voices for five languages and test eight models. The participant survey at the end is a nice addition on the human side of collecting speech data for this purpose.

The work is solid in identifying the gap and providing an initial set of evaluations under non-adversarial conditions. It is new in creating this specific benchmark.

Where it is softer is in the methods for the benchmark. The claim about worsening in non-English languages requires that the requests are comparable in severity. The abstract says it covers unsafe and unfair stereotypical requests across the languages, but if the full paper does not include details on how they localized the content or validated the harm levels with native speakers, then the result could be influenced by differences in the prompts rather than the models. The stress-test note flags this exactly, and it seems like a real issue based on what's described.

For a reader working on model safety or multilingual systems, this would be worth looking at. It is coherent and engages with the literature on the gap in reporting. It deserves a serious referee to help improve the presentation of the methods.

I would recommend sending it to peer review.

Referee Report

2 major / 0 minor

Summary. The paper surveys safety and fairness reporting practices in state-of-the-art speech model releases (finding only 8% include multilingual analysis), introduces the RedVox benchmark of unsafe and unfair stereotypical requests across five languages (English, French, Italian, Spanish, German) using real voices, evaluates eight models under non-adversarial conditions, and reports that vulnerabilities persist, worsen in non-English languages, and are amplified with spoken inputs; it also discusses participant challenges in speech data collection.

Significance. If the benchmark requests are validated for equivalent severity across languages, the results would document an important and understudied gap in multilingual speech-model safety, directly addressing the low rate of multilingual analysis in model releases and highlighting risks for real-world spoken deployments.

major comments (2)

[Abstract / benchmark construction] Abstract and benchmark description: the central claim that vulnerabilities 'worsen in non-English languages' is load-bearing on the assumption that unsafe/unfair requests maintain matched harm levels and stereotypical severity across the five languages. The manuscript supplies no details on request selection criteria, localization/translation validation, native-speaker harm ratings, back-translation checks, or cultural adaptation; without these, observed language differences may be confounded by prompt difficulty rather than model behavior.
[Evaluation and results] Evaluation section: the abstract and results claim amplification under spoken inputs and persistence under non-adversarial conditions, yet no information is given on model versions, statistical methods, or how spoken vs. text inputs were controlled, preventing assessment of whether the data support the multilingual and modality claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below and will revise the manuscript to incorporate additional details where the current version is insufficient.

read point-by-point responses

Referee: [Abstract / benchmark construction] Abstract and benchmark description: the central claim that vulnerabilities 'worsen in non-English languages' is load-bearing on the assumption that unsafe/unfair requests maintain matched harm levels and stereotypical severity across the five languages. The manuscript supplies no details on request selection criteria, localization/translation validation, native-speaker harm ratings, back-translation checks, or cultural adaptation; without these, observed language differences may be confounded by prompt difficulty rather than model behavior.

Authors: We agree that the manuscript lacks sufficient detail on benchmark construction to fully support the multilingual claims. In revision we will add a dedicated subsection describing request selection criteria, the translation and localization process, any back-translation or native-speaker validation steps performed, and steps taken to align stereotypical severity across languages. These additions will clarify the methodology and reduce the possibility of confounding by prompt difficulty. revision: yes
Referee: [Evaluation and results] Evaluation section: the abstract and results claim amplification under spoken inputs and persistence under non-adversarial conditions, yet no information is given on model versions, statistical methods, or how spoken vs. text inputs were controlled, preventing assessment of whether the data support the multilingual and modality claims.

Authors: We acknowledge that the evaluation section omits key methodological details. We will expand it to specify the exact model versions evaluated, the statistical tests and significance thresholds used for cross-language and cross-modality comparisons, and the precise controls applied to ensure spoken and text inputs were matched in content and delivery. These revisions will enable readers to assess the strength of the reported findings. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical benchmark evaluation is self-contained

full rationale

The paper introduces the RedVox benchmark and reports direct model evaluations across languages. No derivations, fitted parameters renamed as predictions, self-definitional quantities, or load-bearing self-citations appear in the provided text. The central claims rest on new data collection and model testing rather than reducing to prior inputs by construction. This matches the default expectation of non-circular empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so no free parameters, axioms, or invented entities can be identified from the full text.

pith-pipeline@v0.9.1-grok · 5699 in / 1006 out tokens · 39960 ms · 2026-06-26T04:34:01.234296+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

157 extracted references · 22 canonical work pages

[1]

2024 , eprint =

Qwen2-Audio Technical Report , author =. 2024 , eprint =

2024
[3]

2025 , eprint =

Kimi-Audio Technical Report , author =. 2025 , eprint =

2025
[4]

2025 , eprint=

Ming-UniAudio: Speech LLM for Joint Understanding, Generation and Editing with Unified Representation , author=. 2025 , eprint=

2025
[5]

2025 , eprint=

SeaLLMs-Audio: Large Audio-Language Models for Southeast Asia , author=. 2025 , eprint=

2025
[7]

2025 , eprint=

Step-Audio 2 Technical Report , author=. 2025 , eprint=

2025
[8]

2026 , howpublished=

Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music , author=. 2026 , howpublished=

2026
[11]

2026 , eprint=

Eureka-Audio: Triggering Audio Intelligence in Compact Language Models , author=. 2026 , eprint=

2026
[12]

2025 , eprint=

Fun-Audio-Chat Technical Report , author=. 2025 , eprint=

2025
[13]

ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

GLAP: General contrastive audio-text pretraining across domains and languages , author=. ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2026 , organization=

2026
[14]

Raon-Speech Technical Report , author =
[17]

2025 , eprint =

Ming-Omni: A Unified Multimodal Model for Perception and Generation , author =. 2025 , eprint =

2025
[20]

2025 , eprint=

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs , author=. 2025 , eprint=

2025
[23]

2026 , eprint=

LongCat-Next: Lexicalizing Modalities as Discrete Tokens , author=. 2026 , eprint=

2026
[24]

2026 , eprint=

MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction , author=. 2026 , eprint=

2026
[25]

International Conference on Learning Representations , volume=

Multilingual jailbreak challenges in large language models , author=. International Conference on Learning Representations , volume=
[27]

2025 , eprint=

PRAC3 (Privacy, Reputation, Accountability, Consent, Credit, Compensation): Long Tailed Risks of Voice Actors in AI Data-Economy , author=. 2025 , eprint=

2025
[28]

Proceedings of the 2021 CHI conference on human factors in computing systems , pages=

The psychological well-being of content moderators: the emotional labor of commercial moderation and avenues for improving support , author=. Proceedings of the 2021 CHI conference on human factors in computing systems , pages=

2021
[30]

Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency , pages=

Adversarial nibbler: An open red-teaming method for identifying diverse harms in text-to-image generation , author=. Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency , pages=

2024
[31]

Generative

Storchan, Victor and Kumar, Ravin and Chowdhury, Rumman and Goldfarb-Tarrant, Seraphina and Cattell, Sven , institution =. Generative. 2024 , url =

2024
[32]

2026 , eprint=

Do What I Say: A Spoken Prompt Dataset for Instruction-Following , author=. 2026 , eprint=

2026
[33]

0 of the AI Risk and Reliability Benchmark from MLCommons , author=

AILuminate: Introducing v1. 0 of the AI Risk and Reliability Benchmark from MLCommons , author=. arXiv e-prints , pages=
[34]

Advances in Neural Information Processing Systems , volume=

Best-of-n jailbreaking , author=. Advances in Neural Information Processing Systems , volume=
[35]

, author=

The effects of stereotype activation on behavior: a review of possible mechanisms. , author=. Psychological bulletin , volume=. 2001 , publisher=

2001
[36]

Plos one , volume=

Exposure to stereotype-relevant stories shapes children’s implicit gender stereotypes , author=. Plos one , volume=. 2022 , publisher=

2022
[37]

2020 , publisher=

The psychology of prejudice: From attitudes to social action , author=. 2020 , publisher=

2020
[38]

2022 , eprint=

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned , author=. 2022 , eprint=

2022
[39]

2025 , eprint=

Qwen3Guard Technical Report , author=. 2025 , eprint=

2025
[40]

2025 , eprint=

MSTS: A Multimodal Safety Test Suite for Vision-Language Models , author=. 2025 , eprint=

2025
[41]

International Conference on Machine Learning , pages=

Robust Speech Recognition via Large-Scale Weak Supervision , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023
[43]

CoRR , volume =

Irene Solaiman and Christy Dennison , title =. CoRR , volume =. 2021 , url =. 2106.10328 , timestamp =

arXiv 2021
[44]

IEEE Transactions on Emerging Topics in Computational Intelligence , year=

An investigation into value misalignment in LLM-generated texts for cultural heritage , author=. IEEE Transactions on Emerging Topics in Computational Intelligence , year=
[45]

ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Summary on the multilingual conversational speech language model challenge: Datasets, tasks, baselines, and methods , author=. ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2026 , organization=

2026
[46]

2025 , eprint=

Attention to Non-Adopters , author=. 2025 , eprint=

2025
[47]

Digital , volume=

Adapting voice assistant technology for older adults: a comprehensive study on usability, learning patterns, and acceptance , author=. Digital , volume=. 2025 , publisher=

2025
[49]

Transactions of the Association for Computational Linguistics , volume=

Spirit-lm: Interleaved spoken and written language model , author=. Transactions of the Association for Computational Linguistics , volume=. 2025 , publisher=

2025
[50]

Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems , pages=

Speech AI for all: Promoting accessibility, fairness, inclusivity, and equity , author=. Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems , pages=
[52]

Transactions of the Association for Computational Linguistics , volume=

Voicebench: Benchmarking llm-based voice assistants , author=. Transactions of the Association for Computational Linguistics , volume=. 2026 , publisher=

2026
[53]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Sea: Low-resource safety alignment for multimodal large language models via synthetic embeddings , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[54]

Proceedings of the 14th International Conference on Learning Representations , year=

AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models , author=. Proceedings of the 14th International Conference on Learning Representations , year=
[55]

2025 , eprint=

International AI Safety Report , author=. 2025 , eprint=

2025
[56]

2025 , eprint=

Omni-SafetyBench: A Benchmark for Safety Evaluation of Audio-Visual Large Language Models , author=. 2025 , eprint=

2025
[57]

2025 , eprint=

Audio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language Models , author=. 2025 , eprint=

2025
[58]

2026 , eprint=

VIBE: Voice-Induced open-ended Bias Evaluation for Large Audio-Language Models via Real-World Speech , author=. 2026 , eprint=

2026
[59]

2025 , eprint=

AHELM: A Holistic Evaluation of Audio-Language Models , author=. 2025 , eprint=

2025
[60]

Companion publication of the 2024 conference on computer-supported cooperative work and social computing , pages=

The human factor in ai red teaming: Perspectives from social and collaborative computing , author=. Companion publication of the 2024 conference on computer-supported cooperative work and social computing , pages=

2024
[62]

2026 , eprint=

VoxSafeBench: Not Just What Is Said, but Who, How, and Where , author=. 2026 , eprint=

2026
[63]

2026 , eprint=

JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language Models , author=. 2026 , eprint=

2026
[64]

2026 , booktitle=

Jailbreak-AudioBench: In-Depth Evaluation and Analysis of Jailbreak Threats for Large Audio Language Models , author=. 2026 , booktitle=

2026
[66]

International journal of speech technology , volume=

Voice user interfaces in manufacturing logistics: a literature review , author=. International journal of speech technology , volume=. 2023 , publisher=

2023
[67]

Biometrics , pages=

An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers , author=. Biometrics , pages=. 1977 , publisher=

1977
[68]

Educational and psychological measurement , volume=

A coefficient of agreement for nominal scales , author=. Educational and psychological measurement , volume=. 1960 , publisher=

1960
[69]

British Journal of Mathematical and Statistical Psychology , volume=

Computing inter-rater reliability and its variance in the presence of high agreement , author=. British Journal of Mathematical and Statistical Psychology , volume=. 2008 , publisher=

2008
[70]

Nature , volume=

Joint speech and text machine translation for up to 100 languages , author=. Nature , volume=. 2025 , publisher=. doi:10.1038/s41586-025-08655-2 , url=

work page doi:10.1038/s41586-025-08655-2 2025
[71]

2025 , eprint=

Canary-1B-v2 & Parakeet-TDT-0.6B-v3: Efficient and High-Performance Models for Multilingual ASR and AST , author=. 2025 , eprint=

2025
[72]

2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) , pages=

Reproducing whisper-style training using an open-source toolkit and publicly available data , author=. 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) , pages=. 2023 , organization=

2023
[73]

2026 , eprint=

Evaluation of Audio Language Models for Fairness, Safety, and Security , author=. 2026 , eprint=

2026
[74]

GitHub repository , howpublished =

Silero VAD: pre-trained enterprise-grade Voice Activity Detector (VAD), Number Detector and Language Classifier , year =. GitHub repository , howpublished =
[76]

Phi-4-mini technical report:

Abouelenin, Abdelrahman and Ashfaq, Atabak and Atkinson, Adam and Awadalla, Hany and Bach, Nguyen and Bao, Jianmin and Benhaim, Alon and Cai, Martin and Chaudhary, Vishrav and Chen, Congcong and others , journal=. Phi-4-mini technical report:. 2025 , url=

2025
[78]

2025 , eprint=

Qwen3-Omni Technical Report , author=. 2025 , eprint=

2025
[79]

doi:10.57967/hf/8653 , publisher =

Julian Mack and Ekagra Ranjan and Walter Beller-Morales and Bharat Venkitesh and Pierre Richemond , title =. doi:10.57967/hf/8653 , publisher =

work page doi:10.57967/hf/8653
[80]

2025 , eprint=

Granite-speech: open-source speech-aware LLMs with strong English ASR capabilities , author=. 2025 , eprint=

2025
[81]

Kshitij Ambilduke and Ben Peters and Sonal Sannigrahi and Anil Keshwani and Tsz Kin Lam and Bruno Martins and André F. T. Martins and Marcely Zanon Boito , year=. From. 2503.10620 , archiveprefix=

arXiv
[82]

2026 , eprint=

VIBEVOICE-ASR Technical Report , author=. 2026 , eprint=

2026
[83]

2024 , url=

Changli Tang and Wenyi Yu and Guangzhi Sun and Xianzhao Chen and Tian Tan and Wei Li and Lu Lu and Zejun MA and Chao Zhang , booktitle=. 2024 , url=

2024
[85]

2026 , eprint=

Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs , author=. 2026 , eprint=

2026
[87]

Inclusion AI. 2025. https://arxiv.org/abs/2506.09344 Ming-omni: A unified multimodal model for perception and generation . Preprint, arXiv:2506.09344

arXiv 2025
[88]

Ranya Aloufi, Srishti Gupta, Soumya Shaw, Battista Biggio, and Lea Schönherr. 2026. https://arxiv.org/abs/2603.13262 Evaluation of audio language models for fairness, safety, and security . Preprint, arXiv:2603.13262

arXiv 2026
[89]

Yoshua Bengio, Sören Mindermann, Daniel Privitera, Tamay Besiroglu, Rishi Bommasani, Stephen Casper, Yejin Choi, Philip Fox, Ben Garfinkel, Danielle Goldfarb, Hoda Heidari, Anson Ho, Sayash Kapoor, Leila Khalatbari, Shayne Longpre, Sam Manning, Vasilios Mavroudis, Mantas Mazeika, Julian Michael, and 77 others. 2025. https://arxiv.org/abs/2501.17805 Intern...

arXiv 2025
[90]

Katharina Block, Antonya Marie Gonzalez, Clement JX Choi, Zoey C Wong, Toni Schmader, and Andrew Scott Baron. 2022. Exposure to stereotype-relevant stories shapes children’s implicit gender stereotypes. Plos one, 17(8):e0271396

2022
[91]

Fan Bu, Zheng Wang, Siyi Wang, and Ziyao Liu. 2025. An investigation into value misalignment in llm-generated texts for cultural heritage. IEEE Transactions on Emerging Topics in Computational Intelligence

2025
[92]

Abu-Ghazaleh, M

Trishna Chakraborty, Erfan Shayegani, Zikui Cai, Nael B. Abu-Ghazaleh, M. Salman Asif, Yue Dong, Amit Roy-Chowdhury, and Chengyu Song. 2024. https://doi.org/10.18653/v1/2024.findings-emnlp.574 Can textual unlearning solve cross-modality safety alignment? In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 9830--9844, Miami, Flo...

work page doi:10.18653/v1/2024.findings-emnlp.574 2024
[93]

Qian Chen, Luyao Cheng, Chong Deng, Xiangang Li, Jiaqing Liu, Chao-Hong Tan, Wen Wang, Junhao Xu, Jieping Ye, Qinglin Zhang, Qiquan Zhang, and Jingren Zhou. 2025. https://arxiv.org/abs/2512.20156 Fun-audio-chat technical report

arXiv 2025
[94]

Yiming Chen, Xianghu Yue, Chen Zhang, Xiaoxue Gao, Robby T Tan, and Haizhou Li. 2026 a . Voicebench: Benchmarking llm-based voice assistants. Transactions of the Association for Computational Linguistics, 14:378--398

2026
[95]

Yupeng Chen, Junchi Yu, Aoxi Liu, Philip Torr, and Adel Bibi. 2026 b . The alignment curse: Cross-modality jailbreak transfer in omni-models. arXiv preprint arXiv:2602.02557

Pith/arXiv arXiv 2026
[96]

Hao Cheng, Erjia Xiao, Jing Shao, Yichi Wang, Le Yang, Chao Shen, Philip Torr, Jindong Gu, and Renjing Xu. 2026. https://arxiv.org/abs/2501.13772 Jailbreak-audiobench: In-depth evaluation and analysis of jailbreak threats for large audio language models . In The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track

arXiv 2026
[98]

Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, and 1 others. 2024 b . https://arxiv.org/abs/2407.10759 Qwen2-audio technical report . arXiv preprint arXiv:2407.10759

Pith/arXiv arXiv 2024
[99]

Junbo Cui, Bokai Xu, Chongyi Wang, Tianyu Yu, Weiyue Sun, Yingjing Xu, Tianran Wang, Zhihui He, Wenshuo Ma, Tianchi Cai, Jiancheng Gui, Luoyuan Zhang, Xian Sun, Fuwei Huang, Moye Chen, Zhuo Lin, Hanyu Liu, Qingxin Gui, Qingzhe Han, and 17 others. 2026. https://arxiv.org/abs/2604.27393 Minicpm-o 4.5: Towards real-time full-duplex omni-modal interaction

Pith/arXiv arXiv 2026
[100]

Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, and Lidong Bing. 2024. Multilingual jailbreak challenges in large language models. In International Conference on Learning Representations, volume 2024, pages 24634--24651

2024
[101]

Heinrich Dinkel, Zhiyong Yan, Tianzi Wang, Yongqing Wang, Xingwei Sun, Yadong Niu, Jizhong Liu, Gang Li, Junbo Zhang, and Jian Luan. 2026. Glap: General contrastive audio-text pretraining across domains and languages. In ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 14737--14741. IEEE

2026
[102]

Amirbek Djanibekov, Nurdaulet Mukhituly, Kentaro Inui, Hanan Aldarmaki, and Nils Lukas. 2025. https://doi.org/10.18653/v1/2025.emnlp-main.734 SPIRIT : Patching speech language models against jailbreak attacks . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 14503--14520, Suzhou, China. Association for Comp...

work page doi:10.18653/v1/2025.emnlp-main.734 2025

Showing first 80 references.

[1] [1]

2024 , eprint =

Qwen2-Audio Technical Report , author =. 2024 , eprint =

2024

[2] [3]

2025 , eprint =

Kimi-Audio Technical Report , author =. 2025 , eprint =

2025

[3] [4]

2025 , eprint=

Ming-UniAudio: Speech LLM for Joint Understanding, Generation and Editing with Unified Representation , author=. 2025 , eprint=

2025

[4] [5]

2025 , eprint=

SeaLLMs-Audio: Large Audio-Language Models for Southeast Asia , author=. 2025 , eprint=

2025

[5] [7]

2025 , eprint=

Step-Audio 2 Technical Report , author=. 2025 , eprint=

2025

[6] [8]

2026 , howpublished=

Audio Flamingo Next: Next-Generation Open Audio-Language Models for Speech, Sound, and Music , author=. 2026 , howpublished=

2026

[7] [11]

2026 , eprint=

Eureka-Audio: Triggering Audio Intelligence in Compact Language Models , author=. 2026 , eprint=

2026

[8] [12]

2025 , eprint=

Fun-Audio-Chat Technical Report , author=. 2025 , eprint=

2025

[9] [13]

ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

GLAP: General contrastive audio-text pretraining across domains and languages , author=. ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2026 , organization=

2026

[10] [14]

Raon-Speech Technical Report , author =

[11] [17]

2025 , eprint =

Ming-Omni: A Unified Multimodal Model for Perception and Generation , author =. 2025 , eprint =

2025

[12] [20]

2025 , eprint=

Phi-4-Mini Technical Report: Compact yet Powerful Multimodal Language Models via Mixture-of-LoRAs , author=. 2025 , eprint=

2025

[13] [23]

2026 , eprint=

LongCat-Next: Lexicalizing Modalities as Discrete Tokens , author=. 2026 , eprint=

2026

[14] [24]

2026 , eprint=

MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction , author=. 2026 , eprint=

2026

[15] [25]

International Conference on Learning Representations , volume=

Multilingual jailbreak challenges in large language models , author=. International Conference on Learning Representations , volume=

[16] [27]

2025 , eprint=

PRAC3 (Privacy, Reputation, Accountability, Consent, Credit, Compensation): Long Tailed Risks of Voice Actors in AI Data-Economy , author=. 2025 , eprint=

2025

[17] [28]

Proceedings of the 2021 CHI conference on human factors in computing systems , pages=

The psychological well-being of content moderators: the emotional labor of commercial moderation and avenues for improving support , author=. Proceedings of the 2021 CHI conference on human factors in computing systems , pages=

2021

[18] [30]

Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency , pages=

Adversarial nibbler: An open red-teaming method for identifying diverse harms in text-to-image generation , author=. Proceedings of the 2024 ACM Conference on Fairness, Accountability, and Transparency , pages=

2024

[19] [31]

Generative

Storchan, Victor and Kumar, Ravin and Chowdhury, Rumman and Goldfarb-Tarrant, Seraphina and Cattell, Sven , institution =. Generative. 2024 , url =

2024

[20] [32]

2026 , eprint=

Do What I Say: A Spoken Prompt Dataset for Instruction-Following , author=. 2026 , eprint=

2026

[21] [33]

0 of the AI Risk and Reliability Benchmark from MLCommons , author=

AILuminate: Introducing v1. 0 of the AI Risk and Reliability Benchmark from MLCommons , author=. arXiv e-prints , pages=

[22] [34]

Advances in Neural Information Processing Systems , volume=

Best-of-n jailbreaking , author=. Advances in Neural Information Processing Systems , volume=

[23] [35]

, author=

The effects of stereotype activation on behavior: a review of possible mechanisms. , author=. Psychological bulletin , volume=. 2001 , publisher=

2001

[24] [36]

Plos one , volume=

Exposure to stereotype-relevant stories shapes children’s implicit gender stereotypes , author=. Plos one , volume=. 2022 , publisher=

2022

[25] [37]

2020 , publisher=

The psychology of prejudice: From attitudes to social action , author=. 2020 , publisher=

2020

[26] [38]

2022 , eprint=

Red Teaming Language Models to Reduce Harms: Methods, Scaling Behaviors, and Lessons Learned , author=. 2022 , eprint=

2022

[27] [39]

2025 , eprint=

Qwen3Guard Technical Report , author=. 2025 , eprint=

2025

[28] [40]

2025 , eprint=

MSTS: A Multimodal Safety Test Suite for Vision-Language Models , author=. 2025 , eprint=

2025

[29] [41]

International Conference on Machine Learning , pages=

Robust Speech Recognition via Large-Scale Weak Supervision , author=. International Conference on Machine Learning , pages=. 2023 , organization=

2023

[30] [43]

CoRR , volume =

Irene Solaiman and Christy Dennison , title =. CoRR , volume =. 2021 , url =. 2106.10328 , timestamp =

arXiv 2021

[31] [44]

IEEE Transactions on Emerging Topics in Computational Intelligence , year=

An investigation into value misalignment in LLM-generated texts for cultural heritage , author=. IEEE Transactions on Emerging Topics in Computational Intelligence , year=

[32] [45]

ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Summary on the multilingual conversational speech language model challenge: Datasets, tasks, baselines, and methods , author=. ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2026 , organization=

2026

[33] [46]

2025 , eprint=

Attention to Non-Adopters , author=. 2025 , eprint=

2025

[34] [47]

Digital , volume=

Adapting voice assistant technology for older adults: a comprehensive study on usability, learning patterns, and acceptance , author=. Digital , volume=. 2025 , publisher=

2025

[35] [49]

Transactions of the Association for Computational Linguistics , volume=

Spirit-lm: Interleaved spoken and written language model , author=. Transactions of the Association for Computational Linguistics , volume=. 2025 , publisher=

2025

[36] [50]

Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems , pages=

Speech AI for all: Promoting accessibility, fairness, inclusivity, and equity , author=. Proceedings of the Extended Abstracts of the CHI Conference on Human Factors in Computing Systems , pages=

[37] [52]

Transactions of the Association for Computational Linguistics , volume=

Voicebench: Benchmarking llm-based voice assistants , author=. Transactions of the Association for Computational Linguistics , volume=. 2026 , publisher=

2026

[38] [53]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Sea: Low-resource safety alignment for multimodal large language models via synthetic embeddings , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[39] [54]

Proceedings of the 14th International Conference on Learning Representations , year=

AudioTrust: Benchmarking the Multifaceted Trustworthiness of Audio Large Language Models , author=. Proceedings of the 14th International Conference on Learning Representations , year=

[40] [55]

2025 , eprint=

International AI Safety Report , author=. 2025 , eprint=

2025

[41] [56]

2025 , eprint=

Omni-SafetyBench: A Benchmark for Safety Evaluation of Audio-Visual Large Language Models , author=. 2025 , eprint=

2025

[42] [57]

2025 , eprint=

Audio Jailbreak: An Open Comprehensive Benchmark for Jailbreaking Large Audio-Language Models , author=. 2025 , eprint=

2025

[43] [58]

2026 , eprint=

VIBE: Voice-Induced open-ended Bias Evaluation for Large Audio-Language Models via Real-World Speech , author=. 2026 , eprint=

2026

[44] [59]

2025 , eprint=

AHELM: A Holistic Evaluation of Audio-Language Models , author=. 2025 , eprint=

2025

[45] [60]

Companion publication of the 2024 conference on computer-supported cooperative work and social computing , pages=

The human factor in ai red teaming: Perspectives from social and collaborative computing , author=. Companion publication of the 2024 conference on computer-supported cooperative work and social computing , pages=

2024

[46] [62]

2026 , eprint=

VoxSafeBench: Not Just What Is Said, but Who, How, and Where , author=. 2026 , eprint=

2026

[47] [63]

2026 , eprint=

JALMBench: Benchmarking Jailbreak Vulnerabilities in Audio Language Models , author=. 2026 , eprint=

2026

[48] [64]

2026 , booktitle=

Jailbreak-AudioBench: In-Depth Evaluation and Analysis of Jailbreak Threats for Large Audio Language Models , author=. 2026 , booktitle=

2026

[49] [66]

International journal of speech technology , volume=

Voice user interfaces in manufacturing logistics: a literature review , author=. International journal of speech technology , volume=. 2023 , publisher=

2023

[50] [67]

Biometrics , pages=

An application of hierarchical kappa-type statistics in the assessment of majority agreement among multiple observers , author=. Biometrics , pages=. 1977 , publisher=

1977

[51] [68]

Educational and psychological measurement , volume=

A coefficient of agreement for nominal scales , author=. Educational and psychological measurement , volume=. 1960 , publisher=

1960

[52] [69]

British Journal of Mathematical and Statistical Psychology , volume=

Computing inter-rater reliability and its variance in the presence of high agreement , author=. British Journal of Mathematical and Statistical Psychology , volume=. 2008 , publisher=

2008

[53] [70]

Nature , volume=

Joint speech and text machine translation for up to 100 languages , author=. Nature , volume=. 2025 , publisher=. doi:10.1038/s41586-025-08655-2 , url=

work page doi:10.1038/s41586-025-08655-2 2025

[54] [71]

2025 , eprint=

Canary-1B-v2 & Parakeet-TDT-0.6B-v3: Efficient and High-Performance Models for Multilingual ASR and AST , author=. 2025 , eprint=

2025

[55] [72]

2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) , pages=

Reproducing whisper-style training using an open-source toolkit and publicly available data , author=. 2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) , pages=. 2023 , organization=

2023

[56] [73]

2026 , eprint=

Evaluation of Audio Language Models for Fairness, Safety, and Security , author=. 2026 , eprint=

2026

[57] [74]

GitHub repository , howpublished =

Silero VAD: pre-trained enterprise-grade Voice Activity Detector (VAD), Number Detector and Language Classifier , year =. GitHub repository , howpublished =

[58] [76]

Phi-4-mini technical report:

Abouelenin, Abdelrahman and Ashfaq, Atabak and Atkinson, Adam and Awadalla, Hany and Bach, Nguyen and Bao, Jianmin and Benhaim, Alon and Cai, Martin and Chaudhary, Vishrav and Chen, Congcong and others , journal=. Phi-4-mini technical report:. 2025 , url=

2025

[59] [78]

2025 , eprint=

Qwen3-Omni Technical Report , author=. 2025 , eprint=

2025

[60] [79]

doi:10.57967/hf/8653 , publisher =

Julian Mack and Ekagra Ranjan and Walter Beller-Morales and Bharat Venkitesh and Pierre Richemond , title =. doi:10.57967/hf/8653 , publisher =

work page doi:10.57967/hf/8653

[61] [80]

2025 , eprint=

Granite-speech: open-source speech-aware LLMs with strong English ASR capabilities , author=. 2025 , eprint=

2025

[62] [81]

Kshitij Ambilduke and Ben Peters and Sonal Sannigrahi and Anil Keshwani and Tsz Kin Lam and Bruno Martins and André F. T. Martins and Marcely Zanon Boito , year=. From. 2503.10620 , archiveprefix=

arXiv

[63] [82]

2026 , eprint=

VIBEVOICE-ASR Technical Report , author=. 2026 , eprint=

2026

[64] [83]

2024 , url=

Changli Tang and Wenyi Yu and Guangzhi Sun and Xianzhao Chen and Tian Tan and Wei Li and Lu Lu and Zejun MA and Chao Zhang , booktitle=. 2024 , url=

2024

[65] [85]

2026 , eprint=

Hearing to Translate: The Effectiveness of Speech Modality Integration into LLMs , author=. 2026 , eprint=

2026

[66] [87]

Inclusion AI. 2025. https://arxiv.org/abs/2506.09344 Ming-omni: A unified multimodal model for perception and generation . Preprint, arXiv:2506.09344

arXiv 2025

[67] [88]

Ranya Aloufi, Srishti Gupta, Soumya Shaw, Battista Biggio, and Lea Schönherr. 2026. https://arxiv.org/abs/2603.13262 Evaluation of audio language models for fairness, safety, and security . Preprint, arXiv:2603.13262

arXiv 2026

[68] [89]

Yoshua Bengio, Sören Mindermann, Daniel Privitera, Tamay Besiroglu, Rishi Bommasani, Stephen Casper, Yejin Choi, Philip Fox, Ben Garfinkel, Danielle Goldfarb, Hoda Heidari, Anson Ho, Sayash Kapoor, Leila Khalatbari, Shayne Longpre, Sam Manning, Vasilios Mavroudis, Mantas Mazeika, Julian Michael, and 77 others. 2025. https://arxiv.org/abs/2501.17805 Intern...

arXiv 2025

[69] [90]

Katharina Block, Antonya Marie Gonzalez, Clement JX Choi, Zoey C Wong, Toni Schmader, and Andrew Scott Baron. 2022. Exposure to stereotype-relevant stories shapes children’s implicit gender stereotypes. Plos one, 17(8):e0271396

2022

[70] [91]

Fan Bu, Zheng Wang, Siyi Wang, and Ziyao Liu. 2025. An investigation into value misalignment in llm-generated texts for cultural heritage. IEEE Transactions on Emerging Topics in Computational Intelligence

2025

[71] [92]

Abu-Ghazaleh, M

Trishna Chakraborty, Erfan Shayegani, Zikui Cai, Nael B. Abu-Ghazaleh, M. Salman Asif, Yue Dong, Amit Roy-Chowdhury, and Chengyu Song. 2024. https://doi.org/10.18653/v1/2024.findings-emnlp.574 Can textual unlearning solve cross-modality safety alignment? In Findings of the Association for Computational Linguistics: EMNLP 2024, pages 9830--9844, Miami, Flo...

work page doi:10.18653/v1/2024.findings-emnlp.574 2024

[72] [93]

Qian Chen, Luyao Cheng, Chong Deng, Xiangang Li, Jiaqing Liu, Chao-Hong Tan, Wen Wang, Junhao Xu, Jieping Ye, Qinglin Zhang, Qiquan Zhang, and Jingren Zhou. 2025. https://arxiv.org/abs/2512.20156 Fun-audio-chat technical report

arXiv 2025

[73] [94]

Yiming Chen, Xianghu Yue, Chen Zhang, Xiaoxue Gao, Robby T Tan, and Haizhou Li. 2026 a . Voicebench: Benchmarking llm-based voice assistants. Transactions of the Association for Computational Linguistics, 14:378--398

2026

[74] [95]

Yupeng Chen, Junchi Yu, Aoxi Liu, Philip Torr, and Adel Bibi. 2026 b . The alignment curse: Cross-modality jailbreak transfer in omni-models. arXiv preprint arXiv:2602.02557

Pith/arXiv arXiv 2026

[75] [96]

Hao Cheng, Erjia Xiao, Jing Shao, Yichi Wang, Le Yang, Chao Shen, Philip Torr, Jindong Gu, and Renjing Xu. 2026. https://arxiv.org/abs/2501.13772 Jailbreak-audiobench: In-depth evaluation and analysis of jailbreak threats for large audio language models . In The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track

arXiv 2026

[76] [98]

Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, and 1 others. 2024 b . https://arxiv.org/abs/2407.10759 Qwen2-audio technical report . arXiv preprint arXiv:2407.10759

Pith/arXiv arXiv 2024

[77] [99]

Junbo Cui, Bokai Xu, Chongyi Wang, Tianyu Yu, Weiyue Sun, Yingjing Xu, Tianran Wang, Zhihui He, Wenshuo Ma, Tianchi Cai, Jiancheng Gui, Luoyuan Zhang, Xian Sun, Fuwei Huang, Moye Chen, Zhuo Lin, Hanyu Liu, Qingxin Gui, Qingzhe Han, and 17 others. 2026. https://arxiv.org/abs/2604.27393 Minicpm-o 4.5: Towards real-time full-duplex omni-modal interaction

Pith/arXiv arXiv 2026

[78] [100]

Yue Deng, Wenxuan Zhang, Sinno Jialin Pan, and Lidong Bing. 2024. Multilingual jailbreak challenges in large language models. In International Conference on Learning Representations, volume 2024, pages 24634--24651

2024

[79] [101]

Heinrich Dinkel, Zhiyong Yan, Tianzi Wang, Yongqing Wang, Xingwei Sun, Yadong Niu, Jizhong Liu, Gang Li, Junbo Zhang, and Jian Luan. 2026. Glap: General contrastive audio-text pretraining across domains and languages. In ICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 14737--14741. IEEE

2026

[80] [102]

Amirbek Djanibekov, Nurdaulet Mukhituly, Kentaro Inui, Hanan Aldarmaki, and Nils Lukas. 2025. https://doi.org/10.18653/v1/2025.emnlp-main.734 SPIRIT : Patching speech language models against jailbreak attacks . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 14503--14520, Suzhou, China. Association for Comp...

work page doi:10.18653/v1/2025.emnlp-main.734 2025