arxiv: 2410.17196 · v3 · pith:XDRDHMN3new · submitted 2024-10-22 · 💻 cs.CL · cs.AI· cs.SD· eess.AS

VoiceBench: Benchmarking LLM-Based Voice Assistants

Yiming Chen , Xianghu Yue , Chen Zhang , Xiaoxue Gao , Robby T. Tan , Haizhou Li This is my paper

Pith reviewed 2026-05-17 00:44 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.SDeess.AS

keywords VoiceBenchLLM-based voice assistantsbenchmarkspoken instructionsspeaker variationsenvironmental factorscontent factorsreal-world evaluation

0 comments

The pith

VoiceBench introduces the first benchmark to evaluate LLM-based voice assistants under real-world variations in speakers, environments, and content.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces VoiceBench to address the lack of suitable tests for LLM-based voice assistants beyond clean speech and basic ASR evaluations. It supplies both real and synthetic spoken instructions that embed three key real-world variations: speaker characteristics, environmental factors, and content factors. A sympathetic reader would care because current methods overlook these complexities, which occur routinely in actual use and can expose performance gaps that clean-speech tests miss. By running experiments on this benchmark, the work shows limitations in existing models and points toward targeted improvements in speech interaction systems.

Core claim

We introduce VoiceBench, the first benchmark designed to provide a multi-faceted evaluation of LLM-based voice assistants. VoiceBench also includes both real and synthetic spoken instructions that incorporate the above three key real-world variations. Extensive experiments reveal the limitations of current LLM-based voice assistant models and offer valuable insights for future research and development in this field.

What carries the argument

VoiceBench, a benchmark that supplies real and synthetic spoken instructions incorporating variations in speaker characteristics, environmental factors, and content factors to test LLM voice assistants beyond clean-speech conditions.

If this is right

Models that succeed on clean speech will still face measurable drops when speaker accents, background noise, or complex instructions are introduced.
Development efforts can now target the specific variation types where current assistants perform worst.
Synthetic data generation within VoiceBench offers a scalable way to expand test coverage without collecting more real recordings.
Insights from the benchmark can directly inform the design of more robust real-time speech interaction pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

VoiceBench could be extended with additional languages or device-specific distortions to test broader deployment conditions.
Systematic comparison of results across multiple base LLMs might reveal which underlying architectures handle acoustic variation more gracefully.
Widespread use of this benchmark could shift evaluation norms away from text-only proxies toward end-to-end spoken interaction testing.

Load-bearing premise

The chosen variations in speaker characteristics, environmental factors, and content factors adequately represent the intricate real-world scenarios that current evaluations neglect.

What would settle it

Run the same set of current LLM voice models on both VoiceBench and standard clean-speech benchmarks; if performance gaps are negligible or if models improved after exposure to VoiceBench variations show no measurable gain on new varied instructions, the benchmark's added value would be cast in doubt.

read the original abstract

Building on the success of large language models (LLMs), recent advancements such as GPT-4o have enabled real-time speech interactions through LLM-based voice assistants, offering a significantly improved user experience compared to traditional text-based interactions. However, the absence of benchmarks designed to evaluate these speech interaction capabilities has hindered progress of LLM-based voice assistants development. Current evaluations focus primarily on automatic speech recognition (ASR) or general knowledge evaluation with clean speeches, neglecting the more intricate, real-world scenarios that involve diverse speaker characteristics, environmental and content factors. To address this, we introduce VoiceBench, the first benchmark designed to provide a multi-faceted evaluation of LLM-based voice assistants. VoiceBench also includes both real and synthetic spoken instructions that incorporate the above three key real-world variations. Extensive experiments reveal the limitations of current LLM-based voice assistant models and offer valuable insights for future research and development in this field.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VoiceBench is a practical first benchmark for LLM voice assistants with real and synthetic data under speaker, environment, and content variations, but the axes lack clear grounding in real failure data.

read the letter

VoiceBench is the first benchmark built specifically to test LLM-based voice assistants with real-world variations in speakers, environments, and content, using both real recordings and synthetic speech. This moves the evaluation beyond the usual clean-speech ASR tests or text-only knowledge checks. The authors generate instructions that vary those three factors and run experiments on current models to show where they struggle. The paper does well at identifying a clear gap in how we currently assess these systems. Putting together real and synthetic data is a practical move that could help developers see robustness issues earlier. The main soft spot is the justification for picking exactly those three variation axes. The work assumes they represent the key neglected real-world factors, but it does not appear to draw from error logs or user studies to confirm their prevalence or impact. If prosody, multi-turn context, or domain vocabulary turn out to drive more failures, the benchmark's diagnostic power would be limited. Details on the exact metrics, baselines, and statistical controls would also strengthen the claims. The stress test note raises a fair point here that the selection feels unvalidated. This is for people working on voice interfaces or multimodal LLMs who need better ways to measure performance outside the lab. A reader focused on practical evaluation frameworks will find the dataset and the reported limitations useful. The paper shows honest engagement with the problem and deserves a serious referee to sort out the design choices and results presentation. I would recommend sending it to peer review rather than a desk reject. Reviewers can push for more evidence on why these variations matter most and for fuller reporting of the experimental setup.

Referee Report

2 major / 2 minor

Summary. The paper introduces VoiceBench, the first benchmark for multi-faceted evaluation of LLM-based voice assistants. It includes both real and synthetic spoken instructions incorporating variations in speaker characteristics, environmental factors, and content factors. The authors report that extensive experiments reveal limitations of current models relative to traditional ASR or clean-speech knowledge evaluations.

Significance. If the benchmark construction holds and the chosen variations prove representative, this work would provide a useful standardized resource for assessing voice assistants under more realistic conditions, offering insights that could guide improvements in handling diverse real-world speech interactions beyond current narrow evaluations.

major comments (2)

[§3] §3 (Benchmark construction): The selection of speaker characteristics, environmental factors, and content factors as the three key real-world variations is presented without quantitative mapping from deployed voice-assistant error logs, user studies, or failure-mode analysis to establish their prevalence or impact; this directly underpins the central claim that VoiceBench supplies a meaningfully more diagnostic evaluation than prior ASR/knowledge benchmarks.
[§4] §4 (Experiments): The claim that experiments reveal limitations of current models lacks reported details on concrete metrics, chosen baselines, statistical controls, or how the three variations were operationalized in the test sets, making it impossible to evaluate whether the results support the multi-faceted diagnostic contribution.

minor comments (2)

[Abstract] Abstract: Adding one sentence on the total number of instructions, models tested, and headline quantitative findings would give readers an immediate sense of scale.
Notation and terminology: Ensure consistent use of 'LLM-based voice assistants' versus 'voice assistant models' throughout to avoid minor reader confusion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments. We address each major comment below, indicating revisions where appropriate to strengthen the manuscript's justification and clarity.

read point-by-point responses

Referee: [§3] §3 (Benchmark construction): The selection of speaker characteristics, environmental factors, and content factors as the three key real-world variations is presented without quantitative mapping from deployed voice-assistant error logs, user studies, or failure-mode analysis to establish their prevalence or impact; this directly underpins the central claim that VoiceBench supplies a meaningfully more diagnostic evaluation than prior ASR/knowledge benchmarks.

Authors: We agree that a more explicit link to empirical evidence of prevalence would strengthen the central claim. The variations were chosen based on recurring themes in the existing voice assistant and ASR literature regarding real-world robustness challenges, but the current manuscript does not include a dedicated quantitative mapping from new error-log analysis or user studies. In revision, we will expand §3 with citations to relevant prior user studies and failure-mode reports (e.g., on accent robustness, environmental noise impact, and query complexity) to better ground the selection and clarify the diagnostic advantage over prior benchmarks. revision: yes
Referee: [§4] §4 (Experiments): The claim that experiments reveal limitations of current models lacks reported details on concrete metrics, chosen baselines, statistical controls, or how the three variations were operationalized in the test sets, making it impossible to evaluate whether the results support the multi-faceted diagnostic contribution.

Authors: We acknowledge that greater detail on experimental design would improve evaluability. The manuscript reports task-specific accuracy and robustness metrics across conditions, includes baselines such as standard ASR pipelines and text-only LLM evaluations, and describes variation implementation (e.g., TTS synthesis with controlled prosody and noise injection). To address the concern directly, we will revise §4 to add expanded result tables broken down by each variation, statistical significance tests, and a clearer step-by-step account of how speaker, environmental, and content factors were instantiated in the real and synthetic test sets. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark introduction with no derivation chain

full rationale

The paper introduces VoiceBench as a new evaluation benchmark for LLM-based voice assistants, asserting that prior work focuses on clean ASR or knowledge tests while neglecting speaker/environment/content variations. No equations, fitted parameters, predictions, or first-principles derivations appear in the provided text. The selection of the three variation axes is presented as an assumption to address real-world gaps, but this is not a self-definitional reduction, fitted-input prediction, or self-citation load-bearing step. The contribution is the benchmark construction and its application, which remains independent of any internal circular logic. No load-bearing claim reduces to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that speaker, environmental, and content factors are the primary real-world variations neglected by existing evaluations; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Current evaluations focus primarily on automatic speech recognition (ASR) or general knowledge evaluation with clean speeches, neglecting intricate real-world scenarios.
Explicitly stated in the abstract as the motivation for the new benchmark.

pith-pipeline@v0.9.0 · 5468 in / 1179 out tokens · 46857 ms · 2026-05-17T00:44:53.168044+00:00 · methodology

discussion (0)

Forward citations

Cited by 18 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents
cs.SD 2026-05 accept novelty 8.0

EVA-Bench introduces a simulation-plus-scoring framework for voice agents that reveals no tested system exceeds 0.5 on both accuracy and experience metrics at pass@1.
Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs
cs.CR 2026-04 conditional novelty 8.0

Benign fine-tuning on audio data breaks safety alignment in Audio LLMs by raising jailbreak success rates up to 87%, with the dominant risk axis depending on model architecture and embedding proximity to harmful content.
VoxSafeBench: Not Just What Is Said, but Who, How, and Where
cs.SD 2026-04 unverdicted novelty 8.0

VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
cs.MM 2026-05 unverdicted novelty 7.0

Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.
MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes
cs.CL 2026-05 unverdicted novelty 7.0

MIST is a new synthetic speech-based tool-calling dataset for IoT devices that exposes performance gaps between open- and closed-weight multimodal LLMs.
SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation
cs.CL 2026-04 unverdicted novelty 7.0

SpeechParaling-Bench is a new evaluation framework for paralinguistic-aware speech generation that reveals major limitations in current large audio-language models.
Putting HUMANS first: Efficient LAM Evaluation with Human Preference Alignment
cs.CL 2026-04 unverdicted novelty 7.0

Curated 50-example subsets of LAM benchmarks, via regression, predict human preferences at 0.98 correlation, outperforming the full benchmark and yielding the open-sourced HUMANS proxy.
From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench
cs.AI 2026-04 unverdicted novelty 7.0

ProVoice-Bench is the first framework to evaluate proactive voice agents, revealing that state-of-the-art multimodal LLMs struggle with over-triggering and context-aware reasoning.
Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection
cs.CR 2026-04 unverdicted novelty 7.0

AudioHijack generates imperceptible adversarial audio via gradient estimation, attention supervision, and reverberation blending to hijack 13 LALMs with 79-96% success on unseen contexts and real commercial agents.
HumDial-EIBench: A Human-Recorded Multi-Turn Emotional Intelligence Benchmark for Audio Language Models
eess.AS 2026-04 unverdicted novelty 7.0

HumDial-EIBench is a new benchmark using real human dialogues to evaluate audio language models on emotional intelligence tasks including multi-turn tracking, causal reasoning, empathy generation, and acoustic-semanti...
The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning
eess.AS 2026-03 unverdicted novelty 7.0

FLAIR enables spoken dialogue AI to conduct continuous latent reasoning while perceiving speech through recursive latent embeddings and an ELBO-based finetuning objective.
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
cs.SD 2025-07 unverdicted novelty 7.0

Audio Flamingo 3 introduces an open large audio-language model achieving new state-of-the-art results on over 20 audio understanding and reasoning benchmarks using a unified encoder and curriculum training on open data.
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
cs.MM 2026-05 unverdicted novelty 6.0

Staged post-training with self-distillation lets a 3B omni-modal model match or slightly exceed a 30B model on a visually debiased benchmark.
MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction
cs.CL 2026-04 unverdicted novelty 6.0

MiniCPM-o 4.5 uses the Omni-Flow streaming framework to deliver real-time full-duplex omni-modal interaction with proactive behavior in a 9B model that approaches Gemini 2.5 Flash performance.
Audio2Tool: Speak, Call, Act -- A Dataset for Benchmarking Speech Tool Use
cs.SD 2026-04 unverdicted novelty 6.0

Audio2Tool is a new benchmark dataset that shows speech models perform well on simple commands but degrade sharply on compositional tasks and realistic acoustic noise.
MM-tau-p$^2$: Persona-Adaptive Prompting for Robust Multi-Modal Agent Evaluation in Dual-Control Settings
cs.ET 2026-03 unverdicted novelty 6.0

MM-tau-p² is a new benchmark with 12 metrics that measures how well multi-modal agents adapt to user personas and maintain robustness in dual-control interactions.
Kimi-Audio Technical Report
eess.AS 2025-04 unverdicted novelty 5.0

Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million ho...
Qwen2.5-Omni Technical Report
cs.CL 2025-03 conditional novelty 5.0

Qwen2.5-Omni presents a multimodal model with block-wise encoders, TMRoPE position embeddings, and a Thinker-Talker architecture that enables simultaneous text and streaming speech generation while matching text perfo...

Reference graph

Works this paper leans on

85 extracted references · 85 canonical work pages · cited by 17 Pith papers · 11 internal anchors

[2]

Advances in Neural Information Processing Systems , volume=

Beavertails: Towards improved safety alignment of llm via a human-preference dataset , author=. Advances in Neural Information Processing Systems , volume=

work page
[3]

The Twelfth International Conference on Learning Representations , year=

Listen, Think, and Understand , author=. The Twelfth International Conference on Learning Representations , year=

work page
[7]

The Twelfth International Conference on Learning Representations , year=

Evaluating Large Language Models at Evaluating Instruction Following , author=. The Twelfth International Conference on Learning Representations , year=

work page
[13]

1994 , school=

Preliminaries to a theory of speech disfluencies , author=. 1994 , school=

work page 1994
[14]

Advances in neural information processing systems , volume=

Visual instruction tuning , author=. Advances in neural information processing systems , volume=

work page
[16]

PloS one , volume=

Speech recognition in natural background noise , author=. PloS one , volume=. 2013 , publisher=

work page 2013
[17]

Proceedings of the 40th International Conference on Machine Learning , pages =

Large Language Models Can Be Easily Distracted by Irrelevant Context , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =

work page 2023
[18]

ELT journal , volume=

Spoken grammar: what is it and how can we teach it? , author=. ELT journal , volume=. 1995 , publisher=

work page 1995
[19]

Applied linguistics , volume=

Grammar and the spoken language , author=. Applied linguistics , volume=. 1995 , publisher=

work page 1995
[20]

Journal of verbal learning and verbal behavior , volume=

Stages in sentence production: An analysis of speech error data , author=. Journal of verbal learning and verbal behavior , volume=. 1981 , publisher=

work page 1981
[23]

International conference on machine learning , pages=

Robust speech recognition via large-scale weak supervision , author=. International conference on machine learning , pages=. 2023 , organization=

work page 2023
[26]

The Journal of the Acoustical Society of America , volume=

Acoustic properties of naturally produced clear speech at normal speaking rates , author=. The Journal of the Acoustical Society of America , volume=. 2004 , publisher=

work page 2004
[27]

Cognition , volume=

Perceptual adaptation to non-native speech , author=. Cognition , volume=. 2008 , publisher=

work page 2008
[34]

Hashimoto , title =

Xuechen Li and Tianyi Zhang and Yann Dubois and Rohan Taori and Ishaan Gulrajani and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , month =

work page 2023
[38]

ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

Dynamic-superb: Towards a dynamic, collaborative, and comprehensive instruction-tuning benchmark for speech , author=. ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2024 , organization=

work page 2024
[41]

Claude 3.5 Sonnet Model Card Addendum , howpublished =

Anthropic. Claude 3.5 Sonnet Model Card Addendum , howpublished =

work page
[42]

ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=

A Chat about Boring Problems: Studying GPT-Based Text Normalization , author=. ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2024 , organization=

work page 2024
[43]

Zhao, Wenliang and Yu, Xumin and Qin, Zengyi , title =

work page
[45]

Microphone Array Processing for Distant Speech Recognition: From Close-Talking Microphones to Far-Field Sensors , year=

Kumatani, Kenichi and McDonough, John and Raj, Bhiksha , journal=. Microphone Array Processing for Distant Speech Recognition: From Close-Talking Microphones to Far-Field Sensors , year=

work page
[46]

and Buzo, A

Gray, R. and Buzo, A. and Gray, A. and Matsuyama, Y. , journal=. Distortion measures for speech processing , year=

work page
[47]

Making Machines Understand Us in Reverberant Rooms: Robustness Against Reverberation for Automatic Speech Recognition , year=

Yoshioka, Takuya and Sehr, Armin and Delcroix, Marc and Kinoshita, Keisuke and Maas, Roland and Nakatani, Tomohiro and Kellermann, Walter , journal=. Making Machines Understand Us in Reverberant Rooms: Robustness Against Reverberation for Automatic Speech Recognition , year=

work page
[48]

Packet Loss Concealment Based on Deep Neural Networks for Digital Speech Transmission , year=

Lee, Bong-Ki and Chang, Joon-Hyuk , journal=. Packet Loss Concealment Based on Deep Neural Networks for Digital Speech Transmission , year=

work page
[49]

IEEE/ACM Transactions on Audio, Speech, and Language Processing , year=

Multichannel Signal Processing With Deep Neural Networks for Automatic Speech Recognition , author=. IEEE/ACM Transactions on Audio, Speech, and Language Processing , year=

work page
[50]

Interspeech , year=

The Conversation: Deep Audio-Visual Speech Enhancement , author=. Interspeech , year=

work page
[51]

and Malah, D

Ephraim, Y. and Malah, D. , journal=. Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator , year=

work page
[52]

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774

work page internal anchor Pith review Pith/arXiv arXiv 2023
[53]

Triantafyllos Afouras, Joon Son Chung, and Andrew Zisserman. 2018. The conversation: Deep audio-visual speech enhancement. Interspeech

work page 2018
[54]

Anthropic . 2024. Claude 3.5 sonnet model card addendum. https://www-cdn.anthropic.com/fed9cc193a14b84131812372d8d5857f8f304c52/Model_Card_Claude_3_Addendum.pdf. Online; accessed October 2024

work page 2024
[55]

Rosana Ardila, Megan Branson, Kelly Davis, Michael Kohler, Josh Meyer, Michael Henretty, Reuben Morais, Lindsay Saunders, Francis Tyers, and Gregor Weber. 2020. https://aclanthology.org/2020.lrec-1.520 Common voice: A massively-multilingual speech corpus . In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4218--4222, Marsei...

work page 2020
[56]

Ann R Bradlow and Tessa Bent. 2008. Perceptual adaptation to non-native speech. Cognition, 106(2):707--729

work page 2008
[57]

Andrew Caines, Christian Bentz, Kate Knill, Marek Rei, and Paula Buttery. 2020. https://doi.org/10.18653/v1/2020.coling-main.195 Grammatical error detection in transcriptions of spoken E nglish . In Proceedings of the 28th International Conference on Computational Linguistics, pages 2144--2162, Barcelona, Spain (Online). International Committee on Computa...

work page doi:10.18653/v1/2020.coling-main.195 2020
[58]

Ronald Carter and Michael Mncarthy. 1995. Grammar and the spoken language. Applied linguistics, 16(2):141--158

work page 1995
[59]

Kai Chen, Yunhao Gou, Runhui Huang, Zhili Liu, Daxin Tan, Jing Xu, Chunwei Wang, Yi Zhu, Yihan Zeng, Kuo Yang, et al. 2024 a . Emova: Empowering language models to see, hear and speak with vivid emotions. arXiv preprint arXiv:2409.18042

work page arXiv 2024
[60]

Yiming Chen, Xianghu Yue, Xiaoxue Gao, Chen Zhang, Luis Fernando D'Haro, Robby T Tan, and Haizhou Li. 2024 b . Beyond single-audio: Advancing multi-audio processing in audio large language models. arXiv preprint arXiv:2409.18680

work page arXiv 2024
[61]

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. 2024 c . How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821

work page internal anchor Pith review Pith/arXiv arXiv 2024
[62]

Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, et al. 2024. Qwen2-audio technical report. arXiv preprint arXiv:2407.10759

work page internal anchor Pith review Pith/arXiv arXiv 2024
[63]

Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. 2023. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models. arXiv preprint arXiv:2311.07919

work page internal anchor Pith review Pith/arXiv arXiv 2023
[64]

Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki

Jonathan H. Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki. 2020. https://doi.org/10.1162/tacl_a_00317 T y D i QA : A benchmark for information-seeking question answering in typologically diverse languages . Transactions of the Association for Computational Linguistics, 8:454--470

work page doi:10.1162/tacl_a_00317 2020
[65]

Alexandre D \'e fossez, Laurent Mazar \'e , Manu Orsini, Am \'e lie Royer, Patrick P \'e rez, Herv \'e J \'e gou, Edouard Grave, and Neil Zeghidour. 2024. Moshi: a speech-text foundation model for real-time dialogue. arXiv preprint arXiv:2410.00037

work page internal anchor Pith review Pith/arXiv arXiv 2024
[66]

Gary S Dell and Peter A Reich. 1981. Stages in sentence production: An analysis of speech error data. Journal of verbal learning and verbal behavior, 20(6):611--629

work page 1981
[67]

Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, et al. 2024. Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens. arXiv preprint arXiv:2407.05407

work page internal anchor Pith review Pith/arXiv arXiv 2024
[68]

Ephraim and D

Y. Ephraim and D. Malah. 1984. https://doi.org/10.1109/TASSP.1984.1164453 Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator . IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(6):1109--1121

work page doi:10.1109/tassp.1984.1164453 1984
[69]

Fahim Faisal, Sharlina Keshava, Md Mahfuz Ibn Alam, and Antonios Anastasopoulos. 2021. https://doi.org/10.18653/v1/2021.findings-emnlp.281 SD - QA : Spoken dialectal question answering for the real world . In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3296--3315, Punta Cana, Dominican Republic. Association for Computation...

work page doi:10.18653/v1/2021.findings-emnlp.281 2021
[70]

Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Shaolei Zhang, and Yang Feng. 2024. Llama-omni: Seamless speech interaction with large language models. arXiv preprint arXiv:2409.06666

work page arXiv 2024
[71]

Chaoyou Fu, Haojia Lin, Zuwei Long, Yunhang Shen, Meng Zhao, Yifan Zhang, Xiong Wang, Di Yin, Long Ma, Xiawu Zheng, et al. 2024. Vita: Towards open-source interactive omni multimodal llm. arXiv preprint arXiv:2408.05211

work page arXiv 2024
[72]

Liu, Leonid Karlinsky, and James R

Yuan Gong, Hongyin Luo, Alexander H. Liu, Leonid Karlinsky, and James R. Glass. 2024. https://openreview.net/forum?id=nBZBPXdJlC Listen, think, and understand . In The Twelfth International Conference on Learning Representations

work page 2024
[73]

R. Gray, A. Buzo, A. Gray, and Y. Matsuyama. 1980. Distortion measures for speech processing. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4):367--376

work page 1980
[74]

William Held, Ella Li, Michael Ryan, Weiyan Shi, Yanzhe Zhang, and Diyi Yang. 2024. Distilling an end-to-end voice assistant without instruction training data. arXiv preprint arXiv:2410.02678

work page arXiv 2024
[75]

Chien-yu Huang, Ke-Han Lu, Shih-Heng Wang, Chi-Yuan Hsiao, Chun-Yi Kuan, Haibin Wu, Siddhant Arora, Kai-Wei Chang, Jiatong Shi, Yifan Peng, et al. 2024. Dynamic-superb: Towards a dynamic, collaborative, and comprehensive instruction-tuning benchmark for speech. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (I...

work page 2024
[76]

Paria Jamshid Lou and Mark Johnson. 2020. https://doi.org/10.18653/v1/2020.findings-emnlp.186 End-to-end speech recognition and disfluency removal . In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2051--2061, Online. Association for Computational Linguistics

work page doi:10.18653/v1/2020.findings-emnlp.186 2020
[77]

Jiaming Ji, Mickel Liu, Josef Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. 2024. Beavertails: Towards improved safety alignment of llm via a human-preference dataset. Advances in Neural Information Processing Systems, 36

work page 2024
[78]

Yassine Kheir, Ahmed Ali, and Shammur Chowdhury. 2023. https://doi.org/10.18653/v1/2023.findings-emnlp.557 Automatic pronunciation assessment - a review . In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 8304--8324, Singapore. Association for Computational Linguistics

work page doi:10.18653/v1/2023.findings-emnlp.557 2023
[79]

Jean C Krause and Louis D Braida. 2004. Acoustic properties of naturally produced clear speech at normal speaking rates. The Journal of the Acoustical Society of America, 115(1):362--378

work page 2004
[80]

Kenichi Kumatani, John McDonough, and Bhiksha Raj. 2012. Microphone array processing for distant speech recognition: From close-talking microphones to far-field sensors. IEEE Signal Processing Magazine, 29(6):127--140

work page 2012
[81]

Bong-Ki Lee and Joon-Hyuk Chang. 2016. Packet loss concealment based on deep neural networks for digital speech transmission. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(2):378--387

work page 2016
[82]

Hashimoto

Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. 2023. Alpacaeval: An automatic evaluator of instruction-following models. https://github.com/tatsu-lab/alpaca_eval

work page 2023
[83]

Yadong Li, Haoze Sun, Mingan Lin, Tianpeng Li, Guosheng Dong, Tao Zhang, Bowen Ding, Wei Song, Zhenglin Cheng, Yuqi Huo, et al. 2024 a . Baichuan-omni technical report. arXiv preprint arXiv:2410.08565

work page arXiv 2024
[84]

Zongxia Li, Ishani Mondal, Yijun Liang, Huy Nghiem, and Jordan Lee Boyd-Graber. 2024 b . Panda (pedantic answer-correctness determination and adjudication): Improving automatic evaluation for question answering and text generation. arXiv preprint arXiv:2402.11161

work page arXiv 2024
[85]

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2024. Visual instruction tuning. Advances in neural information processing systems, 36

work page 2024
[86]

Yang Liu, Yuanshun Yao, Jean-Francois Ton, Xiaoying Zhang, Ruocheng Guo, Hao Cheng, Yegor Klochkov, Muhammad Faaiz Taufiq, and Hang Li. 2023. Trustworthy llms: a survey and guideline for evaluating large language models' alignment. arXiv preprint arXiv:2308.05374

work page Pith review arXiv 2023
[87]

Benjamin Marie. 2023. https://doi.org/10.18653/v1/2023.findings-acl.728 Disfluency generation for more robust dialogue systems . In Findings of the Association for Computational Linguistics: ACL 2023, pages 11479--11488, Toronto, Canada. Association for Computational Linguistics

work page doi:10.18653/v1/2023.findings-acl.728 2023
[88]

Michael McCarthy and Ronald Carter. 1995. Spoken grammar: what is it and how can we teach it? ELT journal, 49(3):207--218

work page 1995
[89]

Julien Meyer, Laure Dentel, and Fanny Meunier. 2013. Speech recognition in natural background noise. PloS one, 8(11):e79279

work page 2013
[90]

Aidar Myrzakhan, Sondos Mahmoud Bsharat, and Zhiqiang Shen. 2024. Open-llm-leaderboard: From multi-choice to open-style questions for llms evaluation, benchmark, and arena. arXiv preprint arXiv:2406.07545

work page arXiv 2024
[91]

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision. In International conference on machine learning, pages 28492--28518. PMLR

work page 2023
[92]

Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. 2024. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530

work page internal anchor Pith review Pith/arXiv arXiv 2024
[93]

Sainath, Ron J

Tara N. Sainath, Ron J. Weiss, Kevin W. Wilson, Bo Li, Arun Narayanan, Ehsan Variani, Michiel Bacchiani, Izhak Shafran, Andrew W. Senior, Kean K. Chin, Ananya Misra, and Chanwoo Kim. 2017. Multichannel signal processing with deep neural networks for automatic speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25:965--979

work page 2017
[94]

Chi, Nathanael Sch\" a rli, and Denny Zhou

Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H. Chi, Nathanael Sch\" a rli, and Denny Zhou. 2023. Large language models can be easily distracted by irrelevant context. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 31210--31227. PMLR

work page 2023
[95]

Elizabeth Ellen Shriberg. 1994. Preliminaries to a theory of speech disfluencies. Ph.D. thesis, Citeseer

work page 1994
[96]

Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang. 2023. Salmonn: Towards generic hearing abilities for large language models. arXiv preprint arXiv:2310.13289

work page internal anchor Pith review arXiv 2023
[97]

Jean E.Fox Tree. 1995. https://doi.org/10.1006/jmla.1995.1032 The effects of false starts and repetitions on the processing of subsequent words in spontaneous speech . Journal of Memory and Language, 34(6):709--738

work page doi:10.1006/jmla.1995.1032 1995
[98]

Bin Wang, Chengwei Wei, Zhengyuan Liu, Geyu Lin, and Nancy F Chen. 2024 a . Resilience of large language models for noisy instructions. arXiv preprint arXiv:2404.09754

work page arXiv 2024
[99]

Bin Wang, Xunlong Zou, Geyu Lin, Shuo Sun, Zhuohan Liu, Wenyu Zhang, Zhengyuan Liu, AiTi Aw, and Nancy F Chen. 2024 b . Audiobench: A universal benchmark for audio large language models. arXiv preprint arXiv:2406.16020

work page arXiv 2024
[100]

Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. 2024 c . Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. arXiv preprint arXiv:2406.01574

work page internal anchor Pith review Pith/arXiv arXiv 2024
[101]

Zhifei Xie and Changqiao Wu. 2024 a . Mini-omni: Language models can hear, talk while thinking in streaming. arXiv preprint arXiv:2408.16725

work page arXiv 2024
[102]

Zhifei Xie and Changqiao Wu. 2024 b . Mini-omni2: Towards open-source gpt-4o model with vision, speech and duplex. arXiv preprint arXiv:2410.11190

work page arXiv 2024
[103]

Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bill Yuchen Lin, and Radha Poovendran. 2024. https://doi.org/10.18653/v1/2024.acl-long.303 S afe D ecoding: Defending against jailbreak attacks via safety-aware decoding . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5587--560...

work page doi:10.18653/v1/2024.acl-long.303 2024
[104]

Qian Yang, Jin Xu, Wenrui Liu, Yunfei Chu, Ziyue Jiang, Xiaohuan Zhou, Yichong Leng, Yuanjun Lv, Zhou Zhao, Chang Zhou, et al. 2024. Air-bench: Benchmarking large audio-language models via generative comprehension. arXiv preprint arXiv:2402.07729

work page arXiv 2024
[105]

Takuya Yoshioka, Armin Sehr, Marc Delcroix, Keisuke Kinoshita, Roland Maas, Tomohiro Nakatani, and Walter Kellermann. 2012. Making machines understand us in reverberant rooms: Robustness against reverberation for automatic speech recognition. IEEE Signal Processing Magazine, 29(6):114--126

work page 2012
[106]

Zhiyuan Zeng, Jiatong Yu, Tianyu Gao, Yu Meng, Tanya Goyal, and Danqi Chen. 2024. https://openreview.net/forum?id=tr0KidwPLc Evaluating large language models at evaluating instruction following . In The Twelfth International Conference on Learning Representations

work page 2024

Showing first 80 references.