VoiceBench: Benchmarking LLM-Based Voice Assistants
Pith reviewed 2026-05-17 00:44 UTC · model grok-4.3
The pith
VoiceBench introduces the first benchmark to evaluate LLM-based voice assistants under real-world variations in speakers, environments, and content.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce VoiceBench, the first benchmark designed to provide a multi-faceted evaluation of LLM-based voice assistants. VoiceBench also includes both real and synthetic spoken instructions that incorporate the above three key real-world variations. Extensive experiments reveal the limitations of current LLM-based voice assistant models and offer valuable insights for future research and development in this field.
What carries the argument
VoiceBench, a benchmark that supplies real and synthetic spoken instructions incorporating variations in speaker characteristics, environmental factors, and content factors to test LLM voice assistants beyond clean-speech conditions.
If this is right
- Models that succeed on clean speech will still face measurable drops when speaker accents, background noise, or complex instructions are introduced.
- Development efforts can now target the specific variation types where current assistants perform worst.
- Synthetic data generation within VoiceBench offers a scalable way to expand test coverage without collecting more real recordings.
- Insights from the benchmark can directly inform the design of more robust real-time speech interaction pipelines.
Where Pith is reading between the lines
- VoiceBench could be extended with additional languages or device-specific distortions to test broader deployment conditions.
- Systematic comparison of results across multiple base LLMs might reveal which underlying architectures handle acoustic variation more gracefully.
- Widespread use of this benchmark could shift evaluation norms away from text-only proxies toward end-to-end spoken interaction testing.
Load-bearing premise
The chosen variations in speaker characteristics, environmental factors, and content factors adequately represent the intricate real-world scenarios that current evaluations neglect.
What would settle it
Run the same set of current LLM voice models on both VoiceBench and standard clean-speech benchmarks; if performance gaps are negligible or if models improved after exposure to VoiceBench variations show no measurable gain on new varied instructions, the benchmark's added value would be cast in doubt.
read the original abstract
Building on the success of large language models (LLMs), recent advancements such as GPT-4o have enabled real-time speech interactions through LLM-based voice assistants, offering a significantly improved user experience compared to traditional text-based interactions. However, the absence of benchmarks designed to evaluate these speech interaction capabilities has hindered progress of LLM-based voice assistants development. Current evaluations focus primarily on automatic speech recognition (ASR) or general knowledge evaluation with clean speeches, neglecting the more intricate, real-world scenarios that involve diverse speaker characteristics, environmental and content factors. To address this, we introduce VoiceBench, the first benchmark designed to provide a multi-faceted evaluation of LLM-based voice assistants. VoiceBench also includes both real and synthetic spoken instructions that incorporate the above three key real-world variations. Extensive experiments reveal the limitations of current LLM-based voice assistant models and offer valuable insights for future research and development in this field.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces VoiceBench, the first benchmark for multi-faceted evaluation of LLM-based voice assistants. It includes both real and synthetic spoken instructions incorporating variations in speaker characteristics, environmental factors, and content factors. The authors report that extensive experiments reveal limitations of current models relative to traditional ASR or clean-speech knowledge evaluations.
Significance. If the benchmark construction holds and the chosen variations prove representative, this work would provide a useful standardized resource for assessing voice assistants under more realistic conditions, offering insights that could guide improvements in handling diverse real-world speech interactions beyond current narrow evaluations.
major comments (2)
- [§3] §3 (Benchmark construction): The selection of speaker characteristics, environmental factors, and content factors as the three key real-world variations is presented without quantitative mapping from deployed voice-assistant error logs, user studies, or failure-mode analysis to establish their prevalence or impact; this directly underpins the central claim that VoiceBench supplies a meaningfully more diagnostic evaluation than prior ASR/knowledge benchmarks.
- [§4] §4 (Experiments): The claim that experiments reveal limitations of current models lacks reported details on concrete metrics, chosen baselines, statistical controls, or how the three variations were operationalized in the test sets, making it impossible to evaluate whether the results support the multi-faceted diagnostic contribution.
minor comments (2)
- [Abstract] Abstract: Adding one sentence on the total number of instructions, models tested, and headline quantitative findings would give readers an immediate sense of scale.
- Notation and terminology: Ensure consistent use of 'LLM-based voice assistants' versus 'voice assistant models' throughout to avoid minor reader confusion.
Simulated Author's Rebuttal
We thank the referee for their constructive comments. We address each major comment below, indicating revisions where appropriate to strengthen the manuscript's justification and clarity.
read point-by-point responses
-
Referee: [§3] §3 (Benchmark construction): The selection of speaker characteristics, environmental factors, and content factors as the three key real-world variations is presented without quantitative mapping from deployed voice-assistant error logs, user studies, or failure-mode analysis to establish their prevalence or impact; this directly underpins the central claim that VoiceBench supplies a meaningfully more diagnostic evaluation than prior ASR/knowledge benchmarks.
Authors: We agree that a more explicit link to empirical evidence of prevalence would strengthen the central claim. The variations were chosen based on recurring themes in the existing voice assistant and ASR literature regarding real-world robustness challenges, but the current manuscript does not include a dedicated quantitative mapping from new error-log analysis or user studies. In revision, we will expand §3 with citations to relevant prior user studies and failure-mode reports (e.g., on accent robustness, environmental noise impact, and query complexity) to better ground the selection and clarify the diagnostic advantage over prior benchmarks. revision: yes
-
Referee: [§4] §4 (Experiments): The claim that experiments reveal limitations of current models lacks reported details on concrete metrics, chosen baselines, statistical controls, or how the three variations were operationalized in the test sets, making it impossible to evaluate whether the results support the multi-faceted diagnostic contribution.
Authors: We acknowledge that greater detail on experimental design would improve evaluability. The manuscript reports task-specific accuracy and robustness metrics across conditions, includes baselines such as standard ASR pipelines and text-only LLM evaluations, and describes variation implementation (e.g., TTS synthesis with controlled prosody and noise injection). To address the concern directly, we will revise §4 to add expanded result tables broken down by each variation, statistical significance tests, and a clearer step-by-step account of how speaker, environmental, and content factors were instantiated in the real and synthetic test sets. revision: yes
Circularity Check
No circularity: benchmark introduction with no derivation chain
full rationale
The paper introduces VoiceBench as a new evaluation benchmark for LLM-based voice assistants, asserting that prior work focuses on clean ASR or knowledge tests while neglecting speaker/environment/content variations. No equations, fitted parameters, predictions, or first-principles derivations appear in the provided text. The selection of the three variation axes is presented as an assumption to address real-world gaps, but this is not a self-definitional reduction, fitted-input prediction, or self-citation load-bearing step. The contribution is the benchmark construction and its application, which remains independent of any internal circular logic. No load-bearing claim reduces to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Current evaluations focus primarily on automatic speech recognition (ASR) or general knowledge evaluation with clean speeches, neglecting intricate real-world scenarios.
Forward citations
Cited by 18 Pith papers
-
EVA-Bench: A New End-to-end Framework for Evaluating Voice Agents
EVA-Bench introduces a simulation-plus-scoring framework for voice agents that reveals no tested system exceeds 0.5 on both accuracy and experience metrics at pass@1.
-
Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs
Benign fine-tuning on audio data breaks safety alignment in Audio LLMs by raising jailbreak success rates up to 87%, with the dominant risk axis depending on model architecture and embedding proximity to harmful content.
-
VoxSafeBench: Not Just What Is Said, but Who, How, and Where
VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.
-
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.
-
MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes
MIST is a new synthetic speech-based tool-calling dataset for IoT devices that exposes performance gaps between open- and closed-weight multimodal LLMs.
-
SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation
SpeechParaling-Bench is a new evaluation framework for paralinguistic-aware speech generation that reveals major limitations in current large audio-language models.
-
Putting HUMANS first: Efficient LAM Evaluation with Human Preference Alignment
Curated 50-example subsets of LAM benchmarks, via regression, predict human preferences at 0.98 correlation, outperforming the full benchmark and yielding the open-sourced HUMANS proxy.
-
From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench
ProVoice-Bench is the first framework to evaluate proactive voice agents, revealing that state-of-the-art multimodal LLMs struggle with over-triggering and context-aware reasoning.
-
Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection
AudioHijack generates imperceptible adversarial audio via gradient estimation, attention supervision, and reverberation blending to hijack 13 LALMs with 79-96% success on unseen contexts and real commercial agents.
-
HumDial-EIBench: A Human-Recorded Multi-Turn Emotional Intelligence Benchmark for Audio Language Models
HumDial-EIBench is a new benchmark using real human dialogues to evaluate audio language models on emotional intelligence tasks including multi-turn tracking, causal reasoning, empathy generation, and acoustic-semanti...
-
The Silent Thought: Modeling Internal Cognition in Full-Duplex Spoken Dialogue Models via Latent Reasoning
FLAIR enables spoken dialogue AI to conduct continuous latent reasoning while perceiving speech through recursive latent embeddings and an ELBO-based finetuning objective.
-
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
Audio Flamingo 3 introduces an open large audio-language model achieving new state-of-the-art results on over 20 audio understanding and reasoning benchmarks using a unified encoder and curriculum training on open data.
-
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
Staged post-training with self-distillation lets a 3B omni-modal model match or slightly exceed a 30B model on a visually debiased benchmark.
-
MiniCPM-o 4.5: Towards Real-Time Full-Duplex Omni-Modal Interaction
MiniCPM-o 4.5 uses the Omni-Flow streaming framework to deliver real-time full-duplex omni-modal interaction with proactive behavior in a 9B model that approaches Gemini 2.5 Flash performance.
-
Audio2Tool: Speak, Call, Act -- A Dataset for Benchmarking Speech Tool Use
Audio2Tool is a new benchmark dataset that shows speech models perform well on simple commands but degrade sharply on compositional tasks and realistic acoustic noise.
-
MM-tau-p$^2$: Persona-Adaptive Prompting for Robust Multi-Modal Agent Evaluation in Dual-Control Settings
MM-tau-p² is a new benchmark with 12 metrics that measures how well multi-modal agents adapt to user personas and maintain robustness in dual-control interactions.
-
Kimi-Audio Technical Report
Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million ho...
-
Qwen2.5-Omni Technical Report
Qwen2.5-Omni presents a multimodal model with block-wise encoders, TMRoPE position embeddings, and a Thinker-Talker architecture that enables simultaneous text and streaming speech generation while matching text perfo...
Reference graph
Works this paper leans on
-
[2]
Advances in Neural Information Processing Systems , volume=
Beavertails: Towards improved safety alignment of llm via a human-preference dataset , author=. Advances in Neural Information Processing Systems , volume=
-
[3]
The Twelfth International Conference on Learning Representations , year=
Listen, Think, and Understand , author=. The Twelfth International Conference on Learning Representations , year=
-
[7]
The Twelfth International Conference on Learning Representations , year=
Evaluating Large Language Models at Evaluating Instruction Following , author=. The Twelfth International Conference on Learning Representations , year=
-
[13]
Preliminaries to a theory of speech disfluencies , author=. 1994 , school=
work page 1994
-
[14]
Advances in neural information processing systems , volume=
Visual instruction tuning , author=. Advances in neural information processing systems , volume=
-
[16]
Speech recognition in natural background noise , author=. PloS one , volume=. 2013 , publisher=
work page 2013
-
[17]
Proceedings of the 40th International Conference on Machine Learning , pages =
Large Language Models Can Be Easily Distracted by Irrelevant Context , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =
work page 2023
-
[18]
Spoken grammar: what is it and how can we teach it? , author=. ELT journal , volume=. 1995 , publisher=
work page 1995
-
[19]
Grammar and the spoken language , author=. Applied linguistics , volume=. 1995 , publisher=
work page 1995
-
[20]
Journal of verbal learning and verbal behavior , volume=
Stages in sentence production: An analysis of speech error data , author=. Journal of verbal learning and verbal behavior , volume=. 1981 , publisher=
work page 1981
-
[23]
International conference on machine learning , pages=
Robust speech recognition via large-scale weak supervision , author=. International conference on machine learning , pages=. 2023 , organization=
work page 2023
-
[26]
The Journal of the Acoustical Society of America , volume=
Acoustic properties of naturally produced clear speech at normal speaking rates , author=. The Journal of the Acoustical Society of America , volume=. 2004 , publisher=
work page 2004
-
[27]
Perceptual adaptation to non-native speech , author=. Cognition , volume=. 2008 , publisher=
work page 2008
-
[34]
Xuechen Li and Tianyi Zhang and Yann Dubois and Rohan Taori and Ishaan Gulrajani and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , month =
work page 2023
-
[38]
Dynamic-superb: Towards a dynamic, collaborative, and comprehensive instruction-tuning benchmark for speech , author=. ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2024 , organization=
work page 2024
-
[41]
Claude 3.5 Sonnet Model Card Addendum , howpublished =
Anthropic. Claude 3.5 Sonnet Model Card Addendum , howpublished =
-
[42]
A Chat about Boring Problems: Studying GPT-Based Text Normalization , author=. ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , pages=. 2024 , organization=
work page 2024
-
[43]
Zhao, Wenliang and Yu, Xumin and Qin, Zengyi , title =
-
[45]
Kumatani, Kenichi and McDonough, John and Raj, Bhiksha , journal=. Microphone Array Processing for Distant Speech Recognition: From Close-Talking Microphones to Far-Field Sensors , year=
-
[46]
Gray, R. and Buzo, A. and Gray, A. and Matsuyama, Y. , journal=. Distortion measures for speech processing , year=
-
[47]
Yoshioka, Takuya and Sehr, Armin and Delcroix, Marc and Kinoshita, Keisuke and Maas, Roland and Nakatani, Tomohiro and Kellermann, Walter , journal=. Making Machines Understand Us in Reverberant Rooms: Robustness Against Reverberation for Automatic Speech Recognition , year=
-
[48]
Packet Loss Concealment Based on Deep Neural Networks for Digital Speech Transmission , year=
Lee, Bong-Ki and Chang, Joon-Hyuk , journal=. Packet Loss Concealment Based on Deep Neural Networks for Digital Speech Transmission , year=
-
[49]
IEEE/ACM Transactions on Audio, Speech, and Language Processing , year=
Multichannel Signal Processing With Deep Neural Networks for Automatic Speech Recognition , author=. IEEE/ACM Transactions on Audio, Speech, and Language Processing , year=
-
[50]
The Conversation: Deep Audio-Visual Speech Enhancement , author=. Interspeech , year=
-
[51]
Ephraim, Y. and Malah, D. , journal=. Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator , year=
-
[52]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report. arXiv preprint arXiv:2303.08774
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[53]
Triantafyllos Afouras, Joon Son Chung, and Andrew Zisserman. 2018. The conversation: Deep audio-visual speech enhancement. Interspeech
work page 2018
-
[54]
Anthropic . 2024. Claude 3.5 sonnet model card addendum. https://www-cdn.anthropic.com/fed9cc193a14b84131812372d8d5857f8f304c52/Model_Card_Claude_3_Addendum.pdf. Online; accessed October 2024
work page 2024
-
[55]
Rosana Ardila, Megan Branson, Kelly Davis, Michael Kohler, Josh Meyer, Michael Henretty, Reuben Morais, Lindsay Saunders, Francis Tyers, and Gregor Weber. 2020. https://aclanthology.org/2020.lrec-1.520 Common voice: A massively-multilingual speech corpus . In Proceedings of the Twelfth Language Resources and Evaluation Conference, pages 4218--4222, Marsei...
work page 2020
-
[56]
Ann R Bradlow and Tessa Bent. 2008. Perceptual adaptation to non-native speech. Cognition, 106(2):707--729
work page 2008
-
[57]
Andrew Caines, Christian Bentz, Kate Knill, Marek Rei, and Paula Buttery. 2020. https://doi.org/10.18653/v1/2020.coling-main.195 Grammatical error detection in transcriptions of spoken E nglish . In Proceedings of the 28th International Conference on Computational Linguistics, pages 2144--2162, Barcelona, Spain (Online). International Committee on Computa...
-
[58]
Ronald Carter and Michael Mncarthy. 1995. Grammar and the spoken language. Applied linguistics, 16(2):141--158
work page 1995
- [59]
- [60]
-
[61]
Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. 2024 c . How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. arXiv preprint arXiv:2404.16821
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[62]
Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, et al. 2024. Qwen2-audio technical report. arXiv preprint arXiv:2407.10759
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[63]
Yunfei Chu, Jin Xu, Xiaohuan Zhou, Qian Yang, Shiliang Zhang, Zhijie Yan, Chang Zhou, and Jingren Zhou. 2023. Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models. arXiv preprint arXiv:2311.07919
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[64]
Jonathan H. Clark, Eunsol Choi, Michael Collins, Dan Garrette, Tom Kwiatkowski, Vitaly Nikolaev, and Jennimaria Palomaki. 2020. https://doi.org/10.1162/tacl_a_00317 T y D i QA : A benchmark for information-seeking question answering in typologically diverse languages . Transactions of the Association for Computational Linguistics, 8:454--470
-
[65]
Alexandre D \'e fossez, Laurent Mazar \'e , Manu Orsini, Am \'e lie Royer, Patrick P \'e rez, Herv \'e J \'e gou, Edouard Grave, and Neil Zeghidour. 2024. Moshi: a speech-text foundation model for real-time dialogue. arXiv preprint arXiv:2410.00037
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[66]
Gary S Dell and Peter A Reich. 1981. Stages in sentence production: An analysis of speech error data. Journal of verbal learning and verbal behavior, 20(6):611--629
work page 1981
-
[67]
Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, et al. 2024. Cosyvoice: A scalable multilingual zero-shot text-to-speech synthesizer based on supervised semantic tokens. arXiv preprint arXiv:2407.05407
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[68]
Y. Ephraim and D. Malah. 1984. https://doi.org/10.1109/TASSP.1984.1164453 Speech enhancement using a minimum-mean square error short-time spectral amplitude estimator . IEEE Transactions on Acoustics, Speech, and Signal Processing, 32(6):1109--1121
-
[69]
Fahim Faisal, Sharlina Keshava, Md Mahfuz Ibn Alam, and Antonios Anastasopoulos. 2021. https://doi.org/10.18653/v1/2021.findings-emnlp.281 SD - QA : Spoken dialectal question answering for the real world . In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3296--3315, Punta Cana, Dominican Republic. Association for Computation...
- [70]
- [71]
-
[72]
Liu, Leonid Karlinsky, and James R
Yuan Gong, Hongyin Luo, Alexander H. Liu, Leonid Karlinsky, and James R. Glass. 2024. https://openreview.net/forum?id=nBZBPXdJlC Listen, think, and understand . In The Twelfth International Conference on Learning Representations
work page 2024
-
[73]
R. Gray, A. Buzo, A. Gray, and Y. Matsuyama. 1980. Distortion measures for speech processing. IEEE Transactions on Acoustics, Speech, and Signal Processing, 28(4):367--376
work page 1980
- [74]
-
[75]
Chien-yu Huang, Ke-Han Lu, Shih-Heng Wang, Chi-Yuan Hsiao, Chun-Yi Kuan, Haibin Wu, Siddhant Arora, Kai-Wei Chang, Jiatong Shi, Yifan Peng, et al. 2024. Dynamic-superb: Towards a dynamic, collaborative, and comprehensive instruction-tuning benchmark for speech. In ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (I...
work page 2024
-
[76]
Paria Jamshid Lou and Mark Johnson. 2020. https://doi.org/10.18653/v1/2020.findings-emnlp.186 End-to-end speech recognition and disfluency removal . In Findings of the Association for Computational Linguistics: EMNLP 2020, pages 2051--2061, Online. Association for Computational Linguistics
-
[77]
Jiaming Ji, Mickel Liu, Josef Dai, Xuehai Pan, Chi Zhang, Ce Bian, Boyuan Chen, Ruiyang Sun, Yizhou Wang, and Yaodong Yang. 2024. Beavertails: Towards improved safety alignment of llm via a human-preference dataset. Advances in Neural Information Processing Systems, 36
work page 2024
-
[78]
Yassine Kheir, Ahmed Ali, and Shammur Chowdhury. 2023. https://doi.org/10.18653/v1/2023.findings-emnlp.557 Automatic pronunciation assessment - a review . In Findings of the Association for Computational Linguistics: EMNLP 2023, pages 8304--8324, Singapore. Association for Computational Linguistics
-
[79]
Jean C Krause and Louis D Braida. 2004. Acoustic properties of naturally produced clear speech at normal speaking rates. The Journal of the Acoustical Society of America, 115(1):362--378
work page 2004
-
[80]
Kenichi Kumatani, John McDonough, and Bhiksha Raj. 2012. Microphone array processing for distant speech recognition: From close-talking microphones to far-field sensors. IEEE Signal Processing Magazine, 29(6):127--140
work page 2012
-
[81]
Bong-Ki Lee and Joon-Hyuk Chang. 2016. Packet loss concealment based on deep neural networks for digital speech transmission. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 24(2):378--387
work page 2016
- [82]
- [83]
- [84]
-
[85]
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. 2024. Visual instruction tuning. Advances in neural information processing systems, 36
work page 2024
-
[86]
Yang Liu, Yuanshun Yao, Jean-Francois Ton, Xiaoying Zhang, Ruocheng Guo, Hao Cheng, Yegor Klochkov, Muhammad Faaiz Taufiq, and Hang Li. 2023. Trustworthy llms: a survey and guideline for evaluating large language models' alignment. arXiv preprint arXiv:2308.05374
work page Pith review arXiv 2023
-
[87]
Benjamin Marie. 2023. https://doi.org/10.18653/v1/2023.findings-acl.728 Disfluency generation for more robust dialogue systems . In Findings of the Association for Computational Linguistics: ACL 2023, pages 11479--11488, Toronto, Canada. Association for Computational Linguistics
-
[88]
Michael McCarthy and Ronald Carter. 1995. Spoken grammar: what is it and how can we teach it? ELT journal, 49(3):207--218
work page 1995
-
[89]
Julien Meyer, Laure Dentel, and Fanny Meunier. 2013. Speech recognition in natural background noise. PloS one, 8(11):e79279
work page 2013
- [90]
-
[91]
Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. 2023. Robust speech recognition via large-scale weak supervision. In International conference on machine learning, pages 28492--28518. PMLR
work page 2023
-
[92]
Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean-baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. 2024. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[93]
Tara N. Sainath, Ron J. Weiss, Kevin W. Wilson, Bo Li, Arun Narayanan, Ehsan Variani, Michiel Bacchiani, Izhak Shafran, Andrew W. Senior, Kean K. Chin, Ananya Misra, and Chanwoo Kim. 2017. Multichannel signal processing with deep neural networks for automatic speech recognition. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 25:965--979
work page 2017
-
[94]
Chi, Nathanael Sch\" a rli, and Denny Zhou
Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H. Chi, Nathanael Sch\" a rli, and Denny Zhou. 2023. Large language models can be easily distracted by irrelevant context. In Proceedings of the 40th International Conference on Machine Learning, volume 202 of Proceedings of Machine Learning Research, pages 31210--31227. PMLR
work page 2023
-
[95]
Elizabeth Ellen Shriberg. 1994. Preliminaries to a theory of speech disfluencies. Ph.D. thesis, Citeseer
work page 1994
-
[96]
Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang. 2023. Salmonn: Towards generic hearing abilities for large language models. arXiv preprint arXiv:2310.13289
work page internal anchor Pith review arXiv 2023
-
[97]
Jean E.Fox Tree. 1995. https://doi.org/10.1006/jmla.1995.1032 The effects of false starts and repetitions on the processing of subsequent words in spontaneous speech . Journal of Memory and Language, 34(6):709--738
- [98]
- [99]
-
[100]
Yubo Wang, Xueguang Ma, Ge Zhang, Yuansheng Ni, Abhranil Chandra, Shiguang Guo, Weiming Ren, Aaran Arulraj, Xuan He, Ziyan Jiang, et al. 2024 c . Mmlu-pro: A more robust and challenging multi-task language understanding benchmark. arXiv preprint arXiv:2406.01574
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [101]
- [102]
-
[103]
Zhangchen Xu, Fengqing Jiang, Luyao Niu, Jinyuan Jia, Bill Yuchen Lin, and Radha Poovendran. 2024. https://doi.org/10.18653/v1/2024.acl-long.303 S afe D ecoding: Defending against jailbreak attacks via safety-aware decoding . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 5587--560...
- [104]
-
[105]
Takuya Yoshioka, Armin Sehr, Marc Delcroix, Keisuke Kinoshita, Roland Maas, Tomohiro Nakatani, and Walter Kellermann. 2012. Making machines understand us in reverberant rooms: Robustness against reverberation for automatic speech recognition. IEEE Signal Processing Magazine, 29(6):114--126
work page 2012
-
[106]
Zhiyuan Zeng, Jiatong Yu, Tianyu Gao, Yu Meng, Tanya Goyal, and Danqi Chen. 2024. https://openreview.net/forum?id=tr0KidwPLc Evaluating large language models at evaluating instruction following . In The Twelfth International Conference on Learning Representations
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.