{"paper":{"title":"VoiceBench: Benchmarking LLM-Based Voice Assistants","license":"http://creativecommons.org/licenses/by/4.0/","headline":"VoiceBench introduces the first benchmark to evaluate LLM-based voice assistants under real-world variations in speakers, environments, and content.","cross_cats":["cs.AI","cs.SD","eess.AS"],"primary_cat":"cs.CL","authors_text":"Chen Zhang, Haizhou Li, Robby T. Tan, Xianghu Yue, Xiaoxue Gao, Yiming Chen","submitted_at":"2024-10-22T17:15:20Z","abstract_excerpt":"Building on the success of large language models (LLMs), recent advancements such as GPT-4o have enabled real-time speech interactions through LLM-based voice assistants, offering a significantly improved user experience compared to traditional text-based interactions. However, the absence of benchmarks designed to evaluate these speech interaction capabilities has hindered progress of LLM-based voice assistants development. Current evaluations focus primarily on automatic speech recognition (ASR) or general knowledge evaluation with clean speeches, neglecting the more intricate, real-world sc"},"claims":{"count":4,"items":[{"kind":"strongest_claim","text":"We introduce VoiceBench, the first benchmark designed to provide a multi-faceted evaluation of LLM-based voice assistants. VoiceBench also includes both real and synthetic spoken instructions that incorporate the above three key real-world variations.","source":"verdict.strongest_claim","status":"machine_extracted","claim_id":"C1","attestation":"unclaimed"},{"kind":"weakest_assumption","text":"That the selected variations in speaker characteristics, environmental factors, and content factors adequately represent the intricate real-world scenarios that current evaluations neglect.","source":"verdict.weakest_assumption","status":"machine_extracted","claim_id":"C2","attestation":"unclaimed"},{"kind":"one_line_summary","text":"VoiceBench is the first benchmark for multi-faceted evaluation of LLM voice assistants using real and synthetic spoken instructions with speaker, environmental, and content variations.","source":"verdict.one_line_summary","status":"machine_extracted","claim_id":"C3","attestation":"unclaimed"},{"kind":"headline","text":"VoiceBench introduces the first benchmark to evaluate LLM-based voice assistants under real-world variations in speakers, environments, and content.","source":"verdict.pith_extraction.headline","status":"machine_extracted","claim_id":"C4","attestation":"unclaimed"}],"snapshot_sha256":"da6901dd0f53cc640dfd4246ea233c47861fa79648cff946c80bf53476223f48"},"source":{"id":"2410.17196","kind":"arxiv","version":3},"verdict":{"id":"b4739c94-417f-4410-b0da-3ccddff6302c","model_set":{"reader":"grok-4.3"},"created_at":"2026-05-17T00:44:53.168044Z","strongest_claim":"We introduce VoiceBench, the first benchmark designed to provide a multi-faceted evaluation of LLM-based voice assistants. VoiceBench also includes both real and synthetic spoken instructions that incorporate the above three key real-world variations.","one_line_summary":"VoiceBench is the first benchmark for multi-faceted evaluation of LLM voice assistants using real and synthetic spoken instructions with speaker, environmental, and content variations.","pipeline_version":"pith-pipeline@v0.9.0","weakest_assumption":"That the selected variations in speaker characteristics, environmental factors, and content factors adequately represent the intricate real-world scenarios that current evaluations neglect.","pith_extraction_headline":"VoiceBench introduces the first benchmark to evaluate LLM-based voice assistants under real-world variations in speakers, environments, and content."},"references":{"count":85,"sample":[{"doi":"","year":null,"title":"Advances in Neural Information Processing Systems , volume=","work_id":"35076cd8-2723-4c99-b604-65676abad5b8","ref_index":2,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"The Twelfth International Conference on Learning Representations , year=","work_id":"324853f3-6bd4-422b-b68c-2ed4fc1e0394","ref_index":3,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"The Twelfth International Conference on Learning Representations , year=","work_id":"13d914fb-8f91-4a50-9c77-40b04fba8d96","ref_index":7,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":1994,"title":"Preliminaries to a theory of speech disfluencies , author=. 1994 , school=","work_id":"9fb6096f-3706-4752-a22c-c88c6778e55e","ref_index":13,"cited_arxiv_id":"","is_internal_anchor":false},{"doi":"","year":null,"title":"Advances in neural information processing systems , volume=","work_id":"f2a4f68e-48af-40c6-90d4-76b155136aa5","ref_index":14,"cited_arxiv_id":"","is_internal_anchor":false}],"resolved_work":85,"snapshot_sha256":"d21bca70ec65b4bc4c5adcb504e06df5c8abbede9c390bf6f474d5ca6c2b74bb","internal_anchors":10},"formal_canon":{"evidence_count":2,"snapshot_sha256":"5efe2de4480e92fb74930c7f438726b937bd90861ca54708e829342465816f4e"},"author_claims":{"count":0,"strong_count":0,"snapshot_sha256":"258153158e38e3291e3d48162225fcdb2d5a3ed65a07baac614ab91432fd4f57"},"builder_version":"pith-number-builder-2026-05-17-v1"}