MIST is a new synthetic speech-based tool-calling dataset for IoT devices that exposes performance gaps between open- and closed-weight multimodal LLMs.
Title resolution pending
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3representative citing papers
Unified Audio Schema adds structured paralinguistic and event labels to audio training data, raising fine-grained perception scores by 10.9% on MMSU while keeping reasoning intact.
Full-Duplex-Bench-v3 provides a dataset of real human audio with five disfluency types and chained API tasks to benchmark six voice agent systems, revealing GPT-Realtime leads in accuracy while cascaded pipelines suffer highest latency.
citing papers explorer
-
MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes
MIST is a new synthetic speech-based tool-calling dataset for IoT devices that exposes performance gaps between open- and closed-weight multimodal LLMs.
-
Beyond Transcription: Unified Audio Schema for Perception-Aware AudioLLMs
Unified Audio Schema adds structured paralinguistic and event labels to audio training data, raising fine-grained perception scores by 10.9% on MMSU while keeping reasoning intact.
-
Full-Duplex-Bench-v3: Benchmarking Tool Use for Full-Duplex Voice Agents Under Real-World Disfluency
Full-Duplex-Bench-v3 provides a dataset of real human audio with five disfluency types and chained API tasks to benchmark six voice agent systems, revealing GPT-Realtime leads in accuracy while cascaded pipelines suffer highest latency.