arxiv: 2605.06897 · v1 · submitted 2026-05-07 · 💻 cs.CL · cs.AI· cs.HC· cs.MM· cs.SD· eess.AS

Recognition: no theorem link

MIST: Multimodal Interactive Speech-based Tool-calling Conversational Assistants for Smart Homes

Alexandros Papangelis, Maximillian Chen, Michael Peng, Xuanming Zhang, Yohan Jo, Zhou Yu

Pith reviewed 2026-05-11 01:10 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.HCcs.MMcs.SDeess.AS

keywords MISTmultimodal LLMstool callingsmart homesIoT devicesvoice assistantsmixed-initiativespatiotemporal constraints

0 comments

The pith

MIST dataset exposes gaps between open- and closed-weight multimodal LLMs on voice-driven IoT tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MIST, a synthetic multi-turn voice-driven code generation dataset for IoT devices that requires models to handle speech inputs alongside spatiotemporal constraints, dynamic state tracking, and mixed-initiative conversations. Evaluation results show a clear performance difference between open-weight and closed-weight multimodal LLMs, with even leading closed models leaving notable room for improvement. A reader would care because practical smart home voice assistants must manage real physical-world factors like device locations and timing rather than isolated commands. The authors release both the dataset and an extensible generation framework to encourage further work on these challenges.

Core claim

MIST is presented as a synthetic multi-turn, voice-driven code generation task over IoT devices that incorporates spatiotemporal constraints with speech inputs, dynamic state tracking, and mixed-initiative interaction patterns. On this benchmark, open-weight multimodal LLMs lag significantly behind closed-weight ones, while even frontier closed-weight models retain substantial headroom.

What carries the argument

MIST, the Multimodal Interactive Speech-based Tool-calling Dataset, functions as the central benchmark by simulating voice-based tool calling that must reason over changing device states and physical constraints in smart homes.

If this is right

Multimodal LLMs require better integration of speech with reasoning about physical device states and locations.
Mixed-initiative dialogue handling becomes essential for voice assistants to manage ongoing smart home interactions.
Open-weight models need specific advances to narrow the observed performance gap with closed models.
The provided data generation framework supports creation of additional datasets for related physical interaction scenarios.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training on MIST-style data could improve voice assistants' ability to track device states across multiple turns in real homes.
Future benchmarks might add visual sensors from smart home cameras to test richer multimodal reasoning.
Deployment tests on physical IoT setups could identify gaps between synthetic benchmark performance and actual user experience.

Load-bearing premise

The synthetic multi-turn voice-driven code generation tasks over IoT devices accurately reflect real-world smart home challenges such as spatiotemporal constraints, dynamic state tracking, and mixed-initiative patterns.

What would settle it

An experiment in which models that score highly on MIST are tested in live user sessions with actual IoT hardware and show no corresponding improvement in handling state changes or user interruptions, or the reverse where low-scoring models succeed in practice.

Figures

Figures reproduced from arXiv: 2605.06897 by Alexandros Papangelis, Maximillian Chen, Michael Peng, Xuanming Zhang, Yohan Jo, Zhou Yu.

**Figure 1.** Figure 1: Example conversation from MIST. Users issue voice commands with natural disfluencies and varied accents. The assistant must generate structured API calls while managing ambiguity, corrections, redundancy, and stateful device tracking across turns. Developing a modern multimodal conversational assistant for real-world IoT devices necessitates going beyond traditional Task-Oriented Dialogue (TOD) tasks such… view at source ↗

**Figure 2.** Figure 2: Overview of the data generation framework to construct MIST. We first sample from diverse set of possible user personas, IoT devices, and rooms to form home configurations, then repeatedly sample valid conversational actions and tool calls conditioned on these configurations to form goal-oriented conversations. based tool-calling (Qin et al., 2024) and speechbased TOD (Zhang et al., 2023; Faisal et al., 2… view at source ↗

**Figure 3.** Figure 3: Error analysis characterizing the types of errors by proportion for each MLLM. The most common tool execution error for frontier models is selecting the ‘Wrong Value‘, whereas open-weight models struggle triggering a tool call at the wrong time or targeting the wrong device. leading open-weight audio models. Open-weight models achieve moderate Execution Match scores (ranging from 48.76% to 60.94%), yet all… view at source ↗

read the original abstract

The rise of Internet of Things (IoT) devices in the physical world necessitates voice-based interfaces capable of handling complex user experiences. While modern Large Language Models (LLMs) already demonstrate strong tool-usage capabilities, modeling real-world IoT devices presents a difficult, understudied challenge which combines modeling spatiotemporal constraints with speech inputs, dynamic state tracking, and mixed-initiative interaction patterns. We introduce MIST (the Multimodal Interactive Speech-based Tool-calling Dataset), a synthetic multi-turn, voice-driven code generation task that operates over IoT devices. We find that there is a significant gap between open- and closed-weight multimodal LLMs on MIST, and that even frontier closed-weight LLMs have substantial headroom. We release MIST and an extensible data generation framework to build related datasets in order to facilitate research on mixed-initiative voice assistants which reason about physical world constraints.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MIST releases a new synthetic benchmark for speech-based IoT tool calling that highlights model gaps, but the procedural data generation leaves the reported open/closed performance difference hard to interpret without further checks.

read the letter

MIST is a new synthetic dataset for multi-turn, voice-driven tool calling over IoT devices that adds speech input plus spatiotemporal and state constraints. The paper's concrete step is releasing both the dataset and an extensible generation framework, which gives others a starting point to build or vary similar tests in this area. That addresses a real understudied corner of tool use where prior benchmarks have stayed mostly in software or single-turn settings. The reported finding of a gap between open- and closed-weight multimodal models, plus headroom on frontier closed models, is presented as evidence that the task is non-trivial. The framing of mixed-initiative dialogue and physical constraints is reasonable and matches practical needs as IoT grows. The citation pattern stays within standard tool-calling and multimodal LLM references without obvious omissions. The soft spots sit in the evaluation and data construction. The abstract supplies no model list, metrics, or significance tests, and the full paper would need to show how the synthetic scripts were validated against actual device behavior or user speech. The stress-test concern holds weight here: fixed schemas and templated scripts could let closed models exploit pretraining patterns rather than demonstrate better handling of constraints or state tracking. No ablations for added noise like ASR errors or underspecified goals are described, so the gap might not survive more realistic conditions. This work is mainly for researchers building or benchmarking voice interfaces for physical devices who want a fresh testbed. The dataset itself is the part worth engaging with. I would send it to peer review because the artifact is new and the problem area matters, though referees would likely ask for clearer validation of the generation process and stronger controls on the empirical claims.

Referee Report

3 major / 2 minor

Summary. The paper introduces MIST, a synthetic multi-turn dataset for voice-driven tool-calling and code generation over IoT device schemas in smart-home settings. The task requires models to handle spatiotemporal constraints, dynamic state tracking, and mixed-initiative dialogue while producing executable code. The central empirical result is a reported performance gap between open- and closed-weight multimodal LLMs together with substantial remaining headroom even for frontier closed models. The authors release the dataset and an extensible procedural generation framework.

Significance. A well-validated benchmark that isolates the combination of speech input, physical-world constraints, and multi-turn tool use would be a useful addition to the evaluation landscape for conversational agents. The release of both the data and the generation code is a clear positive. However, the significance of the headline gap finding is currently limited by the absence of any reported metrics, model list, statistical tests, or controls for synthetic artifacts, so the result cannot yet be treated as a reliable signal about model capabilities.

major comments (3)

[Abstract and §4] Abstract and §4 (Experiments): the claim of a 'significant gap' between open- and closed-weight multimodal LLMs and 'substantial headroom' for frontier models is asserted without any accompanying metrics, model identifiers, evaluation protocol, or statistical significance tests, rendering the central empirical contribution unassessable from the manuscript.
[§3] §3 (Dataset Construction): the procedural generation from fixed IoT schemas and templated multi-turn scripts is described at a high level, but no ablation or sensitivity analysis is provided to test whether the observed open/closed gap persists under varied generation rules or when realistic noise (ASR errors, underspecified goals) is injected; this directly bears on whether the gap reflects genuine reasoning differences or synthetic artifacts.
[§4] §4 (Experiments): no information is given on how the synthetic dialogues were validated against real device behavior or user interaction patterns, leaving the weakest assumption—that the task faithfully captures spatiotemporal constraints and mixed-initiative dynamics—unsupported.

minor comments (2)

[Abstract] The abstract and introduction would benefit from a brief explicit statement of the exact metrics used (e.g., exact-match code accuracy, state-tracking F1) and the set of models evaluated.
[Figures and Tables] Figure captions and table headers should clarify whether results are averaged over multiple seeds or single runs.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment below and describe the revisions we plan to make.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): the claim of a 'significant gap' between open- and closed-weight multimodal LLMs and 'substantial headroom' for frontier models is asserted without any accompanying metrics, model identifiers, evaluation protocol, or statistical significance tests, rendering the central empirical contribution unassessable from the manuscript.

Authors: We agree that the abstract and experimental section would benefit from greater explicitness. In the revised manuscript we will update the abstract to report key quantitative metrics (e.g., exact success rates for representative open- and closed-weight models) and will expand §4 to list all model identifiers, describe the complete evaluation protocol, present the precise performance numbers, and include statistical significance tests supporting the reported gap and headroom. revision: yes
Referee: [§3] §3 (Dataset Construction): the procedural generation from fixed IoT schemas and templated multi-turn scripts is described at a high level, but no ablation or sensitivity analysis is provided to test whether the observed open/closed gap persists under varied generation rules or when realistic noise (ASR errors, underspecified goals) is injected; this directly bears on whether the gap reflects genuine reasoning differences or synthetic artifacts.

Authors: We acknowledge that sensitivity analyses would help confirm robustness. Because the full generation code is released, such experiments are straightforward for the community. In the revision we will expand §3 with a more detailed account of the generation rules and add a discussion of potential artifacts together with a limited sensitivity check on core parameters (e.g., script length and constraint density). We maintain that the gap arises from genuine differences in reasoning over spatiotemporal and state-tracking constraints rather than artifacts, given the deterministic, schema-grounded nature of the data. revision: partial
Referee: [§4] §4 (Experiments): no information is given on how the synthetic dialogues were validated against real device behavior or user interaction patterns, leaving the weakest assumption—that the task faithfully captures spatiotemporal constraints and mixed-initiative dynamics—unsupported.

Authors: The dialogues are generated directly from realistic IoT device schemas and multi-turn scripts that explicitly encode spatiotemporal constraints and mixed-initiative turns. In the revised §4 we will add a paragraph describing our internal validation procedure, which consisted of manual inspection of a representative sample of dialogues to verify schema compliance and presence of the target dynamics. We note that large-scale real-user or physical-device studies were outside the scope of this work but are enabled by the released framework; we will clarify this limitation while emphasizing the controlled, reproducible nature of the current benchmark. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical dataset creation and benchmarking with no derivations or fitted predictions

full rationale

The paper introduces the MIST synthetic dataset for multimodal IoT tool-calling and reports empirical benchmarks on existing open- and closed-weight LLMs. No mathematical derivations, parameter fitting, or predictions are claimed; the core results are direct performance measurements on the new task. No self-citations are used to justify uniqueness theorems or ansatzes, and the generation process is described as procedural from fixed schemas without reducing any output to prior fitted quantities by construction. The reader's assessment of score 1.0 aligns with this being a standard non-circular empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical derivations or new theoretical constructs are introduced; the paper is an empirical dataset and benchmarking contribution.

pith-pipeline@v0.9.0 · 5480 in / 996 out tokens · 40192 ms · 2026-05-11T01:10:40.003514+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

67 extracted references · 67 canonical work pages · 7 internal anchors

[1]

Geng, Xinyang and Liu, Hao , title =

work page
[2]

RedPajama-Data: An Open Source Recipe to Reproduce LLaMA training dataset , month = April, year = 2023, url =

work page 2023
[3]

LLaMA: Open and Efficient Foundation Language Models

Llama: Open and efficient foundation language models , author=. arXiv preprint arXiv:2302.13971 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Chen, Maximillian and Papangelis, Alexandros and Tao, Chenyang and Kim, Seokhwan and Rosenbaum, Andy and Liu, Yang and Yu, Zhou and Hakkani-Tur, Dilek , booktitle=

work page
[5]

Kim, Hyunwoo and Hessel, Jack and Jiang, Liwei and Lu, Ximing and Yu, Youngjae and Zhou, Pei and Bras, Ronan Le and Alikhani, Malihe and Kim, Gunhee and Sap, Maarten and others , journal=

work page
[6]

Advances in Neural Information Processing Systems , year=

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. Advances in Neural Information Processing Systems , year=

work page
[7]

LARD : Large-scale Artificial Disfluency Generation

Passali, Tatiana and Mavropoulos, Thanassis and Tsoumakas, Grigorios and Meditskos, Georgios and Vrochidis, Stefanos. LARD : Large-scale Artificial Disfluency Generation. Proceedings of the Thirteenth Language Resources and Evaluation Conference. 2022

work page 2022
[8]

Advances in Neural Information Processing Systems , volume=

Neural program generation modulo static analysis , author=. Advances in Neural Information Processing Systems , volume=

work page
[9]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , year=

Controllable mixed-initiative dialogue generation through prompting , author=. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers) , year=

work page
[10]

NeurIPS 2022 Workshop on Synthetic Data for Empowering ML Research , year=

Weakly Supervised Data Augmentation Through Prompting for Dialogue Understanding , author=. NeurIPS 2022 Workshop on Synthetic Data for Empowering ML Research , year=

work page 2022
[11]

Advances in Neural Information Processing Systems , volume=

Generating training data with language models: Towards zero-shot language understanding , author=. Advances in Neural Information Processing Systems , volume=

work page
[12]

Wang, Yue and Le, Hung and Gotmare, Akhilesh Deepak and Bui, Nghi DQ and Li, Junnan and Hoi, Steven CH , journal=. Code

work page
[13]

Long Sequence Modeling with

Erik Nijkamp and Tian Xie and Hiroaki Hayashi and Bo Pang and Congying Xia and Chen Xing and Jesse Vig and Semih Yavuz and Philippe Laban and Ben Krause and Senthil Purushwalkam and Tong Niu and Wojciech Kryscinski and Lidiya Murakhovs'ka and Prafulla Kumar Choubey and Alex Fabbri and Ye Liu and Rui Meng and Lifu Tu and Meghana Bhat and Chien-Sheng Wu and...

work page 2023
[14]

Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=

MultiWOZ-A Large-Scale Multi-Domain Wizard-of-Oz Dataset for Task-Oriented Dialogue Modelling , author=. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing , pages=

work page 2018
[15]

Goel, Rahul and Ammar, Waleed and Gupta, Aditya and Vashishtha, Siddharth and Sano, Motoki and Surani, Faiz and Chang, Max and Choe, HyunJeong and Greene, David and He, Kyle and others , journal=

work page
[16]

Proceedings of the AAAI conference on artificial intelligence , volume=

Towards scalable multi-domain conversational agents: The schema-guided dialogue dataset , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

work page
[17]

Findings of the Association for Computational Linguistics: EMNLP 2022 , pages=

Controllable Dialogue Simulation with In-context Learning , author=. Findings of the Association for Computational Linguistics: EMNLP 2022 , pages=

work page 2022
[18]

Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=

Building a Role Specified Open-Domain Dialogue System Leveraging Large-Scale Language Models , author=. Proceedings of the 2022 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies , pages=

work page 2022
[19]

Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations , pages=

Alexa Conversations: An Extensible Data-driven Approach for Building Task-oriented Dialogue Systems , author=. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies: Demonstrations , pages=

work page 2021
[20]

OpenAI: Introducing ChatGPT , year =

OpenAI , howpublished =. OpenAI: Introducing ChatGPT , year =

work page
[21]

Advances in neural information processing systems , volume=

Language models are few-shot learners , author=. Advances in neural information processing systems , volume=

work page
[22]

Snips Voice Platform: an embedded Spoken Language Understanding system for private-by-design voice interfaces

Snips voice platform: an embedded spoken language understanding system for private-by-design voice interfaces , author=. arXiv preprint arXiv:1805.10190 , year=

work page Pith review arXiv
[23]

Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27, 1990 , year=

The ATIS spoken language systems pilot corpus , author=. Speech and Natural Language: Proceedings of a Workshop Held at Hidden Valley, Pennsylvania, June 24-27, 1990 , year=

work page 1990
[24]

Cross-lingual Transfer Learning for Multilingual Task Oriented Dialog , author=. Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pages=

work page 2019
[25]

SIGDIAL , year=

Zero-Shot Dialog Generation with Cross-Domain Latent Actions , author=. SIGDIAL , year=

work page
[26]

Proceedings of the 3rd Workshop on Natural Language Processing for Conversational AI , pages=

AuGPT: Auxiliary Tasks and Data Augmentation for End-To-End Dialogue with Pre-Trained Language Models , author=. Proceedings of the 3rd Workshop on Natural Language Processing for Conversational AI , pages=

work page
[27]

Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=

Paraphrase Augmented Task-Oriented Dialog Generation , author=. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics , pages=

work page
[28]

Mehri, Shikib and Altun, Yasemin and Eskenazi, Maxine , booktitle=

work page
[29]

Findings of the Association for Computational Linguistics: ACL 2022 , pages=

N-Shot Learning for Augmenting Task-Oriented Dialogue State Tracking , author=. Findings of the Association for Computational Linguistics: ACL 2022 , pages=

work page 2022
[30]

The RefinedWeb Dataset for Falcon LLM: Outperforming Curated Corpora with Web Data, and Web Data Only

The RefinedWeb dataset for Falcon LLM: outperforming curated corpora with web data, and web data only , author=. arXiv preprint arXiv:2306.01116 , year=

work page internal anchor Pith review arXiv
[31]

2024 , url=

IoT market forecast to 2030: connections by region and vertical , author=. 2024 , url=

work page 2030
[32]

A Textual Dataset for Situated Proactive Response Selection

Otani, Naoki and Araki, Jun and Kim, HyeongSik and Hovy, Eduard. A Textual Dataset for Situated Proactive Response Selection. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023

work page 2023
[33]

arXiv preprint arXiv:2304.12026 , year=

SocialDial: A Benchmark for Socially-Aware Dialogue Systems , author=. arXiv preprint arXiv:2304.12026 , year=

work page arXiv
[34]

Advances in Neural Information Processing Systems , volume=

A simple language model for task-oriented dialogue , author=. Advances in Neural Information Processing Systems , volume=

work page
[35]

2023 , publisher=

As AI Spreads, Experts Predict the Best and Worst Changes in Digital Life by 2035 , author=. 2023 , publisher=

work page 2035
[36]

Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations , pages=

Transformers: State-of-the-art natural language processing , author=. Proceedings of the 2020 conference on empirical methods in natural language processing: system demonstrations , pages=

work page 2020
[37]

Advances in neural information processing systems , volume=

Pytorch: An imperative style, high-performance deep learning library , author=. Advances in neural information processing systems , volume=

work page
[38]

2023 , eprint=

DialogStudio: Towards Richest and Most Diverse Unified Dataset Collection for Conversational AI , author=. 2023 , eprint=

work page 2023
[39]

Bottom-Up Synthesis of Knowledge-Grounded Task-Oriented Dialogues with Iteratively Self-Refined Prompts , author=. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 2: Short Papers) , pages=

work page 2025
[40]

ICLR , year=

ToolLLM: Facilitating Large Language Models to Master 16000+ Real-world APIs , author=. ICLR , year=

work page
[41]

IEEE Internet of Things Journal , year=

Aiot smart home via autonomous llm agents , author=. IEEE Internet of Things Journal , year=

work page
[42]

Findings of the Association for Computational Linguistics: ACL 2025 , pages=

Data-centric improvements for enhancing multi-modal understanding in spoken conversation modeling , author=. Findings of the Association for Computational Linguistics: ACL 2025 , pages=

work page 2025
[43]

CIRP journal of manufacturing science and technology , volume=

Characterising the Digital Twin: A systematic literature review , author=. CIRP journal of manufacturing science and technology , volume=. 2020 , publisher=

work page 2020
[44]

2025 , booktitle=

Learning to Clarify: Multi-turn Conversations with Action-Based Contrastive Self-Training , author=. 2025 , booktitle=

work page 2025
[45]

Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023) , pages=

GrounDialog: A dataset for repair and grounding in task-oriented spoken dialogues for language learning , author=. Proceedings of the 18th Workshop on Innovative Use of NLP for Building Educational Applications (BEA 2023) , pages=

work page 2023
[46]

Advances in Neural Information Processing Systems , volume=

Spokenwoz: A large-scale speech-text benchmark for spoken task-oriented dialogue agents , author=. Advances in Neural Information Processing Systems , volume=

work page
[47]

Findings of the Association for Computational Linguistics: EMNLP 2021 , pages=

SD-QA: Spoken dialectal question answering for the real world , author=. Findings of the Association for Computational Linguistics: EMNLP 2021 , pages=

work page 2021
[48]

Spoken SQuAD: A Study of Mitigating the Impact of Speech Recognition Errors on Listening Comprehension , author=. Proc. Interspeech 2018 , pages=

work page 2018
[49]

Findings of the association for computational linguistics: NAACL 2022 , pages=

End-to-end spoken conversational question answering: Task, dataset and model , author=. Findings of the association for computational linguistics: NAACL 2022 , pages=

work page 2022
[50]

Proceedings of the 6th Workshop on NLP for Conversational AI (NLP4ConvAI 2024) , pages=

Faithful persona-based conversational dataset generation with large language models , author=. Proceedings of the 6th Workshop on NLP for Conversational AI (NLP4ConvAI 2024) , pages=

work page 2024
[51]

Iot-llm: Enhancing real-world iot task reasoning with large language models , author=

work page
[52]

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models , author=. arXiv preprint arXiv:2311.07919 , year=

work page internal anchor Pith review arXiv
[53]

Qwen2-Audio Technical Report

Qwen2-audio technical report , author=. arXiv preprint arXiv:2407.10759 , year=

work page internal anchor Pith review arXiv
[54]

Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Soundwave: Less is more for speech-text alignment in llms , author=. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

work page
[55]

Qwen3-Omni Technical Report

Qwen3-omni technical report , author=. arXiv preprint arXiv:2509.17765 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[56]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[57]

SayNext-Bench: Why Do LLMs Struggle with Next-Utterance Anticipation?

SayNext-Bench: Why Do LLMs Struggle with Next-Utterance Prediction? , author=. arXiv preprint arXiv:2602.00327 , year=

work page internal anchor Pith review arXiv
[58]

Tan, and Haizhou Li

Voicebench: Benchmarking llm-based voice assistants , author=. arXiv preprint arXiv:2410.17196 , year=

work page arXiv
[59]

arXiv preprint arXiv:2510.15406 , year=

VocalBench-DF: A Benchmark for Evaluating Speech LLM Robustness to Disfluency , author=. arXiv preprint arXiv:2510.15406 , year=

work page arXiv
[60]

Correct",

WildSpeech-Bench: Benchmarking End-to-End SpeechLLMs in the Wild , author=. arXiv preprint arXiv:2506.21875 , year=

work page arXiv
[61]

Speechr: A bench- mark for speech reasoning in large audio-language models,

Speechr: A benchmark for speech reasoning in large audio-language models , author=. arXiv preprint arXiv:2508.02018 , year=

work page arXiv
[62]

Decision support systems , volume=

Digital Twin: Generalization, characterization and implementation , author=. Decision support systems , volume=. 2021 , publisher=

work page 2021
[63]

Cosql: A conversational text-to-sql challenge towards cross-domain natural language interfaces to databases , author=. Proceedings of the 2019 conference on empirical methods in natural language processing and the 9th international joint conference on natural language processing (EMNLP-IJCNLP) , pages=

work page 2019
[64]

Doctoral dissertation, University of California at Berkeley , year=

Preliminaries to a theory of speech disfluencies , author=. Doctoral dissertation, University of California at Berkeley , year=

work page
[65]

2025 , month =

The Future of Smart Homes: Top Technology Trends in 2025 , howpublished =. 2025 , month =

work page 2025
[66]

2025 , month =

Adam Zell , title =. 2025 , month =

work page 2025
[67]

2025 , month =

Must-Have Smart Home Devices for 2025 , howpublished =. 2025 , month =

work page 2025