pith. sign in

arxiv: 2606.19595 · v1 · pith:7ABCQHIXnew · submitted 2026-06-17 · 💻 cs.LG · cs.AI

IHBench: Evaluating Post-Interruption Recovery in Voice Agents with Structured Workflows

Pith reviewed 2026-06-26 20:40 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords voice agentsinterruption handlingrecovery qualitystructured workflowsaudio-language modelstask fulfillmentbenchmarksclosed-weight models
0
0 comments X

The pith

Closed-weight voice models recover more robustly from interruptions in structured workflows than open-weight ones.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces IHBench to test how voice agents resume multi-step enterprise workflows after user interruptions. It generates rubrics for each interruption and scores models on whether they complete the task correctly and address the interjection without repeating prior content. Across 27 configurations, closed-weight models from OpenAI and Google fulfill tasks more often, lose performance more slowly as conversations lengthen, and perform equally on audio or text input. Open-weight models fall behind on all three measures. A human study backs the automated judge, and recovery quality appears separate from other speech benchmarks.

Core claim

Voice agents executing state-machine workflows across ten domains must resume at the correct step after interruptions, address the user's point, and avoid re-delivering heard material. IHBench injects six interruption types at controlled mid-utterance points and scores each on task fulfillment and recovery quality using generated rubrics and an LLM judge. Closed-weight models win more often on fulfillment, degrade roughly 3.3 times more slowly with longer conversations, and show no audio-versus-text gap, while open-weight models lose ground on all three axes.

What carries the argument

IHBench benchmark that pairs six interruption types with per-interruption evaluation rubrics in state-machine workflows and scores recovery via an LLM judge on task fulfillment and quality axes.

If this is right

  • Closed-weight models are more suitable for deployment in voice agents that face frequent interruptions in customer service or scheduling.
  • Recovery quality forms a distinct capability axis from other audio benchmarks such as AudioMultiChallenge.
  • The performance gap widens with conversation length, favoring closed models in extended sessions.
  • Different interruption types produce sharply different recovery outcomes across all models.
  • Open-weight models show consistent weaknesses on both audio input and longer contexts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Open-weight developers may need targeted training data on interruption recovery to narrow the observed gap.
  • The benchmark could be adapted to measure recovery in non-enterprise domains such as personal assistants.
  • If recovery quality is separate from other skills, model training might benefit from explicit objectives for post-interruption resumption.
  • Real-world voice agent reliability may depend more on interruption handling than on raw speech recognition accuracy.

Load-bearing premise

The rubrics generated with the data plus the LLM judge measure actual recovery quality the way human evaluators would.

What would settle it

A follow-up study in which human annotators score the same set of model responses on task fulfillment and recovery quality and find low correlation with the LLM judge scores.

Figures

Figures reproduced from arXiv: 2606.19595 by Ahmad Salimi, Alex Smola, Dongming Shen, Mu Li, Wentao Ma, Yuzhi Tang.

Figure 1
Figure 1. Figure 1: Good vs. bad post-interruption recovery on two [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The IHBENCH data generation pipeline: prerequisites, conversation synthesis (with per-interruption rubrics), verifica￾tion, and audio synthesis. The rightmost panel shows one resulting evaluation sample. Each stage is described in Section 3.3. spoken task success across enterprise domains, tool calling, and dialogue state tracking. τ -voice [33] adds full-duplex turn-taking dynamics through a voice user si… view at source ↗
Figure 3
Figure 3. Figure 3: Per-model task-fulfillment agreement: primary [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Task fulfillment win rate by conversation depth, with a per-model logistic regression slope fitted on the raw per-sample [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Pairwise Pearson correlations across the four AMC [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Audio vs. text-only task fulfillment across the 15 dual-modality configurations (6 Gemini + 9 open-weight, 3 epochs each), grouped by family. Error bars are 95% per￾centile bootstrap CIs. Audio never beats text-only. The recovery-quality counterpart is in Appendix E. in place of the audio waveforms (OpenAI models and Kimi￾Audio-7B are excluded due to not accepting text-only input). Everything else (convers… view at source ↗
Figure 7
Figure 7. Figure 7: Recovery-quality agreement between the primary [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Audio vs. text-only recovery quality across the 15 dual-modality configurations (6 Gemini + 9 open-weight, 3 epochs each), the recovery-quality counterpart of [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗
read the original abstract

Voice agents deployed in structured workflows (customer service, healthcare scheduling, account management) must handle frequent user interruptions while maintaining progress through multi-step procedures. Existing benchmarks for speech-capable models focus on the timing of interruptions: barge-in detection, endpointing, and turn-taking dynamics. They leave unmeasured what happens after the interruption: does the agent resume the workflow at the correct step? Does it address the user's interjection? Does it avoid re-delivering content the user already heard? We introduce IHBench (Interruption Handling Benchmark), a benchmark that evaluates post-interruption recovery in voice agents executing state-machine-driven workflows across 10 enterprise domains. Six interruption types are injected at controlled points mid-utterance, with per-interruption evaluation rubrics generated alongside the data. Each interruption is scored on two axes: task fulfillment and recovery quality. We evaluate 27 audio-language model configurations from OpenAI, Google, and the open-weight community. Models vary widely, and recovery quality depends strongly on the interruption type. Across our experiments, closed-weight models are consistently more robust to interruptions than open-weight ones: they win far more often on task fulfillment, degrade roughly 3.3x more slowly as conversations grow longer, and show no audio-versus-text modality gap, whereas the open-weight models lose ground on all three. A human study validates the LLM judge against human annotators, and a cross-benchmark analysis against AudioMultiChallenge indicates that recovery quality is a largely distinct capability axis.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces IHBench, a benchmark for post-interruption recovery in voice agents executing state-machine workflows across 10 enterprise domains. Six interruption types are injected mid-utterance with generated per-interruption rubrics; each is scored on task fulfillment and recovery quality by an LLM judge. Evaluation of 27 audio-language models from closed- and open-weight families shows closed-weight models win more often on task fulfillment, degrade ~3.3× more slowly with conversation length, and exhibit no audio-vs-text modality gap, while open-weight models underperform on all three axes. A human study is cited to validate the LLM judge, and a cross-benchmark comparison with AudioMultiChallenge indicates recovery quality is a distinct capability.

Significance. If the evaluation protocol is reliable, the work fills a clear gap left by prior benchmarks that focus only on barge-in timing rather than post-interruption workflow recovery, a capability essential for deployed voice agents. The explicit separation of task fulfillment from recovery quality and the demonstration that recovery forms an orthogonal axis to existing benchmarks are useful contributions. The empirical comparison across 27 model configurations provides concrete evidence of systematic differences between closed- and open-weight families that could inform both model development and deployment decisions.

major comments (2)
  1. [Abstract / §4] Abstract and §4 (Human Validation subsection): the claim that 'a human study validates the LLM judge against human annotators' is load-bearing for all comparative results, yet no agreement statistics (Cohen’s κ, percentage agreement, or study size), annotation protocol, or breakdown by interruption type or model family are reported. Without these, it is impossible to assess whether the reported robustness gaps between closed- and open-weight models reflect genuine differences or judge bias.
  2. [§5] §5 (Results, degradation analysis): the statement that closed-weight models 'degrade roughly 3.3× more slowly' is central to the robustness claim, but the paper does not specify the regression model, the exact length metric (turns vs. tokens), or how the 3.3× ratio is computed from the per-model slopes. This prevents independent verification of the quantitative factor.
minor comments (2)
  1. [Results tables] Table 2 or equivalent results table: column headers for the two scoring axes should explicitly state the possible score ranges and whether higher is better.
  2. [§3] §3 (Benchmark construction): the generation procedure for the per-interruption rubrics is described at a high level; a short pseudocode or example rubric for one interruption type would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments below and will revise the manuscript to incorporate the requested details.

read point-by-point responses
  1. Referee: [Abstract / §4] Abstract and §4 (Human Validation subsection): the claim that 'a human study validates the LLM judge against human annotators' is load-bearing for all comparative results, yet no agreement statistics (Cohen’s κ, percentage agreement, or study size), annotation protocol, or breakdown by interruption type or model family are reported. Without these, it is impossible to assess whether the reported robustness gaps between closed- and open-weight models reflect genuine differences or judge bias.

    Authors: We agree that the current Human Validation subsection is insufficiently detailed for a load-bearing claim. In the revision we will expand this subsection (and the corresponding appendix) to report: (i) the full annotation protocol, (ii) study size, (iii) inter-annotator agreement via both Cohen’s κ and raw percentage agreement, and (iv) breakdowns of agreement by interruption type and by model family. These additions will allow readers to evaluate potential judge bias directly. revision: yes

  2. Referee: [§5] §5 (Results, degradation analysis): the statement that closed-weight models 'degrade roughly 3.3× more slowly' is central to the robustness claim, but the paper does not specify the regression model, the exact length metric (turns vs. tokens), or how the 3.3× ratio is computed from the per-model slopes. This prevents independent verification of the quantitative factor.

    Authors: We concur that the degradation analysis lacks the necessary methodological transparency. The revised §5 will explicitly state that we fit ordinary least-squares linear regressions with conversation length measured in turns, report the per-model slope coefficients, and show the exact arithmetic used to obtain the 3.3× ratio (average closed-weight slope divided by average open-weight slope). A supplementary table of all slopes will also be added. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark evaluation with no derivations or self-referential reductions

full rationale

The paper introduces IHBench as an empirical benchmark for post-interruption recovery in voice agents, evaluates 27 model configurations on task fulfillment and recovery quality axes, and reports comparative robustness findings between closed- and open-weight models. No equations, fitted parameters, predictions derived from prior fits, or derivation chains appear in the provided text. The LLM judge is validated by a mentioned human study rather than by self-definition or self-citation chains. The work is self-contained against external benchmarks and does not invoke uniqueness theorems, ansatzes, or renamings that reduce to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no information on free parameters, axioms, or invented entities; the work is a benchmark introduction rather than a theoretical derivation.

pith-pipeline@v0.9.1-grok · 5817 in / 1063 out tokens · 21692 ms · 2026-06-26T20:40:46.168935+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

81 extracted references · 1 linked inside Pith

  1. [1]

    Hewett, Mojan Javaheripi, Piero Kauff- mann, James R

    Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Har- rison, Russell J. Hewett, Mojan Javaheripi, Piero Kauff- mann, James R. Lee, Yin Tat Lee, Yuanzhi Li, Weishung Liu, Caio C. T. Mendes, Anh Nguyen, Eric Price, Gus- tavo de Rosa, Olli Saarikivi, Adil Salim, Shital Shah, Xin Wang, Rachel Ward, Yue Wu, Ding...

  2. [2]

    Tyers, and Gregor We- ber

    Rosana Ardila, Megan Branson, Kelly Davis, Michael Henretty, Michael Kohler, Josh Meyer, Reuben Morais, Lindsay Saunders, Francis M. Tyers, and Gregor We- ber. Common voice: A massively-multilingual speech corpus, 2020

  3. [3]

    Talking turns: Bench- marking audio foundation models on turn-taking dy- namics, 2025

    Siddhant Arora, Zhiyun Lu, Chung-Cheng Chiu, Ruom- ing Pang, and Shinji Watanabe. Talking turns: Bench- marking audio foundation models on turn-taking dy- namics, 2025

  4. [4]

    Nguyen, Raghav Mehndiratta, Lind- say Devon Brin, Joseph Marinier, Hari Subramani, Anil Madamala, Sridhar Krishna Nemala, and Srini- vas Sunkara

    Tara Bogavelli, Gabrielle Gauthier Melançon, Kat- rina Stankiewicz, Oluwanifemi Bamgbose, Fanny Ri- ols, Hoang H. Nguyen, Raghav Mehndiratta, Lind- say Devon Brin, Joseph Marinier, Hari Subramani, Anil Madamala, Sridhar Krishna Nemala, and Srini- vas Sunkara. Eva-bench: A new end-to-end framework for evaluating voice agents, 2026

  5. [5]

    Tan, and Haizhou Li

    Yiming Chen, Xianghu Yue, Chen Zhang, Xiaoxue Gao, Robby T. Tan, and Haizhou Li. V oicebench: Bench- marking llm-based voice assistants, 2024

  6. [6]

    Qwen2- Audio technical report, 2024

    Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen2- Audio technical report, 2024

  7. [7]

    Moshi: a speech-text foun- dation model for real-time dialogue, 2024

    Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, and Neil Zeghidour. Moshi: a speech-text foun- dation model for real-time dialogue, 2024

  8. [8]

    Bootstrap methods: Another look at the jackknife.The Annals of Statistics, 7(1):1–26, 1979

    Bradley Efron. Bootstrap methods: Another look at the jackknife.The Annals of Statistics, 7(1):1–26, 1979

  9. [9]

    From self-evolving synthetic data to verifiable-reward rl: Post-training multi-turn interactive tool-using agents, 2026

    Jiaxuan Gao, Jiaao Chen, Chuyi He, Shusheng Xu, Di Jin, and Yi Wu. From self-evolving synthetic data to verifiable-reward rl: Post-training multi-turn interactive tool-using agents, 2026

  10. [10]

    Flexi: Benchmarking full-duplex human-llm speech interac- tion, 2025

    Yuan Ge, Saihan Chen, Jingqi Xiao, Xiaoqian Liu, Tong Xiao, Yan Xiang, Zhengtao Yu, and Jingbo Zhu. Flexi: Benchmarking full-duplex human-llm speech interac- tion, 2025

  11. [11]

    Gemini 2.5: Our most intelligent AI model

    Google. Gemini 2.5: Our most intelligent AI model. https://blog.google/innovation-and-ai/ models-and-research/google-deepmind/ gemini-model-thinking-updates-march-2025/ ,

  12. [13]

    Gemini 3

    Google. Gemini 3. https://blog.google/ products-and-platforms/products/gemini/ gemini-3/, 2025. Accessed: 2026-06-09

  13. [14]

    Gemini 3.1 Flash Live

    Google. Gemini 3.1 Flash Live. https://blog. google/innovation-and-ai/models-and-research/ gemini-models/gemini-3-1-flash-live/ , 2026. Accessed: 2026-06-09

  14. [15]

    Google DeepMind. Gemma 4. https://deepmind. google/models/gemma/gemma-4/, 2026. Accessed: 2026-06-09

  15. [16]

    Au- dio multichallenge: A multi-turn evaluation of spoken dialogue systems on natural human interaction, 2025

    Advait Gosai, Tyler Vuong, Utkarsh Tyagi, Steven Li, Wenjia You, Miheer Bavare, Arda Uçar, Zhongwang Fang, Brian Jang, Bing Liu, and Yunzhong He. Au- dio multichallenge: A multi-turn evaluation of spoken dialogue systems on natural human interaction, 2025

  16. [17]

    V oiceagentbench: Are voice assistants ready for agentic tasks?, 2026

    Dhruv Jain, Harshit Shukla, Gautam Rajeev, Ashish Kulkarni, Chandra Khatri, and Shubham Agarwal. V oiceagentbench: Are voice assistants ready for agentic tasks?, 2026. 8

  17. [18]

    KimiTeam, Ding Ding, Zeqian Ju, Yichong Leng, Songxiang Liu, Tong Liu, Zeyu Shang, Kai Shen, Wei Song, Xu Tan, Heyi Tang, Zhengtao Wang, Chu Wei, Yifei Xin, Xinran Xu, Jianwei Yu, Yutao Zhang, Xinyu Zhou, Y . Charles, Jun Chen, Yanru Chen, Yulun Du, Weiran He, Zhenxing Hu, Guokun Lai, Qingcheng Li, Yangyang Liu, Weidong Sun, Jianzhou Wang, Yuzhi Wang, Yue...

  18. [19]

    Gonzalez, and Ion Stoica

    Tianle Li, Wei-Lin Chiang, Evan Frick, Lisa Dunlap, Tianhao Wu, Banghua Zhu, Joseph E. Gonzalez, and Ion Stoica. From crowdsourced data to high-quality benchmarks: Arena-hard and benchbuilder pipeline, 2024

  19. [20]

    Wild- bench: Benchmarking llms with challenging tasks from real users in the wild, 2024

    Bill Yuchen Lin, Yuntian Deng, Khyathi Chandu, Faeze Brahman, Abhilasha Ravichander, Valentina Pyatkin, Nouha Dziri, Ronan Le Bras, and Yejin Choi. Wild- bench: Benchmarking llms with challenging tasks from real users in the wild, 2024

  20. [21]

    Full-duplex-bench-v3: Benchmarking tool use for full-duplex voice agents under real-world disfluency, 2026

    Guan-Ting Lin, Chen Chen, Zhehuai Chen, and Hung yi Lee. Full-duplex-bench-v3: Benchmarking tool use for full-duplex voice agents under real-world disfluency, 2026

  21. [22]

    Full-duplex-bench-v2: A multi-turn evaluation framework for duplex dialogue systems with an auto- mated examiner, 2026

    Guan-Ting Lin, Shih-Yun Shan Kuan, Jiatong Shi, Kai- Wei Chang, Siddhant Arora, Shinji Watanabe, and Hung yi Lee. Full-duplex-bench-v2: A multi-turn evaluation framework for duplex dialogue systems with an auto- mated examiner, 2026

  22. [23]

    Full-duplex-bench v1.5: Evaluating overlap handling for full-duplex speech models, 2026

    Guan-Ting Lin, Shih-Yun Shan Kuan, Qirui Wang, Ji- achen Lian, Tingle Li, Shinji Watanabe, and Hung yi Lee. Full-duplex-bench v1.5: Evaluating overlap handling for full-duplex speech models, 2026

  23. [24]

    Liu, and Hung yi Lee

    Guan-Ting Lin, Jiachen Lian, Tingle Li, Qirui Wang, Gopala Anumanchipalli, Alexander H. Liu, and Hung yi Lee. Full-duplex-bench: A benchmark to evaluate full-duplex spoken dialogue models on turn-taking ca- pabilities, 2025

  24. [25]

    V oxtral, 2025

    Mistral AI. V oxtral, 2025

  25. [26]

    Hello GPT-4o

    OpenAI. Hello GPT-4o. https://openai.com/index/ hello-gpt-4o/, 2024. Accessed: 2026-06-09

  26. [27]

    gpt-audio

    OpenAI. gpt-audio. https://developers.openai.com/ api/docs/models/gpt-audio, 2025. Accessed: 2026- 06-09

  27. [28]

    gpt-audio-mini

    OpenAI. gpt-audio-mini. https://developers.openai. com/api/docs/models/gpt-audio-mini, 2025. Ac- cessed: 2026-06-09

  28. [29]

    Introducing gpt-realtime and Realtime API updates for production voice agents

    OpenAI. Introducing gpt-realtime and Realtime API updates for production voice agents. https://openai. com/index/introducing-gpt-realtime/, 2025. Ac- cessed: 2026-06-09

  29. [30]

    Introducing OpenAI o3 and o4-mini

    OpenAI. Introducing OpenAI o3 and o4-mini. https: //openai.com/index/introducing-o3-and-o4-mini/ ,

  30. [31]

    Accessed: 2026-06-17

  31. [32]

    Advancing voice intelligence with new models in the API

    OpenAI. Advancing voice intelligence with new models in the API. https://openai.com/index/ advancing-voice-intelligence-with-new-models-in-the-api/ ,

  32. [33]

    Accessed: 2026-06-09

  33. [34]

    Introducing GPT-5.4 mini and nano

    OpenAI. Introducing GPT-5.4 mini and nano. https://openai.com/index/ introducing-gpt-5-4-mini-and-nano/ , 2026. Ac- cessed: 2026-06-09

  34. [35]

    Robust speech recognition via large-scale weak supervision, 2022

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock- man, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision, 2022

  35. [36]

    τ-voice: Benchmarking full- duplex voice agents on real-world domains, 2026

    Soham Ray, Keshav Dhandhania, Victor Barres, and Karthik Narasimhan. τ-voice: Benchmarking full- duplex voice agents on real-world domains, 2026

  36. [37]

    Schuirmann

    Donald J. Schuirmann. A comparison of the two one- sided tests procedure and the power approach for assess- ing the equivalence of average bioavailability.Journal of Pharmacokinetics and Biopharmaceutics, 15(6):657– 680, 1987

  37. [38]

    Spokenwoz: A large-scale speech-text benchmark for spoken task-oriented dia- logue agents, 2025

    Shuzheng Si, Wentao Ma, Haoyu Gao, Yuchuan Wu, Ting-En Lin, Yinpei Dai, Hangyu Li, Rui Yan, Fei Huang, and Yongbin Li. Spokenwoz: A large-scale speech-text benchmark for spoken task-oriented dia- logue agents, 2025

  38. [39]

    Multichallenge: A realistic multi-turn conversa- tion evaluation benchmark challenging to frontier llms, 2025

    Ved Sirdeshmukh, Kaustubh Deshpande, Johannes Mols, Lifeng Jin, Ed-Yeremai Cardona, Dean Lee, Jeremy Kritz, Willow Primack, Summer Yue, and Chen Xing. Multichallenge: A realistic multi-turn conversa- tion evaluation benchmark challenging to frontier llms, 2025

  39. [40]

    Submodular benchmark selection

    Alexander Smola. Submodular benchmark selection. arXiv preprint arXiv:2605.02209, 2026

  40. [41]

    INSTRUCT-FD: Can your full-duplex speech system follow turn-taking instructions?, 2026

    Yuzhi Tang, Wentao Ma, Xiling Zhao, Ahmad Sal- imi, Sepehr Harfi Moridani, Dongming Shen, Jixuan Wang, Abdulrahman Abdulrazzag, Murdock Aubry, Yu- Hua Chen, Daniel Lee, Jaewon Lee, Jonah Mackey, Silin Meng, Nicholas Stranges, Chenxu Xiong, Hao Yu, Yi Zhu, Mu Li, and Alex Smola. INSTRUCT-FD: Can your full-duplex speech system follow turn-taking instructions?, 2026

  41. [42]

    Full-duplex interaction in spoken dialogue systems: A comprehensive study from the icassp 2026 humdial challenge, 2026

    Chengyou Wang, Hongfei Xue, Guojian Li, Zhixian Zhao, Shuiyuan Wang, Shuai Wang, Xin Xu, Hui Bu, and Lei Xie. Full-duplex interaction in spoken dialogue systems: A comprehensive study from the icassp 2026 humdial challenge, 2026

  42. [43]

    Smith, Daniel Khashabi, and Hannaneh Hajishirzi

    Yizhong Wang, Yeganeh Kordi, Swaroop Mishra, Alisa Liu, Noah A. Smith, Daniel Khashabi, and Hannaneh Hajishirzi. Self-instruct: Aligning language models with self-generated instructions, 2023. 9

  43. [44]

    Semantic-aware interruption detection in spo- ken dialogue systems: Benchmark, metric, and model, 2026

    Kangxiang Xia, Bingshen Mu, Xian Shi, Jin Xu, and Lei Xie. Semantic-aware interruption detection in spo- ken dialogue systems: Benchmark, metric, and model, 2026

  44. [45]

    Mimo-audio: Audio lan- guage models are few-shot learners, 2025

    LLM-Core-Team Xiaomi. Mimo-audio: Audio lan- guage models are few-shot learners, 2025

  45. [46]

    Wizardlm: Empowering large pre- trained language models to follow complex instructions, 2025

    Can Xu, Qingfeng Sun, Kai Zheng, Xiubo Geng, Pu Zhao, Jiazhan Feng, Chongyang Tao, Qingwei Lin, and Daxin Jiang. Wizardlm: Empowering large pre- trained language models to follow complex instructions, 2025

  46. [47]

    Qwen2.5-Omni technical report, 2025

    Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. Qwen2.5-Omni technical report, 2025

  47. [48]

    Qwen3-omni technical report, 2025

    Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, Yuanjun Lv, Yongqi Wang, Dake Guo, He Wang, Linhan Ma, Pei Zhang, Xinyu Zhang, Hongkun Hao, Zishan Guo, Baosong Yang, Bin Zhang, Ziyang Ma, Xipin Wei, Shuai Bai, Keqin Chen, Xue- jing Liu, Peng Wang, Mingkun Yang, Dayiheng Liu, Xingzhang Ren, B...

  48. [49]

    Beyond task-oriented and chitchat dialogues: Proactive and transition-aware conversational agents, 2025

    Yejin Yoon, Yuri Son, Namyoung So, Minseo Kim, Minsoo Cho, Chanhee Park, Seungshin Lee, and Taeuk Kim. Beyond task-oriented and chitchat dialogues: Proactive and transition-aware conversational agents, 2025

  49. [50]

    Mtr- duplexbench: Towards a comprehensive evaluation of multi-round conversations for full-duplex speech lan- guage models, 2026

    He Zhang, Wenqian Cui, Haoning Xu, Xiaohui Li, Lei Zhu, Haoli Bai, Shaohua Ma, and Irwin King. Mtr- duplexbench: Towards a comprehensive evaluation of multi-round conversations for full-duplex speech lan- guage models, 2026

  50. [51]

    Xing, Hao Zhang, Joseph E

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuo- han Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023

  51. [52]

    Sotopia: Interactive evaluation for so- cial intelligence in language agents, 2024

    Xuhui Zhou, Hao Zhu, Leena Mathur, Ruohong Zhang, Haofei Yu, Zhengyang Qi, Louis-Philippe Morency, Yonatan Bisk, Daniel Fried, Graham Neubig, and Maarten Sap. Sotopia: Interactive evaluation for so- cial intelligence in language agents, 2024

  52. [53]

    Glad you’re following along!

    Henry Peng Zou, Chunyu Miao, Wei-Chieh Huang, Yankai Chen, Yue Zhou, Hanrong Zhang, Yaozu Wu, Liancheng Fang, Zhengyao Gu, Zhen Zhang, Kening Zheng, Fangxin Wang, Yi Nian, Shanghao Li, Wen- zhe Fan, Langzhou He, Weizhi Zhang, Xue Liu, and Philip S. Yu. When users change their mind: Evaluat- ing interruptible agents in long-horizon web navigation, 2026. 10...

  53. [54]

    Confirm you’re speaking with the applicant

    Open and Establish Purpose Introduce yourself as calling from SDHA about the applicant’s disaster assistance application to coordinate a required home inspection. Confirm you’re speaking with the applicant. State that the call will include verification, scheduling, and consent to share contact details with an inspector. Ask if now is a good time to procee...

  54. [55]

    Explain verification is required before discussing case details

    Verify Case and Identity Authenticate by requesting the user to read their case ID and to state their full legal name and date of birth. Explain verification is required before discussing case details. Match all three against records. Skip conditions: Only if the correct case ID, full name, and DOB were already provided and explicitly confirmed. Failure h...

  55. [56]

    Clearly distinguish from the mailing address

    Confirm Damaged Property Address Confirm the damaged property address on file and ask the user to validate or correct it. Clearly distinguish from the mailing address. 14 . . . [skip conditions, failure handling, terminate conditions]

  56. [57]

    Capture whether someone 18+ with photo ID can be present

    Assess Access and Safety Needs Ask about accessibility and special requirements: safety hazards, utilities status, debris/blocked access, gates or lock codes, parking constraints, pets on site, occupants with mobility/medical needs, restricted areas. Capture whether someone 18+ with photo ID can be present. . . . [skip conditions, failure handling, termin...

  57. [58]

    Ask for 2–3 preferred date/windows starting from the earliest available date

    Gather Availability Preferences Explain inspection duration (60–90 minutes), presence requirement, and allowed windows: Weekdays AM (8–12), PM (12–4), Late (4–6), and limited Saturday AM, all in Central Time. Ask for 2–3 preferred date/windows starting from the earliest available date. . . . [skip conditions, failure handling, terminate conditions]

  58. [59]

    Ask for explicit consent to share these details with the assigned inspector and to send the appointment confirmation via their chosen contact method

    Obtain Consent to Share Details Disclose that SDHA uses contracted inspectors who need the applicant’s name, damaged property address, and contact number/email to coordinate access and reminders. Ask for explicit consent to share these details with the assigned inspector and to send the appointment confirmation via their chosen contact method. Skip condit...

  59. [60]

    On confirmation, state the inspection is booked and provide the confirmation number

    Propose and Book Appointment Using the user’s preferences and policy windows, propose the earliest matching slot. On confirmation, state the inspection is booked and provide the confirmation number. Remind them an adult 18+ must be present, secure pets, and maintain a clear path. . . . [skip conditions, failure handling, terminate conditions]

  60. [61]

    If consented, state you will send the confirmation

    Confirm Details and Close Read back the final date/window, damaged property address, key access notes, presence requirement, and the confirmation number. If consented, state you will send the confirmation. Provide the conversation reference ID and recap how to reschedule or cancel. Skip conditions: Never. Terminate conditions: Success when appointment det...

  61. [62]

    This call will include verifying your case, scheduling an inspection window, and getting consent to share your contact details” [Undelivered: “with the inspector

    Assistant[INTERRUPTED]: “Hi, this is the State Disaster Housing Assistance program calling about your disaster assistance application to coordinate your home inspection. This call will include verifying your case, scheduling an inspection window, and getting consent to share your contact details” [Undelivered: “with the inspector. Am I speaking with Maris...

  62. [63]

    Hold on, why are we verifying my case again? I already sent all that with my application. I mean, do you not have it on your end?

    User[pushback]: “Hold on, why are we verifying my case again? I already sent all that with my application. I mean, do you not have it on your end?”

  63. [64]

    your full name and date of birth

    Assistant[INTERRUPTED]: “I get it, you already provided that. We do a quick verification to protect your privacy before we talk about your case or schedule. It’ll be fast. Could you please read your case ID for me? After that I’ll confirm” [Undelivered: “your full name and date of birth. ”]

  64. [65]

    Yeah, it’s DR-4827315. Let’s keep this quick, please

    User[impatient]: “Yeah, it’s DR-4827315. Let’s keep this quick, please.”

  65. [66]

    Thanks, that matches our records. To finish verification, please state your full legal name and your date of birth

    Assistant[INTERRUPTED]: “Thanks, that matches our records. To finish verification, please state your full legal name and your date of birth.” [Undelivered: “This is just to protect your privacy, and then we’ll go straight to scheduling. ”]

  66. [67]

    Wait, why do you need my date of birth if I already gave you the case ID? I mean, is that not enough to verify me? I’d rather not say that over the phone without a good reason

    User[pushback]: “Wait, why do you need my date of birth if I already gave you the case ID? I mean, is that not enough to verify me? I’d rather not say that over the phone without a good reason.”

  67. [68]

    not asking for anything like SSN or financial info

    Assistant[INTERRUPTED]: “I hear you, and I understand the concern. We use your date of birth with the case ID to securely match your record and protect your privacy. Without that quick check I can’t move forward with scheduling. Please state your full legal name and date of birth. We’ll use it only to verify your case on this call, and I’m” [Undelivered: ...

  68. [69]

    Yeah, fine. Marisol Ortega, date of birth November second, nineteen eighty four. Let’s just move on, please

    User[impatient]: “Yeah, fine. Marisol Ortega, date of birth November second, nineteen eighty four. Let’s just move on, please.” G.3 Evaluation Rubric (Interruption at Message 1) Rubric for interruption at message 1 (pushback). Task Fulfillment Criterion The assistant briefly explains that verification is required before discussing case details to protect ...

  69. [70]

    Domain name + description (embedded naturally in the role description)

  70. [71]

    The goal (may be paraphrased, but must preserve all required checkpoints and the terminal outcome)

  71. [72]

    Operating rules

    The detailed_guidelines (as “Operating rules” or similar)

  72. [73]

    A section for known user information containing the {known_user_information} placeholder

  73. [74]

    The stages (well-formatted as a numbered procedure), each with: name, description, skip_conditions, failure_handling, terminate_conditions

  74. [75]

    normal",

    Completion definition: terminal success and hard-stop failure conditions Additional constraints: - Do NOT mention any meta-evaluation terms (benchmark, judge, rubric, verifier, dataset, etc.). - Keep it strict but not robotic: do not instruct the assistant to narrate stage numbers to the user. Return ONLY the JSON object. I.2 Round Planner The round plann...

  75. [76]

    Do not rewrite, improve, or touch any message that is not targeted

    Apply ONLY the changes described in the edit commands. Do not rewrite, improve, or touch any message that is not targeted

  76. [77]

    Modify only that message’s content

    Each edit command targets a specific message by its index. Modify only that message’s content

  77. [78]

    Preserve the rest exactly as it is

    Change as FEW words as possible to fix the described issue. Preserve the rest exactly as it is

  78. [79]

    If two edit commands target the same message, apply both changes

  79. [80]

    Maintain TTS-friendly formatting: numbers as words, currency spoken out, no markdown, no em dashes or semicolons

  80. [81]

    modified_content must be PLAIN TEXT ONLY, no XML markup

Showing first 80 references.