pith. machine review for the scientific record. sign in

arxiv: 2605.10199 · v1 · submitted 2026-05-11 · 💻 cs.CL · eess.AS

Recognition: 2 theorem links

· Lean Theorem

How Should LLMs Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:04 UTC · model grok-4.3

classification 💻 cs.CL eess.AS
keywords full-duplex spoken dialogueuser-stream routingchannel fusioncross-attention routingspoken question answeringcontext corruptioninterruptionsLLM dialogue systems
0
0 comments X

The pith

The way an LLM routes incoming user speech while generating its own response creates a clear tradeoff between semantic accuracy and robustness to interruptions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines how large language models can support full-duplex spoken dialogue, in which the system must keep listening to the user while producing its spoken reply. It tests two routing approaches under the same training setup: one that merges the user audio stream directly into the model's input sequence, and another that keeps the user stream separate and reaches it only through added cross-attention layers. Experiments on spoken question-answering and interaction tasks show the direct-merge approach produces stronger understanding and better answers overall. The same approach, however, lets overlapping user speech corrupt the model's ongoing generation when interruptions occur, resulting in incoherent continuations. The separated approach sacrifices some answer quality but maintains cleaner generation context during overlaps. Readers care because this choice directly affects whether voice-based AI can handle natural, overlapping conversation without breaking down.

Core claim

User-stream routing is a central design axis for full-duplex spoken dialogue. Channel fusion, which injects the user stream directly into the LLM input, delivers stronger semantic grounding and higher performance on spoken question answering. Under conditions of semantic overlap such as user interruptions, however, it is vulnerable to context corruption: if the model does not stop in time, the incoming user stream interferes with generation and produces semantically incoherent continuations. Cross-attention routing, which maintains the user stream as external memory accessed through adapters, yields lower question-answering scores yet better preserves the LLM's generation context and proves更

What carries the argument

User-stream routing strategy, implemented either as channel fusion that merges incoming user audio directly into the main token sequence or as cross-attention adapters that treat the user stream as separate external memory.

If this is right

  • Full-duplex systems should prefer channel fusion when question-answering accuracy is the dominant requirement.
  • Cross-attention routing should be selected when preserving coherent generation during user overlaps is more important.
  • Neither routing method alone resolves all challenges of simultaneous listening and speaking.
  • Benchmarks for spoken dialogue must include explicit semantic-overlap cases to evaluate real robustness.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Hybrid routing that switches between the two methods based on detected overlap could combine their respective strengths.
  • The observed tradeoff may appear in other generation settings where external inputs arrive asynchronously, such as live translation or collaborative writing tools.
  • Deployment choices may hinge on measuring typical overlap frequency for the intended user population.

Load-bearing premise

That the shared training pipeline isolates the routing choice fairly and that the chosen benchmarks capture typical real-world interruption patterns without selection bias.

What would settle it

A controlled test on a high-interruption benchmark in which channel fusion produces coherent continuations at rates matching or exceeding cross-attention while retaining its question-answering advantage.

Figures

Figures reproduced from arXiv: 2605.10199 by Hui Lu, Huimeng Wang, Shiyin Kang, Shuhai Peng, Xixin Wu, Xueyuan Chen, Zhiyong Wu.

Figure 1
Figure 1. Figure 1: Model architecture segment the input waveform into fixed-size chunks, extract mel-spectrogram features for each chunk independently, and feed the resulting spectrogram chunks to the encoder incrementally. During inference, the encoder maintains continuity across chunks using KV caching. The original Whisper encoder consists of convolutional layers followed by bidirectional self-attention blocks. To preserv… view at source ↗
Figure 2
Figure 2. Figure 2: Input and output data formats for different tasks [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Note that we may tune this a bit for different QA datasets. [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗
Figure 3
Figure 3. Figure 3: Prompt for rewriting written QA into the spoken one [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Prompt for extending an interruption turn into a base conversation [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Failure case under missed interruption: CF-Duplex produces a semantically incoherent [PITH_FULL_IMAGE:figures/full_fig_p021_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Failure case under missed interruption: CF-Duplex produces a semantically incoherent [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Failure case under missed interruption: CF-Duplex produces a semantically incoherent continuation, while XA-Duplex remains coherent (sample 3). 21 [PITH_FULL_IMAGE:figures/full_fig_p021_7.png] view at source ↗
read the original abstract

Full-duplex spoken dialogue requires a model to keep listening while generating its own spoken response. This is challenging for large language models (LLMs), which are designed to extend a single coherent sequence and do not naturally support user input arriving during generation. We argue that how the user stream is routed into the LLM is therefore a key architectural question for full-duplex modeling. To study this question, we extend a text-only LLM into a unified full-duplex spoken dialogue system and compare two routing strategies under a shared training pipeline: (i) channel fusion, which injects the user stream directly into the LLM input, and (ii) cross-attention routing, which keeps the user stream as external memory accessed through cross-attention adapters. Experiments on spoken question answering and full-duplex interaction benchmarks reveal a clear tradeoff. Channel fusion yields stronger semantic grounding and consistently better question-answering performance. However, under semantically overlapping conditions such as user interruptions, it is more vulnerable to context corruption: if the model fails to stop in time, the overlapping user stream can interfere with ongoing generation and lead to semantically incoherent continuations. Cross-attention routing underperforms on question answering, but better preserves the LLM generation context and is more robust to this failure mode. These results establish user-stream routing as a central design axis in full-duplex spoken dialogue and offer practical guidance on the tradeoff between semantic integration and context robustness. We provide a demo page for qualitative inspection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper extends a text-only LLM into a full-duplex spoken dialogue system and compares two user-stream routing strategies under a shared training pipeline: channel fusion (direct injection of the user stream into the LLM input sequence) versus cross-attention routing (user stream maintained as external memory accessed via adapters). Experiments on spoken question answering and full-duplex interaction benchmarks are reported to show a tradeoff: channel fusion yields stronger semantic grounding and better QA performance, but is vulnerable to context corruption when user input overlaps during generation (e.g., interruptions), producing incoherent continuations; cross-attention routing underperforms on QA but better preserves generation context and is more robust to such failures. The work positions user-stream routing as a central design axis and provides a demo page for qualitative inspection.

Significance. If the reported tradeoff is shown to be robust, the paper makes a useful contribution by empirically identifying a key architectural choice for full-duplex spoken dialogue systems and offering practical guidance on balancing semantic integration against context robustness. The authors deserve credit for including a demo page that supports qualitative evaluation of model behavior in interactive settings, which aids interpretability beyond the quantitative benchmarks.

major comments (2)
  1. [Abstract] The central attribution of the QA advantage and interruption vulnerability to the routing strategy (abstract) depends on the shared training pipeline producing comparably optimized models for both approaches. The abstract does not indicate whether separate hyperparameter sweeps, learning-rate schedules, or capacity-matched ablations were performed; channel fusion changes the primary input sequence and gradient paths while cross-attention adds adapter parameters, so identical hyperparameters could produce performance gaps that reflect optimization dynamics rather than the architectural distinction. This is load-bearing for the tradeoff claim.
  2. [Experiments] The experiments section claims the benchmarks reveal a 'clear tradeoff' with consistent differences, yet the abstract provides no information on the number of runs, error bars, statistical significance tests, or data exclusion rules. Without these details it is impossible to verify whether the reported performance gaps between channel fusion and cross-attention routing are robust or could be affected by post-hoc choices.
minor comments (2)
  1. The abstract states that a demo page is provided, but the main text or a footnote should include the direct URL or access instructions so readers can locate it without additional searching.
  2. Notation for the two routing strategies could be introduced with explicit symbols (e.g., CF for channel fusion) the first time they appear, to improve readability when the strategies are contrasted in later sections.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment of the paper's contribution, the identification of user-stream routing as a key design axis, and the value placed on the demo page for qualitative evaluation. We address the major comments point by point below, providing clarifications on our experimental design and committing to revisions that strengthen the reporting of results and methodology.

read point-by-point responses
  1. Referee: [Abstract] The central attribution of the QA advantage and interruption vulnerability to the routing strategy (abstract) depends on the shared training pipeline producing comparably optimized models for both approaches. The abstract does not indicate whether separate hyperparameter sweeps, learning-rate schedules, or capacity-matched ablations were performed; channel fusion changes the primary input sequence and gradient paths while cross-attention adds adapter parameters, so identical hyperparameters could produce performance gaps that reflect optimization dynamics rather than the architectural distinction. This is load-bearing for the tradeoff claim.

    Authors: The shared training pipeline was intentionally designed with identical hyperparameters, learning-rate schedules, batch sizes, and optimization procedures for both routing strategies. This controlled setup isolates the effect of the routing mechanism itself rather than allowing independent optimization that would confound the architectural comparison. Separate hyperparameter sweeps were not performed, as they would undermine the goal of evaluating the tradeoff under a unified training regime. We will revise the abstract to explicitly note the identical training configuration and expand the methods and experiments sections with full details on hyperparameters, adapter capacity, and any related ablations. revision: yes

  2. Referee: [Experiments] The experiments section claims the benchmarks reveal a 'clear tradeoff' with consistent differences, yet the abstract provides no information on the number of runs, error bars, statistical significance tests, or data exclusion rules. Without these details it is impossible to verify whether the reported performance gaps between channel fusion and cross-attention routing are robust or could be affected by post-hoc choices.

    Authors: We agree that statistical rigor is necessary to substantiate the consistency of the observed tradeoffs. The reported results were obtained from multiple independent runs with different random seeds. In the revised manuscript we will add error bars to all quantitative figures and tables, specify the exact number of runs, include statistical significance tests for key comparisons, and document any data exclusion or preprocessing rules applied to the benchmarks. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark comparison of routing strategies

full rationale

The paper reports experimental results comparing channel fusion and cross-attention routing under a shared training pipeline on spoken QA and full-duplex interaction benchmarks. No mathematical derivations, equations, or self-referential definitions are present that would reduce the tradeoff claims to fitted parameters or prior self-citations by construction. The central findings (QA advantage for fusion, robustness advantage for cross-attention) are presented as direct observations from the benchmarks rather than derived quantities. This is a standard empirical architecture study with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the two routing methods can be compared fairly under a shared training pipeline and that benchmark results generalize to real interruptions. No new entities are postulated and no free parameters are described in the abstract.

axioms (1)
  • domain assumption Text-only LLMs can be extended to spoken dialogue via adapters or input fusion while preserving core generation behavior
    Invoked when the authors extend a text-only LLM into a unified full-duplex system.

pith-pipeline@v0.9.0 · 5590 in / 1490 out tokens · 68418 ms · 2026-05-12T03:04:12.495477+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 7 internal anchors

  1. [1]

    Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binkowski, Ricardo Barreira, Oriol Vinyals, Andrew Zisserman, and Karén Simonyan

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, Roman Ring, Eliza Rutherford, Serkan Cabi, Tengda Han, Zhitao Gong, Sina Samangooei, Marianne Monteiro, Jacob L. Menick, Sebastian Borgeaud, Andy Brock, Aida Nematzadeh, Sahand Sharifzadeh, Mikolaj Binko...

  2. [2]

    Common voice: A massively-multilingual speech corpus

    Rosana Ardila, Megan Branson, Kelly Davis, Michael Kohler, Josh Meyer, Michael Henretty, Reuben Morais, Lindsay Saunders, Francis Tyers, and Gregor Weber. Common voice: A massively-multilingual speech corpus. In Nicoletta Calzolari, Frédéric Béchet, Philippe Blache, Khalid Choukri, Christopher Cieri, Thierry Declerck, Sara Goggi, Hitoshi Isahara, Bente Ma...

  3. [3]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-V oss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott G...

  4. [4]

    Gigaspeech: An evolving, multi-domain ASR corpus with 10, 000 hours of transcribed audio

    Guoguo Chen, Shuzhou Chai, Guan-Bo Wang, Jiayu Du, Wei-Qiang Zhang, Chao Weng, Dan Su, Daniel Povey, Jan Trmal, Junbo Zhang, Mingjie Jin, Sanjeev Khudanpur, Shinji Watanabe, Shuaijiang Zhao, Wei Zou, Xiangang Li, Xuchen Yao, Yongqing Wang, Zhao You, and Zhiyong Yan. Gigaspeech: An evolving, multi-domain ASR corpus with 10, 000 hours of transcribed audio. ...

  5. [5]

    Minmo: A multimodal large language model for seamless voice interaction.CoRR, abs/2501.06282, 2025

    Qian Chen, Yafeng Chen, Yanni Chen, Mengzhe Chen, Yingda Chen, Chong Deng, Zhihao Du, Ruize Gao, Changfeng Gao, Zhifu Gao, Yabin Li, Xiang Lv, Jiaqing Liu, Haoneng Luo, Bin Ma, Chongjia Ni, Xian Shi, Jialong Tang, Hui Wang, Hao Wang, Wen Wang, Yuxuan Wang, Yunlan Xu, Fan Yu, Zhijie Yan, Yexin Yang, Baosong Yang, Xian Yang, Guanrou Yang, Tianyu Zhao, Qingl...

  6. [6]

    Qwen2-Audio Technical Report

    Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen2-audio technical report. CoRR, abs/2407.10759, 2024

  7. [7]

    Moshi: a speech-text foundation model for real-time dialogue

    Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, and Neil Zeghidour. Moshi: a speech-text foundation model for real-time dialogue.CoRR, abs/2410.00037, 2024

  8. [8]

    Enhancing chat language models by scaling high-quality instructional conversations

    Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 3029–3051, Singapore, Dec...

  9. [9]

    CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

    Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, Fan Yu, Huadai Liu, Zhengyan Sheng, Yue Gu, Chong Deng, Wen Wang, Shiliang Zhang, Zhijie Yan, and Jingren Zhou. Cosyvoice 2: Scalable streaming speech synthesis with large language models.CoRR, abs/2412.10117, 2024

  10. [10]

    Llama- omni: Seamless speech interaction with large language models

    Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Shaolei Zhang, and Yang Feng. Llama- omni: Seamless speech interaction with large language models. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenRe- view.net, 2025

  11. [11]

    LLaMA-omni 2: LLM- based real-time spoken chatbot with autoregressive streaming speech synthesis

    Qingkai Fang, Yan Zhou, Shoutao Guo, Shaolei Zhang, and Yang Feng. LLaMA-omni 2: LLM- based real-time spoken chatbot with autoregressive streaming speech synthesis. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long ...

  12. [12]

    Vita: Towards open-source interactive omni multimodal llm

    Chaoyou Fu, Haojia Lin, Zuwei Long, Yunhang Shen, Meng Zhao, Yifan Zhang, Xiong Wang, Di Yin, Long Ma, Xiawu Zheng, Ran He, Rongrong Ji, Yunsheng Wu, Caifeng Shan, and Xing Sun. VITA: towards open-source interactive omni multimodal LLM.CoRR, abs/2408.05211, 2024

  13. [13]

    Vita-1.5: Towards gpt-4o level real-time vision and speech interaction.arXiv preprint arXiv:2501.01957, 2025

    Chaoyou Fu, Haojia Lin, Xiong Wang, Yifan Zhang, Yunhang Shen, Xiaoyu Liu, Haoyu Cao, Zuwei Long, Heting Gao, Ke Li, Long Ma, Xiawu Zheng, Rongrong Ji, Xing Sun, Caifeng Shan, and Ran He. VITA-1.5: towards gpt-4o level real-time vision and speech interaction. CoRR, abs/2501.01957, 2025

  14. [14]

    The people’s speech: A large- scale diverse english speech recognition dataset for commercial usage

    Daniel Galvez, Greg Diamos, Juan Torres, Keith Achorn, Juan Cerón, Anjali Gopi, David Kanter, Max Lam, Mark Mazumder, and Vijay Janapa Reddi. The people’s speech: A large- scale diverse english speech recognition dataset for commercial usage. In J. Vanschoren and S. Yeung, editors,Proceedings of the Neural Information Processing Systems Track on Datasets ...

  15. [15]

    Audio flamingo 3: Advancing audio intelligence with fully open large audio language models

    Sreyan Ghosh, Arushi Goel, Jaehyeon Kim, Sonal Kumar, Zhifeng Kong, Sang gil Lee, Chao- Han Huck Yang, Ramani Duraiswami, Dinesh Manocha, Rafael Valle, and Bryan Catanzaro. Audio flamingo 3: Advancing audio intelligence with fully open large audio language models. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

  16. [16]

    Audio flamingo 2: An audio-language model with long- audio understanding and expert reasoning abilities

    Sreyan Ghosh, Zhifeng Kong, Sonal Kumar, S Sakshi, Jaehyeon Kim, Wei Ping, Rafael Valle, Dinesh Manocha, and Bryan Catanzaro. Audio flamingo 2: An audio-language model with long- audio understanding and expert reasoning abilities. InForty-second International Conference on Machine Learning, 2025. 11

  17. [17]

    Emilia: A large-scale, extensive, multilingual, and diverse dataset for speech generation

    Haorui He, Zengqiang Shang, Chaoren Wang, Xuyuan Li, Yicheng Gu, Hua Hua, Liwei Liu, Chen Yang, Jiaqi Li, Peiyang Shi, Yuancheng Wang, Kai Chen, Pengyuan Zhang, and Zhizheng Wu. Emilia: A large-scale, extensive, multilingual, and diverse dataset for speech generation. CoRR, abs/2501.15907, 2025

  18. [18]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. Lora: Low-rank adaptation of large language models. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net, 2022

  19. [19]

    Efficient and direct duplex modeling for speech-to-speech language model

    Ke Hu, Ehsan Hosseini-Asl, Chen Chen, Edresson Casanova, Subhankar Ghosh, Piotr Zelasko, Zhehuai Chen, Jason Li, Jagadeesh Balam, and Boris Ginsburg. Efficient and direct duplex modeling for speech-to-speech language model. In Odette Scharenborg, Catharine Oertel, and Khiet Truong, editors,26th Annual Conference of the International Speech Communication A...

  20. [20]

    Weld, and Luke Zettlemoyer

    Mandar Joshi, Eunsol Choi, Daniel S. Weld, and Luke Zettlemoyer. Triviaqa: A large scale distantly supervised challenge dataset for reading comprehension. In Regina Barzilay and Min- Yen Kan, editors,Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics, ACL 2017, Vancouver, Canada, July 30 - August 4, Volume 1: Long Pape...

  21. [21]

    KimiTeam, Ding Ding, Zeqian Ju, Yichong Leng, Songxiang Liu, Tong Liu, Zeyu Shang, Kai Shen, Wei Song, Xu Tan, Heyi Tang, Zhengtao Wang, Chu Wei, Yifei Xin, Xinran Xu, Jianwei Yu, Yutao Zhang, Xinyu Zhou, Y . Charles, Jun Chen, Yanru Chen, Yulun Du, Weiran He, Zhenxing Hu, Guokun Lai, Qingcheng Li, Yangyang Liu, Weidong Sun, Jianzhou Wang, Yuzhi Wang, Yue...

  22. [22]

    Audio flamingo: A novel audio language model with few-shot learning and dialogue abilities

    Zhifeng Kong, Arushi Goel, Rohan Badlani, Wei Ping, Rafael Valle, and Bryan Catanzaro. Audio flamingo: A novel audio language model with few-shot learning and dialogue abilities. In Ruslan Salakhutdinov, Zico Kolter, Katherine A. Heller, Adrian Weller, Nuria Oliver, Jonathan Scarlett, and Felix Berkenkamp, editors,Forty-first International Conference on M...

  23. [23]

    Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov

    Tom Kwiatkowski, Jennimaria Palomaki, Olivia Redfield, Michael Collins, Ankur Parikh, Chris Alberti, Danielle Epstein, Illia Polosukhin, Jacob Devlin, Kenton Lee, Kristina Toutanova, Llion Jones, Matthew Kelcey, Ming-Wei Chang, Andrew M. Dai, Jakob Uszkoreit, Quoc Le, and Slav Petrov. Natural questions: A benchmark for question answering research.Transact...

  24. [24]

    Easy turn: Integrating acoustic and linguistic modalities for robust turn-taking in full-duplex spoken dialogue systems.CoRR, abs/2509.23938, 2025

    Guojian Li, Chengyou Wang, Hongfei Xue, Shuiyuan Wang, Dehui Gao, Zihan Zhang, Yuke Lin, Wenjie Li, Longshuai Xiao, Zhonghua Fu, and Lei Xie. Easy turn: Integrating acoustic and linguistic modalities for robust turn-taking in full-duplex spoken dialogue systems.CoRR, abs/2509.23938, 2025

  25. [25]

    Baichuan-audio: A unified framework for end-to-end speech interaction

    Tianpeng Li, Jun Liu, Tao Zhang, Yuanbo Fang, Da Pan, Mingrui Wang, Zheng Liang, Zehuan Li, Mingan Lin, Guosheng Dong, Jianhua Xu, Haoze Sun, Zenan Zhou, and Weipeng Chen. Baichuan-audio: A unified framework for end-to-end speech interaction.CoRR, abs/2502.17239, 2025

  26. [26]

    Hashimoto

    Xuechen Li, Tianyi Zhang, Yann Dubois, Rohan Taori, Ishaan Gulrajani, Carlos Guestrin, Percy Liang, and Tatsunori B. Hashimoto. Alpacaeval: An automatic evaluator of instruction-following models.https://github.com/tatsu-lab/alpaca_eval, 5 2023

  27. [27]

    Full-Duplex-Bench v1.5: Evaluating Overlap Handling for Full-Duplex Speech Models

    Guan-Ting Lin, Shih-Yun Shan Kuan, Qirui Wang, Jiachen Lian, Tingle Li, and Hung-yi Lee. Full-duplex-bench v1. 5: Evaluating overlap handling for full-duplex speech models.arXiv preprint arXiv:2507.23159, 2025. 12

  28. [28]

    Liu, and Hung-Yi Lee

    Guan-Ting Lin, Jiachen Lian, Tingle Li, Qirui Wang, Gopala Anumanchipalli, Alexander H. Liu, and Hung-Yi Lee. Full-duplex-bench: A benchmark to evaluate full-duplex spoken dialogue models on turn-taking capabilities. InIEEE Automatic Speech Recognition and Understanding Workshop, ASRU 2025, Honolulu, HI, USA, December 6-10, 2025, pages 1–8. IEEE, 2025

  29. [29]

    Jihao Liu, Zhiding Yu, Shiyi Lan, Shihao Wang, Rongyao Fang, Jan Kautz, Hongsheng Li, and José M. Álvarez. Streamchat: Chatting with streaming video.CoRR, abs/2412.08646, 2024

  30. [30]

    Decoupled weight decay regularization

    Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. In7th International Conference on Learning Representations, ICLR 2019, New Orleans, LA, USA, May 6-9, 2019. OpenReview.net, 2019

  31. [31]

    Language model can listen while speaking

    Ziyang Ma, Yakun Song, Chenpeng Du, Jian Cong, Zhuo Chen, Yuping Wang, Yuxuan Wang, and Xie Chen. Language model can listen while speaking. InProceedings of the Thirty- Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applications of Artificial Intelligence and Fifteenth Symposium on Educational Advances in Arti...

  32. [32]

    Real-time textless dialogue generation.CoRR, abs/2501.04877, 2025

    Long Mai and Julie Carson-Berndsen. Real-time textless dialogue generation.CoRR, abs/2501.04877, 2025

  33. [33]

    MS MARCO: A human generated machine reading comprehension dataset

    Tri Nguyen, Mir Rosenberg, Xia Song, Jianfeng Gao, Saurabh Tiwary, Rangan Majumder, and Li Deng. MS MARCO: A human generated machine reading comprehension dataset. In Tarek Richard Besold, Antoine Bordes, Artur S. d’Avila Garcez, and Greg Wayne, editors, Proceedings of the Workshop on Cognitive Computation: Integrating neural and symbolic approaches 2016 ...

  34. [34]

    Generative spoken dialogue language modeling.Trans

    Tu Anh Nguyen, Eugene Kharitonov, Jade Copet, Yossi Adi, Wei-Ning Hsu, Ali Elkahky, Paden Tomasello, Robin Algayres, Benoît Sagot, Abdelrahman Mohamed, and Emmanuel Dupoux. Generative spoken dialogue language modeling.Trans. Assoc. Comput. Linguistics, 11:250–266, 2023

  35. [35]

    GPT-4o System Card

    OpenAI. Gpt-4o system card.CoRR, abs/2410.21276, 2024

  36. [36]

    Librispeech: An ASR corpus based on public domain audio books

    Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: An ASR corpus based on public domain audio books. In2015 IEEE International Conference on Acoustics, Speech and Signal Processing, ICASSP 2015, South Brisbane, Queensland, Australia, April 19-24, 2015, pages 5206–5210. IEEE, 2015

  37. [37]

    MLS: A large-scale multilingual dataset for speech research

    Vineel Pratap, Qiantong Xu, Anuroop Sriram, Gabriel Synnaeve, and Ronan Collobert. MLS: A large-scale multilingual dataset for speech research. In Helen Meng, Bo Xu, and Thomas Fang Zheng, editors,21st Annual Conference of the International Speech Communication Association, Interspeech 2020, Virtual Event, Shanghai, China, October 25-29, 2020, pages 2757–...

  38. [38]

    Robust speech recognition via large-scale weak supervision

    Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett, editors,International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii...

  39. [39]

    SQuAD: 100,000+ questions for machine comprehension of text

    Pranav Rajpurkar, Jian Zhang, Konstantin Lopyrev, and Percy Liang. SQuAD: 100,000+ questions for machine comprehension of text. In Jian Su, Kevin Duh, and Xavier Carreras, editors,Proceedings of the 2016 Conference on Empirical Methods in Natural Language Processing, pages 2383–2392, Austin, Texas, November 2016. Association for Computational Linguistics

  40. [40]

    Schegloff

    Emanuel A. Schegloff. Overlapping talk and the organization of turn-taking for conversation. Language in Society, 29(1):1–63, 2000. 13

  41. [41]

    Jianlin Su, Murtadha H. M. Ahmed, Yu Lu, Shengfeng Pan, Wen Bo, and Yunfeng Liu. Ro- former: Enhanced transformer with rotary position embedding.Neurocomputing, 568:127063, 2024

  42. [42]

    Drvoice: Parallel speech-text voice conversation model via dual-resolution speech representations

    Chao-Hong Tan, Qian Chen, Wen Wang, Chong Deng, Qinglin Zhang, Luyao Cheng, Hai Yu, Xin Zhang, Xiang Lyu, Tianyu Zhao, Chong Zhang, Yukun Ma, Yafeng Chen, Hui Wang, Jiaqing Liu, Xiangang Li, and Jieping Ye. Drvoice: Parallel speech-text voice conversation model via dual-resolution speech representations. InThe Fourteenth International Conference on Learni...

  43. [43]

    Step-audio 2 technical report, 2025

    StepFun Audio Team. Step-audio 2 technical report.CoRR, abs/2507.16632, 2025

  44. [44]

    Fun-audio-chat technical report.CoRR, abs/2512.20156, 2025

    Tongyi Fun Team, Qian Chen, Luyao Cheng, Chong Deng, Xiangang Li, Jiaqing Liu, Chao- Hong Tan, Wen Wang, Junhao Xu, Jieping Ye, Qinglin Zhang, Qiquan Zhang, and Jingren Zhou. Fun-audio-chat technical report.CoRR, abs/2512.20156, 2025

  45. [45]

    Gomez, Lukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Isabelle Guyon, Ulrike von Luxburg, Samy Bengio, Hanna M. Wallach, Rob Fergus, S. V . N. Vishwanathan, and Roman Garnett, editors,Advances in Neural Information Processing Systems 30: Annual Conference...

  46. [46]

    Beyond turn-based interfaces: Synchronous LLMs as full-duplex dialogue agents

    Bandhav Veluri, Benjamin N Peloquin, Bokai Yu, Hongyu Gong, and Shyamnath Gollakota. Beyond turn-based interfaces: Synchronous LLMs as full-duplex dialogue agents. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 21390–21402, Miami, Florida, USA, Nov...

  47. [47]

    VoxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation

    Changhan Wang, Morgane Riviere, Ann Lee, Anne Wu, Chaitanya Talnikar, Daniel Haziza, Mary Williamson, Juan Pino, and Emmanuel Dupoux. VoxPopuli: A large-scale multilingual speech corpus for representation learning, semi-supervised learning and interpretation. In Chengqing Zong, Fei Xia, Wenjie Li, and Roberto Navigli, editors,Proceedings of the 59th Annua...

  48. [48]

    NTPP: generative speech language modeling for dual-channel spoken dialogue via next-token-pair prediction

    Qichao Wang, Ziqiao Meng, Wenqian Cui, Yifei Zhang, Pengcheng Wu, Bingzhe Wu, Ir- win King, Liang Chen, and Peilin Zhao. NTPP: generative speech language modeling for dual-channel spoken dialogue via next-token-pair prediction. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, ed...

  49. [49]

    Spark-tts: An efficient llm- based text-to-speech model with single-stream decoupled speech tokens,

    Xinsheng Wang, Mingqi Jiang, Ziyang Ma, Ziyu Zhang, Songxiang Liu, Linqin Li, Zheng Liang, Qixi Zheng, Rui Wang, Xiaoqin Feng, Weizhen Bian, Zhen Ye, Sitong Cheng, Ruibin Yuan, Zhixian Zhao, Xinfa Zhu, Jiahao Pan, Liumeng Xue, Pengcheng Zhu, Yunlin Chen, Zhifei Li, Xie Chen, Lei Xie, Yike Guo, and Wei Xue. Spark-tts: An efficient llm-based text-to-speech ...

  50. [50]

    Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen LLM

    Xiong Wang, Yangze Li, Chaoyou Fu, Yike Zhang, Yunhang Shen, Lei Xie, Ke Li, Xing Sun, and Long Ma. Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen LLM. In Aarti Singh, Maryam Fazel, Daniel Hsu, Simon Lacoste-Julien, Felix Berkenkamp, Tegan Maharaj, Kiri Wagstaff, and Jerry Zhu, editors,Forty-second International Conferenc...

  51. [51]

    Mini-omni2: Towards open-source gpt-4o with vision, speech and duplex capabilities.arXiv preprint arXiv:2410.11190, 2024

    Zhifei Xie and Changqiao Wu. Mini-omni2: Towards open-source gpt-4o with vision, speech and duplex capabilities.CoRR, abs/2410.11190, 2024. 14

  52. [52]

    Qwen2.5-Omni Technical Report

    Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, Bin Zhang, Xiong Wang, Yunfei Chu, and Junyang Lin. Qwen2.5-omni technical report.CoRR, abs/2503.20215, 2025

  53. [53]

    Zhilin Yang, Peng Qi, Saizheng Zhang, Yoshua Bengio, William Cohen, Ruslan Salakhutdinov, and Christopher D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun’ichi Tsujii, editors, Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processi...

  54. [54]

    Flm-audio: Natural monologues improves native full-duplex chatbots via dual training

    Yiqun Yao, Xiang Li, Xin Jiang, Xuezhi Fang, Naitong Yu, Wenjia Ma, Aixin Sun, and Yequan Wang. Flm-audio: Natural monologues improves native full-duplex chatbots via dual training. CoRR, abs/2509.02521, 2025

  55. [55]

    Salmonn-omni: A standalone speech llm without codec injection for full- duplex conversation.arXiv preprint arXiv:2505.17060,

    Wenyi Yu, Siyin Wang, Xiaoyu Yang, Xianzhao Chen, Xiaohai Tian, Jun Zhang, Guangzhi Sun, Lu Lu, Yuxuan Wang, and Chao Zhang. Salmonn-omni: A standalone speech LLM without codec injection for full-duplex conversation.CoRR, abs/2505.17060, 2025

  56. [56]

    Glm-4-voice: Towards intelli- gent and human-like end-to-end spoken chatbot.arXiv preprint arXiv:2412.02612,

    Aohan Zeng, Zhengxiao Du, Mingdao Liu, Kedong Wang, Shengmin Jiang, Lei Zhao, Yuxiao Dong, and Jie Tang. Glm-4-voice: Towards intelligent and human-like end-to-end spoken chatbot.CoRR, abs/2412.02612, 2024

  57. [57]

    SpeechGPT: Empowering large language models with intrinsic cross-modal conversational abilities

    Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. SpeechGPT: Empowering large language models with intrinsic cross-modal conversational abilities. In Houda Bouamor, Juan Pino, and Kalika Bali, editors,Findings of the Association for Computational Linguistics: EMNLP 2023, pages 15757–15773, Singapore, December 2023. Asso...

  58. [58]

    STEER: semantic turn extension-expansion recognition for voice assistants

    Leon Liyang Zhang, Jiarui Lu, Joel Ruben Antony Moniz, Aditya Kulkarni, Dhivya Piraviperu- mal, Tien Dung Tran, Nick Tzou, and Hong Yu. STEER: semantic turn extension-expansion recognition for voice assistants. In Mingxuan Wang and Imed Zitouni, editors,Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing: EMNLP 2023 - In...

  59. [59]

    OmniFlatten: An end-to-end GPT model for seamless voice conversation

    Qinglin Zhang, Luyao Cheng, Chong Deng, Qian Chen, Wen Wang, Siqi Zheng, Jiaqing Liu, Hai Yu, Chao-Hong Tan, Zhihao Du, and ShiLiang Zhang. OmniFlatten: An end-to-end GPT model for seamless voice conversation. In Wanxiang Che, Joyce Nabende, Ekaterina Shutova, and Mohammad Taher Pilehvar, editors,Proceedings of the 63rd Annual Meeting of the Association f...

  60. [60]

    Beyond the turn-based game: Enabling real-time conversations with duplex models

    Xinrong Zhang, Yingfa Chen, Shengding Hu, Xu Han, Zihang Xu, Yuanwei Xu, Weilin Zhao, Maosong Sun, and Zhiyuan Liu. Beyond the turn-based game: Enabling real-time conversations with duplex models. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen, editors,Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 1154...

  61. [61]

    Please transcribe the following audio into text

    Siyi Zhou, Yiquan Zhou, Yi He, Xun Zhou, Jinchao Wang, Wei Deng, and Jingchen Shu. Indextts2: A breakthrough in emotionally expressive and duration-controlled auto-regressive zero-shot text-to-speech. In Sven Koenig, Chad Jenkins, and Matthew E. Taylor, editors,Fortieth AAAI Conference on Artificial Intelligence, Thirty-Eighth Conference on Innovative App...