A Survey of Full-Duplex Spoken Dialogue Systems: Architectural Hierarchy, Interaction Ontology, and Decision State Machine

Changhao Pan; Chen Ye; Chenyuhao Wen; Guanjun Jiang; Haoxiao Wang; Jianming Luo; Jian Wu; Jingyu Lu; Shengpeng Ji; Tianle Liang

arxiv: 2606.19453 · v1 · pith:KF4MOE2Fnew · submitted 2026-06-17 · 📡 eess.AS

A Survey of Full-Duplex Spoken Dialogue Systems: Architectural Hierarchy, Interaction Ontology, and Decision State Machine

Jingyu Lu , Yuhan Wang , Jianming Luo , Yifu Chen , Tianle Liang , Shengpeng Ji , Ziyue Jiang , Xiaoda Yang

show 10 more authors

Yu Zhang Xize Cheng Chenyuhao Wen Changhao Pan Haoxiao Wang Chen Ye Jian Wu Xiaoxi Jiang Guanjun Jiang Zhou Zhao

This is my paper

Pith reviewed 2026-06-26 19:00 UTC · model grok-4.3

classification 📡 eess.AS

keywords full-duplexspoken dialogue systemsarchitectural hierarchyinteraction ontologydecision state machinerealization gaptraining data coverage

0 comments

The pith

Full-duplex spoken dialogue systems remain constrained by training interaction patterns despite architectural potential for duplex states.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The term full-duplex has been applied inconsistently across spoken dialogue systems, creating ambiguity about actual capabilities. This survey introduces an L0-L3 Architectural Hierarchy locating where duplex decisions are made, a T×I×R Interaction Ontology specifying temporal relations, user intents, and required responses, and a Decision State Machine with states IDLE, LISTEN, SPEAK, WAIT, and DUAL to track moment-by-moment behavior. An audit of published systems and benchmarks documents a realization gap: many architectures can operate in full-duplex states in principle, yet observed behavior stays limited by the interaction patterns present in their training and evaluation data. Sympathetic readers would care because the frameworks give builders concrete distinctions to compare systems, while the gap points to concrete barriers in advancing the field.

Core claim

The paper establishes that much of the ambiguity around full-duplex claims stems from taxonomical shortcomings in current terminology. It introduces three complementary frameworks—an L0-L3 Architectural Hierarchy, a T×I×R Interaction Ontology, and a five-state Decision State Machine—to specify decision location, supported interaction types, and state transitions. The audit across published systems shows that although many architectures can in principle operate in full-duplex states, their observed behavior remains constrained by the interaction patterns represented in training and evaluation. It identifies limited public training-data coverage relative to industrial corpora and the unrealize

What carries the argument

The L0-L3 Architectural Hierarchy, T×I×R Interaction Ontology, and Decision State Machine (IDLE/LISTEN/SPEAK/WAIT/DUAL) that locate duplex decisions, classify interactions by temporal relation, intent, and response, and describe state transitions.

If this is right

Builders can locate duplex decisions using the L0-L3 hierarchy and classify interactions via the T×I×R ontology.
Architectures capable of full-duplex states in principle are still limited by training and evaluation patterns.
Public training data coverage must expand to match industrial corpora to close the realization gap.
Achieving L3 representation-level modeling is required to move beyond current constraints.
Benchmarks need broader coverage of interaction types to evaluate true full-duplex capability.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Applying the state machine to industrial systems could reveal whether undisclosed data already supports wider patterns.
Standardizing evaluation around the T×I×R ontology might allow direct comparison between academic and commercial dialogue systems.
If public datasets grow to include more L3-level examples, the documented gap between architectural potential and observed behavior could narrow measurably.
The frameworks might extend to non-spoken dialogue domains where similar state-transition and intent modeling questions arise.

Load-bearing premise

The newly introduced L0-L3 hierarchy, T×I×R ontology, and five-state decision machine capture the distinctions that matter most for builders and the audit across published systems accurately reflects the current state of the field without selection bias.

What would settle it

Finding multiple published systems that demonstrate L3 representation-level modeling or support interaction patterns absent from current public training data would challenge the claimed realization gap.

Figures

Figures reproduced from arXiv: 2606.19453 by Changhao Pan, Chen Ye, Chenyuhao Wen, Guanjun Jiang, Haoxiao Wang, Jianming Luo, Jian Wu, Jingyu Lu, Shengpeng Ji, Tianle Liang, Xiaoda Yang, Xiaoxi Jiang, Xize Cheng, Yifu Chen, Yuhan Wang, Yu Zhang, Zhou Zhao, Ziyue Jiang.

**Figure 2.** Figure 2: Timeline of published full-duplex spoken dialogue systems, 2021–2026, grouped by the L0– [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: The L0–L3 architectural hierarchy. The red [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Six canonical full-duplex interaction scenarios (referenced from §5). Each panel concretizes [PITH_FULL_IMAGE:figures/full_fig_p016_4.png] view at source ↗

**Figure 5.** Figure 5: The full-duplex decision state machine: five states and eleven transitions. Each transition [PITH_FULL_IMAGE:figures/full_fig_p017_5.png] view at source ↗

read the original abstract

More than a dozen spoken dialogue systems have recently claimed to be "full-duplex," yet the term has been used to describe substantially different capabilities. Existing surveys collapse them onto a single axis (cascaded/end-to-end, or engineered/learned) and miss the distinctions that matter most for builders. We argue that much of this ambiguity is taxonomical: current terminology does not specify where duplex decisions are made, which interaction types are supported, or how a system behaves moment by moment. This paper introduces three complementary frameworks: (i) an L0-L3 Architectural Hierarchy that locates where duplex decisions are made; (ii) a $T\times I\times R$ Interaction Ontology that specifies the temporal relation, user intent, and required system response for each interaction; and (iii) a Decision State Machine (IDLE/LISTEN/SPEAK/WAIT/DUAL) that describes how systems move between states. Across published systems and benchmarks, our audit documents a realization gap: although many architectures can in principle operate in full-duplex states, their observed behavior remains constrained by the interaction patterns represented in training and evaluation. We point to the limited public training-data coverage relative to the (largely undisclosed) industrial corpora, together with the still-unrealized goal of L3 representation-level modeling, as the key frontiers for future research on full-duplex dialogue. The related material is available at https://github.com/DuplexLM/DuplexSurvey.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This survey introduces three practical taxonomies for full-duplex spoken dialogue that go beyond prior single-axis splits and uses them to flag a data-driven realization gap.

read the letter

The main point for you is that the authors put forward three concrete frameworks—an L0-L3 architectural hierarchy, a T×I×R interaction ontology, and a five-state decision machine—to replace the usual cascaded/end-to-end or engineered/learned distinctions. They then audit published systems against these and conclude that the main bottleneck is training and evaluation data coverage rather than raw architecture.

What stands out is the effort to make the categories builder-relevant. The L0-L3 levels locate where duplex decisions happen, the ontology breaks down temporal, intent, and response requirements, and the state machine tracks moment-to-moment behavior. The audit shows that even systems capable of L3 or DUAL states are still limited by what their training data actually contains. The GitHub repo with the survey material is a small but real plus for anyone who wants to check the mapping themselves.

The soft spots are predictable for a taxonomy paper. There is no test of whether these frameworks actually improve design decisions or system performance; the realization-gap claim rests on the selected examples, and selection bias is possible. The paper does not quantify how much industrial data differs from public sets or run any controlled comparison. These are limitations of scope, not fatal errors.

This is useful for researchers already working on spoken dialogue who need sharper language for full-duplex capabilities. A reader building or evaluating systems would get the most out of it. It is organized enough and the distinctions are clear enough that it deserves a serious referee rather than a desk reject. I would send it out for review.

Referee Report

0 major / 3 minor

Summary. The manuscript is a survey of spoken dialogue systems claiming full-duplex operation. It argues that existing terminology (e.g., cascaded/end-to-end) collapses important distinctions and introduces three frameworks: an L0-L3 Architectural Hierarchy locating where duplex decisions occur, a T×I×R Interaction Ontology specifying temporal relation, user intent, and required response, and a five-state Decision State Machine (IDLE/LISTEN/SPEAK/WAIT/DUAL) describing moment-by-moment behavior. An audit of published systems and benchmarks documents a realization gap in which architectural capacity exceeds observed full-duplex behavior, attributing the gap primarily to limited public training-data coverage relative to industrial corpora and the unrealized goal of L3 representation-level modeling.

Significance. If the proposed frameworks organize the literature more usefully than prior single-axis taxonomies, the survey supplies a shared vocabulary that can reduce ambiguity for system builders and reviewers. The audit's emphasis on data coverage as the binding constraint (rather than architectural limits) supplies a concrete research direction. The GitHub repository of related material is a positive contribution to survey transparency.

minor comments (3)

[Abstract] Abstract: the phrase 'more than a dozen' systems is used without a later explicit count or table of reviewed systems; adding a summary table or count in the audit section would allow readers to assess coverage directly.
The T×I×R ontology and L0-L3 hierarchy are introduced as complementary, yet the manuscript does not include a worked example mapping a single published system through all three frameworks side-by-side; such an example would strengthen the claim that the frameworks jointly clarify distinctions.
The GitHub link is given only in the abstract; a persistent citation or footnote in the main text (e.g., in the introduction or conclusion) would improve accessibility.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive assessment of the manuscript, the recognition that the proposed frameworks can supply a shared vocabulary for the community, and the recommendation to accept. We also appreciate the note that the GitHub repository contributes to survey transparency.

Circularity Check

0 steps flagged

No significant circularity; pure survey with no derivations

full rationale

This is a survey paper that introduces three new organizational frameworks (L0-L3 Architectural Hierarchy, T×I×R Interaction Ontology, and a five-state Decision State Machine) to clarify terminology in full-duplex spoken dialogue systems. It performs an illustrative audit of published systems but contains no equations, derivations, fitted parameters, predictions, or load-bearing self-citations that reduce to inputs by construction. The central contribution is explicitly taxonomical and organizational, with no deductive chain that could exhibit self-definitional, fitted-input, or uniqueness-imported circularity. The paper is self-contained as a descriptive survey against external benchmarks of prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 3 invented entities

The central claims rest on domain assumptions about inconsistent use of the full-duplex term and on the utility of the three newly introduced taxonomical frameworks. No numerical free parameters are present. The frameworks themselves function as invented classification entities without independent falsifiable evidence outside the survey.

axioms (1)

domain assumption Existing terminology for full-duplex systems does not specify where duplex decisions are made, which interaction types are supported, or moment-by-moment behavior.
Directly stated in the abstract as the motivation for introducing new frameworks.

invented entities (3)

L0-L3 Architectural Hierarchy no independent evidence
purpose: Locates where duplex decisions are made across system architectures.
Newly introduced in the paper as one of the three complementary frameworks.
T×I×R Interaction Ontology no independent evidence
purpose: Specifies temporal relation, user intent, and required system response for interactions.
Newly introduced in the paper as one of the three complementary frameworks.
Decision State Machine (IDLE/LISTEN/SPEAK/WAIT/DUAL) no independent evidence
purpose: Describes transitions between conversational states in full-duplex systems.
Newly introduced in the paper as one of the three complementary frameworks.

pith-pipeline@v0.9.1-grok · 5860 in / 1568 out tokens · 67014 ms · 2026-06-26T19:00:20.058644+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 15 linked inside Pith

[1]

Moshi: a speech-text foundation model for real-time dialogue

Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, and Neil Zeghidour. Moshi: a speech-text foundation model for real-time dialogue. arXiv preprint arXiv:2410.00037,

Pith/arXiv arXiv
[2]

Minmo: A multimodal large language model for seamless voice interaction.arXiv preprint arXiv:2501.06282, 2025a

Qian Chen, Yafeng Chen, Yanni Chen, Mengzhe Chen, Yingda Chen, Chong Deng, Zhihao Du, Ruize Gao, Changfeng Gao, Zhifu Gao, et al. Minmo: A multimodal large language model for seamless voice interaction.arXiv preprint arXiv:2501.06282, 2025a. Qinglin Zhang, Luyao Cheng, Chong Deng, Qian Chen, Wen Wang, Siqi Zheng, Jiaqing Liu, Hai Yu, Chao-Hong Tan, Zhihao...

arXiv 2024
[3]

Fireredchat: A pluggable, full-duplex voice interaction system with cascaded and semi-cascaded implementations.arXiv preprint arXiv:2509.06502, 2025b

Junjie Chen, Yao Hu, Junjie Li, Kangyue Li, Kun Liu, Wenpeng Li, Xu Li, Ziyuan Li, Feiyu Shen, Xu Tang, et al. Fireredchat: A pluggable, full-duplex voice interaction system with cascaded and semi-cascaded implementations.arXiv preprint arXiv:2509.06502, 2025b. Shengpeng Ji, Yifu Chen, Minghui Fang, Jialong Zuo, Jingyu Lu, Hanting Wang, Ziyue Jiang, Long ...

arXiv
[4]

From turn-taking to synchronous dialogue: A survey of full-duplex spoken language models.arXiv preprint arXiv:2509.14515,

Yuxuan Chen and Haoyuan Yu. From turn-taking to synchronous dialogue: A survey of full-duplex spoken language models.arXiv preprint arXiv:2509.14515,

arXiv
[5]

A full-duplex speech dialogue scheme based on large language model.Advances in Neural Information Processing Systems, 37:13372–13403, 2024a

29 Peng Wang, Songshuo Lu, Yaohua Tang, Sijie Yan, Wei Xia, and Yuanjun Xiong. A full-duplex speech dialogue scheme based on large language model.Advances in Neural Information Processing Systems, 37:13372–13403, 2024a. Borui Liao, Yulong Xu, Jiao Ou, Kaiyuan Yang, Weihua Jian, Pengfei Wan, and Di Zhang. Flexduo: A pluggable system for enabling full-duple...

arXiv
[6]

Mini-omni: Language models can hear, talk while thinking in streaming.arXiv preprint arXiv:2408.16725, 2024a

Zhifei Xie and Changqiao Wu. Mini-omni: Language models can hear, talk while thinking in streaming.arXiv preprint arXiv:2408.16725, 2024a. Aohan Zeng, Zhengxiao Du, Mingdao Liu, Kedong Wang, Shengmin Jiang, Lei Zhao, Yuxiao Dong, and Jie Tang. Glm-4-voice: Towards intelligent and human-like end-to-end spoken chatbot.arXiv preprint arXiv:2412.02612,

arXiv
[7]

Llama-omni: Seamless speech interaction with large language models

Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Shaolei Zhang, and Yang Feng. Llama-omni: Seamless speech interaction with large language models. InInternational Conference on Learning Representations, volume 2025, pages 57607–57624,

2025
[8]

Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities

Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 15757–15773,

2023
[9]

High fidelity neural audio compression.arXiv preprint arXiv:2210.13438,

Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression.arXiv preprint arXiv:2210.13438,

Pith/arXiv arXiv
[10]

Neural codec language models are zero-shot text to speech synthesizers.arXiv preprint arXiv:2301.02111,

Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, et al. Neural codec language models are zero-shot text to speech synthesizers.arXiv preprint arXiv:2301.02111,

Pith/arXiv arXiv
[11]

Speechtokenizer: Unified speech tokenizer for speech language models

Xin Zhang, Dong Zhang, Shimin Li, Yaqian Zhou, and Xipeng Qiu. Speechtokenizer: Unified speech tokenizer for speech language models. InInternational Conference on Learning Representations, volume 2024, pages 31798–31818, 2024a. Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, et al. Cosyvoice: A...

Pith/arXiv arXiv 2024
[12]

Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling

Shengpeng Ji, Ziyue Jiang, Wen Wang, Yifu Chen, Minghui Fang, Jialong Zuo, Qian Yang, Xize Cheng, Ruiqi Li, Ziang Zhang, et al. Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling. InInternational Conference on Learning Representations, volume 2025, pages 93809–93826,

2025
[13]

Kimi-audio technical report.arXiv preprint arXiv:2504.18425,

Ding Ding, Zeqian Ju, Yichong Leng, Songxiang Liu, Tong Liu, Zeyu Shang, Kai Shen, Wei Song, Xu Tan, Heyi Tang, et al. Kimi-audio technical report.arXiv preprint arXiv:2504.18425,

Pith/arXiv arXiv
[14]

Mini-omni2: Towards open-source gpt-4o with vision, speech and duplex capabilities.arXiv preprint arXiv:2410.11190, 2024b

Zhifei Xie and Changqiao Wu. Mini-omni2: Towards open-source gpt-4o with vision, speech and duplex capabilities.arXiv preprint arXiv:2410.11190, 2024b. Zhanxun Liu, Yifan Duan, Mengmeng Wang, Pengchao Feng, Haotian Zhang, Xiaoyu Xing, Yijia Shan, Haina Zhu, Yuhang Dai, Chaochao Lu, et al. X-talk: On the underestimated potential of modular speech-to-speech...

arXiv
[15]

Fun-audio-chat technical report.arXiv preprint arXiv:2512.20156,

Tongyi Fun Team, Qian Chen, Luyao Cheng, Chong Deng, Xiangang Li, Jiaqing Liu, Chao-Hong Tan, Wen Wang, Junhao Xu, Jieping Ye, et al. Fun-audio-chat technical report.arXiv preprint arXiv:2512.20156,

arXiv
[16]

Covo-audio technical report.arXiv preprint arXiv:2602.09823, 2026a

Wenfu Wang, Chenxing Li, Liqiang Zhang, Yiyang Zhao, Yuxiang Zou, Hanzhao Li, Mingyu Cui, Hao Zhang, Kun Wei, Le Xu, et al. Covo-audio technical report.arXiv preprint arXiv:2602.09823, 2026a. Donghang Wu, Haoyang Zhang, Jun Chen, Hexin Liu, Eng Siong Chng, Fei Tian, Xuerui Yang, Xiangyu Zhang, Daxin Jiang, Gang Yu, et al. Mind-paced speaking: A dual-brain...

arXiv
[17]

Soulx-duplug: Plug-and-play streaming state prediction module for realtime full-duplex speech conversation.arXiv preprint arXiv:2603.14877,

Ruiqi Yan, Wenxi Chen, Zhanxun Liu, Ziyang Ma, Haopeng Lin, Hanlin Wen, Hanke Xie, Jun Wu, Yuzhe Liang, Yuxiang Zhao, et al. Soulx-duplug: Plug-and-play streaming state prediction module for realtime full-duplex speech conversation.arXiv preprint arXiv:2603.14877,

arXiv
[18]

Fastturn: Unifying acoustic and streaming semantic cues for low-latency and robust turn detection.arXiv preprint arXiv:2604.01897, 2026b

31 Chengyou Wang, Hongfei Xue, Chunjiang He, Jingbin Hu, Shuiyuan Wang, Bo Wu, Yuyu Ji, Jimeng Zheng, Ruofei Chen, Zhou Zhu, et al. Fastturn: Unifying acoustic and streaming semantic cues for low-latency and robust turn detection.arXiv preprint arXiv:2604.01897, 2026b. Xiangyu Lu, Wang Xu, Haoyu Wang, Hongyun Zhou, Haiyan Zhao, Conghui Zhu, Tiejun Zhao, a...

Pith/arXiv arXiv
[19]

Qwen Team

URLhttps://arxiv.org/abs/2503.20215. Qwen Team. Qwen3.5-omni technical report.arXiv preprint arXiv:2604.15804,

Pith/arXiv arXiv
[20]

Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen llm.arXiv preprint arXiv:2411.00774, 2024b

Xiong Wang, Yangze Li, Chaoyou Fu, Yunhang Shen, Lei Xie, Ke Li, Xing Sun, and Long Ma. Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen llm.arXiv preprint arXiv:2411.00774, 2024b. Rongjie Huang, Mingze Li, Dongchao Yang, Jiatong Shi, Xuankai Chang, Zhenhui Ye, Yuning Wu, Zhiqing Hong, Jiawei Huang, Jinglin Liu, et al. Audi...

arXiv
[21]

Qwen2-audio technical report.arXiv preprint arXiv:2407.10759,

Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, et al. Qwen2-audio technical report.arXiv preprint arXiv:2407.10759,

Pith/arXiv arXiv
[22]

Step-audio: Unified understanding and generation in intelligent speech interaction.arXiv preprint arXiv:2502.11946,

Ailin Huang, Boyong Wu, Bruce Wang, Chao Yan, Chen Hu, Chengli Feng, Fei Tian, Feiyu Shen, Jingbei Li, Mingrui Chen, et al. Step-audio: Unified understanding and generation in intelligent speech interaction.arXiv preprint arXiv:2502.11946,

Pith/arXiv arXiv
[23]

Ope- nAI announcement, accessed 2026-05-18

URL https://openai.com/index/hello-gpt-4o/. Ope- nAI announcement, accessed 2026-05-18. Xinrong Zhang, Yingfa Chen, Shengding Hu, Xu Han, Zihang Xu, Yuanwei Xu, Weilin Zhao, Maosong Sun, and Zhiyuan Liu. Beyond the turn-based game: Enabling real-time conversations with duplex models. InProceedings of the 2024 Conference on Empirical Methods in Natural Lan...

arXiv 2026
[24]

Moshirag: Asynchronous knowledge retrieval for full-duplex speech language models

Chung-Ming Chien, Manu Orsini, Eugene Kharitonov, Neil Zeghidour, Karen Livescu, and Alexandre Défossez. Moshirag: Asynchronous knowledge retrieval for full-duplex speech language models. arXiv preprint arXiv:2604.12928,

Pith/arXiv arXiv
[25]

Easy turn: Integrating acoustic and linguistic modalities for robust turn-taking in full-duplex spoken dialogue systems

Guojian Li, Chengyou Wang, Hongfei Xue, Shuiyuan Wang, Dehui Gao, Zihan Zhang, Yuke Lin, Wenjie Li, Longshuai Xiao, Zhonghua Fu, et al. Easy turn: Integrating acoustic and linguistic modalities for robust turn-taking in full-duplex spoken dialogue systems. InICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)...

2026
[26]

Full-duplex-bench v1

32 Guan-Ting Lin, Shih-Yun Shan Kuan, Qirui Wang, Jiachen Lian, Tingle Li, Shinji Watanabe, and Hung-yi Lee. Full-duplex-bench v1. 5: Evaluating overlap handling for full-duplex speech models. InICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 19447–19451. IEEE, 2026a. Christopher Cieri, David Mille...

2026
[27]

Switchboard: Telephone speech corpus for research and development

John J Godfrey, Edward C Holliman, and Jane McDaniel. Switchboard: Telephone speech corpus for research and development. In[Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 1, pages 517–520. IEEE,

1992
[28]

Open source magicdata-ramc: A rich annotated mandarin conversational (ramc) speech dataset.arXiv preprint arXiv:2203.16844,

Zehui Yang, Yifan Chen, Lei Luo, Runyan Yang, Lingxuan Ye, Gaofeng Cheng, Ji Xu, Yaohui Jin, Qingqing Zhang, Pengyuan Zhang, et al. Open source magicdata-ramc: A rich annotated mandarin conversational (ramc) speech dataset.arXiv preprint arXiv:2203.16844,

arXiv
[29]

Humdial-eibench: A human-recorded multi-turn emotional intelligence benchmark for audio language models.arXiv preprint arXiv:2604.11594, 2026c

Shuiyuan Wang, Zhixian Zhao, Hongfei Xue, Chengyou Wang, Shuai Wang, Hui Bu, Xin Xu, and Lei Xie. Humdial-eibench: A human-recorded multi-turn emotional intelligence benchmark for audio language models.arXiv preprint arXiv:2604.11594, 2026c. Adam Janin, Don Baron, Jane Edwards, Dan Ellis, David Gelbart, Nelson Morgan, Barbara Pe- skin, Thilo Pfau, Elizabe...

Pith/arXiv arXiv
[30]

Dual-axis generative reward model toward semantic and turn-taking robustness in interactive spoken dialogue models.arXiv preprint arXiv:2604.14920,

Yifu Chen, Shengpeng Ji, Zhengqing Liu, Qian Chen, Wen Wang, Ziqing Wang, Yangzhuo Li, Tianle Liang, and Zhou Zhao. Dual-axis generative reward model toward semantic and turn-taking robustness in interactive spoken dialogue models.arXiv preprint arXiv:2604.14920,

Pith/arXiv arXiv
[31]

Full-duplex-bench: A benchmark to evaluate full-duplex spoken dialogue models on turn-taking capabilities.arXiv preprint arXiv:2503.04721, 2025a

Guan-Ting Lin, Jiachen Lian, Tingle Li, Qirui Wang, Gopala Anumanchipalli, Alexander H Liu, and Hung-yi Lee. Full-duplex-bench: A benchmark to evaluate full-duplex spoken dialogue models on turn-taking capabilities.arXiv preprint arXiv:2503.04721, 2025a. Guan-Ting Lin, Shih-Yun Shan Kuan, Jiatong Shi, Kai-Wei Chang, Siddhant Arora, Shinji Watanabe, and Hu...

arXiv
[32]

Talking turns: Benchmarking audio foundation models on turn-taking dynamics.arXiv preprint arXiv:2503.01174,

Siddhant Arora, Zhiyun Lu, Chung-Cheng Chiu, Ruoming Pang, and Shinji Watanabe. Talking turns: Benchmarking audio foundation models on turn-taking dynamics.arXiv preprint arXiv:2503.01174,

arXiv
[33]

Mtr-duplexbench: Towards a comprehensive evaluation of multi-round conversations for full-duplex speech language models.arXiv preprint arXiv:2511.10262, 2025b

He Zhang, Wenqian Cui, Haoning Xu, Xiaohui Li, Lei Zhu, Haoli Bai, Shaohua Ma, and Irwin King. Mtr-duplexbench: Towards a comprehensive evaluation of multi-round conversations for full-duplex speech language models.arXiv preprint arXiv:2511.10262, 2025b. Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation with...

Pith/arXiv arXiv
[34]

Flow matching for generative modeling.arXiv preprint arXiv:2210.02747,

33 Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747,

Pith/arXiv arXiv
[35]

Revisiting feature prediction for learning visual representations from video.arXiv preprint arXiv:2404.08471,

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual representations from video.arXiv preprint arXiv:2404.08471,

Pith/arXiv arXiv

[1] [1]

Moshi: a speech-text foundation model for real-time dialogue

Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, and Neil Zeghidour. Moshi: a speech-text foundation model for real-time dialogue. arXiv preprint arXiv:2410.00037,

Pith/arXiv arXiv

[2] [2]

Minmo: A multimodal large language model for seamless voice interaction.arXiv preprint arXiv:2501.06282, 2025a

Qian Chen, Yafeng Chen, Yanni Chen, Mengzhe Chen, Yingda Chen, Chong Deng, Zhihao Du, Ruize Gao, Changfeng Gao, Zhifu Gao, et al. Minmo: A multimodal large language model for seamless voice interaction.arXiv preprint arXiv:2501.06282, 2025a. Qinglin Zhang, Luyao Cheng, Chong Deng, Qian Chen, Wen Wang, Siqi Zheng, Jiaqing Liu, Hai Yu, Chao-Hong Tan, Zhihao...

arXiv 2024

[3] [3]

Fireredchat: A pluggable, full-duplex voice interaction system with cascaded and semi-cascaded implementations.arXiv preprint arXiv:2509.06502, 2025b

Junjie Chen, Yao Hu, Junjie Li, Kangyue Li, Kun Liu, Wenpeng Li, Xu Li, Ziyuan Li, Feiyu Shen, Xu Tang, et al. Fireredchat: A pluggable, full-duplex voice interaction system with cascaded and semi-cascaded implementations.arXiv preprint arXiv:2509.06502, 2025b. Shengpeng Ji, Yifu Chen, Minghui Fang, Jialong Zuo, Jingyu Lu, Hanting Wang, Ziyue Jiang, Long ...

arXiv

[4] [4]

From turn-taking to synchronous dialogue: A survey of full-duplex spoken language models.arXiv preprint arXiv:2509.14515,

Yuxuan Chen and Haoyuan Yu. From turn-taking to synchronous dialogue: A survey of full-duplex spoken language models.arXiv preprint arXiv:2509.14515,

arXiv

[5] [5]

A full-duplex speech dialogue scheme based on large language model.Advances in Neural Information Processing Systems, 37:13372–13403, 2024a

29 Peng Wang, Songshuo Lu, Yaohua Tang, Sijie Yan, Wei Xia, and Yuanjun Xiong. A full-duplex speech dialogue scheme based on large language model.Advances in Neural Information Processing Systems, 37:13372–13403, 2024a. Borui Liao, Yulong Xu, Jiao Ou, Kaiyuan Yang, Weihua Jian, Pengfei Wan, and Di Zhang. Flexduo: A pluggable system for enabling full-duple...

arXiv

[6] [6]

Mini-omni: Language models can hear, talk while thinking in streaming.arXiv preprint arXiv:2408.16725, 2024a

Zhifei Xie and Changqiao Wu. Mini-omni: Language models can hear, talk while thinking in streaming.arXiv preprint arXiv:2408.16725, 2024a. Aohan Zeng, Zhengxiao Du, Mingdao Liu, Kedong Wang, Shengmin Jiang, Lei Zhao, Yuxiao Dong, and Jie Tang. Glm-4-voice: Towards intelligent and human-like end-to-end spoken chatbot.arXiv preprint arXiv:2412.02612,

arXiv

[7] [7]

Llama-omni: Seamless speech interaction with large language models

Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Shaolei Zhang, and Yang Feng. Llama-omni: Seamless speech interaction with large language models. InInternational Conference on Learning Representations, volume 2025, pages 57607–57624,

2025

[8] [8]

Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities

Dong Zhang, Shimin Li, Xin Zhang, Jun Zhan, Pengyu Wang, Yaqian Zhou, and Xipeng Qiu. Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities. InFindings of the Association for Computational Linguistics: EMNLP 2023, pages 15757–15773,

2023

[9] [9]

High fidelity neural audio compression.arXiv preprint arXiv:2210.13438,

Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression.arXiv preprint arXiv:2210.13438,

Pith/arXiv arXiv

[10] [10]

Neural codec language models are zero-shot text to speech synthesizers.arXiv preprint arXiv:2301.02111,

Chengyi Wang, Sanyuan Chen, Yu Wu, Ziqiang Zhang, Long Zhou, Shujie Liu, Zhuo Chen, Yanqing Liu, Huaming Wang, Jinyu Li, et al. Neural codec language models are zero-shot text to speech synthesizers.arXiv preprint arXiv:2301.02111,

Pith/arXiv arXiv

[11] [11]

Speechtokenizer: Unified speech tokenizer for speech language models

Xin Zhang, Dong Zhang, Shimin Li, Yaqian Zhou, and Xipeng Qiu. Speechtokenizer: Unified speech tokenizer for speech language models. InInternational Conference on Learning Representations, volume 2024, pages 31798–31818, 2024a. Zhihao Du, Qian Chen, Shiliang Zhang, Kai Hu, Heng Lu, Yexin Yang, Hangrui Hu, Siqi Zheng, Yue Gu, Ziyang Ma, et al. Cosyvoice: A...

Pith/arXiv arXiv 2024

[12] [12]

Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling

Shengpeng Ji, Ziyue Jiang, Wen Wang, Yifu Chen, Minghui Fang, Jialong Zuo, Qian Yang, Xize Cheng, Ruiqi Li, Ziang Zhang, et al. Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling. InInternational Conference on Learning Representations, volume 2025, pages 93809–93826,

2025

[13] [13]

Kimi-audio technical report.arXiv preprint arXiv:2504.18425,

Ding Ding, Zeqian Ju, Yichong Leng, Songxiang Liu, Tong Liu, Zeyu Shang, Kai Shen, Wei Song, Xu Tan, Heyi Tang, et al. Kimi-audio technical report.arXiv preprint arXiv:2504.18425,

Pith/arXiv arXiv

[14] [14]

Mini-omni2: Towards open-source gpt-4o with vision, speech and duplex capabilities.arXiv preprint arXiv:2410.11190, 2024b

Zhifei Xie and Changqiao Wu. Mini-omni2: Towards open-source gpt-4o with vision, speech and duplex capabilities.arXiv preprint arXiv:2410.11190, 2024b. Zhanxun Liu, Yifan Duan, Mengmeng Wang, Pengchao Feng, Haotian Zhang, Xiaoyu Xing, Yijia Shan, Haina Zhu, Yuhang Dai, Chaochao Lu, et al. X-talk: On the underestimated potential of modular speech-to-speech...

arXiv

[15] [15]

Fun-audio-chat technical report.arXiv preprint arXiv:2512.20156,

Tongyi Fun Team, Qian Chen, Luyao Cheng, Chong Deng, Xiangang Li, Jiaqing Liu, Chao-Hong Tan, Wen Wang, Junhao Xu, Jieping Ye, et al. Fun-audio-chat technical report.arXiv preprint arXiv:2512.20156,

arXiv

[16] [16]

Covo-audio technical report.arXiv preprint arXiv:2602.09823, 2026a

Wenfu Wang, Chenxing Li, Liqiang Zhang, Yiyang Zhao, Yuxiang Zou, Hanzhao Li, Mingyu Cui, Hao Zhang, Kun Wei, Le Xu, et al. Covo-audio technical report.arXiv preprint arXiv:2602.09823, 2026a. Donghang Wu, Haoyang Zhang, Jun Chen, Hexin Liu, Eng Siong Chng, Fei Tian, Xuerui Yang, Xiangyu Zhang, Daxin Jiang, Gang Yu, et al. Mind-paced speaking: A dual-brain...

arXiv

[17] [17]

Soulx-duplug: Plug-and-play streaming state prediction module for realtime full-duplex speech conversation.arXiv preprint arXiv:2603.14877,

Ruiqi Yan, Wenxi Chen, Zhanxun Liu, Ziyang Ma, Haopeng Lin, Hanlin Wen, Hanke Xie, Jun Wu, Yuzhe Liang, Yuxiang Zhao, et al. Soulx-duplug: Plug-and-play streaming state prediction module for realtime full-duplex speech conversation.arXiv preprint arXiv:2603.14877,

arXiv

[18] [18]

Fastturn: Unifying acoustic and streaming semantic cues for low-latency and robust turn detection.arXiv preprint arXiv:2604.01897, 2026b

31 Chengyou Wang, Hongfei Xue, Chunjiang He, Jingbin Hu, Shuiyuan Wang, Bo Wu, Yuyu Ji, Jimeng Zheng, Ruofei Chen, Zhou Zhu, et al. Fastturn: Unifying acoustic and streaming semantic cues for low-latency and robust turn detection.arXiv preprint arXiv:2604.01897, 2026b. Xiangyu Lu, Wang Xu, Haoyu Wang, Hongyun Zhou, Haiyan Zhao, Conghui Zhu, Tiejun Zhao, a...

Pith/arXiv arXiv

[19] [19]

Qwen Team

URLhttps://arxiv.org/abs/2503.20215. Qwen Team. Qwen3.5-omni technical report.arXiv preprint arXiv:2604.15804,

Pith/arXiv arXiv

[20] [20]

Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen llm.arXiv preprint arXiv:2411.00774, 2024b

Xiong Wang, Yangze Li, Chaoyou Fu, Yunhang Shen, Lei Xie, Ke Li, Xing Sun, and Long Ma. Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen llm.arXiv preprint arXiv:2411.00774, 2024b. Rongjie Huang, Mingze Li, Dongchao Yang, Jiatong Shi, Xuankai Chang, Zhenhui Ye, Yuning Wu, Zhiqing Hong, Jiawei Huang, Jinglin Liu, et al. Audi...

arXiv

[21] [21]

Qwen2-audio technical report.arXiv preprint arXiv:2407.10759,

Yunfei Chu, Jin Xu, Qian Yang, Haojie Wei, Xipin Wei, Zhifang Guo, Yichong Leng, Yuanjun Lv, Jinzheng He, Junyang Lin, et al. Qwen2-audio technical report.arXiv preprint arXiv:2407.10759,

Pith/arXiv arXiv

[22] [22]

Step-audio: Unified understanding and generation in intelligent speech interaction.arXiv preprint arXiv:2502.11946,

Ailin Huang, Boyong Wu, Bruce Wang, Chao Yan, Chen Hu, Chengli Feng, Fei Tian, Feiyu Shen, Jingbei Li, Mingrui Chen, et al. Step-audio: Unified understanding and generation in intelligent speech interaction.arXiv preprint arXiv:2502.11946,

Pith/arXiv arXiv

[23] [23]

Ope- nAI announcement, accessed 2026-05-18

URL https://openai.com/index/hello-gpt-4o/. Ope- nAI announcement, accessed 2026-05-18. Xinrong Zhang, Yingfa Chen, Shengding Hu, Xu Han, Zihang Xu, Yuanwei Xu, Weilin Zhao, Maosong Sun, and Zhiyuan Liu. Beyond the turn-based game: Enabling real-time conversations with duplex models. InProceedings of the 2024 Conference on Empirical Methods in Natural Lan...

arXiv 2026

[24] [24]

Moshirag: Asynchronous knowledge retrieval for full-duplex speech language models

Chung-Ming Chien, Manu Orsini, Eugene Kharitonov, Neil Zeghidour, Karen Livescu, and Alexandre Défossez. Moshirag: Asynchronous knowledge retrieval for full-duplex speech language models. arXiv preprint arXiv:2604.12928,

Pith/arXiv arXiv

[25] [25]

Easy turn: Integrating acoustic and linguistic modalities for robust turn-taking in full-duplex spoken dialogue systems

Guojian Li, Chengyou Wang, Hongfei Xue, Shuiyuan Wang, Dehui Gao, Zihan Zhang, Yuke Lin, Wenjie Li, Longshuai Xiao, Zhonghua Fu, et al. Easy turn: Integrating acoustic and linguistic modalities for robust turn-taking in full-duplex spoken dialogue systems. InICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)...

2026

[26] [26]

Full-duplex-bench v1

32 Guan-Ting Lin, Shih-Yun Shan Kuan, Qirui Wang, Jiachen Lian, Tingle Li, Shinji Watanabe, and Hung-yi Lee. Full-duplex-bench v1. 5: Evaluating overlap handling for full-duplex speech models. InICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 19447–19451. IEEE, 2026a. Christopher Cieri, David Mille...

2026

[27] [27]

Switchboard: Telephone speech corpus for research and development

John J Godfrey, Edward C Holliman, and Jane McDaniel. Switchboard: Telephone speech corpus for research and development. In[Proceedings] ICASSP-92: 1992 IEEE International Conference on Acoustics, Speech, and Signal Processing, volume 1, pages 517–520. IEEE,

1992

[28] [28]

Open source magicdata-ramc: A rich annotated mandarin conversational (ramc) speech dataset.arXiv preprint arXiv:2203.16844,

Zehui Yang, Yifan Chen, Lei Luo, Runyan Yang, Lingxuan Ye, Gaofeng Cheng, Ji Xu, Yaohui Jin, Qingqing Zhang, Pengyuan Zhang, et al. Open source magicdata-ramc: A rich annotated mandarin conversational (ramc) speech dataset.arXiv preprint arXiv:2203.16844,

arXiv

[29] [29]

Humdial-eibench: A human-recorded multi-turn emotional intelligence benchmark for audio language models.arXiv preprint arXiv:2604.11594, 2026c

Shuiyuan Wang, Zhixian Zhao, Hongfei Xue, Chengyou Wang, Shuai Wang, Hui Bu, Xin Xu, and Lei Xie. Humdial-eibench: A human-recorded multi-turn emotional intelligence benchmark for audio language models.arXiv preprint arXiv:2604.11594, 2026c. Adam Janin, Don Baron, Jane Edwards, Dan Ellis, David Gelbart, Nelson Morgan, Barbara Pe- skin, Thilo Pfau, Elizabe...

Pith/arXiv arXiv

[30] [30]

Dual-axis generative reward model toward semantic and turn-taking robustness in interactive spoken dialogue models.arXiv preprint arXiv:2604.14920,

Yifu Chen, Shengpeng Ji, Zhengqing Liu, Qian Chen, Wen Wang, Ziqing Wang, Yangzhuo Li, Tianle Liang, and Zhou Zhao. Dual-axis generative reward model toward semantic and turn-taking robustness in interactive spoken dialogue models.arXiv preprint arXiv:2604.14920,

Pith/arXiv arXiv

[31] [31]

Full-duplex-bench: A benchmark to evaluate full-duplex spoken dialogue models on turn-taking capabilities.arXiv preprint arXiv:2503.04721, 2025a

Guan-Ting Lin, Jiachen Lian, Tingle Li, Qirui Wang, Gopala Anumanchipalli, Alexander H Liu, and Hung-yi Lee. Full-duplex-bench: A benchmark to evaluate full-duplex spoken dialogue models on turn-taking capabilities.arXiv preprint arXiv:2503.04721, 2025a. Guan-Ting Lin, Shih-Yun Shan Kuan, Jiatong Shi, Kai-Wei Chang, Siddhant Arora, Shinji Watanabe, and Hu...

arXiv

[32] [32]

Talking turns: Benchmarking audio foundation models on turn-taking dynamics.arXiv preprint arXiv:2503.01174,

Siddhant Arora, Zhiyun Lu, Chung-Cheng Chiu, Ruoming Pang, and Shinji Watanabe. Talking turns: Benchmarking audio foundation models on turn-taking dynamics.arXiv preprint arXiv:2503.01174,

arXiv

[33] [33]

Mtr-duplexbench: Towards a comprehensive evaluation of multi-round conversations for full-duplex speech language models.arXiv preprint arXiv:2511.10262, 2025b

He Zhang, Wenqian Cui, Haoning Xu, Xiaohui Li, Lei Zhu, Haoli Bai, Shaohua Ma, and Irwin King. Mtr-duplexbench: Towards a comprehensive evaluation of multi-round conversations for full-duplex speech language models.arXiv preprint arXiv:2511.10262, 2025b. Tianhong Li, Yonglong Tian, He Li, Mingyang Deng, and Kaiming He. Autoregressive image generation with...

Pith/arXiv arXiv

[34] [34]

Flow matching for generative modeling.arXiv preprint arXiv:2210.02747,

33 Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling.arXiv preprint arXiv:2210.02747,

Pith/arXiv arXiv

[35] [35]

Revisiting feature prediction for learning visual representations from video.arXiv preprint arXiv:2404.08471,

Adrien Bardes, Quentin Garrido, Jean Ponce, Xinlei Chen, Michael Rabbat, Yann LeCun, Mahmoud Assran, and Nicolas Ballas. Revisiting feature prediction for learning visual representations from video.arXiv preprint arXiv:2404.08471,

Pith/arXiv arXiv