arxiv: 2604.01897 · v4 · submitted 2026-04-02 · 💻 cs.SD · eess.AS

Recognition: no theorem link

FastTurn: Unifying Acoustic and Streaming Semantic Cues for Low-Latency and Robust Turn Detection

Bo Wu, Chengyou Wang, Chunjiang He, Hongfei Xue, Jimeng Zheng, Jingbin Hu, Lei Xie, Ruofei Chen, Shuiyuan Wang, Yuyu Ji, Zhou Zhu

Pith reviewed 2026-05-13 20:52 UTC · model grok-4.3

classification 💻 cs.SD eess.AS

keywords turn detectionfull-duplex dialoguestreaming CTCacoustic featureslow-latency interactionspoken dialogue systemsoverlapping speechreal-time turn taking

0 comments

The pith

FastTurn detects conversation turns faster by fusing partial semantic cues from streaming speech recognition with acoustic features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents FastTurn as a unified approach for turn detection in full-duplex spoken dialogue systems, where an agent must decide in real time whether to speak, yield, or interrupt while the user is still talking. It combines streaming CTC decoding outputs with acoustic features to extract early semantic information from incomplete speech segments, aiming to reduce latency compared to full ASR pipelines while avoiding the semantic blindness of pure voice-activity methods. The framework is evaluated on a newly released test set drawn from authentic human dialogues that include overlapping speech, backchannels, pauses, pitch changes, and environmental noise. If the fusion of partial observations works reliably, the result is higher decision accuracy at lower interruption latency than representative baselines, even under challenging acoustic conditions.

Core claim

FastTurn unifies acoustic features with streaming CTC outputs to enable low-latency turn detection by making decisions from partial semantic cues, achieving higher decision accuracy and lower interruption latency than baselines while remaining robust under noisy and overlapping speech conditions on a new real-dialogue test set.

What carries the argument

The fusion of partial CTC decoding outputs with acoustic features to extract early semantic cues for turn detection decisions.

If this is right

Enables real-time full-duplex dialogue agents to respond without waiting for complete transcriptions.
Maintains robustness when speech overlaps or background noise is present.
Supports practical deployment by providing a dedicated real-dialogue evaluation set.
Reduces the latency penalty typically introduced by separate ASR modules in turn-taking systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fusion pattern could extend to other streaming audio tasks that need early semantic context, such as real-time intent detection.
The released real-dialogue dataset may become a useful benchmark for comparing full-duplex methods beyond the original experiments.
Integration with larger end-to-end models might further improve cue reliability when partial observations are very short.

Load-bearing premise

That partial CTC outputs supply reliable early semantic cues without errors caused by incomplete speech observations.

What would settle it

A direct comparison on the real-dialogue test set in which FastTurn shows no reduction in interruption latency or no gain in decision accuracy relative to acoustic-only or full-ASR baselines under heavy noise or short utterances.

Figures

Figures reproduced from arXiv: 2604.01897 by Bo Wu, Chengyou Wang, Chunjiang He, Hongfei Xue, Jimeng Zheng, Jingbin Hu, Lei Xie, Ruofei Chen, Shuiyuan Wang, Yuyu Ji, Zhou Zhu.

**Figure 1.** Figure 1: Model architecture ness. In addition, we release the FastTurn test set 1 , specifically designed to capture authentic turn transitions and overlapping speech. The dataset includes challenging conversational phenomena such as backchannels, pauses, pitch variations, and environmental noise. Based on this evaluation set, we can systematically analyze interaction patterns prone to interruption errors, brid… view at source ↗

**Figure 2.** Figure 2: Training Strategy dimensional annotations and transcriptions, the test set serves as an important resource for research on dialogue coordination, interruption modeling, and full-duplex systems, aimed at accurately capturing turn transitions and interaction flow in natural dialogues [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

read the original abstract

Recent advances in AudioLLMs have enabled spoken dialogue systems to move beyond turn-based interaction toward real-time full-duplex communication, where the agent must decide when to speak, yield, or interrupt while the user is still talking. Existing full-duplex approaches either rely on voice activity cues, which lack semantic understanding, or on ASR-based modules, which introduce latency and degrade under overlapping speech and noise. Moreover, available datasets rarely capture realistic interaction dynamics, limiting evaluation and deployment. To mitigate the problem, we propose \textbf{FastTurn}, a unified framework for low-latency and robust turn detection. To advance latency while maintaining performance, FastTurn combines streaming CTC decoding with acoustic features, enabling early decisions from partial observations while preserving semantic cues. We also release a test set based on real human dialogue, capturing authentic turn transitions, overlapping speech, backchannels, pauses, pitch variation, and environmental noise. Experiments show FastTurn achieves higher decision accuracy with lower interruption latency than representative baselines and remains robust under challenging acoustic conditions, demonstrating its effectiveness for practical full-duplex dialogue systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

FastTurn's fusion of streaming CTC cues with acoustics plus the new real-dialogue test set is a useful practical step for full-duplex turn detection, and the full paper supports the claims with ablations and robustness checks.

read the letter

FastTurn combines streaming CTC semantic cues with acoustic features for quicker turn detection in spoken dialogue and ships a new test set drawn from real conversations with overlap, backchannels, and noise. The unification lets the system make early decisions from partial observations while keeping semantic signals, which addresses the latency problem in existing ASR-heavy or purely acoustic approaches. The paper spells out the architecture, including CTC prefix scoring and the fusion module, then runs ablations on each part, measures interruption latency directly, and tests under noise and overlapping speech. Those controls make the evaluation more believable than abstract-only claims. The test-set statistics on overlap rates and environmental conditions add concrete value for people who need realistic benchmarks. The advance is mostly in the integration and the dataset rather than new theory, but the experiments show measurable gains in accuracy and latency over the baselines they compare against. Soft spots are modest. The improvements read as incremental rather than large, and the test set, while better than synthetic data, still leaves room for questions about coverage of extreme cases like heavy accents or very long overlaps. The robustness tests help, but broader external validation would tighten that. No internal contradictions or unsupported leaps appear in the methods or results. This paper is for researchers and engineers working on real-time spoken dialogue systems who care about low latency and overlapping speech. A reader building or evaluating full-duplex agents would get usable details on the fusion and the dataset. It deserves peer review because the work is grounded, the experiments are properly scoped, and the contribution is clear enough to benefit from referee input on metrics and extensions.

Referee Report

2 major / 2 minor

Summary. The paper proposes FastTurn, a unified framework for low-latency turn detection in full-duplex spoken dialogue systems. It combines streaming CTC decoding (with prefix scoring) and acoustic features to enable early decisions from partial observations while preserving semantic information. The authors also release a new test set derived from real human dialogues that captures overlapping speech, backchannels, pauses, pitch variation, and environmental noise. Experiments, including ablations, latency measurements, and robustness tests under noise and overlap, claim that FastTurn achieves higher decision accuracy and lower interruption latency than representative baselines.

Significance. If the reported results hold, this work would advance practical full-duplex dialogue systems by addressing the latency-semantics tradeoff that limits current voice-activity or full-ASR approaches. The new real-dialogue test set fills an important evaluation gap, and the explicit inclusion of ablations, latency metrics, and robustness experiments under realistic conditions (overlap, noise) strengthens the contribution. The approach builds on established CTC techniques without introducing new free parameters or circular derivations.

major comments (2)

[§3.2] §3.2, CTC prefix scoring and fusion: The description of how partial CTC outputs are combined with acoustic encoder features to produce early semantic cues does not quantify the prefix error rate or provide an error analysis for incomplete observations; this is load-bearing for the claim that the fusion yields reliable cues without introducing errors from partial data.
[§4.3] §4.3, test-set construction: While aggregate statistics on overlap, backchannels, and noise are given, the paper does not detail the annotation protocol or inter-annotator agreement for turn-transition labels; without this, it is difficult to assess whether the set sufficiently represents deployment conditions for the robustness claims.

minor comments (2)

[Figure 2] Figure 2: The diagram of the fusion module would benefit from explicit labels on the acoustic and CTC streams and a legend for the attention weights.
[§4.1] §4.1: The baseline implementations (especially the ASR-based one) are summarized too briefly; adding a short paragraph on the exact ASR model and decision threshold would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and the constructive comments. We address each major point below and will revise the manuscript accordingly.

read point-by-point responses

Referee: [§3.2] §3.2, CTC prefix scoring and fusion: The description of how partial CTC outputs are combined with acoustic encoder features to produce early semantic cues does not quantify the prefix error rate or provide an error analysis for incomplete observations; this is load-bearing for the claim that the fusion yields reliable cues without introducing errors from partial data.

Authors: We agree that quantifying prefix error rates strengthens the claim. In the revised version we will add to §3.2 a table reporting CTC character error rate (CER) and word error rate (WER) on prefixes at 20 %, 40 %, 60 %, and 80 % completion, both before and after fusion with the acoustic encoder. The analysis will show that fusion reduces the error introduced by incomplete observations, thereby supporting the reliability of the early semantic cues. revision: yes
Referee: [§4.3] §4.3, test-set construction: While aggregate statistics on overlap, backchannels, and noise are given, the paper does not detail the annotation protocol or inter-annotator agreement for turn-transition labels; without this, it is difficult to assess whether the set sufficiently represents deployment conditions for the robustness claims.

Authors: We acknowledge the omission. We will expand §4.3 with a description of the annotation protocol: two expert annotators independently labeled turn-transition points using both audio and transcripts; disagreements were resolved by a third annotator. We will also report inter-annotator agreement (Cohen’s κ = 0.82). These additions will substantiate the reliability of the test set for the robustness experiments. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper presents FastTurn as a fusion of established streaming CTC decoding and acoustic feature processing for turn detection. No equations, derivations, or self-referential fitting steps are described in the provided text. The approach combines known techniques without reducing predictions to fitted inputs by construction, and the central claims rest on experimental evaluations rather than self-citation chains or ansatzes imported from prior author work. The new test set is described with explicit statistics, providing external grounding. This is a standard non-circular engineering paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; the framework is presented as an engineering combination of standard CTC decoding and acoustic feature extraction.

pith-pipeline@v0.9.0 · 5525 in / 1064 out tokens · 46028 ms · 2026-05-13T20:52:02.722754+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 5 internal anchors

[1]

Introduction In recent years, rapid advances in AudioLLMs [1, 2, 3, 4, 5, 6] have enabled spoken dialogue systems to move beyond tra- ditional turn-based interaction toward more natural real-time communication. In highly interactive scenarios, this evolution leads to full-duplex interaction, where the system must pro- cess speech perception, partial seman...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

Architecture As shown in Figure 1, the framework consists of three compo- nents: FastTurn-Semantic, an acoustic adapter, and a turn detec- tor

FastTurn 2.1. Architecture As shown in Figure 1, the framework consists of three compo- nents: FastTurn-Semantic, an acoustic adapter, and a turn detec- tor. The architecture is developed in three steps. First, we use FastTurn-Cascaded to route a fast CTC transcript to an LLM for low-latency decisions, then progressively introduce speech- derived cues to ...

work page
[3]

Com” and “Inc

Experiments 3.1. Datasets ASR Task. We use large-scale open-source corpora and in- ternal datasets, including AISHELL-1 [23], AISHELL-2 [24], WenetSpeech [25], LibriSpeech [26], GigaSpeech [27], and MLS [28], totaling over 30,000 hours of Chinese and English speech to support robust feature learning. Turn Detection Task.We use the Easy Turn training set, ...

work page arXiv
[4]

By utilizing fast CTC decod- ing and integrating acoustic features, FastTurn reduces latency and enhances robustness

Conclusion This paper presents FastTurn, a framework for efficient turn de- tection in full-duplex systems. By utilizing fast CTC decod- ing and integrating acoustic features, FastTurn reduces latency and enhances robustness. We release a comprehensive test set to promote research on conversational turn-taking and speech overlap phenomena, specifically de...

work page
[5]

Audiogpt: Understanding and generating speech, music, sound, and talking head,

R. Huang, M. Li, D. Yang, J. Shi, X. Chang, Z. Ye, Y . Wu, Z. Hong, J. Huang, J. Liuet al., “Audiogpt: Understanding and generating speech, music, sound, and talking head,” inProceed- ings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 21, 2024, pp. 23 802–23 804

work page 2024
[6]

Qwen2-Audio Technical Report

Y . Chu, J. Xu, Q. Yang, H. Wei, X. Wei, Z. Guo, Y . Leng, Y . Lv, J. He, J. Linet al., “Qwen2-audio technical report,”arXiv preprint arXiv:2407.10759, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[7]

Salmonn-omni: A codec-free llm for full-duplex speech understanding and generation,

W. Yu, S. Wang, X. Yang, X. Chen, X. Tian, J. Zhang, G. Sun, L. Lu, Y . Wang, and C. Zhang, “Salmonn-omni: A codec-free llm for full-duplex speech understanding and generation,”arXiv preprint arXiv:2411.18138, 2024

work page arXiv 2024
[8]

Glm-4-voice: Towards intelligent and human-like end- to-end spoken chatbot,

A. Zeng, Z. Du, M. Liu, K. Wang, S. Jiang, L. Zhao, Y . Dong, and J. Tang, “Glm-4-voice: Towards intelligent and human-like end- to-end spoken chatbot,”arXiv preprint arXiv:2412.02612, 2024

work page arXiv 2024
[9]

Mini-omni: Language models can hear, talk while thinking in streaming,

Z. Xie and C. Wu, “Mini-omni: Language models can hear, talk while thinking in streaming,”arXiv preprint arXiv:2408.16725, 2024

work page arXiv 2024
[10]

Osum-echat: Enhancing end-to- end empathetic spoken chatbot via understanding-driven spoken dialogue,

X. Geng, Q. Shao, H. Xue, S. Wang, H. Xie, Z. Guo, Y . Zhao, G. Li, W. Tian, C. Wanget al., “Osum-echat: Enhancing end-to- end empathetic spoken chatbot via understanding-driven spoken dialogue,”arXiv preprint arXiv:2508.09600, 2025

work page arXiv 2025
[11]

Duplexmamba: Enhancing real-time speech conver- sations with duplex and streaming capabilities,

X. Lu, W. Xu, H. Wang, H. Zhou, H. Zhao, C. Zhu, T. Zhao, and M. Yang, “Duplexmamba: Enhancing real-time speech conver- sations with duplex and streaming capabilities,” inCCF Interna- tional Conference on Natural Language Processing and Chinese Computing. Springer, 2025, pp. 62–74

work page 2025
[12]

Towards building an intelligent chatbot for customer service: Learning to respond at the appropriate time,

C. Liu, J. Jiang, C. Xiong, Y . Yang, and J. Ye, “Towards building an intelligent chatbot for customer service: Learning to respond at the appropriate time,” inProceedings of the 26th ACM SIGKDD international conference on Knowledge Discovery & Data Min- ing, 2020, pp. 3377–3385

work page 2020
[13]

Google duplex: An ai system for accomplishing real-world tasks over the phone,

Y . Leviathan and Y . Matias, “Google duplex: An ai system for accomplishing real-world tasks over the phone,”Google AI blog, vol. 8, 2018

work page 2018
[14]

Moshi: a speech-text foundation model for real-time dialogue

A. D ´efossez, L. Mazar´e, M. Orsini, A. Royer, P. P´erez, H. J´egou, E. Grave, and N. Zeghidour, “Moshi: a speech-text foundation model for real-time dialogue,”arXiv preprint arXiv:2410.00037, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

Freeze-omni: A smart and low latency speech-to-speech dia- logue model with frozen llm,

X. Wang, Y . Li, C. Fu, Y . Shen, L. Xie, K. Li, X. Sun, and L. Ma, “Freeze-omni: A smart and low latency speech-to-speech dia- logue model with frozen llm,”arXiv preprint arXiv:2411.00774, 2024

work page arXiv 2024
[16]

Fireredchat: A pluggable, full-duplex voice in- teraction system with cascaded and semi-cascaded implementa- tions,

J. Chen, Y . Hu, J. Li, K. Li, K. Liu, W. Li, X. Li, Z. Li, F. Shen, X. Tanget al., “Fireredchat: A pluggable, full-duplex voice in- teraction system with cascaded and semi-cascaded implementa- tions,”arXiv preprint arXiv:2509.06502, 2025

work page arXiv 2025
[17]

Vita-1.5: Towards gpt-4o level real-time vision and speech interaction,

C. Fu, H. Lin, X. Wang, Y .-F. Zhang, Y . Shen, X. Liu, H. Cao, Z. Long, H. Gao, K. Liet al., “Vita-1.5: Towards gpt-4o level real-time vision and speech interaction,”arXiv preprint arXiv:2501.01957, 2025

work page arXiv 2025
[18]

Easy turn: Integrating acous- tic and linguistic modalities for robust turn-taking in full-duplex spoken dialogue systems,

G. Li, C. Wang, H. Xue, S. Wang, D. Gao, Z. Zhang, Y . Lin, W. Li, L. Xiao, Z. Fu, and L. Xie, “Easy turn: Integrating acous- tic and linguistic modalities for robust turn-taking in full-duplex spoken dialogue systems,” inIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2026

work page 2026
[19]

The ami meeting corpus,

W. Kraaij, T. Hain, M. Lincoln, and W. Post, “The ami meeting corpus,” inProc. International Conference on Methods and Tech- niques in Behavioral Research, 2005, pp. 1–4

work page 2005
[20]

Open source magicdata-ramc: A rich annotated mandarin conversational (ramc) speech dataset,

Z. Yang, Y . Chen, L. Luo, R. Yang, L. Ye, G. Cheng, J. Xu, Y . Jin, Q. Zhang, P. Zhanget al., “Open source magicdata-ramc: A rich annotated mandarin conversational (ramc) speech dataset,”arXiv preprint arXiv:2203.16844, 2022

work page arXiv 2022
[21]

Full-duplex-bench: A benchmark to evaluate full-duplex spoken dialogue models on turn-taking capabilities

G.-T. Lin, J. Lian, T. Li, Q. Wang, G. Anumanchipalli, A. H. Liu, and H.-y. Lee, “Full-duplex-bench: A benchmark to evaluate full- duplex spoken dialogue models on turn-taking capabilities,”arXiv preprint arXiv:2503.04721, 2025

work page arXiv 2025
[22]

Fd-bench: A full-duplex benchmarking pipeline de- signed for full duplex spoken dialogue systems,

Y . Peng, Y .-W. Chao, D. Ng, Y . Ma, C. Ni, B. Ma, and E. S. Chng, “Fd-bench: A full-duplex benchmarking pipeline de- signed for full duplex spoken dialogue systems,”arXiv preprint arXiv:2507.19040, 2025

work page arXiv 2025
[23]

Con- nectionist temporal classification: labelling unsegmented se- quence data with recurrent neural networks,

A. Graves, S. Fern ´andez, F. Gomez, and J. Schmidhuber, “Con- nectionist temporal classification: labelling unsegmented se- quence data with recurrent neural networks,” inProceedings of the 23rd international conference on Machine learning, 2006, pp. 369–376

work page 2006
[24]

Qwen3 Technical Report

A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lvet al., “Qwen3 technical report,”arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

DeepSeek-V3 Technical Report

A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruanet al., “Deepseek-v3 technical re- port,”arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Indextts2: A breakthrough in emotionally expressive and duration-controlled auto-regressive zero-shot text-to-speech.arXiv preprint arXiv:2506.21619, 2025

S. Zhou, Y . Zhou, Y . He, X. Zhou, J. Wang, W. Deng, and J. Shu, “Indextts2: A breakthrough in emotionally expressive and duration-controlled auto-regressive zero-shot text-to-speech,” arXiv preprint arXiv:2506.21619, 2025

work page arXiv 2025
[27]

Aishell-1: An open- source mandarin speech corpus and a speech recognition base- line,

H. Bu, J. Du, X. Na, B. Wu, and H. Zheng, “Aishell-1: An open- source mandarin speech corpus and a speech recognition base- line,” in2017 20th conference of the oriental chapter of the inter- national coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA). IEEE, 2017, pp. 1–5

work page 2017
[28]

Aishell-2: Transform- ing mandarin asr research into industrial scale,

J. Du, X. Na, X. Liu, and H. Bu, “Aishell-2: Transform- ing mandarin asr research into industrial scale,”arXiv preprint arXiv:1808.10583, 2018

work page arXiv 2018
[29]

Wenetspeech: A 10000+ hours multi- domain mandarin corpus for speech recognition,

B. Zhang, H. Lv, P. Guo, Q. Shao, C. Yang, L. Xie, X. Xu, H. Bu, X. Chen, C. Zenget al., “Wenetspeech: A 10000+ hours multi- domain mandarin corpus for speech recognition,” inIEEE Inter- national Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2022, pp. 6182–6186

work page 2022
[30]

Lib- rispeech: an asr corpus based on public domain audio books,

V . Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- rispeech: an asr corpus based on public domain audio books,” in2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE, 2015, pp. 5206–5210

work page 2015
[31]

Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio,

G. Chen, S. Chai, G. Wang, J. Du, W.-Q. Zhang, C. Weng, D. Su, D. Povey, J. Trmal, J. Zhanget al., “Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio,” arXiv preprint arXiv:2106.06909, 2021

work page arXiv 2021
[32]

Mls: A large-scale multilingual dataset for speech research,

V . Pratap, Q. Xu, A. Sriram, G. Synnaeve, and R. Collobert, “Mls: A large-scale multilingual dataset for speech research,” arXiv preprint arXiv:2012.03411, 2020

work page arXiv 2012
[33]

Conformer: Convolution- augmented transformer for speech recognition,

A. Gulati, J. Qin, C.-C. Chiu, N. Parmar, Y . Zhang, J. Yu, W. Han, S. Wang, Z. Zhang, Y . Wuet al., “Conformer: Convolution- augmented transformer for speech recognition,”arXiv preprint arXiv:2005.08100, 2020

work page arXiv 2005
[34]

Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to- end speech recognition,

Z. Gao, S. Zhang, I. McLoughlin, and Z. Yan, “Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to- end speech recognition,”arXiv preprint arXiv:2206.08317, 2022

work page arXiv 2022