pith. sign in

arxiv: 2606.18094 · v1 · pith:JJFE4ATMnew · submitted 2026-06-16 · 💻 cs.SD

Next-Turn: Duration-Aware Streaming Endpoint Detection via Time-to-Next-Speech-Onset Prediction

Pith reviewed 2026-06-26 22:40 UTC · model grok-4.3

classification 💻 cs.SD
keywords endpoint detectionstreaming speechspeech onset predictionturn-takingduration-aware modelingsemantic EPDacoustic baselines
0
0 comments X

The pith

A streaming endpoint detector trained to predict time until next speech onset outperforms acoustic and semantic baselines without extra labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Next-Turn, which replaces binary endpoint labels with a regression target that predicts how soon the next speech will start. Targets are taken straight from existing speech timestamps, meeting real-time streaming rules and avoiding ambiguous human annotations. Experiments demonstrate clear gains in detecting utterance ends quickly, including a large lift over prior methods, plus extra benefit when the new objective is combined with standard detection. The approach directly tackles the problem of mid-utterance pauses that confuse conventional detectors.

Core claim

Next-Turn shows that training on the time-to-next-speech-onset objective supplies unambiguous supervision for streaming endpoint detection. Targets come directly from speech timestamps, satisfy strict latency limits, and produce higher endpoint accuracy within short windows than acoustic or recent semantic baselines. When the same model is trained jointly with conventional binary detection, performance improves further as pause lengths increase.

What carries the argument

The time-to-next-speech-onset regression objective, which generates training targets from speech timestamps alone.

If this is right

  • Endpoint accuracy within 320 ms rises by 25.9 percentage points over the strongest prior baseline.
  • Joint training with binary endpoint detection produces gains that increase steadily with longer pauses.
  • No additional human annotation is needed beyond the speech timestamps already present in the data.
  • The method meets streaming constraints while reducing errors caused by hesitations and disfluencies.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Voice interfaces could wait longer during natural pauses without risking missed turns.
  • The same timestamp-derived target might transfer to other real-time audio segmentation tasks.
  • Results on multi-speaker or accented data would test whether timestamp quality limits the gains.
  • Adding language-model context could further sharpen the onset-time predictions.

Load-bearing premise

Speech timestamps supply enough clear information to train models that correctly mark utterance ends even when speakers pause mid-turn.

What would settle it

An evaluation on new utterances where the time-to-next-speech-onset model shows no accuracy advantage, or lower accuracy, than the strongest baseline for endpoint decisions reached inside 320 ms.

Figures

Figures reproduced from arXiv: 2606.18094 by Huu Quyen Dang, Jiajun Deng, Nikita Kuzmin, Simon Lui, Tao Zhong, Tianxiang Cao, Tristan Tsoi, Yingke Zhu.

Figure 1
Figure 1. Figure 1: illustrates the duration target τ (t) across three regions: a) in speech segment, b) in mid-utterance pause, and c) in post￾utterance silence. At time t, we define τ (t) as the remaining time until the next speech onset: τ (t) =    0, t is in a speech segment; tonset − t, t is in a mid-utterance pause; τmax, t is in post-utterance silence. (2) = 0 a) In speech segment b) In mid-utterance pause c) In p… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed Next-Turn framework. a) Single-task with a binary head. b) Single-task with a duration head. c) Joint with a shared encoder and task-specific heads. At inference, the system can use the binary score, the duration￾derived score, or a fusion of both (Sec. 3.4). 4. Experiments 4.1. Dataset We train on an in-house corpus of 1,177 hours of Chi￾nese speech (1,097,898 utterances), spannin… view at source ↗
Figure 3
Figure 3. Figure 3: Absolute improvement in ACC320 over the binary baseline for Single REG and Joint CLS with binary inference, broken down by mid-utterance pause count. Pause-Count Analysis [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Early interruption (EI, left) and response latency (RL, ms, right) of the overall best-performing system as a func￾tion of past (P) and future (F) context chunks. Context Window Analysis: We investigate the tradeoff be￾tween early interruption and response latency when including past (P) and future (F) context chunks in score post-processing. Incorporating future context can reduce EI but adds a look￾ahead… view at source ↗
read the original abstract

Endpoint detection (EPD) is essential for natural turn-taking in streaming speech systems. However, reliably determining the endpoint of an utterance is challenging because speakers often pause mid-utterance due to hesitations and disfluencies. Semantic EPD has emerged as a promising direction to address this issue but is hindered by ambiguous supervision and strict streaming constraints. We propose Next-Turn that uses the time-to-next-speech-onset as the training objective, where targets are derived directly from speech timestamps and require no additional annotation. Experiments show that the proposed method outperforms conventional acoustic and recent semantic EPD baselines, achieving a 25.9% absolute improvement in endpoint accuracy within 320 ms over the strongest baseline. In addition, joint training with the duration-aware objective complements standard binary EPD, with gains that increase monotonically with increasing pauses.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes Next-Turn, a streaming endpoint detection method that trains on a time-to-next-speech-onset regression objective whose targets are derived directly from speech timestamps (no additional annotation required). It claims this duration-aware objective outperforms both conventional acoustic EPD and recent semantic EPD baselines, delivering a 25.9% absolute gain in endpoint accuracy within 320 ms over the strongest baseline, and that joint training with standard binary EPD yields monotonically increasing gains as pause length grows.

Significance. If the experimental claims are substantiated, the work would be significant: it supplies a lightweight, annotation-free supervision signal that explicitly models utterance duration and thereby addresses the core difficulty of mid-utterance disfluencies under strict streaming constraints. The monotonic improvement with pause length, if reproducible, would constitute a falsifiable prediction that distinguishes the approach from prior semantic EPD methods.

major comments (2)
  1. [Abstract] Abstract: the central claim of a 25.9% absolute improvement in endpoint accuracy within 320 ms is stated without any description of model architecture, training procedure, dataset, baseline implementations, or the precise definition of the accuracy metric, rendering the support for the primary result impossible to evaluate.
  2. [Abstract] Abstract: the assertion that timestamp-derived targets supply unambiguous supervision for disfluent pauses is load-bearing for attributing the reported gains to the duration-aware objective rather than label artifacts, yet the text provides no information on onset definition, VAD threshold, or handling of multiple short pauses within a single disfluency.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed review and constructive feedback on our manuscript. We address the major comments point by point below, indicating where revisions will be made to improve clarity and completeness.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim of a 25.9% absolute improvement in endpoint accuracy within 320 ms is stated without any description of model architecture, training procedure, dataset, baseline implementations, or the precise definition of the accuracy metric, rendering the support for the primary result impossible to evaluate.

    Authors: We acknowledge that the abstract, due to its length constraints, does not include these experimental details. These are described in the Methods and Experiments sections of the full manuscript. To address the concern and make the primary result more evaluable from the abstract, we will revise the abstract to include brief mentions of the model architecture, training procedure, dataset, baselines, and the definition of the accuracy metric. revision: yes

  2. Referee: [Abstract] Abstract: the assertion that timestamp-derived targets supply unambiguous supervision for disfluent pauses is load-bearing for attributing the reported gains to the duration-aware objective rather than label artifacts, yet the text provides no information on onset definition, VAD threshold, or handling of multiple short pauses within a single disfluency.

    Authors: The targets are derived from speech timestamps obtained via forced alignment on the audio recordings, which provides direct and unambiguous next-onset times without manual annotation for disfluencies. However, we agree that the manuscript lacks explicit details on the precise onset definition, VAD threshold used for timestamp extraction, and handling of multiple short pauses in disfluencies. We will add this information to Section 3.2 of the revised manuscript, specifying the VAD parameters, the rule for selecting the next onset, and how clustered pauses are treated. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper defines its core training objective (time-to-next-speech-onset) directly from existing speech timestamps with no additional annotation required. This is a standard label extraction step in supervised learning and does not reduce any claimed prediction or result to the input by construction. No equations, self-citations, or uniqueness claims are shown that would create self-definitional, fitted-input, or load-bearing citation loops. The reported gains are presented as empirical comparisons against external baselines, satisfying the criteria for a self-contained derivation against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Review is limited to the abstract; the central assumption is that timestamp-derived targets suffice without semantic labels or violating streaming constraints.

axioms (2)
  • domain assumption Speech timestamps can be used to derive targets for time-to-next-speech-onset without additional annotation.
    Stated directly in the abstract as the source of training targets.
  • domain assumption The method satisfies strict streaming constraints while using the new objective.
    Abstract notes that semantic EPD is hindered by strict streaming constraints and presents the method as addressing this.

pith-pipeline@v0.9.1-grok · 5691 in / 1338 out tokens · 41818 ms · 2026-06-26T22:40:05.042376+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 13 canonical work pages · 3 internal anchors

  1. [1]

    Next-Turn: Duration-Aware Streaming Endpoint Detection via Time-to-Next-Speech-Onset Prediction

    Introduction The growing demand for highly interactive speech interfaces has driven widespread adoption of streaming speech models, particularly for full-duplex conversation [1, 2, 3] and real-time speech translation [4, 5]. In these settings, the ability to deter- mine when a user has finished a meaningful speech unit is crit- ical for both perceived res...

  2. [2]

    Binary Endpoint Detection We first describe the binary semantic EPD, which serves both as a baseline and as the foundation for the duration-aware ex- tension in Section 3. 2.1. Binary Formulation Binary EPD aims to determine whether an utterance has reached endpoint by timet. Lett end denote the time when the speech ends. We define the binary targety(t)as...

  3. [3]

    The duration prediction objective can be used either as an alter- native to the binary formulation or jointly with it during train- ing

    Duration-Aware Endpoint Detection We propose a semantic EPD framework based on time-to-next- onset prediction, which provides graded temporal supervision. The duration prediction objective can be used either as an alter- native to the binary formulation or jointly with it during train- ing. All duration-aware variants use the same architecture as the bina...

  4. [4]

    Dataset We train on an in-house corpus of 1,177 hours of Chi- nese speech (1,097,898 utterances), spanning conversational, command-style, and question–answer scenarios, at 16 kHz

    Experiments 4.1. Dataset We train on an in-house corpus of 1,177 hours of Chi- nese speech (1,097,898 utterances), spanning conversational, command-style, and question–answer scenarios, at 16 kHz. The training set shows a long-tailed distribution over pause counts. For evaluation, we hold out 1,185 utterances disjoint from the training data and manually l...

  5. [5]

    LoRA matrices with rank r=8, scaling factorα=32, and dropoutp=0.05are inserted into the query, key, and value projections of every encoder block

    encoder unless otherwise stated. LoRA matrices with rank r=8, scaling factorα=32, and dropoutp=0.05are inserted into the query, key, and value projections of every encoder block. Models are optimized with AdamW (β1=0.9,β 2=0.98, weight decay0.1) using a learning rate of1×10 −4, batch size of 32 across 8 GPUs with 4-step gradient accumulation, bf16 mixed p...

  6. [6]

    Conclusion We presented Next-Turn, a duration-aware streaming EPD framework that predicts the time-to-next-speech-onset as a supervision signal derived directly from speech timestamps. Experiments show that the proposed approach outperforms conventional acoustic and recent semantic EPD baselines, achieving a 25.9% absolute improvement inACC 320 over the s...

  7. [7]

    All AI-assisted content was reviewed and edited by the authors

    Use of Generative AI Disclosure Generative AI tools were used for minor editing and language improvement. All AI-assisted content was reviewed and edited by the authors

  8. [8]

    FlexDuo: A pluggable system for enabling full- duplex capabilities in speech dialogue systems,

    B. Liao, Y . Xu, J. Ou, K. Yang, W. Jian, P. Wan, and D. Zhang, “FlexDuo: A pluggable system for enabling full- duplex capabilities in speech dialogue systems,”arXiv preprint arXiv:2502.13472, 2025

  9. [9]

    FireRed- Chat: A pluggable, full-duplex voice interaction system with cascaded and semi-cascaded implementations,

    J. Chen, Y . Hu, J. Li, K. Li, K. Liu, W. Li, Z. Li, F. Shen, X. Tang, M. Wei, Y . Wu, F. Xie, K. Xu, and K. Xie, “FireRed- Chat: A pluggable, full-duplex voice interaction system with cascaded and semi-cascaded implementations,”arXiv preprint arXiv:2509.06502, 2025

  10. [10]

    Moshi: a speech-text foundation model for real-time dialogue

    A. D ´efossez, L. Mazar´e, M. Orsini, A. Royer, P. P´erez, H. J´egou, E. Grave, and N. Zeghidour, “Moshi: A speech-text foundation model for real-time dialogue,”arXiv preprint arXiv:2410.00037, 2024

  11. [11]

    Seamless: Multilingual expressive and streaming speech translation,

    Seamless Communication, L. Barrault, Y .-A. Chung, M. C. Meglioli, D. Dale, N. Dong, M. Duppenthaler, P.-A. Duquenne, B. Ellis, H. Elsaharet al., “Seamless: Multilingual expressive and streaming speech translation,”arXiv preprint arXiv:2312.05187, 2023

  12. [12]

    Simulspeech: End-to-end simultaneous speech to text translation,

    Y . Ren, J. Liu, X. Tan, C. Zhang, T. Qin, Z. Zhao, and T.-Y . Liu, “Simulspeech: End-to-end simultaneous speech to text translation,” inProceedings of the 58th Annual Meeting of the Association for Computational Linguistics. Online: Association for Computational Linguistics, jul 2020, pp. 3787–3796. [Online]. Available: https://aclanthology.org/2020.acl-...

  13. [13]

    Recurrent neural networks for voice activity detection,

    T. Hughes and K. Mierle, “Recurrent neural networks for voice activity detection,” inProc. ICASSP, 2013, pp. 7378–7382

  14. [14]

    Deep belief networks based voice activity detection,

    X. Zhang and J. Wu, “Deep belief networks based voice activity detection,”IEEE Transactions on Audio, Speech, and Language Processing, vol. 21, no. 4, pp. 697–710, 2013. [Online]. Available: https://doi.org/10.1109/TASL.2012.2229986

  15. [15]

    Speech activity detection on youtube using deep neural networks,

    N. Ryant, M. Liberman, and J. Yuan, “Speech activity detection on youtube using deep neural networks,” inInterspeech 2013, 2013, pp. 728–731. [Online]. Available: https://www. isca-archive.org/interspeech 2013/ryant13 interspeech.html

  16. [16]

    Is the speaker done yet? Faster and more accurate end-of-utterance detection using prosody,

    L. Ferrer, E. Shriberg, and A. Stolcke, “Is the speaker done yet? Faster and more accurate end-of-utterance detection using prosody,” inProc. ICSLP, 2002

  17. [17]

    A statistical model-based voice activity detection,

    J. Sohn, N. S. Kim, and W. Sung, “A statistical model-based voice activity detection,”IEEE Signal Processing Letters, vol. 6, no. 1, pp. 1–3, 1999

  18. [18]

    Pauses, gaps and overlaps in conver- sations,

    M. Heldner and J. Edlund, “Pauses, gaps and overlaps in conver- sations,”Journal of Phonetics, vol. 38, no. 4, pp. 555–568, 2010

  19. [19]

    Towards fast and accurate streaming end-to-end ASR,

    B. Li, S.-y. Chang, T. N. Sainath, R. Pang, Y . He, T. Strohman, and Y . Wu, “Towards fast and accurate streaming end-to-end ASR,” inProc. ICASSP, 2020, pp. 6069–6073. [Online]. Available: https://ieeexplore.ieee.org/document/9054715

  20. [20]

    E2e segmenter: Joint segmenting and decoding for long-form ASR,

    W. R. Huang, S.-Y . Chang, D. Rybach, T. N. Sainath, R. Prabhavalkar, C. Peyser, Z. Lu, and C. Allauzen, “E2e segmenter: Joint segmenting and decoding for long-form ASR,” inInterspeech 2022, 2022, pp. 4995–4999. [On- line]. Available: https://www.isca-archive.org/interspeech 2022/ huang22 interspeech.html

  21. [21]

    Turn-taking and backchan- nel prediction with acoustic and large language model fusion,

    J. Wang, L. Chen, A. Khare, A. Raju, P. Dheram, D. He, M. Wu, A. Stolcke, and V . Ravichandran, “Turn-taking and backchan- nel prediction with acoustic and large language model fusion,” in Proc. ICASSP, 2024

  22. [22]

    LLM-Enhanced Dialogue Management for Full-Duplex Spoken Dialogue Systems

    H. Zhang, W. Li, R. Chen, V . Kothapally, M. Yu, and D. Yu, “LLM-enhanced dialogue management for full-duplex spoken di- alogue systems,”arXiv preprint arXiv:2502.14145, 2025

  23. [23]

    Unified end-to-end speech recognition and endpointing for fast and efficient speech systems,

    S. Bijwadia, S.-y. Chang, B. Li, T. N. Sainath, C. Zhang, and Y . He, “Unified end-to-end speech recognition and endpointing for fast and efficient speech systems,” inProc. IEEE SLT, 2022, pp. 310–316

  24. [24]

    Phoenix-V AD: Streaming semantic end- point detection for full-duplex speech interaction,

    W. Wu, W. Guan, K. Wang, P. Chen, Z. Zha, J. Li, J. Fang, L. Li, and Q. Hong, “Phoenix-V AD: Streaming semantic end- point detection for full-duplex speech interaction,”arXiv preprint arXiv:2509.20410, 2025

  25. [25]

    Easy turn: Integrating acoustic and linguistic modalities for robust turn-taking in full-duplex spoken dialogue systems,

    G. Li, C. Wang, H. Xue, S. Wang, D. Gao, Z. Zhang, Y . Lin, W. Li, L. Xiao, Z. Fu, and L. Xie, “Easy turn: Integrating acoustic and linguistic modalities for robust turn-taking in full-duplex spoken dialogue systems,”arXiv preprint arXiv:2509.23938, 2025

  26. [26]

    Semantic V AD: Low-latency voice activity detection for speech interaction,

    M. Shi, Y . Shu, L. Zuo, Q. Chen, S. Zhang, J. Zhang, and L.- R. Dai, “Semantic V AD: Low-latency voice activity detection for speech interaction,”arXiv preprint arXiv:2305.12450, 2023

  27. [27]

    Two-pass endpoint detection for speech recognition,

    A. Raju, A. Khare, D. He, I. Sklyar, L. Chen, S. Alptekin, V . A. Trinh, Z. Zhang, C. Vaz, V . Ravichandran, R. Maas, and A. Rastrow, “Two-pass endpoint detection for speech recognition,”arXiv preprint arXiv:2401.08916, 2024. [Online]. Available: https://arxiv.org/pdf/2401.08916

  28. [28]

    Accurate endpointing with expected pause duration,

    B. Liu, B. Hoffmeister, and A. Rastrow, “Accurate endpointing with expected pause duration,” inInterspeech 2015, 2015, pp. 2912–2916. [Online]. Available: https://www.isca-archive.org/ interspeech 2015/liu15d interspeech.html

  29. [29]

    Projecting the end of a speaker’s turn: A cognitive cornerstone of conversation,

    J. P. De Ruiter, H. Mitterer, and N. J. Enfield, “Projecting the end of a speaker’s turn: A cognitive cornerstone of conversation,”Lan- guage, vol. 82, no. 3, pp. 515–535, 2006

  30. [30]

    V oice activity projection: Self- supervised learning of turn-taking events,

    E. Ekstedt and G. Skantze, “V oice activity projection: Self- supervised learning of turn-taking events,”arXiv preprint arXiv:2205.09812, 2022

  31. [31]

    Robust speech recognition via large-scale weak su- pervision,

    A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large-scale weak su- pervision,” inProc. ICML, 2023

  32. [32]

    LoRA: Low-rank adaptation of large lan- guage models,

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large lan- guage models,” inProc. ICLR, 2022

  33. [33]

    Rectified linear units improve restricted boltzmann machines,

    V . Nair and G. E. Hinton, “Rectified linear units improve restricted boltzmann machines,” inProceedings of the 27th International Conference on Machine Learning (ICML), 2010, pp. 807–814

  34. [34]

    Dropout: A simple way to prevent neural net- works from overfitting,

    N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural net- works from overfitting,”Journal of Machine Learning Research, vol. 15, no. 56, pp. 1929–1958, 2014

  35. [35]

    TEN turn detection,

    TEN Framework, “TEN turn detection,” https://github.com/ ten-framework/ten-turn-detection, 2024

  36. [36]

    The Kaldi speech recognition toolkit,

    D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motl´ıˇcek, Y . Qian, P. Schwarzet al., “The Kaldi speech recognition toolkit,” inProc. ASRU, 2011

  37. [37]

    Silero V AD: pre-trained enterprise-grade voice activity detector,

    S. Team, “Silero V AD: pre-trained enterprise-grade voice activity detector,” 2021

  38. [38]

    Smart turn: Audio-only turn-taking detection,

    Pipecat AI, “Smart turn: Audio-only turn-taking detection,” https: //github.com/pipecat-ai/smart-turn, 2024