pith. sign in

arxiv: 2606.11386 · v1 · pith:UDMR6BOPnew · submitted 2026-06-09 · 💻 cs.CL · cs.AI· eess.AS

Overcoming State Inertia in Full-Duplex Spoken Language Models via Activation Steering

Pith reviewed 2026-06-27 13:21 UTC · model grok-4.3

classification 💻 cs.CL cs.AIeess.AS
keywords full-duplex spoken language modelsactivation steeringstate inertiaperception vectorinterruption handlingZero-Buffer Benchmarkhidden representationspredictive patterns
0
0 comments X

The pith

Activation steering with a perception vector shifts full-duplex speech models from generative to perceptive state during interruptions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Full-duplex spoken language models maintain stream-specific predictive patterns in their hidden representations, favoring their own output stream while speaking and the user input stream while listening. These patterns reflect an internal switch between a generative state and a perceptive state, yet the switch lags when a user interrupts, leaving the model briefly stuck generating and missing the start of the new input. The authors label this lag state inertia and introduce the Zero-Buffer Benchmark to measure its effect through response correctness and initial-word occurrence rate. They extract a perception vector from the hidden-state differences and demonstrate that adding it to activations steers the model into the perceptive state. This training-free step raises correctness from 28 percent to 45 percent and initial-word occurrence from 40 percent to 72 percent on PersonaPlex across several models.

Core claim

Full-duplex spoken language models exhibit stream-specific predictive patterns in hidden representations and dynamically modulate between a generative state aligned with model output and a perceptive state aligned with incoming user input. During abrupt user interruptions the modulation lags, producing transient bias toward the generative state that causes the model to miss the beginning of the incoming input. Activation steering with a perception vector extracted from the hidden-state analysis shifts the model into the perceptive state and raises response correctness on the Zero-Buffer Benchmark from 28 percent to 45 percent and initial-word occurrence rate from 40 percent to 72 percent on

What carries the argument

The perception vector, a direction in activation space obtained by contrasting hidden representations during perceptive versus generative contexts, which when added steers the model's internal predictive focus toward the incoming user stream.

If this is right

  • Improves response correctness from 28 percent to 45 percent and initial-word occurrence rate from 40 percent to 72 percent on PersonaPlex.
  • Produces gains across multiple state-of-the-art full-duplex spoken language models.
  • Requires no fine-tuning and adds only negligible computational overhead.
  • Enables immediate comprehension when user speech begins abruptly, as measured by the Zero-Buffer Benchmark.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same vector extraction could be applied to other abrupt context switches such as topic changes or speaker hand-offs.
  • Repeated steering during extended conversations might be tested to check whether it preserves coherence outside interruption events.
  • Vector-based steering may offer a general route to reduce retraining needs when adapting spoken models to new duplex behaviors.

Load-bearing premise

The hidden-state analysis isolates a single perception vector whose addition reliably shifts the model to the perceptive state without side effects on other behaviors or non-interruption turns.

What would settle it

An experiment in which models given the perception vector show no gain in correctness or initial-word occurrence rate on the Zero-Buffer Benchmark, or show degraded performance on ordinary non-interruption dialogue tasks.

Figures

Figures reproduced from arXiv: 2606.11386 by Alexander H. Liu, Cheng-Kuang Chang, James Glass, Kai-Wei Chang.

Figure 1
Figure 1. Figure 1: Overview of state inertia and activation steering. (a) FD-SLMs process concurrent user and model streams, conditioning on incoming user audio and previous model output tokens to generate text and audio tokens. (b) FD-SLMs coordinate speaking and listening by modulating be￾tween generative and perceptive states, tracked by generation and perception affinity. During abrupt interruptions, the model can remain… view at source ↗
Figure 2
Figure 2. Figure 2: Generation affinity Sgen(t) across in￾ternal layers of PersonaPlex on the turn-by-turn interaction dataset. We align 100 examples at the end of the user utterance, with t = 0 mark￾ing this transition. Values are shown on a loga￾rithmic scale [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 5
Figure 5. Figure 5: Perception affinity Sperc(t) in the in￾terruption condition. The model transitions into the perceptive state after 7–8 timesteps, exhibit￾ing state inertia. Each ZBB example consists of a speech-inducing prompt followed by a zero-buffer query. The speech-inducing prompt is an open-ended question that places the model in a generative state; while the model is actively responding, we abruptly interrupt it wi… view at source ↗
Figure 7
Figure 7. Figure 7: Perception affinity Sperc(t) in the in￾terruption with steering condition. With acti￾vation steering, perception affinity recovers im￾mediately after interruption, indicating a faster transition toward the perceptive state. 7 Limitations Our work has several limitations. First, the steering method relies on detecting the onset of user interruption. We use an energy-based onset detector, but real-world depl… view at source ↗
Figure 8
Figure 8. Figure 8: An example from the turn-by-turn interaction dataset used for logit-lens analysis and [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: An example from the dataset for state inertia analysis, illustrating the paired (a) no [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: An example from the ZBB dataset, showing the paired (a) no-interruption and (b) inter [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Generation affinity Sgen(t) in the no￾interruption condition. The model exits the gen￾erative state soon after the user begins speaking, with recovery occurring after approximately 5 timesteps [PITH_FULL_IMAGE:figures/full_fig_p019_11.png] view at source ↗
Figure 13
Figure 13. Figure 13: PCA projections of hidden representations from generation-dominant and perception [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Logit-lens decoding of PersonaPlex hidden states during a listening segment. Intermedi [PITH_FULL_IMAGE:figures/full_fig_p020_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: Additional logit-lens decoding example during a listening segment. The user input is [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: Logit-lens decoding of PersonaPlex hidden states during a model speaking segment. [PITH_FULL_IMAGE:figures/full_fig_p022_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Additional logit-lens decoding example during a model speaking segment. This example [PITH_FULL_IMAGE:figures/full_fig_p023_17.png] view at source ↗
Figure 19
Figure 19. Figure 19: Correctness and IWOR across steer￾ing spans ∆Tsteer on PersonaPlex, with the steering layer fixed to 23 and α = 5.5. At ∆Tsteer = 3, both metrics achieve the best per￾formance. 23 [PITH_FULL_IMAGE:figures/full_fig_p023_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Attention recovery after steering. Heatmaps show the average attention weight assigned [PITH_FULL_IMAGE:figures/full_fig_p024_20.png] view at source ↗
Figure 21
Figure 21. Figure 21: Response quality under false steering triggers. The x-axis represents the expected interval [PITH_FULL_IMAGE:figures/full_fig_p025_21.png] view at source ↗
read the original abstract

Full-duplex spoken language models (FD-SLMs) enable seamless speech interaction by allowing models to listen and speak simultaneously, yet the internal mechanism by which they coordinate listening and speaking remains underexplored. We analyze the predictive behavior encoded in FD-SLM hidden representations and find that they exhibit stream-specific predictive patterns: during listening, they preferentially predict the incoming user stream, whereas during speaking, they preferentially predict the model output stream. Building on this observation, we show that FD-SLMs dynamically modulate their internal predictive focus between two states: a generative state aligned with model output generation and a perceptive state aligned with incoming user input. However, this modulation can lag behind abrupt changes in conversational context. During user interruptions, the model remains transiently biased toward the generative state before transitioning into the perceptive state, causing it to miss the beginning of the incoming input. We term this delayed internal transition state inertia. To quantify its downstream impact, we introduce the Zero-Buffer Benchmark (ZBB), a diagnostic benchmark for evaluating immediate interruption comprehension when user speech begins abruptly. We evaluate this setting using response correctness and initial-word occurrence rate (IWOR). Finally, we mitigate state inertia through activation steering with a perception vector, a training-free intervention with little additional computational overhead. Across multiple state-of-the-art FD-SLMs, activation steering substantially improves interruption handling; for example, on PersonaPlex, it improves correctness from 28% to 45% and IWOR from 40% to 72% without any fine-tuning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that full-duplex spoken language models exhibit stream-specific predictive patterns in hidden states (preferring user input during listening and model output during speaking) and dynamically switch between generative and perceptive states, but suffer from 'state inertia' during abrupt user interruptions, remaining transiently biased toward generation and missing initial input. It introduces the Zero-Buffer Benchmark (ZBB) to measure this via response correctness and initial-word occurrence rate (IWOR). The authors propose a training-free activation steering intervention using a 'perception vector' derived from hidden-state contrasts to induce the perceptive state, reporting concrete gains such as on PersonaPlex (correctness 28% to 45%, IWOR 40% to 72%) across multiple FD-SLMs without fine-tuning.

Significance. If the central result holds, the work supplies a lightweight, post-training intervention for improving interruption handling in FD-SLMs, a practically relevant capability for natural spoken dialogue systems. The introduction of the ZBB diagnostic benchmark and the empirical demonstration of numeric gains on defined interruption settings constitute positive contributions. The approach extends activation steering techniques with concrete benchmark numbers.

major comments (2)
  1. [§5 (Evaluation and results)] §5 (Evaluation and results): The reported gains (e.g., PersonaPlex correctness 28%→45%, IWOR 40%→72%) are measured only on the interruption-specific ZBB; no metrics are supplied for non-interruption spoken QA, generation quality on ordinary turns, or latency. This is load-bearing for the central claim that the single perception vector shifts only the generative/perceptive state without side effects.
  2. [§3 (Hidden-state analysis)] §3 (Hidden-state analysis): The perception vector is constructed from hidden-state differences between listening and speaking streams, but the manuscript provides no ablation on vector construction details (layer selection, token averaging, or contrast method) and reports no error bars or statistical tests on the numeric improvements, leaving the reliability of the state-transition claim unverified.
minor comments (1)
  1. [Abstract] The abstract and introduction introduce multiple novel terms ('state inertia', 'perception vector', 'Zero-Buffer Benchmark') without inline definitions or forward references, which reduces immediate clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address the two major comments point by point below, proposing concrete revisions to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§5 (Evaluation and results)] The reported gains (e.g., PersonaPlex correctness 28%→45%, IWOR 40%→72%) are measured only on the interruption-specific ZBB; no metrics are supplied for non-interruption spoken QA, generation quality on ordinary turns, or latency. This is load-bearing for the central claim that the single perception vector shifts only the generative/perceptive state without side effects.

    Authors: We agree that the current evaluation is centered on the ZBB to isolate the effect of state inertia during abrupt interruptions. The perception vector is constructed specifically from listening-versus-speaking contrasts to target only the state transition. To directly address the concern about potential side effects, we will add experiments in the revision that apply the same steering to non-interruption spoken QA and generation tasks, reporting correctness, generation quality metrics, and latency to confirm that ordinary performance is preserved. revision: yes

  2. Referee: [§3 (Hidden-state analysis)] The perception vector is constructed from hidden-state differences between listening and speaking streams, but the manuscript provides no ablation on vector construction details (layer selection, token averaging, or contrast method) and reports no error bars or statistical tests on the numeric improvements, leaving the reliability of the state-transition claim unverified.

    Authors: We acknowledge the value of ablations and statistical reporting. In the revised manuscript we will add (i) ablations varying the layer(s) used, token-averaging strategy, and contrast formulation, and (ii) results with error bars across multiple random seeds together with statistical significance tests on the reported improvements. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected.

full rationale

The paper's core chain—identifying stream-specific predictive patterns in hidden states, defining generative/perceptive states and state inertia from those observations, introducing the ZBB benchmark, and deriving a perception vector via hidden-state contrast for activation steering—does not reduce any reported result to a fitted quantity on the evaluation data or to a self-referential definition. The perception vector is computed from observed differences and applied post-hoc; gains on PersonaPlex (correctness 28%→45%, IWOR 40%→72%) are measured outcomes, not quantities forced by construction. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the provided text. The derivation is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 3 invented entities

The central claim rests on the existence of identifiable stream-specific predictive patterns in hidden states and on the assumption that a single linear direction can be extracted and added without side effects. No free parameters are explicitly fitted in the abstract. Three new entities are introduced without external validation.

axioms (2)
  • domain assumption FD-SLM hidden representations encode distinct generative and perceptive predictive states that can be read out from activation patterns.
    Stated in the analysis of predictive behavior during listening versus speaking phases.
  • domain assumption The transition between these states can be accelerated by a fixed linear intervention in activation space.
    Underlying premise of the activation-steering method.
invented entities (3)
  • state inertia no independent evidence
    purpose: Label for the transient bias toward generative state after an interruption begins.
    New term introduced to describe the observed lag.
  • perception vector no independent evidence
    purpose: Direction in activation space used for steering the model into listening mode.
    Constructed from the paper's hidden-state analysis.
  • Zero-Buffer Benchmark no independent evidence
    purpose: Diagnostic test for immediate interruption comprehension.
    New benchmark introduced in the paper.

pith-pipeline@v0.9.1-grok · 5819 in / 1575 out tokens · 18194 ms · 2026-06-27T13:21:10.818184+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

51 extracted references · 1 canonical work pages

  1. [1]

    Understanding intermediate layers using linear classifier probes, 2017

    Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes, 2017. URLhttps://openreview.net/forum?id=ryF7rTqgl

  2. [2]

    On the landscape of spo- ken language models: A comprehensive survey.Transactions on Machine Learning Research, 2025

    Siddhant Arora, Kai-Wei Chang, Chung-Ming Chien, Yifan Peng, Haibin Wu, Yossi Adi, Em- manuel Dupoux, Hung-yi Lee, Karen Livescu, and Shinji Watanabe. On the landscape of spo- ken language models: A comprehensive survey.Transactions on Machine Learning Research, 2025

  3. [3]

    V oice activity detection (vad) in noisy environments.arXiv preprint arXiv:2312.05815, 2023

    Joshua Ball. V oice activity detection (vad) in noisy environments.arXiv preprint arXiv:2312.05815, 2023

  4. [4]

    Eliciting latent predictions from transformers with the tuned lens.arXiv preprint arXiv:2303.08112, 2023

    Nora Belrose, Igor Ostrovsky, Lev McKinney, Zach Furman, Logan Smith, Danny Halawi, Stella Biderman, and Jacob Steinhardt. Eliciting latent predictions from transformers with the tuned lens.arXiv preprint arXiv:2303.08112, 2023

  5. [5]

    TiCo: Time- controllable training for spoken dialogue models.arXiv preprint arXiv:2603.22267, 2026

    Kai-Wei Chang, Wei-Chih Chen, En-Pei Hu, Hung-yi Lee, and James Glass. TiCo: Time- controllable training for spoken dialogue models.arXiv preprint arXiv:2603.22267, 2026

  6. [6]

    Game-time: Evaluating temporal dynamics in spoken language models

    Kai-Wei Chang, En-Pei Hu, Chun-Yi Kuan, Wenze Ren, Wei-Chih Chen, Guan-Ting Lin, Yu Tsao, Shao-Hua Sun, Hung-yi Lee, and James Glass. Game-time: Evaluating temporal dynamics in spoken language models. InICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 16302–16306. IEEE, 2026

  7. [7]

    Persona vectors: Monitoring and controlling character traits in language models.arXiv preprint arXiv:2507.21509, 2025

    Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, and Jack Lindsey. Persona vectors: Monitoring and controlling character traits in language models.arXiv preprint arXiv:2507.21509, 2025

  8. [8]

    Clark and Jean E

    Herbert H. Clark and Jean E. Fox Tree. Using uh and um in spontaneous speak- ing.Cognition, 84(1):73–111, 2002. ISSN 0010-0277. doi: https://doi.org/10.1016/ S0010-0277(02)00017-3. URLhttps://www.sciencedirect.com/science/article/ pii/S0010027702000173

  9. [9]

    Simple and controllable music generation.Advances in neural informa- tion processing systems, 36:47704–47720, 2023

    Jade Copet, Felix Kreuk, Itai Gat, Tal Remez, David Kant, Gabriel Synnaeve, Yossi Adi, and Alexandre Défossez. Simple and controllable music generation.Advances in neural informa- tion processing systems, 36:47704–47720, 2023

  10. [10]

    Recent advances in speech language models: A survey

    Wenqian Cui, Dianzhi Yu, Xiaoqi Jiao, Ziqiao Meng, Guangyan Zhang, Qichao Wang, Steven Y Guo, and Irwin King. Recent advances in speech language models: A survey. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13943–13970, 2025

  11. [11]

    Think before you talk: Enhancing meaningful dialogue generation in full-duplex speech language models with planning-inspired text guidance.arXiv preprint arXiv:2508.07375, 2025

    Wenqian Cui, Lei Zhu, Xiaohui Li, Zhihan Guo, Haoli Bai, Lu Hou, and Irwin King. Think before you talk: Enhancing meaningful dialogue generation in full-duplex speech language models with planning-inspired text guidance.arXiv preprint arXiv:2508.07375, 2025

  12. [12]

    High fidelity neural audio compression.Transactions on Machine Learning Research, 2023

    Alexandre Défossez, Jade Copet, Gabriel Synnaeve, and Yossi Adi. High fidelity neural audio compression.Transactions on Machine Learning Research, 2023

  13. [13]

    Moshi: a speech-text foundation model for real-time dialogue.arXiv preprint arXiv:2410.00037, 2024

    Alexandre Défossez, Laurent Mazaré, Manu Orsini, Amélie Royer, Patrick Pérez, Hervé Jégou, Edouard Grave, and Neil Zeghidour. Moshi: a speech-text foundation model for real-time dialogue.arXiv preprint arXiv:2410.00037, 2024

  14. [14]

    Kimi-audio technical report.arXiv preprint arXiv:2504.18425, 2025

    Ding Ding, Zeqian Ju, Yichong Leng, Songxiang Liu, Tong Liu, Zeyu Shang, Kai Shen, Wei Song, Xu Tan, Heyi Tang, et al. Kimi-audio technical report.arXiv preprint arXiv:2504.18425, 2025

  15. [15]

    Exploring filler words and their impact.Schwa

    Emily Duvall, Aimee Robbins, Thomas Graham, and Scott Divett. Exploring filler words and their impact.Schwa. Language & Linguistics, 11:35–49, 2014. 10

  16. [16]

    LLaMA-omni: Seamless speech interaction with large language models

    Qingkai Fang, Shoutao Guo, Yan Zhou, Zhengrui Ma, Shaolei Zhang, and Yang Feng. LLaMA-omni: Seamless speech interaction with large language models. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview. net/forum?id=PYmrUQmMEw

  17. [17]

    Transformer feed-forward layers are key-value memories

    Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495, 2021

  18. [18]

    Challenges for spoken dialogue systems

    James Glass. Challenges for spoken dialogue systems. InProceedings of the 1999 IEEE ASRU Workshop, volume 696. MIT Laboratory for Computer Science Cambridge, 1999

  19. [19]

    Pauses, gaps and overlaps in conversations.Journal of Phonetics, 38(4):555–568, 2010

    Mattias Heldner and Jens Edlund. Pauses, gaps and overlaps in conversations.Journal of Phonetics, 38(4):555–568, 2010

  20. [20]

    Modu- lation of the auditory cortex during speech: an meg study.Journal of cognitive neuroscience, 14(8):1125–1138, 2002

    John F Houde, Srikantan S Nagarajan, Kensuke Sekihara, and Michael M Merzenich. Modu- lation of the auditory cortex during speech: an meg study.Journal of cognitive neuroscience, 14(8):1125–1138, 2002

  21. [21]

    Wavchat: A survey of spoken dialogue models

    Shengpeng Ji, Yifu Chen, Minghui Fang, Jialong Zuo, Jingyu Lu, Hanting Wang, Ziyue Jiang, Long Zhou, Shujie Liu, Xize Cheng, et al. Wavchat: A survey of spoken dialogue models. arXiv preprint arXiv:2411.13577, 2024

  22. [22]

    Raon-speech technical report, 2026

    Krafton. Raon-speech technical report, 2026

  23. [23]

    Full-duplex-bench-v2: A multi-turn evaluation framework for duplex dialogue systems with an automated examiner.arXiv preprint arXiv:2510.07838, 2025

    Guan-Ting Lin, Shih-Yun Shan Kuan, Jiatong Shi, Kai-Wei Chang, Siddhant Arora, Shinji Watanabe, and Hung-yi Lee. Full-duplex-bench-v2: A multi-turn evaluation framework for duplex dialogue systems with an automated examiner.arXiv preprint arXiv:2510.07838, 2025

  24. [24]

    Full-duplex-bench: A benchmark to evaluate full-duplex spoken dia- logue models on turn-taking capabilities.arXiv preprint arXiv:2503.04721, 2025

    Guan-Ting Lin, Jiachen Lian, Tingle Li, Qirui Wang, Gopala Anumanchipalli, Alexander H Liu, and Hung-yi Lee. Full-duplex-bench: A benchmark to evaluate full-duplex spoken dia- logue models on turn-taking capabilities.arXiv preprint arXiv:2503.04721, 2025

  25. [25]

    Full-duplex-bench-v3: Bench- marking tool use for full-duplex voice agents under real-world disfluency.arXiv preprint arXiv:2604.04847, 2026

    Guan-Ting Lin, Chen Chen, Zhehuai Chen, and Hung-yi Lee. Full-duplex-bench-v3: Bench- marking tool use for full-duplex voice agents under real-world disfluency.arXiv preprint arXiv:2604.04847, 2026

  26. [26]

    Full-duplex-bench v1

    Guan-Ting Lin, Shih-Yun Shan Kuan, Qirui Wang, Jiachen Lian, Tingle Li, Shinji Watan- abe, and Hung-yi Lee. Full-duplex-bench v1. 5: Evaluating overlap handling for full-duplex speech models. InICASSP 2026-2026 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 19447–19451. IEEE, 2026

  27. [27]

    interpreting GPT: the logit lens, 2020

    nostalgebraist. interpreting GPT: the logit lens, 2020. URLhttps://www.lesswrong.com/ posts/AcKRB8wDpdaN6v6ru/interpreting-gpt-the-logit-lens

  28. [28]

    Subject’s own speech reduces reactivity of the human auditory cortex.Neuroscience Letters, 265(2):119–122, 1999

    Jussi Numminen, Riitta Salmelin, and Riitta Hari. Subject’s own speech reduces reactivity of the human auditory cortex.Neuroscience Letters, 265(2):119–122, 1999. ISSN 0304-3940. doi: https://doi.org/10.1016/S0304-3940(99)00218-9. URLhttps://www.sciencedirect. com/science/article/pii/S0304394099002189

  29. [29]

    Fd-bench: A full-duplex benchmarking pipeline designed for full duplex spoken dia- logue systems

    Yizhou Peng, Yi-Wen Chao, Dianwen Ng, Yukun Ma, Chongjia Ni, Bin Ma, and Eng Siong Chng. Fd-bench: A full-duplex benchmarking pipeline designed for full duplex spoken dia- logue systems. InProc. Interspeech 2025, pages 176–180, 2025

  30. [30]

    A practical re- view of mechanistic interpretability for transformer-based language models.arXiv preprint arXiv:2407.02646, 2024

    Daking Rai, Yilun Zhou, Shi Feng, Abulhair Saparov, and Ziyu Yao. A practical re- view of mechanistic interpretability for transformer-based language models.arXiv preprint arXiv:2407.02646, 2024

  31. [31]

    Flexible turn-taking for spoken dialog systems.Language Technologies Insti- tute, CMU Dec, 12, 2008

    Antoine Raux. Flexible turn-taking for spoken dialog systems.Language Technologies Insti- tute, CMU Dec, 12, 2008

  32. [32]

    Steering llama 2 via contrastive activation addition

    Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner. Steering llama 2 via contrastive activation addition. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15504– 15522, 2024. 11

  33. [33]

    Personaplex: V oice and role control for full duplex conversational speech models.arXiv preprint arXiv:2602.06053, 2026

    Rajarshi Roy, Jonathan Raiman, Sang-gil Lee, Teodor-Dumitru Ene, Robert Kirby, Sungwon Kim, Jaehyeon Kim, and Bryan Catanzaro. Personaplex: V oice and role control for full duplex conversational speech models.arXiv preprint arXiv:2602.06053, 2026

  34. [34]

    Turn-taking in conversational systems and human-robot interaction: a review

    Gabriel Skantze. Turn-taking in conversational systems and human-robot interaction: a review. Computer Speech & Language, 67:101178, 2021

  35. [35]

    Improving instruction-following in language models through activation steering

    Alessandro Stolfo, Vidhisha Balachandran, Safoora Yousefi, Eric Horvitz, and Besmira Nushi. Improving instruction-following in language models through activation steering. InThe Thir- teenth International Conference on Learning Representations, 2024

  36. [36]

    Intelligent barge-in in conversational systems

    Nikko Ström and Stephanie Seneff. Intelligent barge-in in conversational systems. InINTER- SPEECH, pages 652–655, 2000

  37. [37]

    Bert rediscovers the classical nlp pipeline

    Ian Tenney, Dipanjan Das, and Ellie Pavlick. Bert rediscovers the classical nlp pipeline. In Proceedings of the 57th annual meeting of the association for computational linguistics, pages 4593–4601, 2019

  38. [38]

    Steering language models with activation engineering.arXiv preprint arXiv:2308.10248, 2023

    Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activation engineering.arXiv preprint arXiv:2308.10248, 2023

  39. [39]

    Beyond turn-based interfaces: Synchronous llms as full-duplex dialogue agents

    Bandhav Veluri, Benjamin N Peloquin, Bokai Yu, Hongyu Gong, and Shyamnath Gollakota. Beyond turn-based interfaces: Synchronous llms as full-duplex dialogue agents. InProceed- ings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 21390–21402, 2024

  40. [40]

    Full-duplex interaction in spoken dialogue systems: A comprehen- sive study from the icassp 2026 humdial challenge.arXiv preprint arXiv:2604.21406, 2026

    Chengyou Wang, Hongfei Yue, Guojian Li, Zhixian Zhao, Shuiyuan Wang, Shuai Wang, Xin Xu, Hui Bu, and Lei Xie. Full-duplex interaction in spoken dialogue systems: A comprehen- sive study from the icassp 2026 humdial challenge.arXiv preprint arXiv:2604.21406, 2026

  41. [41]

    Trojan activation attack: Red-teaming large language models using steering vectors for safety-alignment

    Haoran Wang and Kai Shu. Trojan activation attack: Red-teaming large language models using steering vectors for safety-alignment. InProceedings of the 33rd ACM International Conference on Information and Knowledge Management, pages 2347–2357, 2024

  42. [42]

    Step-audio 2 technical report.arXiv preprint arXiv:2507.16632, 2025

    Boyong Wu, Chao Yan, Chen Hu, Cheng Yi, Chengli Feng, Fei Tian, Feiyu Shen, Gang Yu, Haoyang Zhang, Jingbei Li, et al. Step-audio 2 technical report.arXiv preprint arXiv:2507.16632, 2025

  43. [43]

    Codec-superb: An in-depth analysis of sound codec models

    Haibin Wu, Ho-Lam Chung, Yi-Cheng Lin, Yuan-Kuei Wu, Xuanjun Chen, Yu-Chi Pai, Hsiu- Hsuan Wang, Kai-Wei Chang, Alex Liu, and Hung-yi Lee. Codec-superb: An in-depth analysis of sound codec models. InFindings of the Association for Computational Linguistics: ACL 2024, pages 10330–10348, 2024

  44. [44]

    Semantic-aware interrup- tion detection in spoken dialogue systems: Benchmark, metric, and model.arXiv preprint arXiv:2603.24144, 2026

    Kangxiang Xia, Bingshen Mu, Xian Shi, Jin Xu, and Lei Xie. Semantic-aware interrup- tion detection in spoken dialogue systems: Benchmark, metric, and model.arXiv preprint arXiv:2603.24144, 2026

  45. [45]

    Mini-omni: Language models can hear, talk while thinking in streaming.arXiv preprint arXiv:2408.16725, 2024

    Zhifei Xie and Changqiao Wu. Mini-omni: Language models can hear, talk while thinking in streaming.arXiv preprint arXiv:2408.16725, 2024

  46. [46]

    Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025

    Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025

  47. [47]

    Soundstream: An end-to-end neural audio codec.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495–507, 2021

    Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, and Marco Tagliasacchi. Soundstream: An end-to-end neural audio codec.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 30:495–507, 2021

  48. [48]

    Glm-4-voice: Towards intelligent and human-like end-to-end spoken chatbot.arXiv preprint arXiv:2412.02612, 2024

    Aohan Zeng, Zhengxiao Du, Mingdao Liu, Kedong Wang, Shengmin Jiang, Lei Zhao, Yuxiao Dong, and Jie Tang. Glm-4-voice: Towards intelligent and human-like end-to-end spoken chatbot.arXiv preprint arXiv:2412.02612, 2024. 12

  49. [49]

    Mtr-duplexbench: Towards a comprehensive evaluation of multi-round conversations for full-duplex speech language models.arXiv preprint arXiv:2511.10262, 2025

    He Zhang, Wenqian Cui, Haoning Xu, Xiaohui Li, Lei Zhu, Haoli Bai, Shaohua Ma, and Irwin King. Mtr-duplexbench: Towards a comprehensive evaluation of multi-round conversations for full-duplex speech language models.arXiv preprint arXiv:2511.10262, 2025

  50. [50]

    Beyond the turn-based game: Enabling real-time conversa- tions with duplex models

    Xinrong Zhang, Yingfa Chen, Shengding Hu, Xu Han, Zihang Xu, Yuanwei Xu, Weilin Zhao, Maosong Sun, and Zhiyuan Liu. Beyond the turn-based game: Enabling real-time conversa- tions with duplex models. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 11543–11557, 2024

  51. [51]

    score": <0 or 1>,

    Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, et al. Representation engi- neering: A top-down approach to ai transparency.arXiv preprint arXiv:2310.01405, 2023. 13 A Dataset Details A.1 Turn-by-turn interaction dataset A.1 Dataset: Turn-by-turn dataset. Logit l...