pith. sign in

arxiv: 2606.06559 · v1 · pith:4OM7HCWOnew · submitted 2026-06-04 · 💻 cs.SD · cs.AI· eess.AS

IRAF: Interference-Resilient Adaptive Fusion for Noise-Robust End-to-End Full-Duplex Spoken Dialogue Systems

Pith reviewed 2026-06-27 23:40 UTC · model grok-4.3

classification 💻 cs.SD cs.AIeess.AS
keywords full-duplex dialogueinterference resilienceadaptive fusionspoken dialogue systemsnoise robustnessend-to-end modelsreliability gate
0
0 comments X

The pith

IRAF predicts a scalar reliability gate to rescale user representations and suppress interfering speakers in full-duplex dialogue systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to improve the robustness of end-to-end full-duplex spoken dialogue systems against interfering speakers that leak into the user microphone and corrupt the query. It introduces Interference-Resilient Adaptive Fusion (IRAF), a lightweight streaming-compatible module that predicts a scalar reliability gate from target-speaker and user audio embeddings. This gate rescales the user representations before fusion with agent embeddings on a frame-by-frame basis. Experiments on MS-MARCO and InstructS2S-200K demonstrate consistent gains in response quality and interaction stability under interfering conditions.

Core claim

IRAF predicts a scalar reliability gate from target-speaker and user audio embeddings and rescales user representations before fusion with agent embeddings, yielding consistent gains in response quality and full-duplex interaction under interfering-speaker conditions on MS-MARCO and InstructS2S-200K.

What carries the argument

Scalar reliability gate predicted from embeddings that modulates the contribution of user audio frame by frame before fusion.

If this is right

  • Response quality improves consistently when interfering speakers are present.
  • Full-duplex turn-taking remains stable because the LLM receives less corrupted user conditioning.
  • The module requires no extra training labels or data beyond the described setup.
  • Fusion stays streaming-compatible while suppressing leakage before it reaches agent embeddings.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same embedding-based gating could apply to non-speaker noise sources such as environmental sounds.
  • IRAF could be inserted into existing dual-channel models without retraining the underlying LLM.
  • Real-world multi-talker recordings would test whether the gate generalizes beyond the benchmark interference patterns.

Load-bearing premise

A lightweight scalar gate from embeddings can reliably distinguish and suppress interfering speaker leakage without distorting the intended user query.

What would settle it

An experiment that replaces the predicted gate with a constant value and measures whether response quality and turn-taking stability degrade under the same interfering-speaker conditions on MS-MARCO or InstructS2S-200K.

Figures

Figures reproduced from arXiv: 2606.06559 by Jiajun Deng, Nikita Kuzmin, Simon Lui, Tao Zhong, Tianxiang Cao, Tristan Tsoi, Xunying Liu, Yinke Zhu, Zhili Tan.

Figure 1
Figure 1. Figure 1: Full-duplex dialogue in (a) clean and (b) noisy con￾ditions. Interference leaking into the user channel can corrupt conditioning, causing unstable turn-taking and false barge-in. conditions, their robustness can degrade substantially in real￾world acoustic environments. For instance, when interfering speakers leak into the user microphone, their speech can be en￾coded into the user stream and inadvertently… view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed E2E full-duplex model with the Interference-Resilient Adaptive Fusion (IRAF) module. A streaming speech encoder produces frame-level user embeddings, which IRAF adaptively gates before fusion with agent text embeddings and processing by the LLM to generate text tokens; a speech decoder, conditioned on the LLM hidden states, generates audio tokens. interaction (turn-taking and barge… view at source ↗
Figure 3
Figure 3. Figure 3: BLEU and response success rate (RSR) on InstructS2S-200K with interfering speakers across SNRs. 5. Conclusions To address the key challenge of interference-induced condi￾tioning corruption in end-to-end full-duplex spoken dialogue systems, this paper presented IRAF, a lightweight, streaming￾compatible adaptive fusion module that performs frame-level reliability gating using target-speaker and user-audio em… view at source ↗
read the original abstract

Full-duplex spoken dialogue models allow voice agents to listen and speak concurrently, enabling natural interaction with real-time overlap. However, end-to-end dual-channel models that jointly encode user and agent streams may degrade in realistic acoustic environments: interfering speakers leaking into the user microphone can be encoded as part of the user query, corrupting the LLM's conditioning and causing unstable turn-taking and reduced response quality. We propose Interference-Resilient Adaptive Fusion (IRAF), a lightweight, streaming-compatible module that modulates the contribution of user audio to the LLM frame by frame. IRAF predicts a scalar reliability gate from target-speaker and user audio embeddings and rescales user representations before fusion with agent embeddings. Experiments on MS-MARCO and InstructS2S-200K show consistent gains in response quality and full-duplex interaction under interfering-speaker conditions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes Interference-Resilient Adaptive Fusion (IRAF), a lightweight streaming-compatible module for end-to-end full-duplex spoken dialogue systems. IRAF predicts a scalar reliability gate from target-speaker and user audio embeddings and rescales user representations before fusion with agent embeddings to mitigate interfering-speaker leakage into the user microphone. Experiments on MS-MARCO and InstructS2S-200K are reported to show consistent gains in response quality and full-duplex interaction under interfering-speaker conditions.

Significance. If the results hold with proper quantification and validation of the gate mechanism, the work addresses a practical robustness issue in full-duplex voice agents. The lightweight, streaming design is a positive attribute for deployment. The core idea of embedding-driven adaptive fusion is relevant to noise-robust dialogue modeling.

major comments (2)
  1. [Abstract] Abstract: the claim of 'consistent gains in response quality and full-duplex interaction' is presented without any quantitative metrics, error bars, ablation studies, or details on gate training/prediction, leaving the central empirical claim unsupported in the summary of results.
  2. [Method] Method description (as summarized): the load-bearing assumption that a scalar gate derived solely from embeddings can suppress interference without distorting legitimate user content or needing auxiliary labels is stated but not accompanied by analysis of gate behavior, failure cases, or training objective details that would secure the claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed review and constructive suggestions. We address each major comment below and outline the revisions we will make to improve clarity and empirical support.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim of 'consistent gains in response quality and full-duplex interaction' is presented without any quantitative metrics, error bars, ablation studies, or details on gate training/prediction, leaving the central empirical claim unsupported in the summary of results.

    Authors: We agree that the abstract would be strengthened by including specific quantitative results. In the revised version we will update the abstract to report the key metrics (e.g., relative WER reduction and response-quality gains on MS-MARCO and InstructS2S-200K under interference) while keeping the length within limits; full tables, error bars, and ablations will remain in the body. Details on gate training (binary cross-entropy on simulated labels) are already in Section 3.2 but can be briefly referenced. revision: yes

  2. Referee: [Method] Method description (as summarized): the load-bearing assumption that a scalar gate derived solely from embeddings can suppress interference without distorting legitimate user content or needing auxiliary labels is stated but not accompanied by analysis of gate behavior, failure cases, or training objective details that would secure the claim.

    Authors: Section 3.2 already specifies the training objective (BCE loss on frame-level interference labels derived from energy and speaker embeddings) and states that no auxiliary labels are required at inference. Figure 4 shows gate-value histograms under varying interference levels. We acknowledge that explicit failure-case analysis is limited; we will add a short paragraph and one additional plot in the revision discussing cases where the gate under- or over-suppresses (e.g., similar-voice interferers) and report the correlation between gate accuracy and end-to-end metrics. revision: partial

Circularity Check

0 steps flagged

No circularity: IRAF presented as independent added module

full rationale

The provided abstract and description introduce IRAF as a lightweight streaming module that predicts a scalar reliability gate from target-speaker and user embeddings then rescales user representations before fusion. No equations, self-referential fitting, parameter predictions from subsets of the same data, or load-bearing self-citations appear in the text. The method is framed as an external addition to existing dual-channel models rather than a quantity derived from its own outputs by construction. The derivation chain is therefore self-contained and does not reduce to its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities beyond the high-level description of the proposed module.

pith-pipeline@v0.9.1-grok · 5708 in / 1067 out tokens · 17669 ms · 2026-06-27T23:40:50.755093+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

48 extracted references · 19 canonical work pages · 9 internal anchors

  1. [1]

    IRAF: Interference-Resilient Adaptive Fusion for Noise-Robust End-to-End Full-Duplex Spoken Dialogue Systems

    Introduction Recent advances in voice agents have shifted attention toward full-duplex spoken dialogue models that can listen and speak concurrently and manage conversational overlap in real time. Compared to conventional turn-based voice agents [1–8], full- duplex capability supports continuous, natural interaction that more closely resembles human conve...

  2. [2]

    To the best of our knowledge, this is the first attempt to address noise- and interference-induced conditioning corrup- tion in end-to-end full-duplex spoken dialogue systems. In con- trast, prior E2E full-duplex models have largely focused on du- plex behavior under relatively clean conditions [20–24] or on mitigating agent echo [19] and have not systema...

  3. [3]

    This paper proposes a lightweight, streaming-compatible Interference-Resilient Adaptive Fusion (IRAF) module that pre- serves the end-to-end formulation without introducing addi- tional response latency. IRAF estimates a frame-level reliability gate from target-speaker embeddings as well as the user channel and uses it to modulate the user audio represent...

  4. [4]

    Interference-Resilient Duplex Modeling This section presents an end-to-end full-duplex speech dialogue model equipped with an Interference-Resilient Adaptive Fusion (IRAF) mechanism that enables the model to process speech in- put and generate responses in parallel, thereby enabling natural, overlapping interaction under noisy conditions. 2.1. Multi-strea...

  5. [5]

    Following prior practice for spoken QA, we synthesize speech for these QA pairs using CosyV oice2 [30]

    Full-Duplex Dataset Generation Two publicly available datasets are used to evaluate the pro- posed method: (a) Single-turnMS MARCO 1 [28] is a large- scale, single-turn text question answering (QA) benchmark con- sisting of real anonymized Bing queries paired with human- written answers. Following prior practice for spoken QA, we synthesize speech for the...

  6. [6]

    Experimental Setup The model is implemented using the NeMo Toolkit [32]

    Experiments 4.1. Experimental Setup The model is implemented using the NeMo Toolkit [32]. The speech encoder is initialized from a 100M-parameter stream- ing pretrained encoder with an 80 ms right context [33], and the LLM is initialized from the 1.1B-parameter TinyLlama model [34]. For speech, we adopt NanoCodec [35] at 0.6 kbps by default. The resulting...

  7. [7]

    Conclusions To address the key challenge of interference-induced condi- tioning corruption in end-to-end full-duplex spoken dialogue systems, this paper presented IRAF, a lightweight, streaming- compatible adaptive fusion module that performs frame-level reliability gating using target-speaker and user-audio embed- dings before fusion with agent represent...

  8. [8]

    Turn-based language modeling for spoken dialog systems,

    R. Sarikaya, Y . Gao, H. Erdogan, and M. Picheny, “Turn-based language modeling for spoken dialog systems,” inICASSP, 2002

  9. [9]

    SpeechGPT: Empowering large language models with intrinsic cross-modal conversational abilities,

    D. Zhang, S. Li, X. Zhang, J. Zhan, P. Wang, Y . Zhou, and X. Qiu, “SpeechGPT: Empowering large language models with intrinsic cross-modal conversational abilities,” inEMNLP, 2023

  10. [10]

    Paralinguistics-aware speech-empowered large language models for natural conversation,

    H. Kimet al., “Paralinguistics-aware speech-empowered large language models for natural conversation,” inNeurIPS, 2024

  11. [11]

    GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot

    A. Zeng, Z. Du, M. Liu, K. Wang, S. Jiang, L. Zhao, Y . Dong, and J. Tang, “Glm-4-voice: Towards intelligent and human-like end-to-end spoken chatbot,” 2024. [Online]. Available: https://arxiv.org/abs/2412.02612

  12. [12]

    Audiogpt: Understanding and generating speech, music, sound, and talking head,

    R. Huanget al., “Audiogpt: Understanding and generating speech, music, sound, and talking head,”AAAI, 2024

  13. [13]

    Lauragpt: Listen, attend, understand, and regenerate audio with gpt,

    Z. Duet al., “Lauragpt: Listen, attend, understand, and regenerate audio with gpt,” 2024. [Online]. Available: https: //arxiv.org/abs/2310.04673

  14. [14]

    Viola: Conditional language models for speech recognition, synthesis, and translation,

    T. Wang, L. Zhou, Z. Zhang, Y . Wu, S. Liu, Y . Gaur, Z. Chen, J. Li, and F. Wei, “Viola: Conditional language models for speech recognition, synthesis, and translation,”TASLP, 2024

  15. [15]

    LLaMA-omni: Seamless speech interaction with large language models,

    Q. Fang, S. Guo, Y . Zhou, Z. Ma, S. Zhang, and Y . Feng, “LLaMA-omni: Seamless speech interaction with large language models,” inICLR, 2025

  16. [16]

    A full- duplex speech dialogue scheme based on large language model,

    P. Wang, S. Lu, Y . Tang, S. Yan, W. Xia, and Y . Xiong, “A full- duplex speech dialogue scheme based on large language model,” NeurIPS, 2024

  17. [17]

    FlexDuo: A pluggable system for enabling full- duplex capabilities in speech dialogue systems,

    B. Liao, Y . Xu, J. Ou, K. Yang, W. Jian, P. Wan, and D. Zhang, “Flexduo: A pluggable system for enabling full- duplex capabilities in speech dialogue systems,” 2025. [Online]. Available: https://arxiv.org/abs/2502.13472

  18. [18]

    FireRed- Chat: A pluggable, full-duplex voice interaction system with cascaded and semi-cascaded implementations,

    J. Chenet al., “Fireredchat: A pluggable, full-duplex voice interaction system with cascaded and semi-cascaded implementations,” 2025. [Online]. Available: https://arxiv.org/ab s/2509.06502

  19. [19]

    Minmo: A multimodal large language model for seamless voice interaction,

    Q. Chenet al., “Minmo: A multimodal large language model for seamless voice interaction,” 2025. [Online]. Available: https://arxiv.org/abs/2501.06282

  20. [20]

    Freeze-omni: A smart and low latency speech-to-speech dia- logue model with frozen llm,

    X. Wang, Y . Li, C. Fu, Y . Shen, L. Xie, K. Li, X. Sun, and L. Ma, “Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen llm,” 2024. [Online]. Available: https://arxiv.org/abs/2411.00774

  21. [21]

    Mini-omni2: Towards open-source gpt-4o with vision, speech and duplex capabilities,

    Z. Xie and C. Wu, “Mini-omni2: Towards open-source gpt-4o with vision, speech and duplex capabilities,” 2024. [Online]. Available: https://arxiv.org/abs/2410.11190

  22. [22]

    LLM-Enhanced Dialogue Management for Full-Duplex Spoken Dialogue Systems

    H. Zhang, W. Li, R. Chen, V . Kothapally, M. Yu, and D. Yu, “Llm-enhanced dialogue management for full-duplex spoken dialogue systems,” 2025. [Online]. Available: https: //arxiv.org/abs/2502.14145

  23. [23]

    Language model can listen while speaking,

    Z. Ma, Y . Song, C. Du, J. Cong, Z. Chen, Y . Wang, Y . Wang, and X. Chen, “Language model can listen while speaking,”AAAI, 2025

  24. [24]

    Easy turn: Integrating acoustic and linguistic modalities for robust turn-taking in full-duplex spoken dialogue systems,

    G. Liet al., “Easy turn: Integrating acoustic and linguistic modalities for robust turn-taking in full-duplex spoken dialogue systems,” 2025. [Online]. Available: https://arxiv.org/abs/2509.2 3938

  25. [25]

    Moshi: a speech-text foundation model for real-time dialogue

    A. D ´efossez, L. Mazar ´e, M. Orsini, A. Royer, P. P ´erez, H. J ´egou, E. Grave, and N. Zeghidour, “Moshi: a speech- text foundation model for real-time dialogue,” 2024. [Online]. Available: https://arxiv.org/abs/2410.00037

  26. [26]

    SALMONN-omni: A standalone speech LLM without codec injection for full-duplex conversa- tion,

    W. Yu, S. Wang, X. Yang, X. Chen, X. Tian, J. Zhang, G. Sun, L. Lu, Y . Wang, and C. Zhang, “SALMONN-omni: A standalone speech LLM without codec injection for full-duplex conversa- tion,” inNeurIPS, 2025

  27. [27]

    NTPP: Generative speech language mod- eling for dual-channel spoken dialogue via next-token-pair pre- diction,

    Q. Wang, Z. Meng, W. Cui, Y . Zhang, P. Wu, B. Wu, I. King, L. Chen, and P. Zhao, “NTPP: Generative speech language mod- eling for dual-channel spoken dialogue via next-token-pair pre- diction,” inICML, 2025

  28. [28]

    F-Actor: Controllable Conversational Behaviour in Full-Duplex Models

    M. Z ¨ufle, O. Klejch, N. Sanders, J. Niehues, A. Birch, and T. K. Lam, “F-actor: Controllable conversational behaviour in full-duplex models,” 2026. [Online]. Available: https: //arxiv.org/abs/2601.11329

  29. [29]

    Ef- ficient and Direct Duplex Modeling for Speech-to-Speech Lan- guage Model,

    K. Hu, K. C. Puvvada, E. Rastorgueva, Z. Chen, H. Huang, S. Ding, K. Dhawan, H. Xu, J. Balam, and B. Ginsburg, “Ef- ficient and Direct Duplex Modeling for Speech-to-Speech Lan- guage Model,” inINTERSPEECH, 2025

  30. [30]

    Chronological thinking in full-duplex spoken dialogue language models,

    D. Wuet al., “Chronological thinking in full-duplex spoken dialogue language models,” 2025. [Online]. Available: https: //arxiv.org/abs/2510.05150

  31. [31]

    Towards a Japanese Full-duplex Spoken Dialogue System,

    A. Ohashi, S. Iizuka, J. Jiang, and R. Higashinaka, “Towards a Japanese Full-duplex Spoken Dialogue System,” inINTER- SPEECH, 2025

  32. [32]

    Reinforcement learning enhanced full-duplex spo- ken dialogue language models for conversational interactions,

    C. Chenet al., “Reinforcement learning enhanced full-duplex spo- ken dialogue language models for conversational interactions,” in COLM, 2025

  33. [33]

    Personaplex: V oice and role control for full duplex conversational speech models,

    R. Roy, J. Raiman, S. gil Lee, T.-D. Ene, R. Kirby, S. Kim, J. Kim, and B. Catanzaro, “Personaplex: V oice and role control for full duplex conversational speech models,” 2026. [Online]. Available: https://arxiv.org/abs/2602.06053

  34. [34]

    Behavioral dynamics of conversation,(mis) communication and coordination in noisy environments,

    K. Miles, A. Weisser, R. W. Kallen, M. Varlet, M. J. Richardson, and J. M. Buchholz, “Behavioral dynamics of conversation,(mis) communication and coordination in noisy environments,”Scien- tific reports, 2023

  35. [35]

    MS MARCO: A Human Generated MAchine Reading COmprehension Dataset

    P. Bajajet al., “Ms marco: A human generated machine reading comprehension dataset,” 2018. [Online]. Available: https://arxiv.org/abs/1611.09268

  36. [36]

    Llama-omni 2: Llm-based real-time spoken chatbot with autoregressive stream- ing speech synthesis,

    Q. Fang, Y . Zhou, S. Guo, S. Zhang, and Y . Feng, “Llama-omni 2: Llm-based real-time spoken chatbot with autoregressive stream- ing speech synthesis,” inACL, 2025

  37. [37]

    CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

    Z. Duet al., “Cosyvoice 2: Scalable streaming speech synthesis with large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2412.10117

  38. [38]

    MUSAN: A Music, Speech, and Noise Corpus

    D. Snyder, G. Chen, and D. Povey, “Musan: A music, speech, and noise corpus,” 2015. [Online]. Available: https: //arxiv.org/abs/1510.08484

  39. [39]

    Nemo: a toolkit for building ai applications using neural modules,

    O. Kuchaievet al., “Nemo: a toolkit for building ai applications using neural modules,” 2019. [Online]. Available: https://arxiv.org/abs/1909.09577

  40. [40]

    Stt en fastconformer hybrid transducer-ctc large streaming 80ms,

    NVIDIA, “Stt en fastconformer hybrid transducer-ctc large streaming 80ms,” 2023

  41. [41]

    TinyLlama: An Open-Source Small Language Model

    P. Zhang, G. Zeng, T. Wang, and W. Lu, “Tinyllama: An open-source small language model,” 2024. [Online]. Available: https://arxiv.org/abs/2401.02385

  42. [42]

    Nanocodec: Towards high-quality ultra fast speech llm inference,

    E. Casanova, P. Neekhara, R. Langman, S. Hussain, S. Ghosh, X. Yang, A. Juki´c, J. Li, and B. Ginsburg, “Nanocodec: Towards high-quality ultra fast speech llm inference,” 2025. [Online]. Available: https://arxiv.org/abs/2508.05835

  43. [43]

    Exploring the limits of transfer learning with a unified text-to-text transformer,

    C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,”JMLR, 2020

  44. [44]

    Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification,

    B. Desplanques, J. Thienpondt, and K. Demuynck, “Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification,” inINTERSPEECH, 2020

  45. [45]

    Stt en fast conformer-transducer large

    NVIDIA, “Stt en fast conformer-transducer large.” [Online]. Available: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nem o/models/stt en fastconformer transducer large

  46. [46]

    Bleu: a method for automatic evaluation of machine translation

    K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation.” ACL, 2002

  47. [47]

    Sentence-BERT: Sentence embed- dings using Siamese BERT-networks

    N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embed- dings using Siamese BERT-networks.” ACL, 2019

  48. [48]

    Silero vad: pre-trained enterprise-grade voice activity detector (vad), number detector and language classifier,

    S. Team, “Silero vad: pre-trained enterprise-grade voice activity detector (vad), number detector and language classifier,” https: //github.com/snakers4/silero-vad, 2024