IRAF: Interference-Resilient Adaptive Fusion for Noise-Robust End-to-End Full-Duplex Spoken Dialogue Systems
Pith reviewed 2026-06-27 23:40 UTC · model grok-4.3
The pith
IRAF predicts a scalar reliability gate to rescale user representations and suppress interfering speakers in full-duplex dialogue systems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
IRAF predicts a scalar reliability gate from target-speaker and user audio embeddings and rescales user representations before fusion with agent embeddings, yielding consistent gains in response quality and full-duplex interaction under interfering-speaker conditions on MS-MARCO and InstructS2S-200K.
What carries the argument
Scalar reliability gate predicted from embeddings that modulates the contribution of user audio frame by frame before fusion.
If this is right
- Response quality improves consistently when interfering speakers are present.
- Full-duplex turn-taking remains stable because the LLM receives less corrupted user conditioning.
- The module requires no extra training labels or data beyond the described setup.
- Fusion stays streaming-compatible while suppressing leakage before it reaches agent embeddings.
Where Pith is reading between the lines
- The same embedding-based gating could apply to non-speaker noise sources such as environmental sounds.
- IRAF could be inserted into existing dual-channel models without retraining the underlying LLM.
- Real-world multi-talker recordings would test whether the gate generalizes beyond the benchmark interference patterns.
Load-bearing premise
A lightweight scalar gate from embeddings can reliably distinguish and suppress interfering speaker leakage without distorting the intended user query.
What would settle it
An experiment that replaces the predicted gate with a constant value and measures whether response quality and turn-taking stability degrade under the same interfering-speaker conditions on MS-MARCO or InstructS2S-200K.
Figures
read the original abstract
Full-duplex spoken dialogue models allow voice agents to listen and speak concurrently, enabling natural interaction with real-time overlap. However, end-to-end dual-channel models that jointly encode user and agent streams may degrade in realistic acoustic environments: interfering speakers leaking into the user microphone can be encoded as part of the user query, corrupting the LLM's conditioning and causing unstable turn-taking and reduced response quality. We propose Interference-Resilient Adaptive Fusion (IRAF), a lightweight, streaming-compatible module that modulates the contribution of user audio to the LLM frame by frame. IRAF predicts a scalar reliability gate from target-speaker and user audio embeddings and rescales user representations before fusion with agent embeddings. Experiments on MS-MARCO and InstructS2S-200K show consistent gains in response quality and full-duplex interaction under interfering-speaker conditions.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Interference-Resilient Adaptive Fusion (IRAF), a lightweight streaming-compatible module for end-to-end full-duplex spoken dialogue systems. IRAF predicts a scalar reliability gate from target-speaker and user audio embeddings and rescales user representations before fusion with agent embeddings to mitigate interfering-speaker leakage into the user microphone. Experiments on MS-MARCO and InstructS2S-200K are reported to show consistent gains in response quality and full-duplex interaction under interfering-speaker conditions.
Significance. If the results hold with proper quantification and validation of the gate mechanism, the work addresses a practical robustness issue in full-duplex voice agents. The lightweight, streaming design is a positive attribute for deployment. The core idea of embedding-driven adaptive fusion is relevant to noise-robust dialogue modeling.
major comments (2)
- [Abstract] Abstract: the claim of 'consistent gains in response quality and full-duplex interaction' is presented without any quantitative metrics, error bars, ablation studies, or details on gate training/prediction, leaving the central empirical claim unsupported in the summary of results.
- [Method] Method description (as summarized): the load-bearing assumption that a scalar gate derived solely from embeddings can suppress interference without distorting legitimate user content or needing auxiliary labels is stated but not accompanied by analysis of gate behavior, failure cases, or training objective details that would secure the claim.
Simulated Author's Rebuttal
We thank the referee for the detailed review and constructive suggestions. We address each major comment below and outline the revisions we will make to improve clarity and empirical support.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim of 'consistent gains in response quality and full-duplex interaction' is presented without any quantitative metrics, error bars, ablation studies, or details on gate training/prediction, leaving the central empirical claim unsupported in the summary of results.
Authors: We agree that the abstract would be strengthened by including specific quantitative results. In the revised version we will update the abstract to report the key metrics (e.g., relative WER reduction and response-quality gains on MS-MARCO and InstructS2S-200K under interference) while keeping the length within limits; full tables, error bars, and ablations will remain in the body. Details on gate training (binary cross-entropy on simulated labels) are already in Section 3.2 but can be briefly referenced. revision: yes
-
Referee: [Method] Method description (as summarized): the load-bearing assumption that a scalar gate derived solely from embeddings can suppress interference without distorting legitimate user content or needing auxiliary labels is stated but not accompanied by analysis of gate behavior, failure cases, or training objective details that would secure the claim.
Authors: Section 3.2 already specifies the training objective (BCE loss on frame-level interference labels derived from energy and speaker embeddings) and states that no auxiliary labels are required at inference. Figure 4 shows gate-value histograms under varying interference levels. We acknowledge that explicit failure-case analysis is limited; we will add a short paragraph and one additional plot in the revision discussing cases where the gate under- or over-suppresses (e.g., similar-voice interferers) and report the correlation between gate accuracy and end-to-end metrics. revision: partial
Circularity Check
No circularity: IRAF presented as independent added module
full rationale
The provided abstract and description introduce IRAF as a lightweight streaming module that predicts a scalar reliability gate from target-speaker and user embeddings then rescales user representations before fusion. No equations, self-referential fitting, parameter predictions from subsets of the same data, or load-bearing self-citations appear in the text. The method is framed as an external addition to existing dual-channel models rather than a quantity derived from its own outputs by construction. The derivation chain is therefore self-contained and does not reduce to its inputs.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Introduction Recent advances in voice agents have shifted attention toward full-duplex spoken dialogue models that can listen and speak concurrently and manage conversational overlap in real time. Compared to conventional turn-based voice agents [1–8], full- duplex capability supports continuous, natural interaction that more closely resembles human conve...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
To the best of our knowledge, this is the first attempt to address noise- and interference-induced conditioning corrup- tion in end-to-end full-duplex spoken dialogue systems. In con- trast, prior E2E full-duplex models have largely focused on du- plex behavior under relatively clean conditions [20–24] or on mitigating agent echo [19] and have not systema...
-
[3]
This paper proposes a lightweight, streaming-compatible Interference-Resilient Adaptive Fusion (IRAF) module that pre- serves the end-to-end formulation without introducing addi- tional response latency. IRAF estimates a frame-level reliability gate from target-speaker embeddings as well as the user channel and uses it to modulate the user audio represent...
-
[4]
Interference-Resilient Duplex Modeling This section presents an end-to-end full-duplex speech dialogue model equipped with an Interference-Resilient Adaptive Fusion (IRAF) mechanism that enables the model to process speech in- put and generate responses in parallel, thereby enabling natural, overlapping interaction under noisy conditions. 2.1. Multi-strea...
-
[5]
Following prior practice for spoken QA, we synthesize speech for these QA pairs using CosyV oice2 [30]
Full-Duplex Dataset Generation Two publicly available datasets are used to evaluate the pro- posed method: (a) Single-turnMS MARCO 1 [28] is a large- scale, single-turn text question answering (QA) benchmark con- sisting of real anonymized Bing queries paired with human- written answers. Following prior practice for spoken QA, we synthesize speech for the...
-
[6]
Experimental Setup The model is implemented using the NeMo Toolkit [32]
Experiments 4.1. Experimental Setup The model is implemented using the NeMo Toolkit [32]. The speech encoder is initialized from a 100M-parameter stream- ing pretrained encoder with an 80 ms right context [33], and the LLM is initialized from the 1.1B-parameter TinyLlama model [34]. For speech, we adopt NanoCodec [35] at 0.6 kbps by default. The resulting...
-
[7]
Conclusions To address the key challenge of interference-induced condi- tioning corruption in end-to-end full-duplex spoken dialogue systems, this paper presented IRAF, a lightweight, streaming- compatible adaptive fusion module that performs frame-level reliability gating using target-speaker and user-audio embed- dings before fusion with agent represent...
-
[8]
Turn-based language modeling for spoken dialog systems,
R. Sarikaya, Y . Gao, H. Erdogan, and M. Picheny, “Turn-based language modeling for spoken dialog systems,” inICASSP, 2002
2002
-
[9]
SpeechGPT: Empowering large language models with intrinsic cross-modal conversational abilities,
D. Zhang, S. Li, X. Zhang, J. Zhan, P. Wang, Y . Zhou, and X. Qiu, “SpeechGPT: Empowering large language models with intrinsic cross-modal conversational abilities,” inEMNLP, 2023
2023
-
[10]
Paralinguistics-aware speech-empowered large language models for natural conversation,
H. Kimet al., “Paralinguistics-aware speech-empowered large language models for natural conversation,” inNeurIPS, 2024
2024
-
[11]
GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot
A. Zeng, Z. Du, M. Liu, K. Wang, S. Jiang, L. Zhao, Y . Dong, and J. Tang, “Glm-4-voice: Towards intelligent and human-like end-to-end spoken chatbot,” 2024. [Online]. Available: https://arxiv.org/abs/2412.02612
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
Audiogpt: Understanding and generating speech, music, sound, and talking head,
R. Huanget al., “Audiogpt: Understanding and generating speech, music, sound, and talking head,”AAAI, 2024
2024
-
[13]
Lauragpt: Listen, attend, understand, and re- generate audio with gpt,
Z. Duet al., “Lauragpt: Listen, attend, understand, and regenerate audio with gpt,” 2024. [Online]. Available: https: //arxiv.org/abs/2310.04673
-
[14]
Viola: Conditional language models for speech recognition, synthesis, and translation,
T. Wang, L. Zhou, Z. Zhang, Y . Wu, S. Liu, Y . Gaur, Z. Chen, J. Li, and F. Wei, “Viola: Conditional language models for speech recognition, synthesis, and translation,”TASLP, 2024
2024
-
[15]
LLaMA-omni: Seamless speech interaction with large language models,
Q. Fang, S. Guo, Y . Zhou, Z. Ma, S. Zhang, and Y . Feng, “LLaMA-omni: Seamless speech interaction with large language models,” inICLR, 2025
2025
-
[16]
A full- duplex speech dialogue scheme based on large language model,
P. Wang, S. Lu, Y . Tang, S. Yan, W. Xia, and Y . Xiong, “A full- duplex speech dialogue scheme based on large language model,” NeurIPS, 2024
2024
-
[17]
B. Liao, Y . Xu, J. Ou, K. Yang, W. Jian, P. Wan, and D. Zhang, “Flexduo: A pluggable system for enabling full- duplex capabilities in speech dialogue systems,” 2025. [Online]. Available: https://arxiv.org/abs/2502.13472
-
[18]
J. Chenet al., “Fireredchat: A pluggable, full-duplex voice interaction system with cascaded and semi-cascaded implementations,” 2025. [Online]. Available: https://arxiv.org/ab s/2509.06502
-
[19]
Q. Chenet al., “Minmo: A multimodal large language model for seamless voice interaction,” 2025. [Online]. Available: https://arxiv.org/abs/2501.06282
-
[20]
X. Wang, Y . Li, C. Fu, Y . Shen, L. Xie, K. Li, X. Sun, and L. Ma, “Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen llm,” 2024. [Online]. Available: https://arxiv.org/abs/2411.00774
-
[21]
arXiv preprint arXiv:2410.11190 , year=
Z. Xie and C. Wu, “Mini-omni2: Towards open-source gpt-4o with vision, speech and duplex capabilities,” 2024. [Online]. Available: https://arxiv.org/abs/2410.11190
-
[22]
LLM-Enhanced Dialogue Management for Full-Duplex Spoken Dialogue Systems
H. Zhang, W. Li, R. Chen, V . Kothapally, M. Yu, and D. Yu, “Llm-enhanced dialogue management for full-duplex spoken dialogue systems,” 2025. [Online]. Available: https: //arxiv.org/abs/2502.14145
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[23]
Language model can listen while speaking,
Z. Ma, Y . Song, C. Du, J. Cong, Z. Chen, Y . Wang, Y . Wang, and X. Chen, “Language model can listen while speaking,”AAAI, 2025
2025
-
[24]
Easy turn: Integrating acoustic and linguistic modalities for robust turn-taking in full-duplex spoken dialogue systems,
G. Liet al., “Easy turn: Integrating acoustic and linguistic modalities for robust turn-taking in full-duplex spoken dialogue systems,” 2025. [Online]. Available: https://arxiv.org/abs/2509.2 3938
2025
-
[25]
Moshi: a speech-text foundation model for real-time dialogue
A. D ´efossez, L. Mazar ´e, M. Orsini, A. Royer, P. P ´erez, H. J ´egou, E. Grave, and N. Zeghidour, “Moshi: a speech- text foundation model for real-time dialogue,” 2024. [Online]. Available: https://arxiv.org/abs/2410.00037
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
SALMONN-omni: A standalone speech LLM without codec injection for full-duplex conversa- tion,
W. Yu, S. Wang, X. Yang, X. Chen, X. Tian, J. Zhang, G. Sun, L. Lu, Y . Wang, and C. Zhang, “SALMONN-omni: A standalone speech LLM without codec injection for full-duplex conversa- tion,” inNeurIPS, 2025
2025
-
[27]
NTPP: Generative speech language mod- eling for dual-channel spoken dialogue via next-token-pair pre- diction,
Q. Wang, Z. Meng, W. Cui, Y . Zhang, P. Wu, B. Wu, I. King, L. Chen, and P. Zhao, “NTPP: Generative speech language mod- eling for dual-channel spoken dialogue via next-token-pair pre- diction,” inICML, 2025
2025
-
[28]
F-Actor: Controllable Conversational Behaviour in Full-Duplex Models
M. Z ¨ufle, O. Klejch, N. Sanders, J. Niehues, A. Birch, and T. K. Lam, “F-actor: Controllable conversational behaviour in full-duplex models,” 2026. [Online]. Available: https: //arxiv.org/abs/2601.11329
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[29]
Ef- ficient and Direct Duplex Modeling for Speech-to-Speech Lan- guage Model,
K. Hu, K. C. Puvvada, E. Rastorgueva, Z. Chen, H. Huang, S. Ding, K. Dhawan, H. Xu, J. Balam, and B. Ginsburg, “Ef- ficient and Direct Duplex Modeling for Speech-to-Speech Lan- guage Model,” inINTERSPEECH, 2025
2025
-
[30]
Chronological thinking in full-duplex spoken dialogue language models,
D. Wuet al., “Chronological thinking in full-duplex spoken dialogue language models,” 2025. [Online]. Available: https: //arxiv.org/abs/2510.05150
-
[31]
Towards a Japanese Full-duplex Spoken Dialogue System,
A. Ohashi, S. Iizuka, J. Jiang, and R. Higashinaka, “Towards a Japanese Full-duplex Spoken Dialogue System,” inINTER- SPEECH, 2025
2025
-
[32]
Reinforcement learning enhanced full-duplex spo- ken dialogue language models for conversational interactions,
C. Chenet al., “Reinforcement learning enhanced full-duplex spo- ken dialogue language models for conversational interactions,” in COLM, 2025
2025
-
[33]
Ope- nAI announcement, accessed 2026-05-18
R. Roy, J. Raiman, S. gil Lee, T.-D. Ene, R. Kirby, S. Kim, J. Kim, and B. Catanzaro, “Personaplex: V oice and role control for full duplex conversational speech models,” 2026. [Online]. Available: https://arxiv.org/abs/2602.06053
-
[34]
Behavioral dynamics of conversation,(mis) communication and coordination in noisy environments,
K. Miles, A. Weisser, R. W. Kallen, M. Varlet, M. J. Richardson, and J. M. Buchholz, “Behavioral dynamics of conversation,(mis) communication and coordination in noisy environments,”Scien- tific reports, 2023
2023
-
[35]
MS MARCO: A Human Generated MAchine Reading COmprehension Dataset
P. Bajajet al., “Ms marco: A human generated machine reading comprehension dataset,” 2018. [Online]. Available: https://arxiv.org/abs/1611.09268
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[36]
Llama-omni 2: Llm-based real-time spoken chatbot with autoregressive stream- ing speech synthesis,
Q. Fang, Y . Zhou, S. Guo, S. Zhang, and Y . Feng, “Llama-omni 2: Llm-based real-time spoken chatbot with autoregressive stream- ing speech synthesis,” inACL, 2025
2025
-
[37]
CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models
Z. Duet al., “Cosyvoice 2: Scalable streaming speech synthesis with large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2412.10117
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[38]
MUSAN: A Music, Speech, and Noise Corpus
D. Snyder, G. Chen, and D. Povey, “Musan: A music, speech, and noise corpus,” 2015. [Online]. Available: https: //arxiv.org/abs/1510.08484
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[39]
NeMo: A toolkit for building AI applications using neural modules
O. Kuchaievet al., “Nemo: a toolkit for building ai applications using neural modules,” 2019. [Online]. Available: https://arxiv.org/abs/1909.09577
-
[40]
Stt en fastconformer hybrid transducer-ctc large streaming 80ms,
NVIDIA, “Stt en fastconformer hybrid transducer-ctc large streaming 80ms,” 2023
2023
-
[41]
TinyLlama: An Open-Source Small Language Model
P. Zhang, G. Zeng, T. Wang, and W. Lu, “Tinyllama: An open-source small language model,” 2024. [Online]. Available: https://arxiv.org/abs/2401.02385
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
Nanocodec: Towards high-quality ultra fast speech llm inference,
E. Casanova, P. Neekhara, R. Langman, S. Hussain, S. Ghosh, X. Yang, A. Juki´c, J. Li, and B. Ginsburg, “Nanocodec: Towards high-quality ultra fast speech llm inference,” 2025. [Online]. Available: https://arxiv.org/abs/2508.05835
-
[43]
Exploring the limits of transfer learning with a unified text-to-text transformer,
C. Raffel, N. Shazeer, A. Roberts, K. Lee, S. Narang, M. Matena, Y . Zhou, W. Li, and P. J. Liu, “Exploring the limits of transfer learning with a unified text-to-text transformer,”JMLR, 2020
2020
-
[44]
Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification,
B. Desplanques, J. Thienpondt, and K. Demuynck, “Ecapa-tdnn: Emphasized channel attention, propagation and aggregation in tdnn based speaker verification,” inINTERSPEECH, 2020
2020
-
[45]
Stt en fast conformer-transducer large
NVIDIA, “Stt en fast conformer-transducer large.” [Online]. Available: https://catalog.ngc.nvidia.com/orgs/nvidia/teams/nem o/models/stt en fastconformer transducer large
-
[46]
Bleu: a method for automatic evaluation of machine translation
K. Papineni, S. Roukos, T. Ward, and W.-J. Zhu, “Bleu: a method for automatic evaluation of machine translation.” ACL, 2002
2002
-
[47]
Sentence-BERT: Sentence embed- dings using Siamese BERT-networks
N. Reimers and I. Gurevych, “Sentence-BERT: Sentence embed- dings using Siamese BERT-networks.” ACL, 2019
2019
-
[48]
Silero vad: pre-trained enterprise-grade voice activity detector (vad), number detector and language classifier,
S. Team, “Silero vad: pre-trained enterprise-grade voice activity detector (vad), number detector and language classifier,” https: //github.com/snakers4/silero-vad, 2024
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.