arxiv: 2604.23295 · v1 · submitted 2026-04-25 · 💻 cs.CL · cs.AI

Recognition: unknown

Human-1 by Josh Talks: A Full-Duplex Conversational Modeling Framework in Hindi using Real-World Conversations

Bhaskar Singh, Pranav Sharma, Shobhit Banga

Authors on Pith no claims yet

Pith reviewed 2026-05-08 08:11 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords full-duplex dialogue systemsHindi spoken dialogueconversational turn-takingspontaneous speech datatokenizer adaptationduplex speech modelingreal-world conversationsHindi conversational AI

0 comments

The pith

The first open full-duplex spoken dialogue system for Hindi learns turn-taking and overlaps directly from 26,000 hours of real conversations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to build an open and reproducible full-duplex spoken dialogue system for Hindi that can model interruptions, overlaps, and backchannels the way people actually talk. It achieves this by adapting a duplex speech architecture with a custom Hindi tokenizer, keeping the original audio components, and training on a large collection of spontaneous conversations recorded with separate channels for each speaker. A sympathetic reader would care because most dialogue systems still force rigid turn-taking that feels unnatural, and this work shows a path to more fluid voice interactions in a major world language. The method uses a two-stage process of broad pre-training on the full dataset followed by fine-tuning on a smaller conversational subset, with results checked via prompted continuations judged both automatically and by humans.

Core claim

By replacing the original English tokenizer with a custom Hindi one, reinitializing only the text-vocabulary-dependent parameters, retaining the pre-trained audio components, and training on 26,000 hours of real spontaneous conversations from 14,695 speakers recorded on separate channels, the adapted model learns to produce natural and meaningful full-duplex conversational behavior in Hindi, as shown by evaluation on prompted dialogue continuation using automatic metrics and human judgments.

What carries the argument

The adaptation of a duplex speech architecture with a Hindi tokenizer and two-stage training on separate-channel spontaneous conversation data, which lets the model observe and reproduce turn-taking and overlap patterns directly from natural interactions.

If this is right

The system generates natural full-duplex behaviors such as interruptions and backchannels in Hindi.
It provides an initial foundation for real-time spoken dialogue applications in Hindi.
The same adaptation method can extend to other Indian languages.
Direct training on separate speaker channels captures overlap patterns without synthetic data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Separate-channel recordings may give clearer signals for learning simultaneous speech than mixed audio would.
The approach could speed development of conversational systems for other languages that lack large native datasets by reusing audio features.
Practical voice interfaces built on this model might respond more fluidly in everyday Hindi use.

Load-bearing premise

Retaining the pre-trained audio components from the English model while only updating text-related parameters will allow effective learning of Hindi conversational behaviors from the collected data.

What would settle it

Human listeners in blind tests rating the model's Hindi dialogue continuations as less natural in turn-taking or overlap handling than real recorded conversations, or automatic metrics showing clear failures to maintain coherence during simultaneous speech.

Figures

Figures reproduced from arXiv: 2604.23295 by Bhaskar Singh, Pranav Sharma, Shobhit Banga.

**Figure 1.** Figure 1: Training loss during pre-training on 26,000 view at source ↗

**Figure 2.** Figure 2: Evaluation loss during fine-tuning. Text loss view at source ↗

read the original abstract

Full-duplex spoken dialogue systems can model natural conversational behaviours such as interruptions, overlaps, and backchannels, yet such systems remain largely unexplored for Indian languages. We present the first open, reproducible full-duplex spoken dialogue system for Hindi by adapting Moshi, a state-of-the-art duplex speech architecture, using a custom Hindi tokeniser and training on 26,000 hours of real spontaneous conversations collected from 14,695 speakers with separate speaker channels, enabling direct learning of turn-taking and overlap patterns from natural interactions. To support Hindi text generation, we replace the original English tokeniser and reinitialise text-vocabulary-dependent parameters while retaining the pre-trained audio components. We propose a two-stage training recipe -- large-scale pre-training followed by fine-tuning on 1,000 hours of conversational data. Evaluation through the prompted dialogue continuation paradigm with both automatic metrics and human judgments demonstrates that the resulting model generates natural and meaningful full-duplex conversational behaviour in Hindi. This work serves as a first step toward real-time duplex spoken dialogue systems for Hindi and other Indian languages.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a straightforward adaptation of Moshi to Hindi with a large real-conversation dataset, but the abstract shows no numbers or baselines so the claims rest on unshown evidence.

read the letter

The paper's core move is taking the existing Moshi full-duplex architecture, swapping in a Hindi tokenizer, freezing the English audio stack, and training on 26,000 hours of spontaneous multi-speaker Hindi data collected with separate channels. That dataset size and the real-world collection from 14k speakers is the clearest new piece; prior work on duplex systems has stayed mostly English or used smaller or less natural data. The two-stage recipe (large pre-training then fine-tuning on 1k hours) is a reasonable way to handle the language shift while trying to preserve turn-taking and overlap learning from the raw audio channels. Credit to the authors for focusing on actual spontaneous interactions instead of scripted dialogues. The evaluation plan using prompted continuation plus automatic and human metrics is standard for this area. The main weakness is that the abstract supplies zero quantitative results, no baseline comparisons, and no error analysis, so the claim of natural full-duplex behavior cannot be checked yet. The decision to keep the pre-trained English audio components untouched also looks like the riskiest assumption; Hindi phonetics, syllable timing, and prosody differ enough that frozen representations might not support accurate overlap or backchannel modeling even with lots of data. If the full paper has the missing numbers and some ablation on the frozen audio, that would change the picture. This work is aimed at applied speech researchers working on non-English conversational systems. Readers who need a starting point for Hindi duplex models or who want to see how large real-world datasets are gathered for this task will find value. It is coherent enough on its own terms to deserve peer review rather than a desk reject, mainly because the data collection and language adaptation are concrete steps forward even if the results section needs substantial strengthening.

Referee Report

2 major / 2 minor

Summary. The paper claims to introduce Human-1, the first open and reproducible full-duplex spoken dialogue system for Hindi, by adapting the Moshi architecture. It replaces the English tokenizer with a custom Hindi one, reinitializes only text-vocabulary-dependent parameters while retaining the pre-trained English audio components, and trains on 26,000 hours of spontaneous conversations collected from 14,695 speakers with separate speaker channels. A two-stage recipe (large-scale pre-training followed by fine-tuning on 1,000 hours) is used, with evaluation via prompted dialogue continuation using automatic metrics and human judgments to demonstrate natural and meaningful full-duplex behaviors such as overlaps and turn-taking.

Significance. If substantiated with quantitative evidence, this would be a notable contribution as the first open full-duplex system for Hindi, an under-resourced language for such modeling. The large-scale data collection from real conversations with separate channels and the open, reproducible approach are strengths that could enable further work on conversational dynamics in Indian languages.

major comments (2)

[Abstract and §4] Abstract and §4 (Evaluation): No specific quantitative results are reported for the automatic metrics or human judgments, nor are there baseline comparisons or error analysis. This leaves the central claim that the model 'generates natural and meaningful full-duplex conversational behaviour' without verifiable support.
[§3] §3 (Training Recipe): Retaining the frozen English-pretrained audio encoder/decoder while only reinitializing text parameters assumes these components transfer adequately to Hindi acoustics. Hindi differs in phoneme inventory, syllable structure, and prosody, which are critical for modeling overlaps and backchannels; the manuscript provides no ablation studies or analysis demonstrating that the frozen audio stack produces suitable representations for these phenomena from the Hindi data.

minor comments (2)

[§2] Data collection details (speaker demographics, recording conditions, and quality assurance for the 26,000 hours) could be expanded to strengthen reproducibility claims.
[§4] The prompted continuation evaluation paradigm should be described with more concrete examples of prompts and generated outputs to clarify how full-duplex behaviors are assessed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive suggestions. We agree that the current manuscript requires strengthening in the areas of quantitative reporting and analysis of the audio component transfer. We will make the revisions outlined below to address both major comments.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Evaluation): No specific quantitative results are reported for the automatic metrics or human judgments, nor are there baseline comparisons or error analysis. This leaves the central claim that the model 'generates natural and meaningful full-duplex conversational behaviour' without verifiable support.

Authors: We acknowledge that the manuscript does not report specific numerical values for the automatic metrics (e.g., overlap rate, backchannel frequency, turn-taking accuracy) or human judgment scores, nor does it include baseline comparisons or error analysis in §4 or the abstract. The evaluation section describes the prompted dialogue continuation paradigm and states that results demonstrate natural full-duplex behavior, but we agree this lacks the verifiable quantitative support needed. In the revised manuscript we will add a results table with concrete metric values, human evaluation scores (with inter-annotator agreement), comparisons against at least one baseline where feasible, and a brief error analysis of failure cases in overlaps and backchannels. revision: yes
Referee: [§3] §3 (Training Recipe): Retaining the frozen English-pretrained audio encoder/decoder while only reinitializing text parameters assumes these components transfer adequately to Hindi acoustics. Hindi differs in phoneme inventory, syllable structure, and prosody, which are critical for modeling overlaps and backchannels; the manuscript provides no ablation studies or analysis demonstrating that the frozen audio stack produces suitable representations for these phenomena from the Hindi data.

Authors: The design choice to freeze the English-pretrained audio encoder/decoder was motivated by the hypothesis that low-level acoustic and prosodic features relevant to turn-taking and overlaps are largely language-independent, allowing the model to leverage the large-scale English pretraining while adapting only the text vocabulary and related parameters for Hindi. However, we recognize that the manuscript provides no ablation studies or direct analysis (e.g., representation similarity or probing for Hindi-specific prosody) to validate this transfer for overlap and backchannel modeling. We will add a short analysis subsection in §3 examining the frozen audio representations on a Hindi validation set and include at least one targeted ablation (e.g., unfreezing the audio stack on a smaller scale) to quantify the impact on full-duplex metrics. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical adaptation of existing architecture on new data

full rationale

The paper presents an engineering adaptation of the Moshi model to Hindi via tokenizer replacement, retention of pretrained audio components, and training on a newly collected 26k-hour Hindi conversation dataset. No equations, predictions, or uniqueness theorems are introduced that reduce to fitted parameters, self-definitions, or self-citations. The central claims rest on data collection, a two-stage training recipe, and downstream evaluation metrics rather than any load-bearing derivation that loops back to its own inputs. This is a standard non-circular empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the transferability of Moshi's pre-trained audio components to Hindi and on the assumption that the collected 26,000-hour dataset with separate channels sufficiently captures natural duplex patterns.

axioms (1)

domain assumption Pre-trained audio components from an English duplex model can be retained and combined with a new Hindi text tokenizer to learn conversational turn-taking in Hindi.
Invoked when describing the adaptation strategy and retention of audio components.

pith-pipeline@v0.9.0 · 5494 in / 1519 out tokens · 35345 ms · 2026-05-08T08:11:33.717109+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

20 extracted references · 8 canonical work pages · 2 internal anchors

[1]

Gener- ative spoken dialogue language modeling,

T. A. Nguyen, E. Kharitonov, J. Copet, Y. Adi, W.-N. Hsu, A. Elkahky, P. Tomasello, R. Algayres, B. Sagot, A. Mohamed, and E. Dupoux, “Gener- ative spoken dialogue language modeling,”Trans- actions of the Association for Computational Lin- guistics, pp. 250–266, 2023

2023
[2]

A full-duplex speech dialogue scheme based on large language model,

P. Wang, S. Lu, Y. Tang, S. Yan, W. Xia, and Y. Xiong, “A full-duplex speech dialogue scheme based on large language model,” inProc. NeurIPS, 2024

2024
[3]

Language model can listen while speaking,

Z. Ma, Y. Song, C. Du, J. Cong, Z. Chen, Y. Wang, Y. Wang, and X. Chen, “Language model can listen while speaking,”arXiv preprint arXiv:2408.02622, 2024

work page arXiv 2024
[4]

Moshi: a speech-text foundation model for real-time dialogue

A. D´ efossez, L. Mazar´ e, M. Orsini, A. Royer, P. P´ erez, H. J´ egou, E. Grave, and N. Zeghidour, “Moshi: a speech-text foundation model for real- time dialogue,”arXiv preprint arXiv:2410.00037, 2024

work page internal anchor Pith review arXiv 2024
[5]

Beyond turn-based interfaces: Syn- chronous LLMs as full-duplex dialogue agents,

B. Veluri, B. N. Peloquin, B. Yu, H. Gong, and S. Gollakota, “Beyond turn-based interfaces: Syn- chronous LLMs as full-duplex dialogue agents,” in Proc. EMNLP, pp. 21390–21402, 2024

2024
[6]

Omniflat- ten: An end-to-end GPT model for seamless voice conversation,

Q. Zhang, L. Cheng, C. Deng, Q. Chen, W. Wang, S. Zheng, J. Liu, H. Yu, and C. Tan, “Omniflat- ten: An end-to-end GPT model for seamless voice conversation,”arXiv preprint arXiv:2410.17799, 2024

work page arXiv 2024
[7]

Salmonn-omni: A codec-free llm for full-duplex speech understanding and generation,

W. Yu, S. Wang, X. Yang, X. Chen, X. Tian, J. Zhang, G. Sun, L. Lu, Y. Wang, and C. Zhang, “SALMONN-omni: A codec-free LLM for full- duplex speech understanding and generation,” arXiv preprint arXiv:2411.18138, 2024

work page arXiv 2024
[8]

Simultaneous talk—from the per- spective of floor management of English and Japanese speakers,

R. Hayashi, “Simultaneous talk—from the per- spective of floor management of English and Japanese speakers,”World Englishes, vol. 7, no. 3, pp. 269–288, 1988

1988
[9]

Are you listening? Cultural influ- ences on the use of supportive verbal feedback in conversation,

M. Stubbe, “Are you listening? Cultural influ- ences on the use of supportive verbal feedback in conversation,”Journal of Pragmatics, vol. 29, no. 3, pp. 257–289, 1998

1998
[10]

SpeechGPT: Empowering large language models with intrinsic cross-modal conversational abilities,

D. Zhang, S. Li, X. Zhang, J. Zhan, P. Wang, Y. Zhou, and X. Qiu, “SpeechGPT: Empowering large language models with intrinsic cross-modal conversational abilities,” inFindings of EMNLP, pp. 15757–15773, 2023

2023
[11]

AudioPaLM: A large language model that can speak and listen,

P. K. Rubenstein, C. Asawaroengchai, D. D. Nguyen, et al., “AudioPaLM: A large language model that can speak and listen,”arXiv preprint arXiv:2306.12925, 2023

work page arXiv 2023
[12]

Unsupervised cross-lingual representation learning for speech recognition,

A. Conneau, A. Baevski, R. Collobert, A. Mo- hamed, and M. Auli, “Unsupervised cross-lingual representation learning for speech recognition,” in Proc. Interspeech, pp. 2426–2430, 2020

2020
[13]

arXiv preprint arXiv:2303.03926 , year=

X. Zhang, Z. Tan, R. Huang, et al., “VALL-E X: Speak foreign languages with your own voice,” arXiv preprint arXiv:2303.03926, 2023

work page arXiv 2023
[14]

High fidelity neural audio compression,

A. D´ efossez, J. Copet, G. Synnaeve, and Y. Adi, “High fidelity neural audio compression,”Trans- actions on Machine Learning Research, 2023

2023
[15]

SoundStream: An end-to- end neural audio codec,

N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi, “SoundStream: An end-to- end neural audio codec,”IEEE/ACM Trans. Au- dio, Speech, Lang. Process., vol. 30, pp. 495–507, 2021

2021
[16]

SentencePiece: A simple and language independent subword tok- enizer and detokenizer for neural text processing,

T. Kudo and J. Richardson, “SentencePiece: A simple and language independent subword tok- enizer and detokenizer for neural text processing,” inProc. EMNLP: System Demonstrations, pp. 66– 71, 2018

2018
[17]

Whisperx: Time-accurate speech transcription of long-form audio.arXiv preprint arXiv:2303.00747, 2023

M. Bain, J. Huh, T. Han, and A. Zisser- man, “WhisperX: Time-accurate speech tran- scription of long-form audio,”arXiv preprint arXiv:2303.00747, 2023

work page arXiv 2023
[18]

Decoupled weight decay regularization,

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” inProc. ICLR, 2019

2019
[19]

Llama 2: Open Foundation and Fine-Tuned Chat Models

H. Touvron, L. Martin, K. Stone, et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023

work page internal anchor Pith review arXiv 2023
[20]

ZeRO: Memory optimizations toward training trillion parameter models,

S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He, “ZeRO: Memory optimizations toward training trillion parameter models,” inProc. SC, 2020. 5

2020