pith. machine review for the scientific record. sign in

arxiv: 2604.23295 · v1 · submitted 2026-04-25 · 💻 cs.CL · cs.AI

Recognition: unknown

Human-1 by Josh Talks: A Full-Duplex Conversational Modeling Framework in Hindi using Real-World Conversations

Bhaskar Singh, Pranav Sharma, Shobhit Banga

Authors on Pith no claims yet

Pith reviewed 2026-05-08 08:11 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords full-duplex dialogue systemsHindi spoken dialogueconversational turn-takingspontaneous speech datatokenizer adaptationduplex speech modelingreal-world conversationsHindi conversational AI
0
0 comments X

The pith

The first open full-duplex spoken dialogue system for Hindi learns turn-taking and overlaps directly from 26,000 hours of real conversations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to build an open and reproducible full-duplex spoken dialogue system for Hindi that can model interruptions, overlaps, and backchannels the way people actually talk. It achieves this by adapting a duplex speech architecture with a custom Hindi tokenizer, keeping the original audio components, and training on a large collection of spontaneous conversations recorded with separate channels for each speaker. A sympathetic reader would care because most dialogue systems still force rigid turn-taking that feels unnatural, and this work shows a path to more fluid voice interactions in a major world language. The method uses a two-stage process of broad pre-training on the full dataset followed by fine-tuning on a smaller conversational subset, with results checked via prompted continuations judged both automatically and by humans.

Core claim

By replacing the original English tokenizer with a custom Hindi one, reinitializing only the text-vocabulary-dependent parameters, retaining the pre-trained audio components, and training on 26,000 hours of real spontaneous conversations from 14,695 speakers recorded on separate channels, the adapted model learns to produce natural and meaningful full-duplex conversational behavior in Hindi, as shown by evaluation on prompted dialogue continuation using automatic metrics and human judgments.

What carries the argument

The adaptation of a duplex speech architecture with a Hindi tokenizer and two-stage training on separate-channel spontaneous conversation data, which lets the model observe and reproduce turn-taking and overlap patterns directly from natural interactions.

If this is right

  • The system generates natural full-duplex behaviors such as interruptions and backchannels in Hindi.
  • It provides an initial foundation for real-time spoken dialogue applications in Hindi.
  • The same adaptation method can extend to other Indian languages.
  • Direct training on separate speaker channels captures overlap patterns without synthetic data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Separate-channel recordings may give clearer signals for learning simultaneous speech than mixed audio would.
  • The approach could speed development of conversational systems for other languages that lack large native datasets by reusing audio features.
  • Practical voice interfaces built on this model might respond more fluidly in everyday Hindi use.

Load-bearing premise

Retaining the pre-trained audio components from the English model while only updating text-related parameters will allow effective learning of Hindi conversational behaviors from the collected data.

What would settle it

Human listeners in blind tests rating the model's Hindi dialogue continuations as less natural in turn-taking or overlap handling than real recorded conversations, or automatic metrics showing clear failures to maintain coherence during simultaneous speech.

Figures

Figures reproduced from arXiv: 2604.23295 by Bhaskar Singh, Pranav Sharma, Shobhit Banga.

Figure 1
Figure 1. Figure 1: Training loss during pre-training on 26,000 view at source ↗
Figure 2
Figure 2. Figure 2: Evaluation loss during fine-tuning. Text loss view at source ↗
read the original abstract

Full-duplex spoken dialogue systems can model natural conversational behaviours such as interruptions, overlaps, and backchannels, yet such systems remain largely unexplored for Indian languages. We present the first open, reproducible full-duplex spoken dialogue system for Hindi by adapting Moshi, a state-of-the-art duplex speech architecture, using a custom Hindi tokeniser and training on 26,000 hours of real spontaneous conversations collected from 14,695 speakers with separate speaker channels, enabling direct learning of turn-taking and overlap patterns from natural interactions. To support Hindi text generation, we replace the original English tokeniser and reinitialise text-vocabulary-dependent parameters while retaining the pre-trained audio components. We propose a two-stage training recipe -- large-scale pre-training followed by fine-tuning on 1,000 hours of conversational data. Evaluation through the prompted dialogue continuation paradigm with both automatic metrics and human judgments demonstrates that the resulting model generates natural and meaningful full-duplex conversational behaviour in Hindi. This work serves as a first step toward real-time duplex spoken dialogue systems for Hindi and other Indian languages.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims to introduce Human-1, the first open and reproducible full-duplex spoken dialogue system for Hindi, by adapting the Moshi architecture. It replaces the English tokenizer with a custom Hindi one, reinitializes only text-vocabulary-dependent parameters while retaining the pre-trained English audio components, and trains on 26,000 hours of spontaneous conversations collected from 14,695 speakers with separate speaker channels. A two-stage recipe (large-scale pre-training followed by fine-tuning on 1,000 hours) is used, with evaluation via prompted dialogue continuation using automatic metrics and human judgments to demonstrate natural and meaningful full-duplex behaviors such as overlaps and turn-taking.

Significance. If substantiated with quantitative evidence, this would be a notable contribution as the first open full-duplex system for Hindi, an under-resourced language for such modeling. The large-scale data collection from real conversations with separate channels and the open, reproducible approach are strengths that could enable further work on conversational dynamics in Indian languages.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Evaluation): No specific quantitative results are reported for the automatic metrics or human judgments, nor are there baseline comparisons or error analysis. This leaves the central claim that the model 'generates natural and meaningful full-duplex conversational behaviour' without verifiable support.
  2. [§3] §3 (Training Recipe): Retaining the frozen English-pretrained audio encoder/decoder while only reinitializing text parameters assumes these components transfer adequately to Hindi acoustics. Hindi differs in phoneme inventory, syllable structure, and prosody, which are critical for modeling overlaps and backchannels; the manuscript provides no ablation studies or analysis demonstrating that the frozen audio stack produces suitable representations for these phenomena from the Hindi data.
minor comments (2)
  1. [§2] Data collection details (speaker demographics, recording conditions, and quality assurance for the 26,000 hours) could be expanded to strengthen reproducibility claims.
  2. [§4] The prompted continuation evaluation paradigm should be described with more concrete examples of prompts and generated outputs to clarify how full-duplex behaviors are assessed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive suggestions. We agree that the current manuscript requires strengthening in the areas of quantitative reporting and analysis of the audio component transfer. We will make the revisions outlined below to address both major comments.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Evaluation): No specific quantitative results are reported for the automatic metrics or human judgments, nor are there baseline comparisons or error analysis. This leaves the central claim that the model 'generates natural and meaningful full-duplex conversational behaviour' without verifiable support.

    Authors: We acknowledge that the manuscript does not report specific numerical values for the automatic metrics (e.g., overlap rate, backchannel frequency, turn-taking accuracy) or human judgment scores, nor does it include baseline comparisons or error analysis in §4 or the abstract. The evaluation section describes the prompted dialogue continuation paradigm and states that results demonstrate natural full-duplex behavior, but we agree this lacks the verifiable quantitative support needed. In the revised manuscript we will add a results table with concrete metric values, human evaluation scores (with inter-annotator agreement), comparisons against at least one baseline where feasible, and a brief error analysis of failure cases in overlaps and backchannels. revision: yes

  2. Referee: [§3] §3 (Training Recipe): Retaining the frozen English-pretrained audio encoder/decoder while only reinitializing text parameters assumes these components transfer adequately to Hindi acoustics. Hindi differs in phoneme inventory, syllable structure, and prosody, which are critical for modeling overlaps and backchannels; the manuscript provides no ablation studies or analysis demonstrating that the frozen audio stack produces suitable representations for these phenomena from the Hindi data.

    Authors: The design choice to freeze the English-pretrained audio encoder/decoder was motivated by the hypothesis that low-level acoustic and prosodic features relevant to turn-taking and overlaps are largely language-independent, allowing the model to leverage the large-scale English pretraining while adapting only the text vocabulary and related parameters for Hindi. However, we recognize that the manuscript provides no ablation studies or direct analysis (e.g., representation similarity or probing for Hindi-specific prosody) to validate this transfer for overlap and backchannel modeling. We will add a short analysis subsection in §3 examining the frozen audio representations on a Hindi validation set and include at least one targeted ablation (e.g., unfreezing the audio stack on a smaller scale) to quantify the impact on full-duplex metrics. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical adaptation of existing architecture on new data

full rationale

The paper presents an engineering adaptation of the Moshi model to Hindi via tokenizer replacement, retention of pretrained audio components, and training on a newly collected 26k-hour Hindi conversation dataset. No equations, predictions, or uniqueness theorems are introduced that reduce to fitted parameters, self-definitions, or self-citations. The central claims rest on data collection, a two-stage training recipe, and downstream evaluation metrics rather than any load-bearing derivation that loops back to its own inputs. This is a standard non-circular empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the transferability of Moshi's pre-trained audio components to Hindi and on the assumption that the collected 26,000-hour dataset with separate channels sufficiently captures natural duplex patterns.

axioms (1)
  • domain assumption Pre-trained audio components from an English duplex model can be retained and combined with a new Hindi text tokenizer to learn conversational turn-taking in Hindi.
    Invoked when describing the adaptation strategy and retention of audio components.

pith-pipeline@v0.9.0 · 5494 in / 1519 out tokens · 35345 ms · 2026-05-08T08:11:33.717109+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 8 canonical work pages · 2 internal anchors

  1. [1]

    Gener- ative spoken dialogue language modeling,

    T. A. Nguyen, E. Kharitonov, J. Copet, Y. Adi, W.-N. Hsu, A. Elkahky, P. Tomasello, R. Algayres, B. Sagot, A. Mohamed, and E. Dupoux, “Gener- ative spoken dialogue language modeling,”Trans- actions of the Association for Computational Lin- guistics, pp. 250–266, 2023

  2. [2]

    A full-duplex speech dialogue scheme based on large language model,

    P. Wang, S. Lu, Y. Tang, S. Yan, W. Xia, and Y. Xiong, “A full-duplex speech dialogue scheme based on large language model,” inProc. NeurIPS, 2024

  3. [3]

    Language model can listen while speaking,

    Z. Ma, Y. Song, C. Du, J. Cong, Z. Chen, Y. Wang, Y. Wang, and X. Chen, “Language model can listen while speaking,”arXiv preprint arXiv:2408.02622, 2024

  4. [4]

    Moshi: a speech-text foundation model for real-time dialogue

    A. D´ efossez, L. Mazar´ e, M. Orsini, A. Royer, P. P´ erez, H. J´ egou, E. Grave, and N. Zeghidour, “Moshi: a speech-text foundation model for real- time dialogue,”arXiv preprint arXiv:2410.00037, 2024

  5. [5]

    Beyond turn-based interfaces: Syn- chronous LLMs as full-duplex dialogue agents,

    B. Veluri, B. N. Peloquin, B. Yu, H. Gong, and S. Gollakota, “Beyond turn-based interfaces: Syn- chronous LLMs as full-duplex dialogue agents,” in Proc. EMNLP, pp. 21390–21402, 2024

  6. [6]

    Omniflat- ten: An end-to-end GPT model for seamless voice conversation,

    Q. Zhang, L. Cheng, C. Deng, Q. Chen, W. Wang, S. Zheng, J. Liu, H. Yu, and C. Tan, “Omniflat- ten: An end-to-end GPT model for seamless voice conversation,”arXiv preprint arXiv:2410.17799, 2024

  7. [7]

    Salmonn-omni: A codec-free llm for full-duplex speech understanding and generation,

    W. Yu, S. Wang, X. Yang, X. Chen, X. Tian, J. Zhang, G. Sun, L. Lu, Y. Wang, and C. Zhang, “SALMONN-omni: A codec-free LLM for full- duplex speech understanding and generation,” arXiv preprint arXiv:2411.18138, 2024

  8. [8]

    Simultaneous talk—from the per- spective of floor management of English and Japanese speakers,

    R. Hayashi, “Simultaneous talk—from the per- spective of floor management of English and Japanese speakers,”World Englishes, vol. 7, no. 3, pp. 269–288, 1988

  9. [9]

    Are you listening? Cultural influ- ences on the use of supportive verbal feedback in conversation,

    M. Stubbe, “Are you listening? Cultural influ- ences on the use of supportive verbal feedback in conversation,”Journal of Pragmatics, vol. 29, no. 3, pp. 257–289, 1998

  10. [10]

    SpeechGPT: Empowering large language models with intrinsic cross-modal conversational abilities,

    D. Zhang, S. Li, X. Zhang, J. Zhan, P. Wang, Y. Zhou, and X. Qiu, “SpeechGPT: Empowering large language models with intrinsic cross-modal conversational abilities,” inFindings of EMNLP, pp. 15757–15773, 2023

  11. [11]

    AudioPaLM: A large language model that can speak and listen,

    P. K. Rubenstein, C. Asawaroengchai, D. D. Nguyen, et al., “AudioPaLM: A large language model that can speak and listen,”arXiv preprint arXiv:2306.12925, 2023

  12. [12]

    Unsupervised cross-lingual representation learning for speech recognition,

    A. Conneau, A. Baevski, R. Collobert, A. Mo- hamed, and M. Auli, “Unsupervised cross-lingual representation learning for speech recognition,” in Proc. Interspeech, pp. 2426–2430, 2020

  13. [13]

    arXiv preprint arXiv:2303.03926 , year=

    X. Zhang, Z. Tan, R. Huang, et al., “VALL-E X: Speak foreign languages with your own voice,” arXiv preprint arXiv:2303.03926, 2023

  14. [14]

    High fidelity neural audio compression,

    A. D´ efossez, J. Copet, G. Synnaeve, and Y. Adi, “High fidelity neural audio compression,”Trans- actions on Machine Learning Research, 2023

  15. [15]

    SoundStream: An end-to- end neural audio codec,

    N. Zeghidour, A. Luebs, A. Omran, J. Skoglund, and M. Tagliasacchi, “SoundStream: An end-to- end neural audio codec,”IEEE/ACM Trans. Au- dio, Speech, Lang. Process., vol. 30, pp. 495–507, 2021

  16. [16]

    SentencePiece: A simple and language independent subword tok- enizer and detokenizer for neural text processing,

    T. Kudo and J. Richardson, “SentencePiece: A simple and language independent subword tok- enizer and detokenizer for neural text processing,” inProc. EMNLP: System Demonstrations, pp. 66– 71, 2018

  17. [17]

    Whisperx: Time-accurate speech transcription of long-form audio.arXiv preprint arXiv:2303.00747, 2023

    M. Bain, J. Huh, T. Han, and A. Zisser- man, “WhisperX: Time-accurate speech tran- scription of long-form audio,”arXiv preprint arXiv:2303.00747, 2023

  18. [18]

    Decoupled weight decay regularization,

    I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” inProc. ICLR, 2019

  19. [19]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    H. Touvron, L. Martin, K. Stone, et al., “Llama 2: Open foundation and fine-tuned chat models,” arXiv preprint arXiv:2307.09288, 2023

  20. [20]

    ZeRO: Memory optimizations toward training trillion parameter models,

    S. Rajbhandari, J. Rasley, O. Ruwase, and Y. He, “ZeRO: Memory optimizations toward training trillion parameter models,” inProc. SC, 2020. 5