CrossAccent-TTS: Cross-Lingual Accent-Intensity Controllable Text-to-Speech via Disentangled Speaker and Accent Representations

Ankit Tatawat; Ashishkumar P. Gudmalwar; Nirmesh J. Shah; Pankaj Wasnik; Ram Annamdevula

arxiv: 2606.25403 · v1 · pith:B5VK3SDXnew · submitted 2026-06-24 · 📡 eess.AS · cs.AI· cs.SD

CrossAccent-TTS: Cross-Lingual Accent-Intensity Controllable Text-to-Speech via Disentangled Speaker and Accent Representations

Ram Annamdevula , Ankit Tatawat , Ashishkumar P. Gudmalwar , Nirmesh J. Shah , Pankaj Wasnik This is my paper

Pith reviewed 2026-06-25 20:19 UTC · model grok-4.3

classification 📡 eess.AS cs.AIcs.SD

keywords cross-lingual TTSaccent controldisentangled representationsaccent intensity controllerIndic languagestext-to-speechaccent conversionlow-resource languages

0 comments

The pith

CrossAccent-TTS disentangles speaker and accent representations to control intensity in cross-lingual TTS.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes a framework for controlling accent intensity in text-to-speech systems that work across languages. It separates speaker identity from accent features using disentangled representations. The key addition is an Accent Intensity Controller that mixes language embeddings with different weights to adjust how strong the accent sounds. This is important for low-resource languages where users want to modulate accent without changing the speaker's voice or reducing quality. The experiments show better performance than baselines on accent similarity while keeping naturalness.

Core claim

CrossAccent-TTS achieves precise control of accent intensity by injecting weighted language embeddings into the accent subspace with the Accent Intensity Controller, enabling smooth interpolation between accents and fine-grained modulation at inference time while preserving speaker identity and naturalness on the Indic Multilingual and L2-arctic datasets.

What carries the argument

The Accent Intensity Controller (AIC) that injects weighted language embeddings into the accent subspace to allow modulation of accent strength.

If this is right

Accent control and conversion become possible in cross-lingual TTS for phonetically diverse languages.
Accent similarity and controllability improve over strong baselines.
Speaker similarity and naturalness remain high during accent changes.
Fine-grained modulation of accent strength is available at inference time without retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar disentanglement techniques could apply to other attributes like emotion or speaking rate in TTS.
The method may help create more inclusive voice assistants that adapt accents for different users.
Further work could test if the control works for real-time applications or with very limited training data.

Load-bearing premise

That speaker and accent representations can be reliably disentangled in the latent space for phonetically diverse low-resource languages.

What would settle it

Listening tests showing that different AIC weight values produce no perceptible change in accent intensity or that speaker similarity drops significantly when accent is modulated.

Figures

Figures reproduced from arXiv: 2606.25403 by Ankit Tatawat, Ashishkumar P. Gudmalwar, Nirmesh J. Shah, Pankaj Wasnik, Ram Annamdevula.

**Figure 1.** Figure 1: Model architecture of the proposed CrossAccent TTS ference. Other than this, Codec based TTS [9, 10, 11] initially focused on extracting semantic tokens. More recently, singlestream acoustic tokenization approaches have been introduced [12, 13], which directly encode both linguistic and acoustic information, simplifying downstream TTS models and improving synthesis quality. In the context of accent-contr… view at source ↗

**Figure 3.** Figure 3: Accent Similarity MOS of Foreign Accents 4.5. Accent-intensity Control Analysis To assess accent-intensity control, we synthesize samples using cross-lingual reference audio, scaling the accent intensity to 0, 0.3, 0.6, and 1.0. Accent embeddings from the synthesized speech are extracted using GenAID, and accent similarity scores are computed with the mentioned intensity scores. From the [PITH_FULL_IMAG… view at source ↗

**Figure 4.** Figure 4: Effect of Accent Intensity on Accent Similarity score 5. Conclusion This paper presents CrossAccent TTS, a cross-lingual, accentcontrollable text-to-speech framework that enables both accent conversion and continuous modulation of accent intensity while preserving speaker identity. The proposed Accent Intensity Controller injects weighted language embeddings into disentangled representations, allowing sm… view at source ↗

read the original abstract

Accent conversion and controllability remain fundamental challenges in cross-lingual text-to-speech (TTS), particularly for low-resource and phonetically diverse Indic languages. While recent large language model (LLM)-based TTS systems exhibit strong cross-lingual generalization, they provide limited explicit control over accent characteristics and intensity. In this paper, we propose CrossAccentTTS, a framework that enables both accent control and conversion while preserving speaker identity. Specifically, we introduce an Accent Intensity Controller (AIC) that injects weighted language embeddings into the accent subspace, allowing smooth interpolation between accents and fine-grained modulation of accent strength at inference time. Experiments on the Indic Multilingual and L2-arctic datasets shows that CrossAccent-TTS achieves precise control of accent intensity, outperforming strong baselines in accent similarity and controllability by maintaining speaker similarity and naturalness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The new AIC module offers a control knob for accent intensity but the paper gives no proof that the speaker-accent split works on the target languages.

read the letter

CrossAccent-TTS puts forward an Accent Intensity Controller that allows weighted injection of language embeddings into the accent subspace for adjustable intensity at inference time. The module itself is the main addition here.

The work does a decent job framing the need for explicit accent control in cross-lingual settings with low-resource Indic languages, where LLM-based TTS falls short on fine-grained modulation. It keeps the focus on preserving speaker identity alongside the accent changes.

The soft spots are in the validation. The abstract assumes reliable disentanglement of speaker and accent but supplies no architecture details on how that separation is enforced and no quantitative checks that it succeeded on the datasets. Without those, the central claim about precise control rests on untested ground. The results section is summarized without any actual numbers or comparisons shown, which leaves the outperformance statement unsupported from what's available.

This paper is aimed at speech synthesis groups working on accent conversion and control. A reader interested in practical TTS extensions for Indic languages could get some value from the AIC concept if the full paper fills in the gaps. It should go to peer review so the methods can be scrutinized properly.

Referee Report

2 major / 0 minor

Summary. The paper proposes CrossAccent-TTS, a cross-lingual TTS framework that enables accent control and conversion while preserving speaker identity. It introduces an Accent Intensity Controller (AIC) that injects weighted language embeddings into a disentangled accent subspace to support smooth interpolation and fine-grained intensity modulation at inference. Experiments on the Indic Multilingual and L2-arctic datasets are reported to demonstrate precise accent-intensity control, with outperformance over strong baselines in accent similarity and controllability while maintaining speaker similarity and naturalness.

Significance. If the disentanglement succeeds and the reported gains are reproducible with proper metrics, the work would offer a practical advance in controllable TTS for low-resource Indic languages, where explicit accent-intensity control has been limited in recent LLM-based systems. The AIC mechanism, if validated, could enable new applications in accent conversion without trade-offs in speaker fidelity or naturalness.

major comments (2)

[Proposed Method (disentanglement and AIC subsections)] The central claim requires reliable separation of speaker and accent subspaces so that AIC-weighted language embeddings modulate only accent intensity. The manuscript supplies no architecture details on the separation mechanism (adversarial loss, orthogonality constraint, or mutual-information minimization) nor any quantitative check (probing classifier accuracy, subspace similarity metrics) that the separation succeeded on the Indic Multilingual or L2-arctic data. This is load-bearing for the claim that AIC produces perceptually meaningful intensity changes for phonetically diverse low-resource languages.
[Experiments and Results] The abstract states outperformance on accent similarity and controllability but supplies no metrics, baseline details, statistical tests, or error analysis. Without these, the experimental support for the claims of precise control and superiority cannot be assessed.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: [Proposed Method (disentanglement and AIC subsections)] The central claim requires reliable separation of speaker and accent subspaces so that AIC-weighted language embeddings modulate only accent intensity. The manuscript supplies no architecture details on the separation mechanism (adversarial loss, orthogonality constraint, or mutual-information minimization) nor any quantitative check (probing classifier accuracy, subspace similarity metrics) that the separation succeeded on the Indic Multilingual or L2-arctic data. This is load-bearing for the claim that AIC produces perceptually meaningful intensity changes for phonetically diverse low-resource languages.

Authors: We agree that the separation mechanism and its validation are central to the claims. The current manuscript describes the disentangled representations at a high level but does not detail the specific separation technique or provide quantitative verification. In the revised version we will add a dedicated subsection specifying the separation approach (including any adversarial losses, orthogonality constraints or mutual-information terms) together with probing-classifier accuracies and subspace-similarity metrics evaluated on both the Indic Multilingual and L2-arctic datasets. revision: yes
Referee: [Experiments and Results] The abstract states outperformance on accent similarity and controllability but supplies no metrics, baseline details, statistical tests, or error analysis. Without these, the experimental support for the claims of precise control and superiority cannot be assessed.

Authors: The abstract is intentionally concise. The full experimental section does report the relevant metrics and baselines; however, we acknowledge that statistical significance tests and error analyses could be presented more explicitly. In the revision we will add these elements (including p-values or confidence intervals where appropriate) and ensure they are clearly cross-referenced from the abstract and results tables. revision: yes

Circularity Check

0 steps flagged

No circularity; claims rest on experimental comparisons without self-referential derivations

full rationale

The manuscript introduces CrossAccent-TTS with an Accent Intensity Controller (AIC) that injects weighted language embeddings, but the provided sections contain no equations, derivations, or first-principles steps that reduce by construction to fitted inputs or self-citations. The central claims of precise accent-intensity control and disentanglement are presented as outcomes of architecture design validated on Indic Multilingual and L2-arctic datasets, not as tautological re-statements of the inputs. No load-bearing self-citation chains, ansatzes smuggled via prior work, or renaming of known results appear in the abstract or framework description. This is the expected non-finding for an empirical TTS paper whose contributions are architectural and comparative rather than analytic.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no equations, training objectives, or architectural details, so no free parameters, axioms, or invented entities can be identified.

pith-pipeline@v0.9.1-grok · 5706 in / 932 out tokens · 19632 ms · 2026-06-25T20:19:56.108288+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 2 linked inside Pith

[1]

Introduction Speech accent refers to systematic variations in phonemes, rhythm, intonation, and linguistic structure, often providing cues about a speaker’s background, such as their geographic origin or native language [1]. However, disentangling accent from other speaker-specific attributes, such as pitch range, tim- bre, and vocal-tract characteristics...
[2]

We introduce an Accent Intensity Controller that enables con- tinuous accent modulation at inference time without requir- ing accent-specific training data
[3]

We propose an Accent Suppression Module that promotes ad- versarial disentanglement of accent from speaker and style representations, thereby preserving speaker similarity
[4]

The rest of the paper is organized as follows: Section 1 covers related work, followed by Section 3, where we describe the proposed method

We evaluate our method on the Indic Multilingual and L2 ARCTIC datasets, demonstrating consistent accent control- lability across Indian and foreign accents. The rest of the paper is organized as follows: Section 1 covers related work, followed by Section 3, where we describe the proposed method. Section 4 details the training procedure and experiments, a...
[5]

Related Works Early Text-to-Speech (TTS) systems primarily focused on im- proving naturalness and speaker similarity, with limited atten- tion given to accent modeling. Traditional approaches [8] at- tempted to handle accent and speaker variation by conditioning multi-speaker TTS systems on speaker embeddings extracted using a speaker-verification encoder...

Pith/arXiv arXiv 2026
[6]

Closest to our inspira- tion, [20] proposed expanding language embeddings across all acoustic tokens to suppress L2 accent leakage in cross-lingual synthesis

and GST-based [19] baselines. Closest to our inspira- tion, [20] proposed expanding language embeddings across all acoustic tokens to suppress L2 accent leakage in cross-lingual synthesis. We build on this direction by extending token-level language conditioning to accent-agnostic representations de- rived from speech codecs
[7]

An overview of the architecture is illustrated in 1

Proposed Methodology This section describes the proposed Cross-Accent TTS frame- work. An overview of the architecture is illustrated in 1. The system consists of four main components: (i) speech tokeniza- tion using Neucodec, (ii) a Perceiver Resampler for speaker and style encoding with a fixed-length bottleneck, (iii) adversarial suppression of accent ...
[8]

Training and Evaluation Setup This section describes the experimental setup, evaluation met- rics, and results to assess the proposed model’s performance. 4.1. Datasets We conduct experiments on two datasets, Indic Multilingual and L2 ARCTIC, to evaluate performance in low-resource and second-language (L2) accent scenarios. 4.1.1. Indic Multilingual Datas...
[9]

Conclusion This paper presents CrossAccent TTS, a cross-lingual, accent- controllable text-to-speech framework that enables both accent conversion and continuous modulation of accent intensity while preserving speaker identity. The proposed Accent Intensity Controller injects weighted language embeddings into disen- tangled representations, allowing smoot...
[10]

Generative AI Use Disclosure Generative AI tools were used solely for language refinement, grammar correction, and improving the overall clarity and read- ability of the manuscript. These tools assisted in polishing and structuring the text but were not used to generate or design the core research ideas, methodology, experimental setup, analy- sis, or con...
[11]

Moyer,F oreign accent: The phenomenon of non-native speech

A. Moyer,F oreign accent: The phenomenon of non-native speech. Cambridge University Press, 2013

2013
[12]

Multi-scale ac- cent modeling and disentangling for multi-speaker multi-accent text-to-speech synthesis,

X. Zhou, M. Zhang, Y . Zhou, Z. Wu, and H. Li, “Multi-scale ac- cent modeling and disentangling for multi-speaker multi-accent text-to-speech synthesis,”IEEE Transactions on Audio, Speech and Language Processing, 2026

2026
[13]

Controllable accented text- to-speech synthesis with fine and coarse-grained intensity render- ing,

R. Liu, B. Sisman, G. Gao, and H. Li, “Controllable accented text- to-speech synthesis with fine and coarse-grained intensity render- ing,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 2188–2201, 2024

2024
[14]

Accented text- to-speech synthesis with limited data,

X. Zhou, M. Zhang, Y . Zhou, Z. Wu, and H. Li, “Accented text- to-speech synthesis with limited data,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 1699– 1711, 2024

2024
[15]

Low frame-rate speech codec: a codec designed for fast high-quality speech llm training and in- ference,

E. Casanova, R. Langman, P. Neekhara, S. Hussain, J. Li, S. Ghosh, A. Juki´c, and S.-g. Lee, “Low frame-rate speech codec: a codec designed for fast high-quality speech llm training and in- ference,” inICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

2025
[16]

XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model,

E. Casanova, K. Davis, E. G ¨olge, G. G ¨oknar, I. Gulea, L. Hart, A. Aljafari, J. Meyer, R. Morais, S. Olayemi, and J. We- ber, “XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model,” inInterspeech 2024, 2024, pp. 4978–4982

2024
[17]

DubWise: Video-Guided Speech Duration Control in Mul- timodal LLM-based Text-to-Speech for Dubbing,

N. Sahipjohn, A. Gudmalwar, N. Shah, P. Wasnik, and R. R. Shah, “DubWise: Video-Guided Speech Duration Control in Mul- timodal LLM-based Text-to-Speech for Dubbing,” inInterspeech 2024, 2024, pp. 2960–2964

2024
[18]

Transfer learning from speaker verification to multispeaker text-to-speech synthesis,

Y . Jia, Y . Zhang, R. Weiss, Q. Wang, J. Shen, F. Ren, P. Nguyen, R. Pang, I. Lopez Moreno, Y . Wuet al., “Transfer learning from speaker verification to multispeaker text-to-speech synthesis,”Ad- vances in neural information processing systems, vol. 31, 2018

2018
[19]

Vall-e 2: Neural codec language models are hu- man parity zero-shot text to speech synthesizers,

S. Chen, S. Liu, L. Zhou, Y . Liu, X. Tan, J. Li, S. Zhao, Y . Qian, and F. Wei, “Vall-e 2: Neural codec language models are hu- man parity zero-shot text to speech synthesizers,”arXiv preprint arXiv:2406.05370, 2024

arXiv 2024
[20]

Maskgct: Zero-shot text-to-speech with masked generative codec transformer,

Y . Wang, H. Zhan, L. Liu, R. Zeng, H. Guo, J. Zheng, Q. Zhang, X. Zhang, S. Zhang, and Z. Wu, “Maskgct: Zero-shot text-to-speech with masked generative codec transformer,” in International Conference on Learning Representations (ICLR), 2025, pp. 47 127–47 150. [Online]. Available: https://proceeding s.iclr.cc/paper files/paper/2025/file/74a31a3b862eb7f01...

2025
[21]

Robust neural codec language modeling with phoneme position prediction for zero-shot tts,

C. Lu, X. Wen, L. Song, and J. Oh, “Robust neural codec language modeling with phoneme position prediction for zero-shot tts,” in Proc. Interspeech 2025, 2025, pp. 2475–2479

2025
[22]

Spark-tts: An efficient llm- based text-to-speech model with single-stream decoupled speech tokens,

X. Wang, M. Jiang, Z. Ma, Z. Zhang, S. Liu, L. Li, Z. Liang, Q. Zheng, R. Wang, X. Fenget al., “Spark-tts: An efficient llm- based text-to-speech model with single-stream decoupled speech tokens,”arXiv preprint arXiv:2503.01710, 2025

Pith/arXiv arXiv 2025
[23]

Discrete audio tokens: More than a survey!

P. Mousavi, G. Maimon, A. Moumen, D. Petermann, J. Shi, H. Wu, H. Yang, A. Kuznetsova, A. Ploujnikov, R. Marxeret al., “Discrete audio tokens: More than a survey!”arXiv preprint arXiv:2506.10274, 2025

arXiv 2025
[24]

Vani: Very-lightweight accent- controllable tts for native and non-native speakers with identity preservation,

R. Badlani, A. Arora, S. Ghosh, R. Valle, K. J. Shih, J. F. Santos, B. Ginsburg, and B. Catanzaro, “Vani: Very-lightweight accent- controllable tts for native and non-native speakers with identity preservation,” inICASSP 2023-2023 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–2

2023
[25]

Phir hera fairy: An english fairytaler is a strong faker of flu- ent speech in low-resource indian languages,

P. S. Varadhan, S. Anand, S. Siddhartha, and M. M. Khapra, “Phir hera fairy: An english fairytaler is a strong faker of flu- ent speech in low-resource indian languages,”arXiv preprint arXiv:2505.20693, 2025

arXiv 2025
[26]

Scalable control- lable accented tts,

H. L. Xinyuan, Z. Cai, A. Garg, K. Duh, L. P. Garc ´ıa-Perera, S. Khudanpur, N. Andrews, and M. Wiesner, “Scalable control- lable accented tts,”arXiv preprint arXiv:2508.07426, 2025

arXiv 2025
[27]

L2-ARCTIC: A Non-native English Speech Corpus,

G. Zhao, S. Sonsaat, A. Silpachai, I. Lucic, E. Chukharev- Hudilainen, J. Levis, and R. Gutierrez-Osuna, “L2-ARCTIC: A Non-native English Speech Corpus,” inInterspeech 2018, 2018, pp. 2783–2787

2018
[28]

Ac- cented text-to-speech synthesis with a conditional variational au- toencoder,

J. Melechovsky, A. Mehrish, B. Sisman, and D. Herremans, “Ac- cented text-to-speech synthesis with a conditional variational au- toencoder,” inTENCON 2024-2024 IEEE Region 10 Conference (TENCON). IEEE, 2024, pp. 343–346

2024
[29]

Style tokens: Un- supervised style modeling, control and transfer in end-to-end speech synthesis,

Y . Wang, D. Stanton, Y . Zhang, R.-S. Ryan, E. Battenberg, J. Shor, Y . Xiao, Y . Jia, F. Ren, and R. A. Saurous, “Style tokens: Un- supervised style modeling, control and transfer in end-to-end speech synthesis,” inInternational conference on machine learn- ing. PMLR, 2018, pp. 5180–5189

2018
[30]

Speak foreign languages with your own voice: Cross-lingual neural codec language modeling,

Z. Zhang, L. Zhou, C. Wang, S. Chen, Y . Wu, S. Liu, Z. Chen, Y . Liu, H. Wang, J. Liet al., “Speak foreign languages with your own voice: Cross-lingual neural codec language modeling,”arXiv preprint arXiv:2303.03926, 2023

arXiv 2023
[31]

Fi- nite scalar quantization enables redundant and transmission- robust neural audio compression at low bit-rates,

H. Julian, R. Beeson, L. Konathala, J. Ulin, and J. Gao, “Fi- nite scalar quantization enables redundant and transmission- robust neural audio compression at low bit-rates,”arXiv preprint arXiv:2509.09550, 2025

arXiv 2025
[32]

Perceiver: General perception with iterative atten- tion,

A. Jaegle, F. Gimeno, A. Brock, O. Vinyals, A. Zisserman, and J. Carreira, “Perceiver: General perception with iterative atten- tion,” inInternational conference on machine learning. PMLR, 2021, pp. 4651–4664

2021
[33]

Qwen 2.5: A comprehensive review of the leading resource-efficient llm with potentioal to surpass all competitors,

I. Ahmed, S. Islam, P. P. Datta, I. Kabir, N. U. R. Chowdhury, and A. Haque, “Qwen 2.5: A comprehensive review of the leading resource-efficient llm with potentioal to surpass all competitors,” Authorea Preprints, 2025

2025
[34]

Emilia: A large-scale, extensive, multilingual, and diverse dataset for speech generation,

H. He, Z. Shang, C. Wang, X. Li, Y . Gu, H. Hua, L. Liu, C. Yang, J. Li, P. Shiet al., “Emilia: A large-scale, extensive, multilingual, and diverse dataset for speech generation,”IEEE Transactions on Audio, Speech and Language Processing, 2025

2025
[35]

Indicvoices: Towards building an inclusive multilingual speech dataset for indian languages,

T. Javed, J. Nawale, E. George, S. Joshi, K. Bhogale, D. Mehen- dale, I. Sethi, A. Ananthanarayanan, H. Faquih, P. Palitet al., “Indicvoices: Towards building an inclusive multilingual speech dataset for indian languages,” inFindings of the Association for Computational Linguistics: ACL 2024, 2024, pp. 10 740–10 782

2024
[36]

Utmos: Utokyo-sarulab system for voicemos challenge 2022,

T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari, “Utmos: Utokyo-sarulab system for voicemos challenge 2022,”arXiv preprint arXiv:2204.02152, 2022

arXiv 2022
[37]

Resemblyzer: V oice encoder for speaker verifica- tion,

G. Louppe, “Resemblyzer: V oice encoder for speaker verifica- tion,” https://github.com/resemble-ai/Resemblyzer, 2019, ac- cessed: 2026-02-19

2019

[1] [1]

Introduction Speech accent refers to systematic variations in phonemes, rhythm, intonation, and linguistic structure, often providing cues about a speaker’s background, such as their geographic origin or native language [1]. However, disentangling accent from other speaker-specific attributes, such as pitch range, tim- bre, and vocal-tract characteristics...

[2] [2]

We introduce an Accent Intensity Controller that enables con- tinuous accent modulation at inference time without requir- ing accent-specific training data

[3] [3]

We propose an Accent Suppression Module that promotes ad- versarial disentanglement of accent from speaker and style representations, thereby preserving speaker similarity

[4] [4]

The rest of the paper is organized as follows: Section 1 covers related work, followed by Section 3, where we describe the proposed method

We evaluate our method on the Indic Multilingual and L2 ARCTIC datasets, demonstrating consistent accent control- lability across Indian and foreign accents. The rest of the paper is organized as follows: Section 1 covers related work, followed by Section 3, where we describe the proposed method. Section 4 details the training procedure and experiments, a...

[5] [5]

Related Works Early Text-to-Speech (TTS) systems primarily focused on im- proving naturalness and speaker similarity, with limited atten- tion given to accent modeling. Traditional approaches [8] at- tempted to handle accent and speaker variation by conditioning multi-speaker TTS systems on speaker embeddings extracted using a speaker-verification encoder...

Pith/arXiv arXiv 2026

[6] [6]

Closest to our inspira- tion, [20] proposed expanding language embeddings across all acoustic tokens to suppress L2 accent leakage in cross-lingual synthesis

and GST-based [19] baselines. Closest to our inspira- tion, [20] proposed expanding language embeddings across all acoustic tokens to suppress L2 accent leakage in cross-lingual synthesis. We build on this direction by extending token-level language conditioning to accent-agnostic representations de- rived from speech codecs

[7] [7]

An overview of the architecture is illustrated in 1

Proposed Methodology This section describes the proposed Cross-Accent TTS frame- work. An overview of the architecture is illustrated in 1. The system consists of four main components: (i) speech tokeniza- tion using Neucodec, (ii) a Perceiver Resampler for speaker and style encoding with a fixed-length bottleneck, (iii) adversarial suppression of accent ...

[8] [8]

Training and Evaluation Setup This section describes the experimental setup, evaluation met- rics, and results to assess the proposed model’s performance. 4.1. Datasets We conduct experiments on two datasets, Indic Multilingual and L2 ARCTIC, to evaluate performance in low-resource and second-language (L2) accent scenarios. 4.1.1. Indic Multilingual Datas...

[9] [9]

Conclusion This paper presents CrossAccent TTS, a cross-lingual, accent- controllable text-to-speech framework that enables both accent conversion and continuous modulation of accent intensity while preserving speaker identity. The proposed Accent Intensity Controller injects weighted language embeddings into disen- tangled representations, allowing smoot...

[10] [10]

Generative AI Use Disclosure Generative AI tools were used solely for language refinement, grammar correction, and improving the overall clarity and read- ability of the manuscript. These tools assisted in polishing and structuring the text but were not used to generate or design the core research ideas, methodology, experimental setup, analy- sis, or con...

[11] [11]

Moyer,F oreign accent: The phenomenon of non-native speech

A. Moyer,F oreign accent: The phenomenon of non-native speech. Cambridge University Press, 2013

2013

[12] [12]

Multi-scale ac- cent modeling and disentangling for multi-speaker multi-accent text-to-speech synthesis,

X. Zhou, M. Zhang, Y . Zhou, Z. Wu, and H. Li, “Multi-scale ac- cent modeling and disentangling for multi-speaker multi-accent text-to-speech synthesis,”IEEE Transactions on Audio, Speech and Language Processing, 2026

2026

[13] [13]

Controllable accented text- to-speech synthesis with fine and coarse-grained intensity render- ing,

R. Liu, B. Sisman, G. Gao, and H. Li, “Controllable accented text- to-speech synthesis with fine and coarse-grained intensity render- ing,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 2188–2201, 2024

2024

[14] [14]

Accented text- to-speech synthesis with limited data,

X. Zhou, M. Zhang, Y . Zhou, Z. Wu, and H. Li, “Accented text- to-speech synthesis with limited data,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 1699– 1711, 2024

2024

[15] [15]

Low frame-rate speech codec: a codec designed for fast high-quality speech llm training and in- ference,

E. Casanova, R. Langman, P. Neekhara, S. Hussain, J. Li, S. Ghosh, A. Juki´c, and S.-g. Lee, “Low frame-rate speech codec: a codec designed for fast high-quality speech llm training and in- ference,” inICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

2025

[16] [16]

XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model,

E. Casanova, K. Davis, E. G ¨olge, G. G ¨oknar, I. Gulea, L. Hart, A. Aljafari, J. Meyer, R. Morais, S. Olayemi, and J. We- ber, “XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model,” inInterspeech 2024, 2024, pp. 4978–4982

2024

[17] [17]

DubWise: Video-Guided Speech Duration Control in Mul- timodal LLM-based Text-to-Speech for Dubbing,

N. Sahipjohn, A. Gudmalwar, N. Shah, P. Wasnik, and R. R. Shah, “DubWise: Video-Guided Speech Duration Control in Mul- timodal LLM-based Text-to-Speech for Dubbing,” inInterspeech 2024, 2024, pp. 2960–2964

2024

[18] [18]

Transfer learning from speaker verification to multispeaker text-to-speech synthesis,

Y . Jia, Y . Zhang, R. Weiss, Q. Wang, J. Shen, F. Ren, P. Nguyen, R. Pang, I. Lopez Moreno, Y . Wuet al., “Transfer learning from speaker verification to multispeaker text-to-speech synthesis,”Ad- vances in neural information processing systems, vol. 31, 2018

2018

[19] [19]

Vall-e 2: Neural codec language models are hu- man parity zero-shot text to speech synthesizers,

S. Chen, S. Liu, L. Zhou, Y . Liu, X. Tan, J. Li, S. Zhao, Y . Qian, and F. Wei, “Vall-e 2: Neural codec language models are hu- man parity zero-shot text to speech synthesizers,”arXiv preprint arXiv:2406.05370, 2024

arXiv 2024

[20] [20]

Maskgct: Zero-shot text-to-speech with masked generative codec transformer,

Y . Wang, H. Zhan, L. Liu, R. Zeng, H. Guo, J. Zheng, Q. Zhang, X. Zhang, S. Zhang, and Z. Wu, “Maskgct: Zero-shot text-to-speech with masked generative codec transformer,” in International Conference on Learning Representations (ICLR), 2025, pp. 47 127–47 150. [Online]. Available: https://proceeding s.iclr.cc/paper files/paper/2025/file/74a31a3b862eb7f01...

2025

[21] [21]

Robust neural codec language modeling with phoneme position prediction for zero-shot tts,

C. Lu, X. Wen, L. Song, and J. Oh, “Robust neural codec language modeling with phoneme position prediction for zero-shot tts,” in Proc. Interspeech 2025, 2025, pp. 2475–2479

2025

[22] [22]

Spark-tts: An efficient llm- based text-to-speech model with single-stream decoupled speech tokens,

X. Wang, M. Jiang, Z. Ma, Z. Zhang, S. Liu, L. Li, Z. Liang, Q. Zheng, R. Wang, X. Fenget al., “Spark-tts: An efficient llm- based text-to-speech model with single-stream decoupled speech tokens,”arXiv preprint arXiv:2503.01710, 2025

Pith/arXiv arXiv 2025

[23] [23]

Discrete audio tokens: More than a survey!

P. Mousavi, G. Maimon, A. Moumen, D. Petermann, J. Shi, H. Wu, H. Yang, A. Kuznetsova, A. Ploujnikov, R. Marxeret al., “Discrete audio tokens: More than a survey!”arXiv preprint arXiv:2506.10274, 2025

arXiv 2025

[24] [24]

Vani: Very-lightweight accent- controllable tts for native and non-native speakers with identity preservation,

R. Badlani, A. Arora, S. Ghosh, R. Valle, K. J. Shih, J. F. Santos, B. Ginsburg, and B. Catanzaro, “Vani: Very-lightweight accent- controllable tts for native and non-native speakers with identity preservation,” inICASSP 2023-2023 IEEE International Con- ference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2023, pp. 1–2

2023

[25] [25]

Phir hera fairy: An english fairytaler is a strong faker of flu- ent speech in low-resource indian languages,

P. S. Varadhan, S. Anand, S. Siddhartha, and M. M. Khapra, “Phir hera fairy: An english fairytaler is a strong faker of flu- ent speech in low-resource indian languages,”arXiv preprint arXiv:2505.20693, 2025

arXiv 2025

[26] [26]

Scalable control- lable accented tts,

H. L. Xinyuan, Z. Cai, A. Garg, K. Duh, L. P. Garc ´ıa-Perera, S. Khudanpur, N. Andrews, and M. Wiesner, “Scalable control- lable accented tts,”arXiv preprint arXiv:2508.07426, 2025

arXiv 2025

[27] [27]

L2-ARCTIC: A Non-native English Speech Corpus,

G. Zhao, S. Sonsaat, A. Silpachai, I. Lucic, E. Chukharev- Hudilainen, J. Levis, and R. Gutierrez-Osuna, “L2-ARCTIC: A Non-native English Speech Corpus,” inInterspeech 2018, 2018, pp. 2783–2787

2018

[28] [28]

Ac- cented text-to-speech synthesis with a conditional variational au- toencoder,

J. Melechovsky, A. Mehrish, B. Sisman, and D. Herremans, “Ac- cented text-to-speech synthesis with a conditional variational au- toencoder,” inTENCON 2024-2024 IEEE Region 10 Conference (TENCON). IEEE, 2024, pp. 343–346

2024

[29] [29]

Style tokens: Un- supervised style modeling, control and transfer in end-to-end speech synthesis,

Y . Wang, D. Stanton, Y . Zhang, R.-S. Ryan, E. Battenberg, J. Shor, Y . Xiao, Y . Jia, F. Ren, and R. A. Saurous, “Style tokens: Un- supervised style modeling, control and transfer in end-to-end speech synthesis,” inInternational conference on machine learn- ing. PMLR, 2018, pp. 5180–5189

2018

[30] [30]

Speak foreign languages with your own voice: Cross-lingual neural codec language modeling,

Z. Zhang, L. Zhou, C. Wang, S. Chen, Y . Wu, S. Liu, Z. Chen, Y . Liu, H. Wang, J. Liet al., “Speak foreign languages with your own voice: Cross-lingual neural codec language modeling,”arXiv preprint arXiv:2303.03926, 2023

arXiv 2023

[31] [31]

Fi- nite scalar quantization enables redundant and transmission- robust neural audio compression at low bit-rates,

H. Julian, R. Beeson, L. Konathala, J. Ulin, and J. Gao, “Fi- nite scalar quantization enables redundant and transmission- robust neural audio compression at low bit-rates,”arXiv preprint arXiv:2509.09550, 2025

arXiv 2025

[32] [32]

Perceiver: General perception with iterative atten- tion,

A. Jaegle, F. Gimeno, A. Brock, O. Vinyals, A. Zisserman, and J. Carreira, “Perceiver: General perception with iterative atten- tion,” inInternational conference on machine learning. PMLR, 2021, pp. 4651–4664

2021

[33] [33]

Qwen 2.5: A comprehensive review of the leading resource-efficient llm with potentioal to surpass all competitors,

I. Ahmed, S. Islam, P. P. Datta, I. Kabir, N. U. R. Chowdhury, and A. Haque, “Qwen 2.5: A comprehensive review of the leading resource-efficient llm with potentioal to surpass all competitors,” Authorea Preprints, 2025

2025

[34] [34]

Emilia: A large-scale, extensive, multilingual, and diverse dataset for speech generation,

H. He, Z. Shang, C. Wang, X. Li, Y . Gu, H. Hua, L. Liu, C. Yang, J. Li, P. Shiet al., “Emilia: A large-scale, extensive, multilingual, and diverse dataset for speech generation,”IEEE Transactions on Audio, Speech and Language Processing, 2025

2025

[35] [35]

Indicvoices: Towards building an inclusive multilingual speech dataset for indian languages,

T. Javed, J. Nawale, E. George, S. Joshi, K. Bhogale, D. Mehen- dale, I. Sethi, A. Ananthanarayanan, H. Faquih, P. Palitet al., “Indicvoices: Towards building an inclusive multilingual speech dataset for indian languages,” inFindings of the Association for Computational Linguistics: ACL 2024, 2024, pp. 10 740–10 782

2024

[36] [36]

Utmos: Utokyo-sarulab system for voicemos challenge 2022,

T. Saeki, D. Xin, W. Nakata, T. Koriyama, S. Takamichi, and H. Saruwatari, “Utmos: Utokyo-sarulab system for voicemos challenge 2022,”arXiv preprint arXiv:2204.02152, 2022

arXiv 2022

[37] [37]

Resemblyzer: V oice encoder for speaker verifica- tion,

G. Louppe, “Resemblyzer: V oice encoder for speaker verifica- tion,” https://github.com/resemble-ai/Resemblyzer, 2019, ac- cessed: 2026-02-19

2019