Accent Conversion: A Problem-Driven Survey of Sociolinguistic and Technical Constraints

Yurii Halychanskyi , Jianfeng Steven Guo , Volodymyr Kindratenko

Authors on Pith no claims yet

Pith reviewed 2026-05-07 08:11 UTC · model grok-4.3

classification 💻 cs.SD

keywords accent conversionvoice conversionneural speech synthesissociolinguistic constraintsdata alignmentspeaker identity preservationspeech evaluationaccent modification

0 comments

The pith

Accent conversion methods have progressed from rule-based signal processing to neural architectures to overcome data alignment, disentanglement, and scarcity issues.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper surveys how accent conversion techniques evolved in response to persistent technical hurdles. It starts with early methods relying on spectral manipulation and formant analysis before moving to neural systems for more adaptable transformations. A key focus is the linguistic context and how real-world uses create different demands on changing an accent versus keeping the original speaker's voice. Readers care because improved systems could aid cross-cultural interactions without losing personal vocal qualities. The survey also covers available datasets, how to evaluate results, remaining problems, and possible next steps in research.

Core claim

This survey establishes that accent conversion has developed through addressing fundamental problems of aligning training data, disentangling accent-related features from speaker identity, and managing limited data resources. The progression moves from rigid rule-based digital signal processing techniques, such as spectral and formant adjustments, to flexible neural architectures that support reference-free conversions. Linguistic foundations inform the process, and application contexts dictate the acceptable trade-off between accent modification strength and identity preservation. The review catalogs speech datasets and evaluation practices while pinpointing ongoing challenges and suggests

What carries the argument

The problem-driven linkage between sociolinguistic constraints on accent-identity balance and technical constraints on data alignment, representation disentanglement, and resource scarcity, which explains the shift in methodologies.

Load-bearing premise

That the chosen literature sufficiently illustrates the field's response to the identified challenges and that application requirements create distinct, identifiable constraints on the accent-identity trade-off.

What would settle it

A new comprehensive review or empirical analysis revealing that technical challenges like data alignment have not primarily driven methodological changes, or that application constraints do not systematically vary as described.

read the original abstract

Accent conversion has rapidly progressed alongside growing interest in improving global cross-cultural communication. This survey presents an overview of the evolution of accent conversion methodologies, analyzing how the field has developed in response to fundamental challenges related to data alignment, representation disentanglement, and resource scarcity. We trace the progression from early rule-based digital signal processing approaches such as spectral manipulation and formant-based analysis to modern neural architectures capable of flexible and reference-free accent transformation. In addition, the survey situates accent conversion within its linguistic foundations and examines how different application requirements impose varying constraints on the balance between accent modification and speaker identity preservation. Finally, it reviews commonly used speech datasets and evaluation methodologies, identifies persistent challenges, and outlines directions for future research aimed at achieving more controllable and perceptually consistent accent conversion.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript is a survey paper on accent conversion. It claims to trace the field's evolution from early rule-based digital signal processing methods (spectral manipulation, formant-based analysis) to modern neural architectures for flexible, reference-free accent transformation. The development is framed as a direct response to core challenges in data alignment, representation disentanglement, and resource scarcity. The paper situates the topic in linguistic foundations, analyzes how application requirements create varying constraints on accent modification versus speaker identity preservation, reviews speech datasets and evaluation methodologies, identifies persistent challenges, and outlines future research directions for more controllable and perceptually consistent systems.

Significance. If the literature selection proves representative and the causal links between the identified challenges and methodological shifts are substantiated by the reviewed works, the survey would provide a useful synthesis bridging technical speech processing with sociolinguistic considerations. It could help researchers understand trade-offs in identity preservation during accent conversion and support more standardized benchmarking through its dataset and evaluation review, potentially guiding targeted advances in controllable systems.

major comments (2)

[Introduction] Introduction: The central problem-driven claim states that accent conversion methodologies evolved explicitly in response to data alignment, representation disentanglement, and resource scarcity, with the progression from rule-based to neural methods presented as evidence of this response. However, no literature search protocol, databases, keywords, inclusion/exclusion criteria, or coverage statistics (e.g., papers per category or era) are provided. This absence is load-bearing for the narrative, as it prevents verification that the selected works form a representative sample demonstrating the claimed causal progression rather than post-hoc selection.
[Technical Approaches] Technical Approaches section (evolution tracing): The assertion that application requirements impose identifiable varying constraints on accent modification versus speaker identity preservation relies on the reviewed papers illustrating these trade-offs. Without explicit justification for why certain voice conversion or sociolinguistic works were included or omitted, the analysis of constraints risks being incomplete or non-generalizable.

minor comments (1)

[Abstract] Abstract: The term 'reference-free' is used to describe modern neural architectures but is not defined or contrasted with reference-based methods at first mention; adding a short clarification would improve readability for a broad audience.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed review of our survey on accent conversion. The comments correctly identify the need for greater transparency in how the literature was assembled to support the problem-driven narrative. We address each major comment below and will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses

Referee: [Introduction] Introduction: The central problem-driven claim states that accent conversion methodologies evolved explicitly in response to data alignment, representation disentanglement, and resource scarcity, with the progression from rule-based to neural methods presented as evidence of this response. However, no literature search protocol, databases, keywords, inclusion/exclusion criteria, or coverage statistics (e.g., papers per category or era) are provided. This absence is load-bearing for the narrative, as it prevents verification that the selected works form a representative sample demonstrating the claimed causal progression rather than post-hoc selection.

Authors: We agree that the absence of an explicit literature selection protocol weakens the verifiability of the claimed causal progression. Although the survey was structured as a narrative review organized around the core technical challenges rather than a formal systematic review, we recognize that documenting the search process is necessary to substantiate representativeness. In the revised manuscript we will add a dedicated subsection (tentatively titled 'Literature Selection and Scope') immediately after the abstract or at the start of the Introduction. This subsection will specify the databases consulted (Google Scholar, arXiv, IEEE Xplore, ACL Anthology), the primary keywords and Boolean combinations employed ('accent conversion' OR 'accent modification' AND ('voice conversion' OR 'prosody transfer' OR 'speaker disentanglement')), the time window (2000–2024), inclusion criteria (peer-reviewed works that directly address accent transformation rather than generic voice conversion), and exclusion criteria (non-English publications, purely theoretical linguistics papers without technical implementation). We will also report approximate coverage statistics (e.g., number of papers per decade and per methodological category) to allow readers to evaluate the sample. These additions will directly support the problem-driven framing without altering the existing analysis. revision: yes
Referee: [Technical Approaches] Technical Approaches section (evolution tracing): The assertion that application requirements impose identifiable varying constraints on accent modification versus speaker identity preservation relies on the reviewed papers illustrating these trade-offs. Without explicit justification for why certain voice conversion or sociolinguistic works were included or omitted, the analysis of constraints risks being incomplete or non-generalizable.

Authors: We acknowledge that the current text does not explicitly justify the inclusion or omission of particular works when discussing application-driven constraints. The papers cited were chosen because they exemplify concrete trade-offs (for instance, early DSP methods' limited disentanglement versus neural models' improved identity preservation), yet the rationale is implicit. To address this, we will expand the opening paragraphs of the Technical Approaches section with a short justification paragraph and, where space permits, a supplementary table listing representative papers alongside the specific constraint each illustrates (e.g., identity leakage in rule-based formant shifting versus controllable disentanglement in reference-free neural systems). We will also add a brief limitations paragraph noting that the reviewed literature is skewed toward English and high-resource languages, which may limit generalizability to other linguistic contexts. These revisions will make the constraint analysis more transparent and defensible while preserving the original technical narrative. revision: yes

Circularity Check

0 steps flagged

Survey narrative rests on external literature with no internal derivations or self-referential reductions

full rationale

This is a survey paper that traces the historical progression of accent conversion methods by summarizing external published works. No original equations, fitted parameters, predictions, or mathematical derivations appear in the abstract or described structure. The central framing—that methodological evolution responded to challenges like data alignment and disentanglement—is presented as an interpretive overview of cited literature rather than a self-contained claim that reduces to the paper's own selection criteria or prior self-citations. No self-definitional loops, fitted-input predictions, or load-bearing uniqueness theorems from the authors are invoked. The analysis is therefore self-contained against external benchmarks and receives the default non-circularity finding for literature reviews.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a survey paper whose central claims rest on the completeness and representativeness of the reviewed literature rather than new axioms, free parameters, or invented entities.

pith-pipeline@v0.9.0 · 5434 in / 1120 out tokens · 36539 ms · 2026-05-07T08:11:43.552482+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

74 extracted references · 46 canonical work pages · 1 internal anchor

[1]

Peter Trudgill. Accent. In Keith Brown, editor,Encyclopedia of Language and Linguistics, page 14. Elsevier, second edition edition, 2006. URLhttps://www.sciencedirect.com/science/article/pii/B0080448542015066. [Online]

2006
[3]

In: ICASSP 2023 - 2023 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP), pp

MuminJin,PrashantSerai,JilongWu,AndrosTjandra,VimalManohar,andQingHe. Voice-preservingzero-shotmultipleaccentconversion. InICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2023. doi: 10.1109/ICASSP49357.2023.10094737

work page doi:10.1109/icassp49357.2023.10094737 2023
[4]

Does foreign-accented speech affect credibility? evidence from the illusory-truth paradigm.Journal of Cognition, 7(1):26, 2024

Anna Lorenzoni, Rita Faccio, and Eduardo Navarrete. Does foreign-accented speech affect credibility? evidence from the illusory-truth paradigm.Journal of Cognition, 7(1):26, 2024. doi: 10.5334/joc.353

work page doi:10.5334/joc.353 2024
[5]

Non-autoregressive real-time accent conversion model with voice cloning, 2024

Vladimir Nechaev and Sergey Kosyakov. Non-autoregressive real-time accent conversion model with voice cloning, 2024. URLhttps: //arxiv.org/abs/2405.13162

work page arXiv 2024
[6]

Anoverviewofaccentconversiontechniquesfromstatisticalanddeeplearning:Itsadvantages and limitations.Circuits, Systems, and Signal Processing, 2025

AayamShrestha,SeyedGhorshi,andI.Panahi. Anoverviewofaccentconversiontechniquesfromstatisticalanddeeplearning:Itsadvantages and limitations.Circuits, Systems, and Signal Processing, 2025. URLhttps://api.semanticscholar.org/CorpusID:282338789

2025
[7]

Springer, 06 2024

Sabyasachi Chandra, Puja Bharati, Satya Prasad Gaddamedi, Debolina Pramanik, and Shyamal Mandal.Recent Advancement in Accent Conversion Using Deep Learning Techniques: A Comprehensive Review, pages 61–73. Springer, 06 2024. ISBN 978-981-97-1548-0. doi: 10.1007/978-981-97-1549-7_5

work page doi:10.1007/978-981-97-1549-7_5 2024
[8]

Society and language: Overview

Rajend Mesthrie. Society and language: Overview. In Keith Brown, editor,Encyclopedia of Language & Linguistics (Second Edition), pages 472–484. Elsevier, Oxford, second edition edition, 2006. ISBN 978-0-08-044854-1. doi: 10.1016/B0-08-044854-2/01310-9. URL https://www.sciencedirect.com/science/article/pii/B0080448542013109

work page doi:10.1016/b0-08-044854-2/01310-9 2006
[9]

John Wiley & Sons, 2021

Ronald Wardhaugh and Janet M Fuller.An introduction to sociolinguistics. John Wiley & Sons, 2021

2021
[10]

Billings.Assessing Language Attitudes: Speaker Evaluation Studies, chapter 7, pages 187–209

Howard Giles and Andrew C. Billings.Assessing Language Attitudes: Speaker Evaluation Studies, chapter 7, pages 187–209. John Wiley & Sons, Ltd, 2004. ISBN 9780470757000. doi: https://doi.org/10.1002/9780470757000.ch7. URLhttps://onlinelibrary.wiley.com/ doi/abs/10.1002/9780470757000.ch7

work page doi:10.1002/9780470757000.ch7 2004
[11]

A meta-analysis of the effects of speakers’ accents on interpersonal evaluations.European Journal of Social Psychology, 42:120 – 133, 02 2012

Jairo Fuertes, William Gottdiener, Helena Martin, Tracey Gilbert, and Howard Giles. A meta-analysis of the effects of speakers’ accents on interpersonal evaluations.European Journal of Social Psychology, 42:120 – 133, 02 2012. doi: 10.1002/ejsp.862

work page doi:10.1002/ejsp.862 2012
[12]

Measuring foreign accent in spanish: How much does vot really matter? InSelected Proceedings of the 6th ConferenceonLaboratoryApproachestoRomancePhonology,2015

Elena Schoonmaker-Gates. Measuring foreign accent in spanish: How much does vot really matter? InSelected Proceedings of the 6th ConferenceonLaboratoryApproachestoRomancePhonology,2015. URLhttps://api.semanticscholar.org/CorpusID:55192135

2015
[13]

TingYanRachelKan.Suprasegmentalandprosodicfeaturescontributingtoperceivedaccentinheritagecantonese.InProc.10thInternational Conference on Speech Prosody, May 2020

2020
[14]

Cambridge University Press, Cambridge, England; New York, 1980

John Laver.The Phonetic Description of Voice Quality. Cambridge University Press, Cambridge, England; New York, 1980. ISBN 0521231760. URLhttps://nla.gov.au/nla.cat-vn553115. Accessed: 21 October 2025

1980
[15]

Wiley-Blackwell, 2011

JodyKreimanandDianaVanLanckerSidtis.FoundationsofVoiceStudies:AnInterdisciplinaryApproachtoVoiceProductionandPerception. Wiley-Blackwell, 2011. ISBN 9780631222972. doi: 10.1002/9781444395068

work page doi:10.1002/9781444395068 2011
[16]

Mike Burton, Sophie K

Nadine Lavan, A. Mike Burton, Sophie K. Scott, and Carolyn McGettigan. Flexible voices: Identity perception from variable vocal signals. Psychonomic Bulletin & Review, 26(1):90–102, 2019. ISSN 1531-5320. doi: 10.3758/s13423-018-1497-7. URLhttps://doi.org/10. 3758/s13423-018-1497-7

work page doi:10.3758/s13423-018-1497-7 2019
[17]

Foreignaccentconversionincomputerassistedpronunciationtraining.Speech Communication,51(10):920–932,2009

DanielFelps,HeatherBortfeld,andRicardoGutierrez-Osuna. Foreignaccentconversionincomputerassistedpronunciationtraining.Speech Communication,51(10):920–932,2009. ISSN0167-6393. doi:10.1016/j.specom.2008.11.004. URLhttps://www.sciencedirect.com/ science/article/pii/S0167639308001763. Spoken Language Technology for Education

work page doi:10.1016/j.specom.2008.11.004 2009
[18]

Automatic prosody modification as a means for foreign language pronunciation training

Anna Sundström. Automatic prosody modification as a means for foreign language pronunciation training. InETRW on Speech Technology in Language Learning (STiLL), pages 49–52, 1998

1998
[19]

English speech training using voice conversion

Keiko Nagano and Kazunori Ozawa. English speech training using voice conversion. InProc. First International Conference on Spoken Language Processing (ICSLP 1990), pages 1169–1172, 1990. doi: 10.21437/ICSLP.1990-309

work page doi:10.21437/icslp.1990-309 1990
[20]

Munro and Tracey M

Murray J. Munro and Tracey M. Derwing. Foreign accent, comprehensibility, and intelligibility in the speech of second language learners. LanguageLearning,45(1):73–97,1995.doi:https://doi.org/10.1111/j.1467-1770.1995.tb00963.x.URLhttps://onlinelibrary.wiley. com/doi/abs/10.1111/j.1467-1770.1995.tb00963.x

work page doi:10.1111/j.1467-1770.1995.tb00963.x.urlhttps://onlinelibrary.wiley 1995
[21]

Rubin and Kim A

Donald L. Rubin and Kim A. Smith. Effects of accent, ethnicity, and lecture topic on undergraduates’ perceptions of nonnative english- speaking teaching assistants.International Journal of Intercultural Relations, 14:337–353, 1990. URLhttps://api.semanticscholar. org/CorpusID:144548326

1990
[22]

Accent modification for speech recognition of non-native speakers using neuralstyletransfer.EURASIPJournalonAudio,Speech,andMusicProcessing,2021(1):11,2021

Kacper Radzikowski, Le Wang, Osamu Yoshie, and Robert Nowak. Accent modification for speech recognition of non-native speakers using neuralstyletransfer.EURASIPJournalonAudio,Speech,andMusicProcessing,2021(1):11,2021. doi:10.1186/s13636-021-00199-3. URL https://doi.org/10.1186/s13636-021-00199-3

work page doi:10.1186/s13636-021-00199-3 2021
[23]

Promoting intercultural awareness through native-to-foreign speech accent conversion

Takeshi Nishida. Promoting intercultural awareness through native-to-foreign speech accent conversion. InProceedings of the 5th ACM International Conference on Collaboration across Boundaries: Culture, Distance & Technology, CABS ’14, pages 83–86, New York, NY, USA, 2014. Association for Computing Machinery. ISBN 9781450325578. doi: 10.1145/2631488.263405...

work page doi:10.1145/2631488.2634058 2014
[24]

Voices from the south: Study and synthesis of andalusian accents with artificial intelligence

Jose Gonzalez Lopez, Antonio Rubia, Alfredo de Haro, Angel Gomez, and Antonio Peinado. Voices from the south: Study and synthesis of andalusian accents with artificial intelligence. InProc. IberSPEECH 2024, pages 285–288, 11 2024. doi: 10.21437/IberSPEECH.2024-59

work page doi:10.21437/iberspeech.2024-59 2024
[25]

critic/sycophant

Zhijun Jia, Huaying Xue, Xiulian Peng, and Yan Lu. Convert and speak: Zero-shot accent conversion with minimum supervision. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 4446–4454. ACM, October 2024. doi: 10.1145/3664647. 3681539. URLhttp://dx.doi.org/10.1145/3664647.3681539

work page doi:10.1145/3664647 2024
[26]

Accent-VITS: Accent transfer for end-to- end TTS

Linhan Ma, Yongmao Zhang, Xinfa Zhu, Yi Lei, Ziqian Ning, Pengcheng Zhu, and Lei Xie. Accent-VITS: Accent transfer for end-to- end TTS. InMan-Machine Speech Communication – 18th National Conference, NCMMSC 2023, pages 203–214. Springer, 2024. doi: 10.1007/978-981-97-0601-3_17

work page doi:10.1007/978-981-97-0601-3_17 2023
[27]

Zero-shot accent conversionusingpseudosiamesedisentanglementnetwork

Dongya Jia, Qiao Tian, Kainan Peng, Jiaxin Li, Yuan-Jui Chen, Mingbo Ma, Yuping Wang, and Yuxuan Wang. Zero-shot accent conversionusingpseudosiamesedisentanglementnetwork. InInterspeech,2022. URLhttps://api.semanticscholar.org/CorpusID: 254564036. Halychanskyi et al.:Preprint submitted to ElsevierPage 10 of 12 Accent Conversion Survey

2022
[28]

Accentbox: Towards high-fidelity zero-shot accent generation, 09 2024

Jinzuomu Zhong, Korin Richmond, Zhiba Su, and Siqi Sun. Accentbox: Towards high-fidelity zero-shot accent generation, 09 2024

2024
[29]

StarGAN-VC: Non-parallel many-to-many voice conversion with star generative adversarial networks

Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, and Nobukatsu Hojo. StarGAN-VC: Non-parallel many-to-many voice conversion with star generative adversarial networks. In2018 IEEE Spoken Language Technology Workshop (SLT), pages 266–273, 2018. doi: 10.1109/SLT.2018.8639535

work page doi:10.1109/slt.2018.8639535 2018
[30]

AutoVC: Zero-shot voice style transfer with only autoencoder loss

Kaizhi Qian, Yang Zhang, Shiyu Chang, Xuesong Yang, and Mark Hasegawa-Johnson. AutoVC: Zero-shot voice style transfer with only autoencoder loss. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 5210–5219. PMLR, 09–15 Ju...

2019
[31]

Cross-lingual voice conversion with disentangled universal linguistic representations

Zhenchuan Yang, Weibin Zhang, Yufei Liu, and Xiaofen Xing. Cross-lingual voice conversion with disentangled universal linguistic representations. InProc. Interspeech 2021, pages 1604–1608, 08 2021. doi: 10.21437/Interspeech.2021-552

work page doi:10.21437/interspeech.2021-552 2021
[32]

Disentangling correlated speaker and noise for speech synthesis via data augmentation and adversarial factorization

Wei-Ning Hsu, Yu Zhang, Ron Weiss, Yu-An Chung, Yonghui Wu, and James Glass. Disentangling correlated speaker and noise for speech synthesis via data augmentation and adversarial factorization. InProc. IEEE ICASSP 2019, pages 5901–5905, 05 2019. doi: 10.1109/ICASSP.2019.8683561

work page doi:10.1109/icassp.2019.8683561 2019
[33]

Soong, and Lei Xie

Xiangyu An, Frank K. Soong, and Lei Xie. Improving performance of seen and unseen speech style transfer in end-to-end neural TTS. In Proc. Interspeech, pages 4688–4692, 2021. doi: 10.21437/Interspeech.2021-1407

work page doi:10.21437/interspeech.2021-1407 2021
[34]

Zeroshotaudiotoaudioemotiontransferwithspeakerdisentanglement

SoumyaDuttaandSriramGanapathy. Zeroshotaudiotoaudioemotiontransferwithspeakerdisentanglement. InICASSP2024–2024IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024. doi: 10.1109/ICASSP48485.2024.10445962

work page doi:10.1109/icassp48485.2024.10445962 2024
[35]

Haobin Tang, Xulong Zhang, Jianzong Wang, Ning Cheng, and Jing Xiao. Qi-tts: Questioning intonation control for emotional speech synthesis.ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2023. URLhttps://api.semanticscholar.org/CorpusID:257504874

2023
[36]

Accented text-to-speech synthesis with limited data.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32:1699–1711, 2024

Xuehao Zhou, Mingyang Zhang, Yi Zhou, Zhizheng Wu, and Haizhou Li. Accented text-to-speech synthesis with limited data.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32:1699–1711, 2024. doi: 10.1109/TASLP.2024.3363414

work page doi:10.1109/taslp.2024.3363414 2024
[37]

In: ICASSP 2023 - 2023 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP), pp

Tuan-Nam Nguyen, Ngoc-Quan Pham, and Alexander Waibel. Syntacc : Synthesizing multi-accent speech by weight factorization. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2023. doi: 10.1109/ICASSP49357.2023.10096431

work page doi:10.1109/icassp49357.2023.10096431 2023
[38]

Multi-scale accent modeling and disentangling for multi-speaker multi-accent text-to-speech synthesis, 2025

Xuehao Zhou, Mingyang Zhang, Yi Zhou, Zhizheng Wu, and Haizhou Li. Multi-scale accent modeling and disentangling for multi-speaker multi-accent text-to-speech synthesis, 2025. URLhttps://arxiv.org/abs/2406.10844

work page arXiv 2025
[39]

Fréchetaudiodistance:Areference-freemetricforevaluatingmusic enhancement algorithms

KevinKilgour,MauricioZuluaga,DominikRoblek,andMatthewSharifi. Fréchetaudiodistance:Areference-freemetricforevaluatingmusic enhancement algorithms. InProc. Interspeech 2019, pages 2350–2354, 2019. doi: 10.21437/Interspeech.2019-2219

work page doi:10.21437/interspeech.2019-2219 2019
[41]

V oxCeleb: A Large-Scale Speaker Identification Dataset

Mark Huckvale. Accdist: a metric for comparing speakers’ accents. InInterspeech 2004, pages 29–32, 2004. doi: 10.21437/Interspeech. 2004-29

work page doi:10.21437/interspeech 2004
[42]

Methods for subjective determination of transmission quality

ITU-T. Methods for subjective determination of transmission quality. Technical Report Recommendation P.800, ITU-T, 1996. Introduces Mean Opinion Score (MOS) on a 5-point scale

1996
[43]

Method for the subjective assessment of intermediate quality level of audio systems (mushra)

ITU-R. Method for the subjective assessment of intermediate quality level of audio systems (mushra). Technical Report Recommendation BS.1534-1, ITU-R, 2003. Introduces the MUlti Stimulus test with Hidden Reference and Anchor (MUSHRA)

2003
[44]

Improving pronunciation and accent conversion through knowledge distillation and synthetic ground-truth from native tts

Tuan Nam Nguyen, Seymanur Akti, Ngoc Quan Pham, and Alexander Waibel. Improving pronunciation and accent conversion through knowledge distillation and synthetic ground-truth from native tts. InICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2025. doi: 10.1109/ICASSP49660.2025.10890229

work page doi:10.1109/icassp49660.2025.10890229 2025
[45]

Libri-Light: A Benchmark for ASR with Limited or No Supervision

Jacob Kahn, Morgane Rivière, Wenming Zheng, Eugene Kharitonov, Qiantong Xu, Pierre-Emmanuel Mazaré, Julien Karadayi, Vitaliy Liptchinsky, Ronan Collobert, Christian Fügen, Tatiana Likhomanenko, Gabriel Synnaeve, Armand Joulin, Abdelrahman Mohamed, and Emmanuel Dupoux. Libri-Light: A Benchmark for ASR with Limited or No Supervision. InICASSP 2020 - 45th IE...

work page doi:10.1109/icassp40776.2020.9052942 2020
[46]

Librispeech: An asr corpus based on public domain audio books

Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: An asr corpus based on public domain audio books. In2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210, 2015. doi: 10.1109/ICASSP.2015.7178964

work page doi:10.1109/icassp.2015.7178964 2015
[47]

Weiss, Ye Jia, Zhifeng Chen, and Yonghui Wu

Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J. Weiss, Ye Jia, Zhifeng Chen, and Yonghui Wu. LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech. InProc. Interspeech 2019, pages 1526–1530, 2019. doi: 10.21437/Interspeech.2019-2441

work page doi:10.21437/interspeech.2019-2441 2019
[48]

LibriTTS-R: A Restored Multi-Speaker Text-to-Speech Corpus

Yuma Koizumi, Heiga Zen, Shigeki Karita, Yifan Ding, Kohei Yatabe, Nobuyuki Morioka, Michiel Bacchiani, Yu Zhang, Wei Han, and Ankur Bapna. LibriTTS-R: A Restored Multi-Speaker Text-to-Speech Corpus. InProc. Interspeech 2023, pages 5496–5500, 2023. doi: 10.21437/Interspeech.2023-1584

work page doi:10.21437/interspeech.2023-1584 2023
[49]

The LJ Speech Dataset.https://keithito.com/LJ-Speech-Dataset/, 2017

Keith Ito and Linda Johnson. The LJ Speech Dataset.https://keithito.com/LJ-Speech-Dataset/, 2017

2017
[50]

CSTR VCTK Corpus: English multi-speaker corpus for cstr voice cloning toolkit (version 0.92), 2019

Junichi Yamagishi, Christophe Veaux, and Kirsten MacDonald. CSTR VCTK Corpus: English multi-speaker corpus for cstr voice cloning toolkit (version 0.92), 2019

2019
[51]

Common Voice: A Massively-Multilingual Speech Corpus

Rosana Ardila, Megan Branson, Kelly Davis, Michael Kohler, Josh Meyer, Michael Henretty, Reuben Morais, Lindsay Saunders, Francis Tyers, and Gregor Weber. Common Voice: A Massively-Multilingual Speech Corpus. InProceedings of the 12th Language Resources and Evaluation Conference (LREC 2020), pages 4218–4222, Marseille, France, 2020. European Language Reso...

2020
[52]

Speech Accent Archive.https://accent.gmu.edu, 2015

Steven Weinberger. Speech Accent Archive.https://accent.gmu.edu, 2015

2015
[53]

John Kominek and Alan W. Black. The CMU Arctic speech databases. InProc. 5th ISCA Workshop on Speech Synthesis (SSW5), pages 223–224, Pittsburgh, PA, USA, 2004. ISCA. URLhttps://www.isca-archive.org/ssw_2004/kominek04b_ssw.html. Halychanskyi et al.:Preprint submitted to ElsevierPage 11 of 12 Accent Conversion Survey

2004
[54]

L2- ARCTIC: A Non-native English Speech Corpus

Guanlong Zhao, Sinem Sonsaat, Alif Silpachai, Ivana Lucic, Evgeny Chukharev-Hudilainen, John Levis, and Ricardo Gutierrez-Osuna. L2- ARCTIC: A Non-native English Speech Corpus. InProc. Interspeech 2018, pages 2783–2787, 2018. doi: 10.21437/Interspeech.2018-1110

work page doi:10.21437/interspeech.2018-1110 2018
[55]

AccentDB: A Database of Non-Native English Accents to Assist Neural Speech Recognition

Afroz Ahamad, Ankit Anand, and Pranesh Bhargava. AccentDB: A Database of Non-Native English Accents to Assist Neural Speech Recognition. InProceedingsofthe12thLanguageResourcesandEvaluationConference(LREC2020),pages5351–5358,Marseille,France,
[56]

URLhttps://aclanthology.org/2020.lrec-1.659

European Language Resources Association. URLhttps://aclanthology.org/2020.lrec-1.659

2020
[57]

EnglishspeechdatabasereadbyJapanese learners for CALL system development

N.Minematsu,Y.Tomiyama,K.Yoshimoto,K.Shimizu,S.Nakagawa,M.Dantsuji,andS.Makino. EnglishspeechdatabasereadbyJapanese learners for CALL system development. In Manuel González Rodríguez and Carmen Paz Suarez Araujo, editors,Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02), Las Palmas, Canary Islands - Spain, Ma...

2002
[58]

Analysis and synthesis of formant spaces of british, australian, and american accents.IEEE Transactions on Audio, Speech, and Language Processing, 15(2):676–689, 2007

Qin Yan, Saeed Vaseghi, Dimitrios Rentzos, and Ching-Hsiang Ho. Analysis and synthesis of formant spaces of british, australian, and american accents.IEEE Transactions on Audio, Speech, and Language Processing, 15(2):676–689, 2007. doi: 10.1109/TASL.2006.885923

work page doi:10.1109/tasl.2006.885923 2007
[59]

Foreign accent conversion through concatenative synthesis in the articulatory domain.IEEE Transactions on Audio, Speech, and Language Processing, 20(8):2301–2312, 2012

Daniel Felps, Christian Geng, and Ricardo Gutierrez-Osuna. Foreign accent conversion through concatenative synthesis in the articulatory domain.IEEE Transactions on Audio, Speech, and Language Processing, 20(8):2301–2312, 2012. doi: 10.1109/TASL.2012.2201474

work page doi:10.1109/tasl.2012.2201474 2012
[60]

Foreign-language speech synthesis

Nick Campbell. Foreign-language speech synthesis. InProceedings of the Third ESCA/COCOSDA Workshop (ETRW) on Speech Synthesis, JenolanCavesHouse,BlueMountains,NSW,Australia,1998.ISCA. URLhttp://www.isca-speech.org/archive. Pagesnotavailable

1998
[61]

Spokenlanguageconversionwithaccentmorphing

MarkHuckvaleandKayokoYanagisawa. Spokenlanguageconversionwithaccentmorphing. InProceedingsofthe6thISCASpeechSynthesis Workshop, Bonn, Germany, 2007. ISCA

2007
[62]

doi: 10.1016/j.specom.2007.09.001

TomokiToda,AlanW.Black,andKeiichiTokuda.Statisticalmappingbetweenarticulatorymovementsandacousticspectrumusingagaussian mixture model.Speech Communication, 50(3):215–227, 2008. doi: 10.1016/j.specom.2007.09.001. URLhttps://www.sciencedirect. com/science/article/pii/S0167639307001495

work page doi:10.1016/j.specom.2007.09.001 2008
[63]

Can voice conversion be used to reduce non-native accents? In2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7879–7883, 2014

Sandesh Aryal and Ricardo Gutierrez-Osuna. Can voice conversion be used to reduce non-native accents? In2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7879–7883, 2014. doi: 10.1109/ICASSP.2014.6855134

work page doi:10.1109/icassp.2014.6855134 2014
[64]

Accentconversionusingphonetic posteriorgrams

GuanlongZhao,SinemSonsaat,JohnLevis,EvgenyChukharev-Hudilainen,andRicardoGutierrez-Osuna. Accentconversionusingphonetic posteriorgrams. In2018IEEEInternationalConferenceonAcoustics,SpeechandSignalProcessing(ICASSP),pages5314–5318,2018. doi: 10.1109/ICASSP.2018.8462258

work page doi:10.1109/icassp.2018.8462258 2018
[65]

Accentron: Foreign accent conversion to arbitrary non-native speakers using zero-shot learning.Computer Speech & Language, 72:101302, 2022

Shaojin Ding, Guanlong Zhao, and Ricardo Gutierrez-Osuna. Accentron: Foreign accent conversion to arbitrary non-native speakers using zero-shot learning.Computer Speech & Language, 72:101302, 2022. ISSN 0885-2308. doi: 10.1016/j.csl.2021.101302. URL https://www.sciencedirect.com/science/article/pii/S0885230821001029

work page doi:10.1016/j.csl.2021.101302 2022
[66]

Libri-Light: A Benchmark for ASR with Limited or No Supervision

Songxiang Liu, Disong Wang, Yuewen Cao, Lifa Sun, Xixin Wu, Shiyin Kang, Zhiyong Wu, Xunying Liu, Dan Su, Dong Yu, and Helen Meng. End-to-end accent conversion without using native utterances. InICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6289–6293, 2020. doi: 10.1109/ICASSP40776.2020.9053797

work page doi:10.1109/icassp40776.2020.9053797 2020
[67]

Accent modeling of low-resourced dialect in pitch accent language using variational autoencoder

Kazuya Yufune, Tomoki Koriyama, Shinnosuke Takamichi, and Hiroshi Saruwatari. Accent modeling of low-resourced dialect in pitch accent language using variational autoencoder. In11th ISCA Speech Synthesis Workshop (SSW 11), pages 189–194, 2021. doi: 10.21437/ SSW.2021-33

2021
[68]

MacST: Multi-accent speech synthesis via text transliteration for accent conversion

Sho Inoue, Shuai Wang, Wanxing Wang, Pengcheng Zhu, Mengxiao Bi, and Haizhou Li. MacST: Multi-accent speech synthesis via text transliteration for accent conversion. InICASSP 2025 – 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025. doi: 10.1109/ICASSP49357.2025.10888195

work page doi:10.1109/icassp49357.2025.10888195 2025
[69]

Accent conversion using discrete units with parallel data synthesized from controllable accented TTS

Tuan Nam Nguyen, Ngoc Quan Pham, and Alexander Waibel. Accent conversion using discrete units with parallel data synthesized from controllable accented TTS. InProc. Workshop on Synthetic Data’s Transformative Role in Foundational Speech Models (SynData4GenAI 2024), pages 51–55, 2024. URLhttps://www.isca-archive.org/syndata4genai_2024/nguyen24_syndata4genai.html

2024
[70]

Siriwardena, Nathan Swedlow, Audrey Howard, Evan Gitterman, Dan Darcy, Carol Espy-Wilson, and Andrea Fanelli

Yashish M. Siriwardena, Nathan Swedlow, Audrey Howard, Evan Gitterman, Dan Darcy, Carol Espy-Wilson, and Andrea Fanelli. Accent conversion with articulatory representations. InProc. Interspeech 2024, 2024. doi: 10.21437/Interspeech.2024-1416

work page doi:10.21437/interspeech.2024-1416 2024
[71]

Training text-to-speech systems from synthetic data: A practical approach for accent transfer tasks

LevFinkelstein,HeigaZen,NormanCasagrande,ChunanChan,YeJia,TomKenter,AlexeyPetelin,JonathanShen,VincentWan,YuZhang, Yonghui Wu, and Rob Clark. Training text-to-speech systems from synthetic data: A practical approach for accent transfer tasks. InProc. Interspeech 2022, 2022. doi: 10.21437/Interspeech.2022-10115

work page doi:10.21437/interspeech.2022-10115 2022
[72]

Cross-dialect text-to-speech in pitch-accent language incorporating multi-dialect phoneme-level bert, 09 2024

Kazuki Yamauchi, Yuki Saito, and Hiroshi Saruwatari. Cross-dialect text-to-speech in pitch-accent language incorporating multi-dialect phoneme-level bert, 09 2024

2024
[73]

Remap, warp and attend: Non-parallel many-to-many accent conversion with normalizing flows

Abdelhamid Ezzerg, Thomas Merritt, Kayoko Yanagisawa, Piotr Bilinski, Magdalena Proszewska, Kamil Pokora, Renard Korzeniowski, Roberto Barra-Chicote, and Daniel Korzekwa. Remap, warp and attend: Non-parallel many-to-many accent conversion with normalizing flows. In2022 IEEE Spoken Language Technology Workshop (SLT), pages 984–990, 2023. doi: 10.1109/SLT54...

work page doi:10.1109/slt54892.2023.10022506 2023
[74]

Dart: Disentanglement of accent and speaker representation in multispeaker text-to-speech, 10 2024

Jan Melechovsky, Ambuj Mehrish, Berrak Sisman, and Dorien Herremans. Dart: Disentanglement of accent and speaker representation in multispeaker text-to-speech, 10 2024

2024
[75]

FAC-FACodec: Controllable Zero-Shot Foreign Accent Conversion with Factorized Speech Codec

Yurii Halychanskyi, Cameron Churchwell, Yutong Wen, and Volodymyr Kindratenko. FAC-FACodec: Controllable zero-shot foreign accent conversion with factorized speech codec, 2026. URLhttps://arxiv.org/abs/2510.10785

work page internal anchor Pith review Pith/arXiv arXiv 2026
[76]

Controllableaccentnormalizationviadiscretediffusion,2026

QibingBai,YuhanDu,TomKo,ShuaiWang,YannanWang,andHaizhouLi. Controllableaccentnormalizationviadiscretediffusion,2026. URLhttps://arxiv.org/abs/2603.14275. Halychanskyi et al.:Preprint submitted to ElsevierPage 12 of 12

work page arXiv 2026