pith. machine review for the scientific record. sign in

arxiv: 2604.27281 · v1 · submitted 2026-04-30 · 💻 cs.SD

Recognition: unknown

Accent Conversion: A Problem-Driven Survey of Sociolinguistic and Technical Constraints

Authors on Pith no claims yet

Pith reviewed 2026-05-07 08:11 UTC · model grok-4.3

classification 💻 cs.SD
keywords accent conversionvoice conversionneural speech synthesissociolinguistic constraintsdata alignmentspeaker identity preservationspeech evaluationaccent modification
0
0 comments X

The pith

Accent conversion methods have progressed from rule-based signal processing to neural architectures to overcome data alignment, disentanglement, and scarcity issues.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper surveys how accent conversion techniques evolved in response to persistent technical hurdles. It starts with early methods relying on spectral manipulation and formant analysis before moving to neural systems for more adaptable transformations. A key focus is the linguistic context and how real-world uses create different demands on changing an accent versus keeping the original speaker's voice. Readers care because improved systems could aid cross-cultural interactions without losing personal vocal qualities. The survey also covers available datasets, how to evaluate results, remaining problems, and possible next steps in research.

Core claim

This survey establishes that accent conversion has developed through addressing fundamental problems of aligning training data, disentangling accent-related features from speaker identity, and managing limited data resources. The progression moves from rigid rule-based digital signal processing techniques, such as spectral and formant adjustments, to flexible neural architectures that support reference-free conversions. Linguistic foundations inform the process, and application contexts dictate the acceptable trade-off between accent modification strength and identity preservation. The review catalogs speech datasets and evaluation practices while pinpointing ongoing challenges and suggests

What carries the argument

The problem-driven linkage between sociolinguistic constraints on accent-identity balance and technical constraints on data alignment, representation disentanglement, and resource scarcity, which explains the shift in methodologies.

Load-bearing premise

That the chosen literature sufficiently illustrates the field's response to the identified challenges and that application requirements create distinct, identifiable constraints on the accent-identity trade-off.

What would settle it

A new comprehensive review or empirical analysis revealing that technical challenges like data alignment have not primarily driven methodological changes, or that application constraints do not systematically vary as described.

read the original abstract

Accent conversion has rapidly progressed alongside growing interest in improving global cross-cultural communication. This survey presents an overview of the evolution of accent conversion methodologies, analyzing how the field has developed in response to fundamental challenges related to data alignment, representation disentanglement, and resource scarcity. We trace the progression from early rule-based digital signal processing approaches such as spectral manipulation and formant-based analysis to modern neural architectures capable of flexible and reference-free accent transformation. In addition, the survey situates accent conversion within its linguistic foundations and examines how different application requirements impose varying constraints on the balance between accent modification and speaker identity preservation. Finally, it reviews commonly used speech datasets and evaluation methodologies, identifies persistent challenges, and outlines directions for future research aimed at achieving more controllable and perceptually consistent accent conversion.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript is a survey paper on accent conversion. It claims to trace the field's evolution from early rule-based digital signal processing methods (spectral manipulation, formant-based analysis) to modern neural architectures for flexible, reference-free accent transformation. The development is framed as a direct response to core challenges in data alignment, representation disentanglement, and resource scarcity. The paper situates the topic in linguistic foundations, analyzes how application requirements create varying constraints on accent modification versus speaker identity preservation, reviews speech datasets and evaluation methodologies, identifies persistent challenges, and outlines future research directions for more controllable and perceptually consistent systems.

Significance. If the literature selection proves representative and the causal links between the identified challenges and methodological shifts are substantiated by the reviewed works, the survey would provide a useful synthesis bridging technical speech processing with sociolinguistic considerations. It could help researchers understand trade-offs in identity preservation during accent conversion and support more standardized benchmarking through its dataset and evaluation review, potentially guiding targeted advances in controllable systems.

major comments (2)
  1. [Introduction] Introduction: The central problem-driven claim states that accent conversion methodologies evolved explicitly in response to data alignment, representation disentanglement, and resource scarcity, with the progression from rule-based to neural methods presented as evidence of this response. However, no literature search protocol, databases, keywords, inclusion/exclusion criteria, or coverage statistics (e.g., papers per category or era) are provided. This absence is load-bearing for the narrative, as it prevents verification that the selected works form a representative sample demonstrating the claimed causal progression rather than post-hoc selection.
  2. [Technical Approaches] Technical Approaches section (evolution tracing): The assertion that application requirements impose identifiable varying constraints on accent modification versus speaker identity preservation relies on the reviewed papers illustrating these trade-offs. Without explicit justification for why certain voice conversion or sociolinguistic works were included or omitted, the analysis of constraints risks being incomplete or non-generalizable.
minor comments (1)
  1. [Abstract] Abstract: The term 'reference-free' is used to describe modern neural architectures but is not defined or contrasted with reference-based methods at first mention; adding a short clarification would improve readability for a broad audience.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed review of our survey on accent conversion. The comments correctly identify the need for greater transparency in how the literature was assembled to support the problem-driven narrative. We address each major comment below and will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses
  1. Referee: [Introduction] Introduction: The central problem-driven claim states that accent conversion methodologies evolved explicitly in response to data alignment, representation disentanglement, and resource scarcity, with the progression from rule-based to neural methods presented as evidence of this response. However, no literature search protocol, databases, keywords, inclusion/exclusion criteria, or coverage statistics (e.g., papers per category or era) are provided. This absence is load-bearing for the narrative, as it prevents verification that the selected works form a representative sample demonstrating the claimed causal progression rather than post-hoc selection.

    Authors: We agree that the absence of an explicit literature selection protocol weakens the verifiability of the claimed causal progression. Although the survey was structured as a narrative review organized around the core technical challenges rather than a formal systematic review, we recognize that documenting the search process is necessary to substantiate representativeness. In the revised manuscript we will add a dedicated subsection (tentatively titled 'Literature Selection and Scope') immediately after the abstract or at the start of the Introduction. This subsection will specify the databases consulted (Google Scholar, arXiv, IEEE Xplore, ACL Anthology), the primary keywords and Boolean combinations employed ('accent conversion' OR 'accent modification' AND ('voice conversion' OR 'prosody transfer' OR 'speaker disentanglement')), the time window (2000–2024), inclusion criteria (peer-reviewed works that directly address accent transformation rather than generic voice conversion), and exclusion criteria (non-English publications, purely theoretical linguistics papers without technical implementation). We will also report approximate coverage statistics (e.g., number of papers per decade and per methodological category) to allow readers to evaluate the sample. These additions will directly support the problem-driven framing without altering the existing analysis. revision: yes

  2. Referee: [Technical Approaches] Technical Approaches section (evolution tracing): The assertion that application requirements impose identifiable varying constraints on accent modification versus speaker identity preservation relies on the reviewed papers illustrating these trade-offs. Without explicit justification for why certain voice conversion or sociolinguistic works were included or omitted, the analysis of constraints risks being incomplete or non-generalizable.

    Authors: We acknowledge that the current text does not explicitly justify the inclusion or omission of particular works when discussing application-driven constraints. The papers cited were chosen because they exemplify concrete trade-offs (for instance, early DSP methods' limited disentanglement versus neural models' improved identity preservation), yet the rationale is implicit. To address this, we will expand the opening paragraphs of the Technical Approaches section with a short justification paragraph and, where space permits, a supplementary table listing representative papers alongside the specific constraint each illustrates (e.g., identity leakage in rule-based formant shifting versus controllable disentanglement in reference-free neural systems). We will also add a brief limitations paragraph noting that the reviewed literature is skewed toward English and high-resource languages, which may limit generalizability to other linguistic contexts. These revisions will make the constraint analysis more transparent and defensible while preserving the original technical narrative. revision: yes

Circularity Check

0 steps flagged

Survey narrative rests on external literature with no internal derivations or self-referential reductions

full rationale

This is a survey paper that traces the historical progression of accent conversion methods by summarizing external published works. No original equations, fitted parameters, predictions, or mathematical derivations appear in the abstract or described structure. The central framing—that methodological evolution responded to challenges like data alignment and disentanglement—is presented as an interpretive overview of cited literature rather than a self-contained claim that reduces to the paper's own selection criteria or prior self-citations. No self-definitional loops, fitted-input predictions, or load-bearing uniqueness theorems from the authors are invoked. The analysis is therefore self-contained against external benchmarks and receives the default non-circularity finding for literature reviews.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is a survey paper whose central claims rest on the completeness and representativeness of the reviewed literature rather than new axioms, free parameters, or invented entities.

pith-pipeline@v0.9.0 · 5434 in / 1120 out tokens · 36539 ms · 2026-05-07T08:11:43.552482+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

74 extracted references · 46 canonical work pages · 1 internal anchor

  1. [1]

    Peter Trudgill. Accent. In Keith Brown, editor,Encyclopedia of Language and Linguistics, page 14. Elsevier, second edition edition, 2006. URLhttps://www.sciencedirect.com/science/article/pii/B0080448542015066. [Online]

  2. [3]

    In: ICASSP 2023 - 2023 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP), pp

    MuminJin,PrashantSerai,JilongWu,AndrosTjandra,VimalManohar,andQingHe. Voice-preservingzero-shotmultipleaccentconversion. InICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2023. doi: 10.1109/ICASSP49357.2023.10094737

  3. [4]

    Does foreign-accented speech affect credibility? evidence from the illusory-truth paradigm.Journal of Cognition, 7(1):26, 2024

    Anna Lorenzoni, Rita Faccio, and Eduardo Navarrete. Does foreign-accented speech affect credibility? evidence from the illusory-truth paradigm.Journal of Cognition, 7(1):26, 2024. doi: 10.5334/joc.353

  4. [5]

    Non-autoregressive real-time accent conversion model with voice cloning, 2024

    Vladimir Nechaev and Sergey Kosyakov. Non-autoregressive real-time accent conversion model with voice cloning, 2024. URLhttps: //arxiv.org/abs/2405.13162

  5. [6]

    Anoverviewofaccentconversiontechniquesfromstatisticalanddeeplearning:Itsadvantages and limitations.Circuits, Systems, and Signal Processing, 2025

    AayamShrestha,SeyedGhorshi,andI.Panahi. Anoverviewofaccentconversiontechniquesfromstatisticalanddeeplearning:Itsadvantages and limitations.Circuits, Systems, and Signal Processing, 2025. URLhttps://api.semanticscholar.org/CorpusID:282338789

  6. [7]

    Springer, 06 2024

    Sabyasachi Chandra, Puja Bharati, Satya Prasad Gaddamedi, Debolina Pramanik, and Shyamal Mandal.Recent Advancement in Accent Conversion Using Deep Learning Techniques: A Comprehensive Review, pages 61–73. Springer, 06 2024. ISBN 978-981-97-1548-0. doi: 10.1007/978-981-97-1549-7_5

  7. [8]

    Society and language: Overview

    Rajend Mesthrie. Society and language: Overview. In Keith Brown, editor,Encyclopedia of Language & Linguistics (Second Edition), pages 472–484. Elsevier, Oxford, second edition edition, 2006. ISBN 978-0-08-044854-1. doi: 10.1016/B0-08-044854-2/01310-9. URL https://www.sciencedirect.com/science/article/pii/B0080448542013109

  8. [9]

    John Wiley & Sons, 2021

    Ronald Wardhaugh and Janet M Fuller.An introduction to sociolinguistics. John Wiley & Sons, 2021

  9. [10]

    Billings.Assessing Language Attitudes: Speaker Evaluation Studies, chapter 7, pages 187–209

    Howard Giles and Andrew C. Billings.Assessing Language Attitudes: Speaker Evaluation Studies, chapter 7, pages 187–209. John Wiley & Sons, Ltd, 2004. ISBN 9780470757000. doi: https://doi.org/10.1002/9780470757000.ch7. URLhttps://onlinelibrary.wiley.com/ doi/abs/10.1002/9780470757000.ch7

  10. [11]

    A meta-analysis of the effects of speakers’ accents on interpersonal evaluations.European Journal of Social Psychology, 42:120 – 133, 02 2012

    Jairo Fuertes, William Gottdiener, Helena Martin, Tracey Gilbert, and Howard Giles. A meta-analysis of the effects of speakers’ accents on interpersonal evaluations.European Journal of Social Psychology, 42:120 – 133, 02 2012. doi: 10.1002/ejsp.862

  11. [12]

    Measuring foreign accent in spanish: How much does vot really matter? InSelected Proceedings of the 6th ConferenceonLaboratoryApproachestoRomancePhonology,2015

    Elena Schoonmaker-Gates. Measuring foreign accent in spanish: How much does vot really matter? InSelected Proceedings of the 6th ConferenceonLaboratoryApproachestoRomancePhonology,2015. URLhttps://api.semanticscholar.org/CorpusID:55192135

  12. [13]

    TingYanRachelKan.Suprasegmentalandprosodicfeaturescontributingtoperceivedaccentinheritagecantonese.InProc.10thInternational Conference on Speech Prosody, May 2020

  13. [14]

    Cambridge University Press, Cambridge, England; New York, 1980

    John Laver.The Phonetic Description of Voice Quality. Cambridge University Press, Cambridge, England; New York, 1980. ISBN 0521231760. URLhttps://nla.gov.au/nla.cat-vn553115. Accessed: 21 October 2025

  14. [15]

    Wiley-Blackwell, 2011

    JodyKreimanandDianaVanLanckerSidtis.FoundationsofVoiceStudies:AnInterdisciplinaryApproachtoVoiceProductionandPerception. Wiley-Blackwell, 2011. ISBN 9780631222972. doi: 10.1002/9781444395068

  15. [16]

    Mike Burton, Sophie K

    Nadine Lavan, A. Mike Burton, Sophie K. Scott, and Carolyn McGettigan. Flexible voices: Identity perception from variable vocal signals. Psychonomic Bulletin & Review, 26(1):90–102, 2019. ISSN 1531-5320. doi: 10.3758/s13423-018-1497-7. URLhttps://doi.org/10. 3758/s13423-018-1497-7

  16. [17]

    Foreignaccentconversionincomputerassistedpronunciationtraining.Speech Communication,51(10):920–932,2009

    DanielFelps,HeatherBortfeld,andRicardoGutierrez-Osuna. Foreignaccentconversionincomputerassistedpronunciationtraining.Speech Communication,51(10):920–932,2009. ISSN0167-6393. doi:10.1016/j.specom.2008.11.004. URLhttps://www.sciencedirect.com/ science/article/pii/S0167639308001763. Spoken Language Technology for Education

  17. [18]

    Automatic prosody modification as a means for foreign language pronunciation training

    Anna Sundström. Automatic prosody modification as a means for foreign language pronunciation training. InETRW on Speech Technology in Language Learning (STiLL), pages 49–52, 1998

  18. [19]

    English speech training using voice conversion

    Keiko Nagano and Kazunori Ozawa. English speech training using voice conversion. InProc. First International Conference on Spoken Language Processing (ICSLP 1990), pages 1169–1172, 1990. doi: 10.21437/ICSLP.1990-309

  19. [20]

    Munro and Tracey M

    Murray J. Munro and Tracey M. Derwing. Foreign accent, comprehensibility, and intelligibility in the speech of second language learners. LanguageLearning,45(1):73–97,1995.doi:https://doi.org/10.1111/j.1467-1770.1995.tb00963.x.URLhttps://onlinelibrary.wiley. com/doi/abs/10.1111/j.1467-1770.1995.tb00963.x

  20. [21]

    Rubin and Kim A

    Donald L. Rubin and Kim A. Smith. Effects of accent, ethnicity, and lecture topic on undergraduates’ perceptions of nonnative english- speaking teaching assistants.International Journal of Intercultural Relations, 14:337–353, 1990. URLhttps://api.semanticscholar. org/CorpusID:144548326

  21. [22]

    Accent modification for speech recognition of non-native speakers using neuralstyletransfer.EURASIPJournalonAudio,Speech,andMusicProcessing,2021(1):11,2021

    Kacper Radzikowski, Le Wang, Osamu Yoshie, and Robert Nowak. Accent modification for speech recognition of non-native speakers using neuralstyletransfer.EURASIPJournalonAudio,Speech,andMusicProcessing,2021(1):11,2021. doi:10.1186/s13636-021-00199-3. URL https://doi.org/10.1186/s13636-021-00199-3

  22. [23]

    Promoting intercultural awareness through native-to-foreign speech accent conversion

    Takeshi Nishida. Promoting intercultural awareness through native-to-foreign speech accent conversion. InProceedings of the 5th ACM International Conference on Collaboration across Boundaries: Culture, Distance & Technology, CABS ’14, pages 83–86, New York, NY, USA, 2014. Association for Computing Machinery. ISBN 9781450325578. doi: 10.1145/2631488.263405...

  23. [24]

    Voices from the south: Study and synthesis of andalusian accents with artificial intelligence

    Jose Gonzalez Lopez, Antonio Rubia, Alfredo de Haro, Angel Gomez, and Antonio Peinado. Voices from the south: Study and synthesis of andalusian accents with artificial intelligence. InProc. IberSPEECH 2024, pages 285–288, 11 2024. doi: 10.21437/IberSPEECH.2024-59

  24. [25]

    critic/sycophant

    Zhijun Jia, Huaying Xue, Xiulian Peng, and Yan Lu. Convert and speak: Zero-shot accent conversion with minimum supervision. In Proceedings of the 32nd ACM International Conference on Multimedia, pages 4446–4454. ACM, October 2024. doi: 10.1145/3664647. 3681539. URLhttp://dx.doi.org/10.1145/3664647.3681539

  25. [26]

    Accent-VITS: Accent transfer for end-to- end TTS

    Linhan Ma, Yongmao Zhang, Xinfa Zhu, Yi Lei, Ziqian Ning, Pengcheng Zhu, and Lei Xie. Accent-VITS: Accent transfer for end-to- end TTS. InMan-Machine Speech Communication – 18th National Conference, NCMMSC 2023, pages 203–214. Springer, 2024. doi: 10.1007/978-981-97-0601-3_17

  26. [27]

    Zero-shot accent conversionusingpseudosiamesedisentanglementnetwork

    Dongya Jia, Qiao Tian, Kainan Peng, Jiaxin Li, Yuan-Jui Chen, Mingbo Ma, Yuping Wang, and Yuxuan Wang. Zero-shot accent conversionusingpseudosiamesedisentanglementnetwork. InInterspeech,2022. URLhttps://api.semanticscholar.org/CorpusID: 254564036. Halychanskyi et al.:Preprint submitted to ElsevierPage 10 of 12 Accent Conversion Survey

  27. [28]

    Accentbox: Towards high-fidelity zero-shot accent generation, 09 2024

    Jinzuomu Zhong, Korin Richmond, Zhiba Su, and Siqi Sun. Accentbox: Towards high-fidelity zero-shot accent generation, 09 2024

  28. [29]

    StarGAN-VC: Non-parallel many-to-many voice conversion with star generative adversarial networks

    Hirokazu Kameoka, Takuhiro Kaneko, Kou Tanaka, and Nobukatsu Hojo. StarGAN-VC: Non-parallel many-to-many voice conversion with star generative adversarial networks. In2018 IEEE Spoken Language Technology Workshop (SLT), pages 266–273, 2018. doi: 10.1109/SLT.2018.8639535

  29. [30]

    AutoVC: Zero-shot voice style transfer with only autoencoder loss

    Kaizhi Qian, Yang Zhang, Shiyu Chang, Xuesong Yang, and Mark Hasegawa-Johnson. AutoVC: Zero-shot voice style transfer with only autoencoder loss. In Kamalika Chaudhuri and Ruslan Salakhutdinov, editors,Proceedings of the 36th International Conference on Machine Learning, volume 97 ofProceedings of Machine Learning Research, pages 5210–5219. PMLR, 09–15 Ju...

  30. [31]

    Cross-lingual voice conversion with disentangled universal linguistic representations

    Zhenchuan Yang, Weibin Zhang, Yufei Liu, and Xiaofen Xing. Cross-lingual voice conversion with disentangled universal linguistic representations. InProc. Interspeech 2021, pages 1604–1608, 08 2021. doi: 10.21437/Interspeech.2021-552

  31. [32]

    Disentangling correlated speaker and noise for speech synthesis via data augmentation and adversarial factorization

    Wei-Ning Hsu, Yu Zhang, Ron Weiss, Yu-An Chung, Yonghui Wu, and James Glass. Disentangling correlated speaker and noise for speech synthesis via data augmentation and adversarial factorization. InProc. IEEE ICASSP 2019, pages 5901–5905, 05 2019. doi: 10.1109/ICASSP.2019.8683561

  32. [33]

    Soong, and Lei Xie

    Xiangyu An, Frank K. Soong, and Lei Xie. Improving performance of seen and unseen speech style transfer in end-to-end neural TTS. In Proc. Interspeech, pages 4688–4692, 2021. doi: 10.21437/Interspeech.2021-1407

  33. [34]

    Zeroshotaudiotoaudioemotiontransferwithspeakerdisentanglement

    SoumyaDuttaandSriramGanapathy. Zeroshotaudiotoaudioemotiontransferwithspeakerdisentanglement. InICASSP2024–2024IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2024. doi: 10.1109/ICASSP48485.2024.10445962

  34. [35]

    Haobin Tang, Xulong Zhang, Jianzong Wang, Ning Cheng, and Jing Xiao. Qi-tts: Questioning intonation control for emotional speech synthesis.ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2023. URLhttps://api.semanticscholar.org/CorpusID:257504874

  35. [36]

    Accented text-to-speech synthesis with limited data.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32:1699–1711, 2024

    Xuehao Zhou, Mingyang Zhang, Yi Zhou, Zhizheng Wu, and Haizhou Li. Accented text-to-speech synthesis with limited data.IEEE/ACM Transactions on Audio, Speech, and Language Processing, 32:1699–1711, 2024. doi: 10.1109/TASLP.2024.3363414

  36. [37]

    In: ICASSP 2023 - 2023 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP), pp

    Tuan-Nam Nguyen, Ngoc-Quan Pham, and Alexander Waibel. Syntacc : Synthesizing multi-accent speech by weight factorization. In ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2023. doi: 10.1109/ICASSP49357.2023.10096431

  37. [38]

    Multi-scale accent modeling and disentangling for multi-speaker multi-accent text-to-speech synthesis, 2025

    Xuehao Zhou, Mingyang Zhang, Yi Zhou, Zhizheng Wu, and Haizhou Li. Multi-scale accent modeling and disentangling for multi-speaker multi-accent text-to-speech synthesis, 2025. URLhttps://arxiv.org/abs/2406.10844

  38. [39]

    Fréchetaudiodistance:Areference-freemetricforevaluatingmusic enhancement algorithms

    KevinKilgour,MauricioZuluaga,DominikRoblek,andMatthewSharifi. Fréchetaudiodistance:Areference-freemetricforevaluatingmusic enhancement algorithms. InProc. Interspeech 2019, pages 2350–2354, 2019. doi: 10.21437/Interspeech.2019-2219

  39. [41]

    V oxCeleb: A Large-Scale Speaker Identification Dataset

    Mark Huckvale. Accdist: a metric for comparing speakers’ accents. InInterspeech 2004, pages 29–32, 2004. doi: 10.21437/Interspeech. 2004-29

  40. [42]

    Methods for subjective determination of transmission quality

    ITU-T. Methods for subjective determination of transmission quality. Technical Report Recommendation P.800, ITU-T, 1996. Introduces Mean Opinion Score (MOS) on a 5-point scale

  41. [43]

    Method for the subjective assessment of intermediate quality level of audio systems (mushra)

    ITU-R. Method for the subjective assessment of intermediate quality level of audio systems (mushra). Technical Report Recommendation BS.1534-1, ITU-R, 2003. Introduces the MUlti Stimulus test with Hidden Reference and Anchor (MUSHRA)

  42. [44]

    Improving pronunciation and accent conversion through knowledge distillation and synthetic ground-truth from native tts

    Tuan Nam Nguyen, Seymanur Akti, Ngoc Quan Pham, and Alexander Waibel. Improving pronunciation and accent conversion through knowledge distillation and synthetic ground-truth from native tts. InICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5, 2025. doi: 10.1109/ICASSP49660.2025.10890229

  43. [45]

    Libri-Light: A Benchmark for ASR with Limited or No Supervision

    Jacob Kahn, Morgane Rivière, Wenming Zheng, Eugene Kharitonov, Qiantong Xu, Pierre-Emmanuel Mazaré, Julien Karadayi, Vitaliy Liptchinsky, Ronan Collobert, Christian Fügen, Tatiana Likhomanenko, Gabriel Synnaeve, Armand Joulin, Abdelrahman Mohamed, and Emmanuel Dupoux. Libri-Light: A Benchmark for ASR with Limited or No Supervision. InICASSP 2020 - 45th IE...

  44. [46]

    Librispeech: An asr corpus based on public domain audio books

    Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. Librispeech: An asr corpus based on public domain audio books. In2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 5206–5210, 2015. doi: 10.1109/ICASSP.2015.7178964

  45. [47]

    Weiss, Ye Jia, Zhifeng Chen, and Yonghui Wu

    Heiga Zen, Viet Dang, Rob Clark, Yu Zhang, Ron J. Weiss, Ye Jia, Zhifeng Chen, and Yonghui Wu. LibriTTS: A Corpus Derived from LibriSpeech for Text-to-Speech. InProc. Interspeech 2019, pages 1526–1530, 2019. doi: 10.21437/Interspeech.2019-2441

  46. [48]

    LibriTTS-R: A Restored Multi-Speaker Text-to-Speech Corpus

    Yuma Koizumi, Heiga Zen, Shigeki Karita, Yifan Ding, Kohei Yatabe, Nobuyuki Morioka, Michiel Bacchiani, Yu Zhang, Wei Han, and Ankur Bapna. LibriTTS-R: A Restored Multi-Speaker Text-to-Speech Corpus. InProc. Interspeech 2023, pages 5496–5500, 2023. doi: 10.21437/Interspeech.2023-1584

  47. [49]

    The LJ Speech Dataset.https://keithito.com/LJ-Speech-Dataset/, 2017

    Keith Ito and Linda Johnson. The LJ Speech Dataset.https://keithito.com/LJ-Speech-Dataset/, 2017

  48. [50]

    CSTR VCTK Corpus: English multi-speaker corpus for cstr voice cloning toolkit (version 0.92), 2019

    Junichi Yamagishi, Christophe Veaux, and Kirsten MacDonald. CSTR VCTK Corpus: English multi-speaker corpus for cstr voice cloning toolkit (version 0.92), 2019

  49. [51]

    Common Voice: A Massively-Multilingual Speech Corpus

    Rosana Ardila, Megan Branson, Kelly Davis, Michael Kohler, Josh Meyer, Michael Henretty, Reuben Morais, Lindsay Saunders, Francis Tyers, and Gregor Weber. Common Voice: A Massively-Multilingual Speech Corpus. InProceedings of the 12th Language Resources and Evaluation Conference (LREC 2020), pages 4218–4222, Marseille, France, 2020. European Language Reso...

  50. [52]

    Speech Accent Archive.https://accent.gmu.edu, 2015

    Steven Weinberger. Speech Accent Archive.https://accent.gmu.edu, 2015

  51. [53]

    John Kominek and Alan W. Black. The CMU Arctic speech databases. InProc. 5th ISCA Workshop on Speech Synthesis (SSW5), pages 223–224, Pittsburgh, PA, USA, 2004. ISCA. URLhttps://www.isca-archive.org/ssw_2004/kominek04b_ssw.html. Halychanskyi et al.:Preprint submitted to ElsevierPage 11 of 12 Accent Conversion Survey

  52. [54]

    L2- ARCTIC: A Non-native English Speech Corpus

    Guanlong Zhao, Sinem Sonsaat, Alif Silpachai, Ivana Lucic, Evgeny Chukharev-Hudilainen, John Levis, and Ricardo Gutierrez-Osuna. L2- ARCTIC: A Non-native English Speech Corpus. InProc. Interspeech 2018, pages 2783–2787, 2018. doi: 10.21437/Interspeech.2018-1110

  53. [55]

    AccentDB: A Database of Non-Native English Accents to Assist Neural Speech Recognition

    Afroz Ahamad, Ankit Anand, and Pranesh Bhargava. AccentDB: A Database of Non-Native English Accents to Assist Neural Speech Recognition. InProceedingsofthe12thLanguageResourcesandEvaluationConference(LREC2020),pages5351–5358,Marseille,France,

  54. [56]

    URLhttps://aclanthology.org/2020.lrec-1.659

    European Language Resources Association. URLhttps://aclanthology.org/2020.lrec-1.659

  55. [57]

    EnglishspeechdatabasereadbyJapanese learners for CALL system development

    N.Minematsu,Y.Tomiyama,K.Yoshimoto,K.Shimizu,S.Nakagawa,M.Dantsuji,andS.Makino. EnglishspeechdatabasereadbyJapanese learners for CALL system development. In Manuel González Rodríguez and Carmen Paz Suarez Araujo, editors,Proceedings of the Third International Conference on Language Resources and Evaluation (LREC’02), Las Palmas, Canary Islands - Spain, Ma...

  56. [58]

    Analysis and synthesis of formant spaces of british, australian, and american accents.IEEE Transactions on Audio, Speech, and Language Processing, 15(2):676–689, 2007

    Qin Yan, Saeed Vaseghi, Dimitrios Rentzos, and Ching-Hsiang Ho. Analysis and synthesis of formant spaces of british, australian, and american accents.IEEE Transactions on Audio, Speech, and Language Processing, 15(2):676–689, 2007. doi: 10.1109/TASL.2006.885923

  57. [59]

    Foreign accent conversion through concatenative synthesis in the articulatory domain.IEEE Transactions on Audio, Speech, and Language Processing, 20(8):2301–2312, 2012

    Daniel Felps, Christian Geng, and Ricardo Gutierrez-Osuna. Foreign accent conversion through concatenative synthesis in the articulatory domain.IEEE Transactions on Audio, Speech, and Language Processing, 20(8):2301–2312, 2012. doi: 10.1109/TASL.2012.2201474

  58. [60]

    Foreign-language speech synthesis

    Nick Campbell. Foreign-language speech synthesis. InProceedings of the Third ESCA/COCOSDA Workshop (ETRW) on Speech Synthesis, JenolanCavesHouse,BlueMountains,NSW,Australia,1998.ISCA. URLhttp://www.isca-speech.org/archive. Pagesnotavailable

  59. [61]

    Spokenlanguageconversionwithaccentmorphing

    MarkHuckvaleandKayokoYanagisawa. Spokenlanguageconversionwithaccentmorphing. InProceedingsofthe6thISCASpeechSynthesis Workshop, Bonn, Germany, 2007. ISCA

  60. [62]

    doi: 10.1016/j.specom.2007.09.001

    TomokiToda,AlanW.Black,andKeiichiTokuda.Statisticalmappingbetweenarticulatorymovementsandacousticspectrumusingagaussian mixture model.Speech Communication, 50(3):215–227, 2008. doi: 10.1016/j.specom.2007.09.001. URLhttps://www.sciencedirect. com/science/article/pii/S0167639307001495

  61. [63]

    Can voice conversion be used to reduce non-native accents? In2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7879–7883, 2014

    Sandesh Aryal and Ricardo Gutierrez-Osuna. Can voice conversion be used to reduce non-native accents? In2014 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 7879–7883, 2014. doi: 10.1109/ICASSP.2014.6855134

  62. [64]

    Accentconversionusingphonetic posteriorgrams

    GuanlongZhao,SinemSonsaat,JohnLevis,EvgenyChukharev-Hudilainen,andRicardoGutierrez-Osuna. Accentconversionusingphonetic posteriorgrams. In2018IEEEInternationalConferenceonAcoustics,SpeechandSignalProcessing(ICASSP),pages5314–5318,2018. doi: 10.1109/ICASSP.2018.8462258

  63. [65]

    Accentron: Foreign accent conversion to arbitrary non-native speakers using zero-shot learning.Computer Speech & Language, 72:101302, 2022

    Shaojin Ding, Guanlong Zhao, and Ricardo Gutierrez-Osuna. Accentron: Foreign accent conversion to arbitrary non-native speakers using zero-shot learning.Computer Speech & Language, 72:101302, 2022. ISSN 0885-2308. doi: 10.1016/j.csl.2021.101302. URL https://www.sciencedirect.com/science/article/pii/S0885230821001029

  64. [66]

    Libri-Light: A Benchmark for ASR with Limited or No Supervision

    Songxiang Liu, Disong Wang, Yuewen Cao, Lifa Sun, Xixin Wu, Shiyin Kang, Zhiyong Wu, Xunying Liu, Dan Su, Dong Yu, and Helen Meng. End-to-end accent conversion without using native utterances. InICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6289–6293, 2020. doi: 10.1109/ICASSP40776.2020.9053797

  65. [67]

    Accent modeling of low-resourced dialect in pitch accent language using variational autoencoder

    Kazuya Yufune, Tomoki Koriyama, Shinnosuke Takamichi, and Hiroshi Saruwatari. Accent modeling of low-resourced dialect in pitch accent language using variational autoencoder. In11th ISCA Speech Synthesis Workshop (SSW 11), pages 189–194, 2021. doi: 10.21437/ SSW.2021-33

  66. [68]

    MacST: Multi-accent speech synthesis via text transliteration for accent conversion

    Sho Inoue, Shuai Wang, Wanxing Wang, Pengcheng Zhu, Mengxiao Bi, and Haizhou Li. MacST: Multi-accent speech synthesis via text transliteration for accent conversion. InICASSP 2025 – 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025. doi: 10.1109/ICASSP49357.2025.10888195

  67. [69]

    Accent conversion using discrete units with parallel data synthesized from controllable accented TTS

    Tuan Nam Nguyen, Ngoc Quan Pham, and Alexander Waibel. Accent conversion using discrete units with parallel data synthesized from controllable accented TTS. InProc. Workshop on Synthetic Data’s Transformative Role in Foundational Speech Models (SynData4GenAI 2024), pages 51–55, 2024. URLhttps://www.isca-archive.org/syndata4genai_2024/nguyen24_syndata4genai.html

  68. [70]

    Siriwardena, Nathan Swedlow, Audrey Howard, Evan Gitterman, Dan Darcy, Carol Espy-Wilson, and Andrea Fanelli

    Yashish M. Siriwardena, Nathan Swedlow, Audrey Howard, Evan Gitterman, Dan Darcy, Carol Espy-Wilson, and Andrea Fanelli. Accent conversion with articulatory representations. InProc. Interspeech 2024, 2024. doi: 10.21437/Interspeech.2024-1416

  69. [71]

    Training text-to-speech systems from synthetic data: A practical approach for accent transfer tasks

    LevFinkelstein,HeigaZen,NormanCasagrande,ChunanChan,YeJia,TomKenter,AlexeyPetelin,JonathanShen,VincentWan,YuZhang, Yonghui Wu, and Rob Clark. Training text-to-speech systems from synthetic data: A practical approach for accent transfer tasks. InProc. Interspeech 2022, 2022. doi: 10.21437/Interspeech.2022-10115

  70. [72]

    Cross-dialect text-to-speech in pitch-accent language incorporating multi-dialect phoneme-level bert, 09 2024

    Kazuki Yamauchi, Yuki Saito, and Hiroshi Saruwatari. Cross-dialect text-to-speech in pitch-accent language incorporating multi-dialect phoneme-level bert, 09 2024

  71. [73]

    Remap, warp and attend: Non-parallel many-to-many accent conversion with normalizing flows

    Abdelhamid Ezzerg, Thomas Merritt, Kayoko Yanagisawa, Piotr Bilinski, Magdalena Proszewska, Kamil Pokora, Renard Korzeniowski, Roberto Barra-Chicote, and Daniel Korzekwa. Remap, warp and attend: Non-parallel many-to-many accent conversion with normalizing flows. In2022 IEEE Spoken Language Technology Workshop (SLT), pages 984–990, 2023. doi: 10.1109/SLT54...

  72. [74]

    Dart: Disentanglement of accent and speaker representation in multispeaker text-to-speech, 10 2024

    Jan Melechovsky, Ambuj Mehrish, Berrak Sisman, and Dorien Herremans. Dart: Disentanglement of accent and speaker representation in multispeaker text-to-speech, 10 2024

  73. [75]

    FAC-FACodec: Controllable Zero-Shot Foreign Accent Conversion with Factorized Speech Codec

    Yurii Halychanskyi, Cameron Churchwell, Yutong Wen, and Volodymyr Kindratenko. FAC-FACodec: Controllable zero-shot foreign accent conversion with factorized speech codec, 2026. URLhttps://arxiv.org/abs/2510.10785

  74. [76]

    Controllableaccentnormalizationviadiscretediffusion,2026

    QibingBai,YuhanDu,TomKo,ShuaiWang,YannanWang,andHaizhouLi. Controllableaccentnormalizationviadiscretediffusion,2026. URLhttps://arxiv.org/abs/2603.14275. Halychanskyi et al.:Preprint submitted to ElsevierPage 12 of 12