arxiv: 2604.27607 · v2 · submitted 2026-04-30 · 💻 cs.CL

Recognition: unknown

JaiTTS: A Thai Voice Cloning Model

Attapol T. Rutherford, Jullajak Karnjanaekarin, Narongkorn Panitsrisit, Nithid Guntasin, Pontakorn Trakuekul, Sumana Sumanakul, Thanavin Denkavin, Vichayuth Nitayasomboon

Authors on Pith no claims yet

Pith reviewed 2026-05-08 03:05 UTC · model grok-4.3

classification 💻 cs.CL

keywords Thai TTSvoice cloningtext-to-speechautoregressive modelcode-switchingcontinual trainingThai languagespeech synthesis

0 comments

The pith

A Thai voice cloning model generates short speech with lower error rates than human recordings while matching them on longer utterances.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents JaiTTS-v1.0, a voice cloning text-to-speech system for Thai built by continual training on a large Thai-centric speech corpus. It adapts a tokenizer-free autoregressive architecture to process numerals and Thai-English code-switching directly without explicit normalization, then evaluates performance on both short and long speech generation tasks that mirror common real-world uses. The model reports a character error rate of 1.94 percent on short tasks, below the human ground truth of 1.98 percent, and performs on par with humans on long tasks while winning 283 of 400 pairwise human preference tests against commercial systems. A sympathetic reader would care because Thai TTS has historically struggled with mixed-language inputs and limited data, so a system that handles realistic usage out of the box could open practical applications in voice interfaces and content creation for Thai speakers.

Core claim

JaiTTS-v1.0, built through continual training on a large Thai-centric speech corpus from a tokenizer-free autoregressive base architecture, directly processes numerals and Thai-English code-switching without explicit text normalization and achieves a state-of-the-art CER of 1.94 percent on short-duration tasks, surpassing the human ground truth of 1.98 percent, while performing on par with human ground truth for long-duration tasks and winning 283 of 400 pairwise human judgment comparisons against commercial flagships with only 58 losses.

What carries the argument

Continual training of the VoxCPM tokenizer-free autoregressive TTS architecture on a large Thai-centric speech corpus, which enables direct handling of code-switched and numeric text inputs.

If this is right

Thai applications can now use voice cloning with little or no text preprocessing for mixed-language inputs.
Short- and long-duration performance parity with humans suggests the model fits both brief announcements and extended conversations.
Winning most human comparisons against commercial systems indicates open models can compete in low-resource language settings.
Direct processing of numerals and code-switching reduces engineering overhead for realistic Thai deployment.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the approach scales to other mixed-language Southeast Asian languages, similar continual-training pipelines could accelerate TTS development for those languages.
Community access to the code and demo opens the door to fine-tuning on individual voices for personalized Thai interfaces.
Strong short-task results may translate to better performance in interactive voice assistants where utterances are typically brief.

Load-bearing premise

The test sets, human ground truth recordings, and pairwise listening tests are unbiased and representative of real-world Thai usage including code-switching.

What would settle it

A fresh test set drawn from diverse Thai speakers and natural code-switching contexts where the model's character error rate rises above the human baseline or where it loses more than half of new pairwise preference tests.

Figures

Figures reproduced from arXiv: 2604.27607 by Attapol T. Rutherford, Jullajak Karnjanaekarin, Narongkorn Panitsrisit, Nithid Guntasin, Pontakorn Trakuekul, Sumana Sumanakul, Thanavin Denkavin, Vichayuth Nitayasomboon.

**Figure 1.** Figure 1: Architecture of VoxCPM, the backbone of JaiTTS-v1.0. The Text-Semantic Language Model (TSLM) view at source ↗

**Figure 2.** Figure 2: Head-to-head human judgment results of JaiTTS-v1.0 against commercial flagship models. view at source ↗

read the original abstract

We present JaiTTS-v1.0, a state-of-the-art Thai voice cloning text-to-speech model built through continual training on a large Thai-centric speech corpus. The model architecture is adapted from VoxCPM, a tokenizer-free autoregressive TTS model. JaiTTS-v1.0 directly processes numerals and Thai-English code-switching, which is very common in realistic settings, without explicit text normalization. We test the models on short- and long-duration speech generation, which reflects many real-world use cases. JaiTTS-v1.0 achieves a state-of-the-art CER of 1.94%, surpassing the human ground truth of 1.98% for short-duration tasks while performing on par with human ground truth for long-duration tasks. In human judgment evaluations, our model wins 283 of 400 pairwise comparisons against commercial flagships, with only 58 losses. Our code and demo are available at https://github.com/JTS-AI-Team/JaiTTS .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

JaiTTS is a solid but incremental Thai TTS application whose headline results over human performance need more scrutiny on the evaluation methodology.

read the letter

The main things to know are that JaiTTS takes an existing autoregressive TTS model and trains it further on Thai data to handle code-switching and numerals directly, and they report it slightly beats human CER on short clips while winning most blind preference tests. The work is an application paper rather than a methods advance. They do well by releasing code and focusing on realistic Thai scenarios with mixed language. The continual training is a reasonable way to adapt without starting from scratch. Providing both short and long duration tests is a plus for practical relevance. The soft spots are around the evaluation. Without details on test set selection, consistent ASR use for CER calculation on both model and human audio, or confirmation that the 400 comparisons were properly controlled, it's difficult to be confident that the 1.94% vs 1.98% is a real advance rather than a measurement artifact. More baseline comparisons would also help. The abstract doesn't mention any statistical tests either. This paper is for TTS practitioners interested in Thai or similar languages. A reader looking for open implementations in this area would get some value, though the core architecture isn't novel. It should go to peer review because the empirical results on a new language application are worth checking, even with the current gaps in reporting.

Referee Report

3 major / 2 minor

Summary. The manuscript presents JaiTTS-v1.0, a Thai voice cloning TTS model adapted from the VoxCPM tokenizer-free autoregressive architecture. It claims direct handling of Thai-English code-switching and numerals without explicit normalization, with evaluation on short- and long-duration speech tasks. Key results include a CER of 1.94% on short-duration tasks (surpassing human ground truth at 1.98%) and parity on long-duration tasks, plus winning 283 of 400 pairwise human judgments against commercial flagships.

Significance. If the evaluation protocols prove robust and the test sets representative of real Thai usage including code-switching, the work would constitute a meaningful empirical advance in language-specific TTS for Thai, demonstrating effective zero-shot handling of mixed-language input in an autoregressive model. The public release of code and demo supports reproducibility and further research.

major comments (3)

[Abstract] Abstract: The central claim that JaiTTS-v1.0 achieves a CER of 1.94% surpassing human ground truth (1.98%) on short-duration tasks is load-bearing for the state-of-the-art assertion. The manuscript must explicitly confirm that an identical ASR pipeline was used to compute CER on both model outputs and human reference recordings; without this, the 0.04% margin could arise from ASR inconsistency rather than TTS improvement.
[Abstract] Abstract: The human judgment result (283 wins, 58 losses out of 400 pairwise comparisons) is load-bearing for the superiority claim over commercial systems. The paper must detail the protocol, including blinding procedures, balancing across speakers/conditions/durations, selection of the 400 pairs, and safeguards against prompt leakage or evaluator bias, as these controls are required to rule out artifacts.
[Abstract] Abstract: The test sets for short- and long-duration tasks are central to validating the model's handling of code-switching and numerals. The manuscript should report the size, composition, and construction method of these held-out sets (including proportion of code-switched and numeral-containing utterances) to establish that they are unbiased and representative of real-world Thai usage.

minor comments (2)

[Abstract] The abstract would benefit from naming the specific commercial flagships used in the 400 pairwise comparisons and from adding a table summarizing CER and human preference results broken down by duration.
[Abstract] Consider including basic statistical tests (e.g., p-values or confidence intervals) for the CER difference and win rate to quantify the reliability of the reported margins.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the referee's insightful comments. We address each major comment below and have made revisions to the manuscript to provide the requested clarifications and details.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that JaiTTS-v1.0 achieves a CER of 1.94% surpassing human ground truth (1.98%) on short-duration tasks is load-bearing for the state-of-the-art assertion. The manuscript must explicitly confirm that an identical ASR pipeline was used to compute CER on both model outputs and human reference recordings; without this, the 0.04% margin could arise from ASR inconsistency rather than TTS improvement.

Authors: We confirm that an identical ASR pipeline was used to compute the CER scores for both the model outputs and the human reference recordings. This ensures the comparison is direct and the observed margin is attributable to the TTS quality rather than evaluation artifacts. We have added an explicit confirmation of this in the revised abstract and evaluation section. revision: yes
Referee: [Abstract] Abstract: The human judgment result (283 wins, 58 losses out of 400 pairwise comparisons) is load-bearing for the superiority claim over commercial systems. The paper must detail the protocol, including blinding procedures, balancing across speakers/conditions/durations, selection of the 400 pairs, and safeguards against prompt leakage or evaluator bias, as these controls are required to rule out artifacts.

Authors: We have revised the manuscript to include a comprehensive description of the human evaluation protocol. This now details the blinding procedures (evaluators were blinded to the system identities), balancing across speakers, conditions, and durations, the selection process for the 400 pairs (randomly sampled from a larger set of generated and reference samples), and safeguards including randomized presentation and controls for prompt leakage and evaluator bias. revision: yes
Referee: [Abstract] Abstract: The test sets for short- and long-duration tasks are central to validating the model's handling of code-switching and numerals. The manuscript should report the size, composition, and construction method of these held-out sets (including proportion of code-switched and numeral-containing utterances) to establish that they are unbiased and representative of real-world Thai usage.

Authors: The revised manuscript now includes detailed information on the test sets. We report the sizes, composition (including proportions of code-switched and numeral-containing utterances), and construction methods for both the short- and long-duration held-out sets. These details demonstrate that the sets are unbiased and representative of real-world Thai usage. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical TTS evaluation with no derivation chain

full rationale

The paper describes training JaiTTS-v1.0 on a Thai speech corpus and reports direct empirical measurements (CER on short/long utterances, pairwise human judgments). No equations, first-principles derivations, or predictions are presented that could reduce to fitted inputs or self-citations by construction. The architecture is adapted from an external model (VoxCPM) without any claimed uniqueness theorem or ansatz smuggling. All load-bearing claims rest on external test data and human raters rather than internal redefinitions or renamings of known results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on empirical outcomes of neural network training and human evaluation on speech data. No free parameters, axioms, or invented entities are introduced beyond standard assumptions of machine learning model training.

pith-pipeline@v0.9.0 · 5513 in / 1325 out tokens · 81199 ms · 2026-05-08T03:05:23.398080+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 12 canonical work pages

[1]

online" 'onlinestring :=

ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
[2]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
[3]

Philip Anastassiou, Jiawei Chen, Jitong Chen, Yuanzhe Chen, Zhuo Chen, Ziyi Chen, Jian Cong, Lelai Deng, Chuang Ding, Lu Gao, and 1 others. 2024. https://arxiv.org/abs/2406.02430 Seed- TTS : A family of high-quality versatile speech generation models . arXiv preprint arXiv:2406.02430

work page arXiv 2024
[4]

Ardila, M

R. Ardila, M. Branson, K. Davis, M. Henretty, M. Kohler, J. Meyer, R. Morais, L. Saunders, F. M. Tyers, and G. Weber. 2020. Common voice: A massively-multilingual speech corpus. In Proceedings of the 12th Conference on Language Resources and Evaluation (LREC 2020), pages 4211--4215

2020
[5]

Oleg Atamanenko, Anna Chalova, Joseph Coombes, Nikki Cope, Phillip Dang, Zhifeng Deng, Jimmy Du, Michael Ermolenko, Feifan Fan, Yufei Feng, and 1 others. 2025. https://arxiv.org/abs/2507.21138 TTS-1 technical report . Technical report, Inworld AI. ArXiv preprint arXiv:2507.21138

work page arXiv 2025
[6]

Thura Aung, Panyut Sriwirote, Thanachot Thavornmongkol, Knot Pipatsrisawat, Titipat Achakulvisut, and Zaw Htet Aung. 2025. https://doi.org/10.1109/iSAI-NLP66160.2025.11320472 Thonburiantts: Enhancing neural flow matching models for authentic thai text-to-speech . In 2025 20th International Joint Symposium on Artificial Intelligence and Natural Language Pr...

work page doi:10.1109/isai-nlp66160.2025.11320472 2025
[7]

Sanyuan Chen, Chengyi Wang, Zhengyang Chen, Yu Wu, Shujie Liu, Zhuo Chen, Jinyu Li, Naoyuki Kanda, Takuya Yoshioka, Xiong Xiao, and 1 others. 2022 a . Wavlm: Large-scale self-supervised pre-training for full stack speech processing. IEEE Journal of Selected Topics in Signal Processing, 16(6):1505--1518

2022
[8]

Zhengyang Chen, Sanyuan Chen, Yu Wu, Yao Qian, Chengyi Wang, Shujie Liu, Yanmin Qian, and Michael Zeng. 2022 b . Large-scale self-supervised speech representation learning for automatic speaker verification. In ICASSP 2022-2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6147--6151. IEEE

2022
[9]

Fredrik Cumlin, Xinyu Liang, Victor Ungureanu, Chandan KA Reddy, Christian Sch \"u ldt, and Saikat Chatterjee. 2024. https://doi.org/10.21437/Interspeech.2024-478 Dnsmos pro: A reduced-size dnn for probabilistic mos of speech . In Proceedings of Interspeech 2024, pages 4818--4822

work page doi:10.21437/interspeech.2024-478 2024
[10]

Hangrui Hu, Xinfa Zhu, Ting He, Dake Guo, Bin Zhang, Xiong Wang, Zhifang Guo, Ziyue Jiang, Hongkun Hao, Zishan Guo, and 1 others. 2026. https://arxiv.org/abs/2601.15621 Qwen3-tts technical report . arXiv preprint arXiv:2601.15621

work page arXiv 2026
[11]

Fabian Mentzer, David Minnen, Eirikur Agustsson, and Michael Tschannen. 2024. https://arxiv.org/abs/2309.15505 Finite scalar quantization: VQ-VAE made simple . In International Conference on Learning Representations (ICLR)

work page arXiv 2024
[12]

MiniCPM Team , Chaojun Xiao, Yuxuan Li, Xu Han, Yuzhuo Bai, Jie Cai, and 1 others. 2025. https://arxiv.org/abs/2506.07900 MiniCPM4 : Ultra-efficient LLMs on end devices . arXiv preprint arXiv:2506.07900

work page arXiv 2025
[13]

Warit Sirichotedumrong, Adisai Na-Thalang, Potsawee Manakul, Pittawat Taveekitworachai, Sittipong Sripaisarnmongkol, and Kunat Pipatanakul. 2026. https://arxiv.org/abs/2601.13044 Typhoon asr real-time: Fastconformer-transducer for thai automatic speech recognition . arXiv preprint arXiv:2601.13044

work page arXiv 2026
[14]

Yifan Yang, Zheshu Song, Jianheng Zhuo, Mingyu Cui, Jinpeng Li, Bo Yang, Yexing Du, Ziyang Ma, Xunying Liu, Ziyuan Wang, Ke Li, Shuai Fan, Kai Yu, Wei-Qiang Zhang, Guoguo Chen, and Xie Chen. 2025. https://doi.org/10.18653/v1/2025.acl-long.135 GigaSpeech 2: An evolving, large-scale and multi-domain ASR corpus for low-resource languages with automated crawl...

work page doi:10.18653/v1/2025.acl-long.135 2025
[15]

Zhen Ye, Peiwen Sun, Jiahe Lei, Hongzhan Lin, Xu Tan, Zheqi Dai, Qiuqiang Kong, Jianyi Chen, Jiahao Pan, Qifeng Liu, Yike Guo, and Wei Xue. 2025 a . https://arxiv.org/abs/2408.17175 Codec does matter: Exploring the semantic shortcoming of codec for audio language model . In International Conference on Learning Representations (ICLR)

work page arXiv 2025
[16]

Zhen Ye, Xinfa Zhu, Chi-Min Chan, Xinsheng Wang, Xu Tan, Jiahe Lei, Yi Peng, Haohe Liu, Yizhu Jin, Zheqi Dai, Hongzhan Lin, Jianyi Chen, Xingjian Du, Liumeng Xue, Yunlin Chen, Zhifei Li, Lei Xie, Qiuqiang Kong, Yike Guo, and Wei Xue. 2025 b . https://arxiv.org/abs/2502.04128 Llasa : Scaling train-time and inference-time compute for Llama -based speech syn...

work page arXiv 2025
[17]

Yixuan Zhou, Guoyang Zeng, Xin Liu, Xiang Li, Renjie Yu, Ziyang Wang, Runchuan Ye, Weiyue Sun, Jiancheng Gui, Kehan Li, Zhiyong Wu, and Zhiyuan Liu. 2025. https://arxiv.org/abs/2509.24650 VoxCPM : Tokenizer-free TTS for context-aware speech generation and true-to-life voice cloning . arXiv preprint arXiv:2509.24650

work page arXiv 2025