Toward Natural Emotional Text-To-Speech System with Fine-Grained Non-Verbal Expression Control

Bagus Tris Atmaja; Sakriani Sakti; Wangzixi Zhou

arxiv: 2605.25504 · v1 · pith:PJ45JLV4new · submitted 2026-05-25 · 📡 eess.AS

Toward Natural Emotional Text-To-Speech System with Fine-Grained Non-Verbal Expression Control

Wangzixi Zhou , Bagus Tris Atmaja , Sakriani Sakti This is my paper

Pith reviewed 2026-06-29 20:51 UTC · model grok-4.3

classification 📡 eess.AS

keywords emotional TTSnon-verbal vocalizationsfine-grained controlEARS corpustag-based annotationexpressivenessemotional recognition accuracyspeech synthesis

0 comments

The pith

A tag-based annotation scheme for non-verbal sounds lets emotional TTS systems reach 78.8 percent recognition accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that adding fine-grained control over non-verbal vocalizations such as laughs, sighs, and gasps can make emotional text-to-speech output more expressive and easier for listeners to recognize as conveying specific emotions. The authors address the gap by reprocessing female utterances from the EARS corpus and introducing a tag scheme that marks NV type, frequency, and duration before feeding the data into an emotional TTS benchmark. A sympathetic reader would care because most current emotional TTS ignores these non-verbal elements even though they are central to how humans signal feeling in speech. If the approach holds, it shows that modest additions of annotated NV data can raise emotional clarity without destroying overall naturalness.

Core claim

The authors claim that curating and reprocessing female NV utterances from the EARS corpus, developing a tag-based annotation scheme to encode NV types, frequencies, and durations, and building an emotional TTS benchmark produces systems with an expressiveness MOS of 4.20 and an emotional recognition accuracy of 78.8 percent, with NV cues proving especially effective for high-arousal emotions and nearly perfect for sadness.

What carries the argument

The tag-based annotation scheme that encodes non-verbal vocalization types, frequencies, and durations to enable precise control inside the emotional TTS pipeline.

If this is right

Emotional recognition accuracy reaches 78.8 percent overall.
NV cues achieve 82.5 percent accuracy for happy, 82.7 percent for fear, and 98.3 percent for sadness.
Perceived naturalness experiences only minor degradation.
The benchmark demonstrates that fine-grained NV control can be added to existing emotional TTS pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same tag scheme could be applied to male voices or other languages to test whether the accuracy gains generalize.
Real-time conversational agents might incorporate adjustable NV parameters to match user arousal levels.
Combining the NV tags with visual or gesture cues could further increase multimodal emotion transmission.

Load-bearing premise

The reprocessed female NV utterances from the EARS corpus together with the new tag-based annotation scheme provide a faithful and sufficient representation of the fine-grained non-verbal expressions needed for the TTS benchmark.

What would settle it

A side-by-side listening test in which the same text prompts are synthesized once with the NV tags and once without them, and emotional recognition accuracy shows no statistically significant gain for the tagged version.

read the original abstract

While current emotional Text-to-Speech (TTS) models have successfully controlled verbal prosody, they often ignore non-verbal vocalizations (NVs), which are essential for authentic human emotion. Although some non-verbal datasets have recently emerged, they often lack high-quality, fine-grained annotations, which restricts a model's ability to precisely control NV generation. To address this limitation, we propose a novel approach for fine-grained non-verbal expression synthesis. We curate and reprocess female NV utterances from the EARS corpus, develop a new annotation scheme using tags to encode NV types, frequencies, and durations, and build an emotional TTS benchmark to demonstrate its effectiveness. Our evaluation shows that while our NV approach leads to minor trade-offs in perceived naturalness, it significantly improves expressiveness (eMOS 4.20) and emotional recognition accuracy (78.8%). Emotion-specific analysis further reveals that NV cues are highly effective for high-arousal emotions like happy (82.5%) and fear (82.7%), and almost perfectly convey sadness (98.3%).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper ships a new tag scheme for controlling NV type/frequency/duration in emotional TTS plus a benchmark, with concrete gains in expressiveness and recognition accuracy.

read the letter

The core contribution is the tag-based annotation that captures NV types, frequencies, and durations on reprocessed female EARS utterances, then used to build an emotional TTS benchmark. That scheme looks new relative to the cited emotional TTS and NV dataset work, and the paper reports usable numbers: eMOS of 4.20 for expressiveness, 78.8% emotional recognition overall, with standout results for sadness at 98.3% and solid figures for happy and fear.

The evaluation also notes only minor naturalness trade-offs, which is a fair trade if the expressiveness lift holds. The emotion-specific breakdown adds practical value for high-arousal cases.

The main soft spot is that the abstract leaves the TTS architecture, training details, and any statistical tests unspecified, so it is hard to tell how much of the reported lift traces to the NV tags versus other pipeline choices. The weakest assumption flagged in the read—that the reprocessed EARS data plus tags give a faithful representation of fine-grained NVs—remains plausible but untested without the full methods and dataset stats.

This is incremental but grounded work aimed at speech synthesis practitioners who already work with emotional TTS. It is worth sending to peer review because it introduces a reproducible annotation method and supplies checkable metrics rather than just claims.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes a novel method for fine-grained non-verbal expression control in emotional text-to-speech (TTS) systems. It involves curating and reprocessing female non-verbal vocalization (NV) utterances from the EARS corpus, introducing a tag-based annotation scheme to encode NV types, frequencies, and durations, and constructing an emotional TTS benchmark. The evaluation indicates minor trade-offs in naturalness but significant improvements in expressiveness (eMOS 4.20) and emotional recognition accuracy (78.8%), with particularly high performance for emotions like sadness (98.3%).

Significance. If the findings are robust, this work fills an important gap in emotional TTS by enabling precise control over non-verbal cues, which are crucial for natural emotional expression. The provision of concrete quantitative results and emotion-specific breakdowns strengthens the contribution. The approach could influence future TTS systems aiming for more human-like speech synthesis.

major comments (2)

[Methods] The reprocessing of the EARS corpus and the development of the tag-based annotation scheme require more detailed description, including dataset statistics, annotation guidelines, and inter-annotator agreement, as this is load-bearing for the claim of fine-grained control.
[Experiments] The evaluation lacks details on the TTS model architecture, training procedure, baseline comparisons, and statistical tests for the reported metrics (e.g., eMOS 4.20, 78.8% accuracy), making it challenging to attribute improvements specifically to the NV approach rather than other factors.

minor comments (1)

[Abstract] The abstract could benefit from a brief mention of the TTS model used or the number of NV types annotated to provide more context.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help strengthen the manuscript. We address each major point below and will revise accordingly to provide the requested details.

read point-by-point responses

Referee: [Methods] The reprocessing of the EARS corpus and the development of the tag-based annotation scheme require more detailed description, including dataset statistics, annotation guidelines, and inter-annotator agreement, as this is load-bearing for the claim of fine-grained control.

Authors: We agree that the current description of the EARS corpus reprocessing and tag-based annotation scheme is insufficient to fully support the fine-grained control claims. In the revision we will expand this section with: (i) full dataset statistics (total utterances, speaker demographics, distribution across NV types/frequencies/durations), (ii) explicit annotation guidelines and tag definitions, and (iii) inter-annotator agreement results (e.g., Cohen’s kappa). These additions will make the methodology reproducible and directly address the load-bearing concern. revision: yes
Referee: [Experiments] The evaluation lacks details on the TTS model architecture, training procedure, baseline comparisons, and statistical tests for the reported metrics (e.g., eMOS 4.20, 78.8% accuracy), making it challenging to attribute improvements specifically to the NV approach rather than other factors.

Authors: We acknowledge that the experimental section requires additional transparency. The revised manuscript will include: (i) the precise TTS model architecture and conditioning mechanism for the tags, (ii) training hyperparameters and data splits, (iii) explicit baseline systems (e.g., standard emotional TTS without NV tags), and (iv) statistical tests (paired t-tests or Wilcoxon signed-rank with p-values and effect sizes) for all reported metrics including eMOS and emotion recognition accuracy. This will allow clearer attribution of gains to the NV annotation scheme. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper presents an engineering pipeline: reprocess EARS corpus female NV utterances, apply a new tag-based annotation scheme for NV types/frequencies/durations, then benchmark an emotional TTS system. Reported metrics (eMOS 4.20, 78.8% recognition accuracy) are evaluation outcomes on held-out or separate test data, not quantities that reduce by construction to the input annotations or fitted parameters. No equations, self-definitional loops, fitted-input-as-prediction, or load-bearing self-citations appear in the abstract or described method. The derivation chain is self-contained data-processing plus standard TTS evaluation and does not invoke uniqueness theorems or ansatzes from prior author work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, no explicit free parameters, axioms or invented entities are stated. The central claim rests on the unverified assumption that the curated EARS subset and new tags faithfully capture the required non-verbal phenomena.

pith-pipeline@v0.9.1-grok · 5722 in / 1150 out tokens · 21118 ms · 2026-06-29T20:51:57.444192+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 4 canonical work pages · 2 internal anchors

[1]

INTRODUCTION Emotional Text-to-Speech (TTS) synthesis is a growing re- search area driven by the increasing integration of conver- sational AI into daily life. Users now expect emotionally rich and engaging responses from dialogue systems, making the ability to synthesize speech with specific emotional tones crucial for creating more empathetic and human-...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

RELA TED WORK 2.1. Non-verbal Dataset While several non-verbal vocalization (NV) datasets have been proposed, their limitations hinder the development of fine-grained, controllable emotional TTS models. The NVTTS dataset [4], derived from sources like V ox- Celeb [5] and Expresso [6], suffers from poor acoustic quality. With only 1525 of its 3642 utteranc...
[3]

This section provides an overview of the original data sources and presents statistics on the non- verbal sounds (NVs) and emotion tags in our final dataset

CONSTRUCT FINE-GRAINED NON-VERBAL EXPRESSION DA TA Our dataset is constructed by annotating and filtering the ex- isting EARS dataset [9]. This section provides an overview of the original data sources and presents statistics on the non- verbal sounds (NVs) and emotion tags in our final dataset. 1https://37integer.github.io/FINE-GRAINED-NON-VERBAL-TTS/ 3....
[4]

To enable emotional synthesis, we enhanced the model with an emotion encoder, allowing it to incorporate emotional embeddings

NON-VERBAL EMOTIONAL TTS We selected Grad-TTS [11] as our backbone model, which is recognized for its high-quality synthesis of reading-style speech. To enable emotional synthesis, we enhanced the model with an emotion encoder, allowing it to incorporate emotional embeddings. Following Russell’s circumplex model [12] of affect, we utilized arousal and val...
[5]

Only Verbal

EXPERIENMENT 5.1. Training setting To enable the synthesis of emotional verbal speech, we con- structed a comprehensive 9-hour mixed dataset of English fe- male speakers, sampled at 22.05 kHz. It was compiled from: EXPRESSO [6], SEMAINE [13], and ESD[14] datasets. To address missing continuous arousal and valence labels in EX- PRESSO and ESD, we used a pr...
[6]

Fine-Grained Non-Verbal

CONCLUSION In this paper, we constructed a Fine-Grained Non-Verbal Ex- pression Data and built a Fine-Grained Non-Verbal emotional TTS benchmark to enable the use of fine-grained non-verbal cues. Our comprehensive subjective evaluation yielded three key findings. First, while the inclusion of NVs resulted in a minor trade-off in perceived speech naturalne...
[7]

Albert Mehrabian,Nonverbal communication, Rout- ledge, 2017

2017
[8]

Affect bursts,

Klaus R Scherer, “Affect bursts,”Emotions: Essays on emotion theory, vol. 161, pp. 196, 1994

1994
[9]

The ami meeting corpus,

Wessel Kraaij, Thomas Hain, Mike Lincoln, and Wil- fried Post, “The ami meeting corpus,” inProc. Inter- national Conference on Methods and Techniques in Be- havioral Research, 2005, pp. 1–4

2005
[10]

arXiv preprint arXiv:2507.13155 , year=

Maksim Borisov, Egor Spirin, and Daria Diatlova, “Nonverbaltts: A public english corpus of text-aligned nonverbal vocalizations with emotion annotations for text-to-speech,”arXiv preprint arXiv:2507.13155, 2025

work page arXiv 2025
[11]

V oxceleb: Large-scale speaker verifi- cation in the wild,

Arsha Nagrani, Joon Son Chung, Weidi Xie, and An- drew Zisserman, “V oxceleb: Large-scale speaker verifi- cation in the wild,”Computer Speech & Language, vol. 60, pp. 101027, 2020

2020
[12]

arXiv preprint arXiv:2308.05725 , year=

Tu Anh Nguyen, Wei-Ning Hsu, Antony d’Avirro, Bowen Shi, Itai Gat, Maryam Fazel-Zarani, Tal Re- mez, Jade Copet, Gabriel Synnaeve, Michael Hassid, et al., “Expresso: A benchmark and analysis of dis- crete expressive speech resynthesis,”arXiv preprint arXiv:2308.05725, 2023

work page arXiv 2023
[13]

Jnv corpus: A corpus of japanese nonver- bal vocalizations with diverse phrases and emotions,

Detai Xin, Shinnosuke Takamichi, and Hiroshi Saruwatari, “Jnv corpus: A corpus of japanese nonver- bal vocalizations with diverse phrases and emotions,” Speech Communication, vol. 156, pp. 103004, 2024

2024
[14]

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, et al., “Cosyvoice 2: Scalable stream- ing speech synthesis with large language models,”arXiv preprint arXiv:2412.10117, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[15]

EARS: An anechoic fullband speech dataset benchmarked for speech en- hancement and dereverberation,

Julius Richter, Yi-Chiao Wu, Steven Krenn, Simon Welker, Bunlong Lay, Shinjii Watanabe, Alexander Richard, and Timo Gerkmann, “EARS: An anechoic fullband speech dataset benchmarked for speech en- hancement and dereverberation,” inISCA Interspeech, 2024, pp. 4873–4877

2024
[16]

Robust speech recognition via large-scale weak supervision,

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock- man, Christine McLeavey, and Ilya Sutskever, “Robust speech recognition via large-scale weak supervision,” in International conference on machine learning. PMLR, 2023, pp. 28492–28518

2023
[17]

Grad-tts: A diffu- sion probabilistic model for text-to-speech,

Vadim Popov, Ivan V ovk, Vladimir Gogoryan, Tasnima Sadekova, and Mikhail Kudinov, “Grad-tts: A diffu- sion probabilistic model for text-to-speech,” inInter- national conference on machine learning. PMLR, 2021, pp. 8599–8608

2021
[18]

A circumplex model of affect.,

James A Russell, “A circumplex model of affect.,”Jour- nal of personality and social psychology, vol. 39, no. 6, pp. 1161, 1980

1980
[19]

The semaine corpus of emotionally coloured character interactions,

Gary McKeown, Michel F Valstar, Roderick Cowie, and Maja Pantic, “The semaine corpus of emotionally coloured character interactions,” in2010 IEEE interna- tional conference on multimedia and expo. IEEE, 2010, pp. 1079–1084

2010
[20]

Emotional voice conversion: Theory, databases and esd,

Kun Zhou, Berrak Sisman, Rui Liu, and Haizhou Li, “Emotional voice conversion: Theory, databases and esd,”Speech Communication, vol. 137, pp. 1–18, 2022

2022
[21]

Dawn of the trans- former era in speech emotion recognition: Closing the valence gap,

Johannes Wagner, Andreas Triantafyllopoulos, Hagen Wierstorf, Maximilian Schmitt, Felix Burkhardt, Flo- rian Eyben, and Bj ¨orn W Schuller, “Dawn of the trans- former era in speech emotion recognition: Closing the valence gap,”IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–13, 2023

2023
[22]

HiFi-GAN: Generative ad- versarial networks for efficient and high fidelity speech synthesis,

J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative ad- versarial networks for efficient and high fidelity speech synthesis,” inProc. NeurIPS, Dec. 2020, pp. 17022– 17033

2020

[1] [1]

INTRODUCTION Emotional Text-to-Speech (TTS) synthesis is a growing re- search area driven by the increasing integration of conver- sational AI into daily life. Users now expect emotionally rich and engaging responses from dialogue systems, making the ability to synthesize speech with specific emotional tones crucial for creating more empathetic and human-...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

RELA TED WORK 2.1. Non-verbal Dataset While several non-verbal vocalization (NV) datasets have been proposed, their limitations hinder the development of fine-grained, controllable emotional TTS models. The NVTTS dataset [4], derived from sources like V ox- Celeb [5] and Expresso [6], suffers from poor acoustic quality. With only 1525 of its 3642 utteranc...

[3] [3]

This section provides an overview of the original data sources and presents statistics on the non- verbal sounds (NVs) and emotion tags in our final dataset

CONSTRUCT FINE-GRAINED NON-VERBAL EXPRESSION DA TA Our dataset is constructed by annotating and filtering the ex- isting EARS dataset [9]. This section provides an overview of the original data sources and presents statistics on the non- verbal sounds (NVs) and emotion tags in our final dataset. 1https://37integer.github.io/FINE-GRAINED-NON-VERBAL-TTS/ 3....

[4] [4]

To enable emotional synthesis, we enhanced the model with an emotion encoder, allowing it to incorporate emotional embeddings

NON-VERBAL EMOTIONAL TTS We selected Grad-TTS [11] as our backbone model, which is recognized for its high-quality synthesis of reading-style speech. To enable emotional synthesis, we enhanced the model with an emotion encoder, allowing it to incorporate emotional embeddings. Following Russell’s circumplex model [12] of affect, we utilized arousal and val...

[5] [5]

Only Verbal

EXPERIENMENT 5.1. Training setting To enable the synthesis of emotional verbal speech, we con- structed a comprehensive 9-hour mixed dataset of English fe- male speakers, sampled at 22.05 kHz. It was compiled from: EXPRESSO [6], SEMAINE [13], and ESD[14] datasets. To address missing continuous arousal and valence labels in EX- PRESSO and ESD, we used a pr...

[6] [6]

Fine-Grained Non-Verbal

CONCLUSION In this paper, we constructed a Fine-Grained Non-Verbal Ex- pression Data and built a Fine-Grained Non-Verbal emotional TTS benchmark to enable the use of fine-grained non-verbal cues. Our comprehensive subjective evaluation yielded three key findings. First, while the inclusion of NVs resulted in a minor trade-off in perceived speech naturalne...

[7] [7]

Albert Mehrabian,Nonverbal communication, Rout- ledge, 2017

2017

[8] [8]

Affect bursts,

Klaus R Scherer, “Affect bursts,”Emotions: Essays on emotion theory, vol. 161, pp. 196, 1994

1994

[9] [9]

The ami meeting corpus,

Wessel Kraaij, Thomas Hain, Mike Lincoln, and Wil- fried Post, “The ami meeting corpus,” inProc. Inter- national Conference on Methods and Techniques in Be- havioral Research, 2005, pp. 1–4

2005

[10] [10]

arXiv preprint arXiv:2507.13155 , year=

Maksim Borisov, Egor Spirin, and Daria Diatlova, “Nonverbaltts: A public english corpus of text-aligned nonverbal vocalizations with emotion annotations for text-to-speech,”arXiv preprint arXiv:2507.13155, 2025

work page arXiv 2025

[11] [11]

V oxceleb: Large-scale speaker verifi- cation in the wild,

Arsha Nagrani, Joon Son Chung, Weidi Xie, and An- drew Zisserman, “V oxceleb: Large-scale speaker verifi- cation in the wild,”Computer Speech & Language, vol. 60, pp. 101027, 2020

2020

[12] [12]

arXiv preprint arXiv:2308.05725 , year=

Tu Anh Nguyen, Wei-Ning Hsu, Antony d’Avirro, Bowen Shi, Itai Gat, Maryam Fazel-Zarani, Tal Re- mez, Jade Copet, Gabriel Synnaeve, Michael Hassid, et al., “Expresso: A benchmark and analysis of dis- crete expressive speech resynthesis,”arXiv preprint arXiv:2308.05725, 2023

work page arXiv 2023

[13] [13]

Jnv corpus: A corpus of japanese nonver- bal vocalizations with diverse phrases and emotions,

Detai Xin, Shinnosuke Takamichi, and Hiroshi Saruwatari, “Jnv corpus: A corpus of japanese nonver- bal vocalizations with diverse phrases and emotions,” Speech Communication, vol. 156, pp. 103004, 2024

2024

[14] [14]

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

Zhihao Du, Yuxuan Wang, Qian Chen, Xian Shi, Xiang Lv, Tianyu Zhao, Zhifu Gao, Yexin Yang, Changfeng Gao, Hui Wang, et al., “Cosyvoice 2: Scalable stream- ing speech synthesis with large language models,”arXiv preprint arXiv:2412.10117, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[15] [15]

EARS: An anechoic fullband speech dataset benchmarked for speech en- hancement and dereverberation,

Julius Richter, Yi-Chiao Wu, Steven Krenn, Simon Welker, Bunlong Lay, Shinjii Watanabe, Alexander Richard, and Timo Gerkmann, “EARS: An anechoic fullband speech dataset benchmarked for speech en- hancement and dereverberation,” inISCA Interspeech, 2024, pp. 4873–4877

2024

[16] [16]

Robust speech recognition via large-scale weak supervision,

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brock- man, Christine McLeavey, and Ilya Sutskever, “Robust speech recognition via large-scale weak supervision,” in International conference on machine learning. PMLR, 2023, pp. 28492–28518

2023

[17] [17]

Grad-tts: A diffu- sion probabilistic model for text-to-speech,

Vadim Popov, Ivan V ovk, Vladimir Gogoryan, Tasnima Sadekova, and Mikhail Kudinov, “Grad-tts: A diffu- sion probabilistic model for text-to-speech,” inInter- national conference on machine learning. PMLR, 2021, pp. 8599–8608

2021

[18] [18]

A circumplex model of affect.,

James A Russell, “A circumplex model of affect.,”Jour- nal of personality and social psychology, vol. 39, no. 6, pp. 1161, 1980

1980

[19] [19]

The semaine corpus of emotionally coloured character interactions,

Gary McKeown, Michel F Valstar, Roderick Cowie, and Maja Pantic, “The semaine corpus of emotionally coloured character interactions,” in2010 IEEE interna- tional conference on multimedia and expo. IEEE, 2010, pp. 1079–1084

2010

[20] [20]

Emotional voice conversion: Theory, databases and esd,

Kun Zhou, Berrak Sisman, Rui Liu, and Haizhou Li, “Emotional voice conversion: Theory, databases and esd,”Speech Communication, vol. 137, pp. 1–18, 2022

2022

[21] [21]

Dawn of the trans- former era in speech emotion recognition: Closing the valence gap,

Johannes Wagner, Andreas Triantafyllopoulos, Hagen Wierstorf, Maximilian Schmitt, Felix Burkhardt, Flo- rian Eyben, and Bj ¨orn W Schuller, “Dawn of the trans- former era in speech emotion recognition: Closing the valence gap,”IEEE Transactions on Pattern Analysis and Machine Intelligence, pp. 1–13, 2023

2023

[22] [22]

HiFi-GAN: Generative ad- versarial networks for efficient and high fidelity speech synthesis,

J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative ad- versarial networks for efficient and high fidelity speech synthesis,” inProc. NeurIPS, Dec. 2020, pp. 17022– 17033

2020