Advancing Electrolaryngeal Speech Enhancement Through Speech-Text Representation Learning

Ding Ma; Fengji Li; Jiajun He; Jinyi Mi; Kazuhiro Kobayashi; Lester Phillip Violeta; Tomoki Toda; Wenchin Huang

arxiv: 2606.01905 · v1 · pith:FS35IYGSnew · submitted 2026-06-01 · 📡 eess.AS · cs.SD

Advancing Electrolaryngeal Speech Enhancement Through Speech-Text Representation Learning

Ding Ma , Jinyi Mi , Fengji Li , Lester Phillip Violeta , Jiajun He , Wenchin Huang , Kazuhiro Kobayashi , Tomoki Toda This is my paper

Pith reviewed 2026-06-28 12:41 UTC · model grok-4.3

classification 📡 eess.AS cs.SD

keywords electrolaryngeal speech enhancementspeech-text representation learningvoice conversionsequence-to-sequence modelassistive communicationdata augmentation

0 comments

The pith

Integrating speech and text representations improves electrolaryngeal speech conversion without added complexity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a framework that combines pretrained speech and text modules to create integrated representations for converting distorted electrolaryngeal speech into more natural speech. This addresses mismatches that cause errors in standard voice conversion approaches. A sympathetic reader would care because it offers a practical way to enhance assistive devices for people who have lost their larynx, potentially improving intelligibility and naturalness in communication. The method uses fusion strategies and an extra loss term to transfer the representations effectively. Experiments show consistent outperformance over speech-only methods across datasets.

Core claim

The paper claims that constructing a network with pretrained modules to learn speech-text integrated representations, followed by an autoencoder-style reconstruction strategy to inherit these in the seq2seq voice conversion model, leads to better EL2SP performance. Three fusion strategies—middle-, input-, and hybrid-level—are introduced, along with an additional reconstruction loss, and when combined with data augmentations, the approach outperforms baselines relying solely on speech representations.

What carries the argument

Speech-text representation integration via middle-, input-, and hybrid-level fusion strategies in a network built from pretrained modules, transferred to the reconstruction stage through an additional loss term.

Load-bearing premise

Pretrained speech and text modules can be fused to produce integrated representations that transfer cleanly to the EL2SP reconstruction stage without introducing cumulative mapping errors or requiring increased model complexity.

What would settle it

A set of experiments on held-out EL2SP datasets where the speech-text methods fail to outperform or match the performance of speech-only baselines would falsify the central claim.

read the original abstract

Objective: laryngectomees depend on an electromechanical device to generate electrolaryngeal (EL) speech. Compared with normal speech, EL speech suffers from severe distortion, limited phonetic variation, unnatural prosody, and temporal shifts, degrading naturalness and intelligibility. Although sequence-to-sequence (seq2seq) voice conversion (VC) based EL-speech-to-normal-speech conversion (EL2SP) is promising, substantial mismatches between EL and normal speech inevitably cause cumulative mapping errors that limit performance. To address this, we describe a novel representation learning framework integrating speech and text representations to improve mapping and reconstruction quality within a seq2seq VC model. Methods: our methodology comprises two main stages: 1) representation integration and learning, and 2) reconstruction training. A network capable of incorporating auxiliary text information is first constructed with pretrained modules to learn speech--text-based integrated representations. Then, an autoencoder-style reconstruction strategy finalizes EL2SP model to inherit these representations without increasing model complexity. We introduce three fusion strategies including middle-, input-, and hybrid-level fusion strategies that progressively enhance learning. Moreover, besides standard seq2seq VC objectives, an additional reconstruction loss on the integrated representation is introduced to refine representation transfer. Results: experiments under different EL2SP datasets consistently demonstrate that our methods, combined with data augmentations, outperform baselines relying solely on speech representations. Furthermore, progressive improvements with system design depth validate the effectiveness of our methods. Significance: the proposed methods provide an extensible and practical methodology for EL speech enhancement and assistive communication technologies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper adds text fusion at multiple levels plus an auxiliary reconstruction loss to seq2seq EL2SP and claims better results than speech-only baselines, but the abstract gives no numbers to judge the size of the gains.

read the letter

This paper takes the usual seq2seq voice conversion setup for electrolaryngeal to normal speech and layers in text representations through pretrained modules. It tests middle, input, and hybrid fusion, then adds a reconstruction loss on the combined representations so the final model can inherit them without extra complexity. The stated goal is to reduce mapping errors from the EL-normal domain gap.

The method description is clear and the progressive fusion choices make sense as a way to control how much text information enters at different stages. The two-stage training (first learn integrated reps, then train reconstruction) is a practical way to avoid increasing model size. The abstract says the approach plus data augmentation beats speech-only baselines on multiple EL2SP datasets and that deeper designs give further gains.

The soft spot is the lack of any quantitative results, dataset sizes, error bars, or ablation tables in the abstract. Without those, it is hard to tell whether the fusion actually delivers meaningful improvement or just small edges that might not survive statistical checks. The weakest assumption—that the fused representations transfer cleanly without adding new errors—remains untested in the provided text.

This is for researchers working on assistive speech technology or domain-mismatch problems in voice conversion. A reader focused on practical EL speech enhancement would get the fusion strategies and loss design from it. The work is internally consistent and builds directly on prior VC methods, so it deserves a serious referee to examine the experiments and numbers.

Referee Report

1 major / 1 minor

Summary. The paper proposes a two-stage representation learning framework for electrolaryngeal speech enhancement in seq2seq voice conversion (EL2SP). Stage 1 constructs a network with pretrained speech and text modules to learn integrated representations via three fusion strategies (middle-, input-, and hybrid-level). Stage 2 applies autoencoder-style reconstruction training with an auxiliary reconstruction loss on the integrated representation to enable transfer without added model complexity. The central claim is that the resulting systems, when combined with data augmentations, consistently outperform speech-only baselines across multiple EL2SP datasets, with progressive gains validating the design choices.

Significance. If the empirical outperformance holds under detailed scrutiny, the work offers a practical, extensible approach to mitigating domain mismatch and cumulative mapping errors in EL2SP without increasing model complexity. This could meaningfully advance assistive communication technologies for laryngectomees by leveraging readily available text information alongside speech representations.

major comments (1)

[Abstract] Abstract (Results paragraph): the claim that experiments under different EL2SP datasets consistently demonstrate outperformance is unsupported by any quantitative metrics, error bars, dataset sizes, ablation studies, or statistical tests in the provided text, rendering the central empirical claim unverifiable.

minor comments (1)

[Abstract] Abstract (Methods paragraph): the three fusion strategies and the auxiliary reconstruction loss are described at a high level only; without equations, architecture diagrams, or pseudocode it is difficult to assess how representation transfer avoids cumulative errors.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's comments. We address the major comment point by point below.

read point-by-point responses

Referee: [Abstract] Abstract (Results paragraph): the claim that experiments under different EL2SP datasets consistently demonstrate outperformance is unsupported by any quantitative metrics, error bars, dataset sizes, ablation studies, or statistical tests in the provided text, rendering the central empirical claim unverifiable.

Authors: We thank the referee for this observation. The full manuscript contains the requested details (quantitative metrics, error bars, dataset sizes, ablation studies, and statistical tests) in the Experiments and Results sections. However, the abstract's Results paragraph summarizes these findings at a high level without specific numbers. To address the concern and make the abstract self-contained, we will revise it to include key quantitative results supporting the outperformance claim across datasets. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical two-stage pipeline (pretrained speech-text fusion followed by autoencoder-style reconstruction with auxiliary loss) whose central claim is outperformance on EL2SP datasets versus speech-only baselines. No equations, fitted parameters, or derivations are presented that reduce by construction to inputs, self-citations, or ansatzes. The methodology is self-contained, relying on standard pretrained modules and fusion strategies without load-bearing uniqueness theorems or self-referential definitions. The reader's assessment of score 2 is consistent with the absence of any circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the central claim rests on the unstated premise that text auxiliaries improve mapping without new error sources.

pith-pipeline@v0.9.1-grok · 5835 in / 1047 out tokens · 28957 ms · 2026-06-28T12:41:38.069383+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

58 extracted references · 3 linked inside Pith

[1]

Electrolaryngeal speech enhancement with statistical voice conversion based on CLDNN,

K. Kobayashi and T. Toda, “Electrolaryngeal speech enhancement with statistical voice conversion based on CLDNN,” in2018 26th European Signal Processing Conference (EUSIPCO). IEEE, 2018, pp. 2115– 2119

2018
[2]

A comprehensive review of speech emotion recognition systems,

T. M. Wani, T. S. Gunawan, S. A. A. Qadri, M. Kartiwi, and E. Am- bikairajah, “A comprehensive review of speech emotion recognition systems,”IEEE access, vol. 9, pp. 47 795–47 814, 2021

2021
[3]

Psychosocial quality of life in patients after total laryngectomy

E. Babin, D. Beynier, D. Le Gall, and M. Hitier, “Psychosocial quality of life in patients after total laryngectomy.”Revue de laryngologie-otologie- rhinologie, vol. 130, no. 1, pp. 29–34, 2009

2009
[4]

The effects of indwelling voice prosthesis on the quality of life, depressive symptoms, and self-esteem in patients with total laryngectomy,

B. Polat, K. S. Orhan, M. C. Kesimli, Y . Gorgulu, M. Ulusan, and K. Deger, “The effects of indwelling voice prosthesis on the quality of life, depressive symptoms, and self-esteem in patients with total laryngectomy,”European Archives of Oto-rhino-laryngology, vol. 272, no. 11, pp. 3431–3437, 2015

2015
[5]

V oice restoration after total laryngec- tomy,

C. G. Tang and C. F. Sinclair, “V oice restoration after total laryngec- tomy,”Otolaryngologic Clinics of North America, vol. 48, no. 4, pp. 687–702, 2015

2015
[6]

End-to-end mandarin speech reconstruction based on ultrasound tongue images using deep learning,

F. Li, F. Shen, D. Ma, J. Zhou, S. Zhang, L. Wang, F. Fan, T. Liu, X. Chen, T. Todaet al., “End-to-end mandarin speech reconstruction based on ultrasound tongue images using deep learning,”IEEE Trans- actions on Neural Systems and Rehabilitation Engineering, vol. 33, pp. 130–149, 2024

2024
[7]

An endoscopic technique for restoration of voice after laryngectomy,

M. I. Singer and E. D. Blom, “An endoscopic technique for restoration of voice after laryngectomy,”Annals of Otology, Rhinology & Laryngology, vol. 89, no. 6, pp. 529–533, 1980

1980
[8]

Industrialization of the electrolarynx with a pitch control function and its evaluation,

M. Hashiba, “Industrialization of the electrolarynx with a pitch control function and its evaluation,”IEICE Trans. Inf. & Syst. (Japanese Edition), D-II, vol. 94, no. 6, pp. 1240–1247, 2001

2001
[9]

Differences in speaking proficiencies in three laryngectomee groups,

S. E. Williams and J. B. Watson, “Differences in speaking proficiencies in three laryngectomee groups,”Archives of Otolaryngology, vol. 111, no. 4, pp. 216–219, 1985

1985
[10]

Improvement of electrolaryngeal speech by introducing normal excitation informa- tion,

K. Ma, P. Demirel, C. Y . Espy-Wilson, and J. MacAuslan, “Improvement of electrolaryngeal speech by introducing normal excitation informa- tion,” inEUROSPEECH, 1999, pp. 323–326

1999
[11]

Recognition of the electrolaryn- geal speech: comparison between human and machine,

P. Stanislav, J. V . Psutka, and J. Psutka, “Recognition of the electrolaryn- geal speech: comparison between human and machine,” inText, Speech, and Dialogue: 20th International Conference, TSD 2017, Prague, Czech Republic, August 27-31, 2017, Proceedings 20. Springer, 2017, pp. 509–517

2017
[12]

V oice conversion: Factors responsible for quality,

D. Childers, B. Yegnanarayana, and K. Wu, “V oice conversion: Factors responsible for quality,” inICASSP’85. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 10. IEEE, 1985, pp. 748–751

1985
[13]

Continuous probabilistic transform for voice conversion,

Y . Stylianou, O. Capp ´e, and E. Moulines, “Continuous probabilistic transform for voice conversion,”IEEE Transactions on speech and audio processing, vol. 6, no. 2, pp. 131–142, 1998

1998
[14]

V oice conversion based on maximum-likelihood estimation of spectral parameter trajectory,

T. Toda, A. W. Black, and K. Tokuda, “V oice conversion based on maximum-likelihood estimation of spectral parameter trajectory,”IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 8, pp. 2222–2235, 2007

2007
[15]

Speaking-aid systems using gmm-based voice conversion for electrolaryngeal speech,

K. Nakamura, T. Toda, H. Saruwatari, and K. Shikano, “Speaking-aid systems using gmm-based voice conversion for electrolaryngeal speech,” Speech communication, vol. 54, no. 1, pp. 134–146, 2012

2012
[16]

A hy- brid approach to electrolaryngeal speech enhancement based on noise reduction and statistical excitation generation,

K. Tanaka, T. Toda, G. Neubig, S. Sakti, and S. Nakamura, “A hy- brid approach to electrolaryngeal speech enhancement based on noise reduction and statistical excitation generation,”IEICE Transactions on Information and Systems, vol. 97, no. 6, pp. 1429–1437, 2014

2014
[17]

Alaryn- geal speech enhancement based on one-to-many eigenvoice conversion,

H. Doi, T. Toda, K. Nakamura, H. Saruwatari, and K. Shikano, “Alaryn- geal speech enhancement based on one-to-many eigenvoice conversion,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 1, pp. 172–183, 2013

2013
[18]

Mandarin electro-laryngeal speech enhancement based on statistical voice conversion and manual tone control,

Z. Qian, H. Niu, L. Wang, K. Kobayashi, S. Zhang, and T. Toda, “Mandarin electro-laryngeal speech enhancement based on statistical voice conversion and manual tone control,” in2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 2021, pp. 546–552

2021
[19]

Sequence to sequence learning with neural networks,

I. Sutskever, O. Vinyals, and Q. V . Le, “Sequence to sequence learning with neural networks,” inAdvances in Neural Information Processing Systems (NeurIPS), 2014, pp. 3104–3112

2014
[20]

V oice conver- sion using sequence-to-sequence learning of context posterior probabil- ities,

H. Miyoshi, Y . Saito, S. Takamichi, and H. Saruwatari, “V oice conver- sion using sequence-to-sequence learning of context posterior probabil- ities,”arXiv preprint arXiv:1704.02360, 2017

Pith/arXiv arXiv 2017
[21]

AttS2S-VC: Sequence-to-sequence voice conversion with attention and context preservation mechanisms,

K. Tanaka, H. Kameoka, T. Kaneko, and N. Hojo, “AttS2S-VC: Sequence-to-sequence voice conversion with attention and context preservation mechanisms,” inICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 6805–6809

2019
[22]

Pretraining and fine- tuning techniques for electrolaryngeal speech enhancement based on sequence-to-sequence voice conversion,

D. Ma, L. P. Violeta, K. Kobayashi, and T. Toda, “Pretraining and fine- tuning techniques for electrolaryngeal speech enhancement based on sequence-to-sequence voice conversion,”IEEE Transactions on Audio, Speech and Language Processing, vol. 33, pp. 3189–3201, 2025

2025
[23]

Pretraining and adap- tation techniques for electrolaryngeal speech recognition,

L. P. Violeta, D. Ma, W.-C. Huang, and T. Toda, “Pretraining and adap- tation techniques for electrolaryngeal speech recognition,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 2777–2789, 2024

2024
[24]

Mandarin electrolaryngeal speech voice conversion with sequence-to-sequence modeling,

M.-C. Yen, W.-C. Huang, K. Kobayashi, Y .-H. Peng, S.-W. Tsai, Y . Tsao, T. Toda, J.-S. R. Jang, and H.-M. Wang, “Mandarin electrolaryngeal speech voice conversion with sequence-to-sequence modeling,” in2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2021, pp. 650–657. AUTHORet al.: PREPARATION OF PAPERS FOR IEEE TRANSACTIO...

2021
[25]

V oice transformer network: Sequence-to-sequence voice conversion using transformer with text-to-speech pretraining,

W.-C. Huang, T. Hayashi, Y .-C. Wu, H. Kameoka, and T. Toda, “V oice transformer network: Sequence-to-sequence voice conversion using transformer with text-to-speech pretraining,” inInterspeech, 2021, pp. 4676–4680

2021
[26]

Pretraining techniques for sequence-to-sequence voice conver- sion,

——, “Pretraining techniques for sequence-to-sequence voice conver- sion,”IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, vol. 29, pp. 745–755, 2021

2021
[27]

Cotatron: Transcription-guided speech encoder for any-to-many voice conversion without parallel data,

S.-w. Park, D.-y. Kim, and M.-c. Joe, “Cotatron: Transcription-guided speech encoder for any-to-many voice conversion without parallel data,” inInterspeech, 2020, pp. 4696–4700

2020
[28]

Joint training framework for text-to-speech and voice conversion using multi-source tacotron and wavenet,

M. Zhang, X. Wang, F. Fang, H. Li, and J. Yamagishi, “Joint training framework for text-to-speech and voice conversion using multi-source tacotron and wavenet,” inInterspeech, 2019, pp. 15–19

2019
[29]

StyleFusion TTS: Multimodal style-control and enhanced feature fusion for zero-shot text-to-speech synthesis,

Z. Chen, X. Li, Z. Ai, and S. Xu, “StyleFusion TTS: Multimodal style-control and enhanced feature fusion for zero-shot text-to-speech synthesis,” inChinese Conference on Pattern Recognition and Computer Vision (PRCV). Springer, 2024, pp. 263–277

2024
[30]

MM-TTS: Multi-modal prompt based style transfer for expressive text-to-speech synthesis,

W. Guan, Y . Li, T. Li, H. Huang, F. Wang, J. Lin, L. Huang, L. Li, and Q. Hong, “MM-TTS: Multi-modal prompt based style transfer for expressive text-to-speech synthesis,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 16, 2024, pp. 18 117– 18 125

2024
[31]

Crossmodal voice conversion,

H. Kameoka, K. Tanaka, A. V . Puche, Y . Ohishi, and T. Kaneko, “Crossmodal voice conversion,”arXiv preprint arXiv:1904.04540, 2019

Pith/arXiv arXiv 1904
[32]

HybridVC: Efficient V oice Style Conversion with Text and Audio Prompts,

X. Niu, J. Zhang, and C. P. Martin, “HybridVC: Efficient V oice Style Conversion with Text and Audio Prompts,” inInterspeech 2024, 2024, pp. 4368–4372

2024
[33]

AlignSTS: Speech- to-singing conversion via cross-modal alignment,

R. Li, R. Huang, L. Zhang, J. Liu, and Z. Zhao, “AlignSTS: Speech- to-singing conversion via cross-modal alignment,” inFindings of the Association for Computational Linguistics: ACL 2023. Association for Computational Linguistics, 2023, pp. 7074–7088

2023
[34]

WavFusion: Towards wav2vec 2.0 multimodal speech emotion recognition,

F. Li, J. Luo, and W. Xia, “WavFusion: Towards wav2vec 2.0 multimodal speech emotion recognition,” inInternational Conference on Multimedia Modeling. Springer, 2025, pp. 325–336

2025
[35]

A robustly optimized BERT pre-training approach with post-training,

L. Zhuang, L. Wayne, S. Ya, and Z. Jun, “A robustly optimized BERT pre-training approach with post-training,” inProceedings of the 20th Chinese National Conference on Computational Linguistics, S. Li, M. Sun, Y . Liu, H. Wu, K. Liu, W. Che, S. He, and G. Rao, Eds., 2021, pp. 1218–1227

2021
[36]

Wav2vec 2.0: A framework for self-supervised learning of speech representations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “Wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12 449– 12 460, 2020

2020
[37]

Enhanced multimodal speech processing for healthcare applications: A deep fusion approach,

J. Lv, W. Boulila, S. Rani, and H. Jiang, “Enhanced multimodal speech processing for healthcare applications: A deep fusion approach,”IEEE Journal of Selected Topics in Signal Processing, vol. 19, no. 4, pp. 600– 612, 2025

2025
[38]

Large language models are strong audio-visual speech recognition learners,

U. Cappellazzo, M. Kim, H. Chen, P. Ma, S. Petridis, D. Falavigna, A. Brutti, and M. Pantic, “Large language models are strong audio-visual speech recognition learners,” inICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

2025
[39]

LoRA: Low-Rank Adaptation of large language models,

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-Rank Adaptation of large language models,” ICLR, vol. 1, no. 2, pp. 3–23, 2022

2022
[40]

Improving electrolaryngeal speech enhancement via a representation learning method based on integrated text and speech representations,

D. Ma, J. Mi, F. Li, L. P. Violeta, K. Kobayashi, and T. Toda, “Improving electrolaryngeal speech enhancement via a representation learning method based on integrated text and speech representations,” in2025 47th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), 2025, pp. 1–6

2025
[41]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in Neural Information Processing Systems (NeurIPS), vol. 30, pp. 5998– 6008, 2017

2017
[42]

Efficiently trainable text- to-speech system based on deep convolutional networks with guided attention,

H. Tachibana, K. Uenoyama, and S. Aihara, “Efficiently trainable text- to-speech system based on deep convolutional networks with guided attention,” in2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), 2018, pp. 4784–4788

2018
[43]

JSUT corpus: free large- scale Japanese scpeech corpus for end-to-end speech synthesis,

R. Sonobe, S. Takamichi, and H. Saruwatari, “JSUT corpus: free large- scale Japanese scpeech corpus for end-to-end speech synthesis,”arXiv Preprint, arXiv:1711.00354, 2017

Pith/arXiv arXiv 2017
[44]

ESPnet: End-to-end speech processing toolkit,

S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y . Unno, N. E. Y . Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai, “ESPnet: End-to-end speech processing toolkit,” in Interspeech, 2018, pp. 2207–2211

2018
[45]

Large batch optimization for deep learning: Training bert in 76 minutes,

Y . You, J. Li, S. Reddi, J. Hseu, S. Kumar, S. Bhojanapalli, X. Song, J. Demmel, K. Keutzer, and C.-J. Hsieh, “Large batch optimization for deep learning: Training bert in 76 minutes,” inProceedings of International Conference on Learning Representations, 2020, p. 36 pages

2020
[46]

Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,

R. Yamamoto, E. Song, and J.-M. Kim, “Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,” inIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2020), 2020, pp. 6199–6203

2020
[47]

Mel-cepstral distance measure for objective speech qual- ity assessment,

R. Kubichek, “Mel-cepstral distance measure for objective speech qual- ity assessment,” inProceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing, vol. 1, pp. 125–128
[48]

Investigating self-supervised pretraining frameworks for pathological speech recognition,

L. P. Violeta, W.-C. Huang, and T. Toda, “Investigating self-supervised pretraining frameworks for pathological speech recognition,” inInter- speech, 2022, pp. 41–45

2022
[49]

I. T. U. T. S. Sector,Methods for Subjective Determination of Trans- mission Quality. International Telecommunication Union, 1996

1996
[50]

A voice conversion system from electrolarynx speech to preoperative patient’s speech for total laryngectomy,

N. Nishio, K. Kobayashi, D. Ma, S. Mitani, M. Sone, and T. Toda, “A voice conversion system from electrolarynx speech to preoperative patient’s speech for total laryngectomy,”OTO open, vol. 10, no. 1, p. e70207, 2026

2026
[51]

CycleGAN-based prosody and spectrum modeling for Man- darin touch-controlled electrolaryngeal speech enhancement,

J. Zhou, L. Wang, F. Li, S. Zhang, F. Shen, F. Fan, T. Liu, X. Chen, and H. Niu, “CycleGAN-based prosody and spectrum modeling for Man- darin touch-controlled electrolaryngeal speech enhancement,”Biomedi- cal Signal Processing and Control, vol. 118, p. 109746, 2026

2026
[52]

Electrolaryngeal speech intelligibility enhancement through robust linguistic encoders,

L. P. Violeta, W.-C. Huang, D. Ma, R. Yamamoto, K. Kobayashi, and T. Toda, “Electrolaryngeal speech intelligibility enhancement through robust linguistic encoders,” inIEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP 2024), pp. 10 961–10 965

2024
[53]

Conan: A chunkwise online network for zero-shot adaptive voice conversion,

Y . Zhang, B. Tian, and Z. Duan, “Conan: A chunkwise online network for zero-shot adaptive voice conversion,” in2025 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2025, pp. 1–8

2025
[54]

Study of lightweight Transformer architectures for single-channel speech enhancement,

H. Zhao and N. Madhu, “Study of lightweight Transformer architectures for single-channel speech enhancement,” in2025 33rd European Signal Processing Conference (EUSIPCO). IEEE, 2025, pp. 101–105

2025
[55]

BinauralFlow: A causal and streamable approach for high-quality binaural speech synthesis with flow matching models,

S. Liang, D. Markovic, I. D. Gebru, S. Krenn, T. Keebler, J. Sandakly, F. Yu, S. Hassel, C. Xu, and A. Richard, “BinauralFlow: A causal and streamable approach for high-quality binaural speech synthesis with flow matching models,” inForty-second International Conference on Machine Learning, 2025, pp. 1–18

2025
[56]

FastS2S-VC: Stream- ing non-autoregressive Sequence-to-Sequence V oice Conversion,

H. Kameoka, K. Tanaka, and T. Kaneko, “FastS2S-VC: Stream- ing non-autoregressive Sequence-to-Sequence V oice Conversion,”arXiv Preprint, arXiv:2104.06900, 2021

arXiv 2021
[57]

An investigation of streaming non-autoregressive sequence-to-sequence voice conversion,

T. Hayashi, K. Kobayashi, and T. Toda, “An investigation of streaming non-autoregressive sequence-to-sequence voice conversion,” inProceed- ings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022, 2022, pp. 6802–6806

2022
[58]

Low-latency electrolaryngeal speech enhancement based on FastSpeech2-based voice conversion and self-supervised speech representation,

K. Kobayashi, T. Hayashi, and T. Toda, “Low-latency electrolaryngeal speech enhancement based on FastSpeech2-based voice conversion and self-supervised speech representation,” inProceedings of IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023, 2023, p. 5 pages

2023

[1] [1]

Electrolaryngeal speech enhancement with statistical voice conversion based on CLDNN,

K. Kobayashi and T. Toda, “Electrolaryngeal speech enhancement with statistical voice conversion based on CLDNN,” in2018 26th European Signal Processing Conference (EUSIPCO). IEEE, 2018, pp. 2115– 2119

2018

[2] [2]

A comprehensive review of speech emotion recognition systems,

T. M. Wani, T. S. Gunawan, S. A. A. Qadri, M. Kartiwi, and E. Am- bikairajah, “A comprehensive review of speech emotion recognition systems,”IEEE access, vol. 9, pp. 47 795–47 814, 2021

2021

[3] [3]

Psychosocial quality of life in patients after total laryngectomy

E. Babin, D. Beynier, D. Le Gall, and M. Hitier, “Psychosocial quality of life in patients after total laryngectomy.”Revue de laryngologie-otologie- rhinologie, vol. 130, no. 1, pp. 29–34, 2009

2009

[4] [4]

The effects of indwelling voice prosthesis on the quality of life, depressive symptoms, and self-esteem in patients with total laryngectomy,

B. Polat, K. S. Orhan, M. C. Kesimli, Y . Gorgulu, M. Ulusan, and K. Deger, “The effects of indwelling voice prosthesis on the quality of life, depressive symptoms, and self-esteem in patients with total laryngectomy,”European Archives of Oto-rhino-laryngology, vol. 272, no. 11, pp. 3431–3437, 2015

2015

[5] [5]

V oice restoration after total laryngec- tomy,

C. G. Tang and C. F. Sinclair, “V oice restoration after total laryngec- tomy,”Otolaryngologic Clinics of North America, vol. 48, no. 4, pp. 687–702, 2015

2015

[6] [6]

End-to-end mandarin speech reconstruction based on ultrasound tongue images using deep learning,

F. Li, F. Shen, D. Ma, J. Zhou, S. Zhang, L. Wang, F. Fan, T. Liu, X. Chen, T. Todaet al., “End-to-end mandarin speech reconstruction based on ultrasound tongue images using deep learning,”IEEE Trans- actions on Neural Systems and Rehabilitation Engineering, vol. 33, pp. 130–149, 2024

2024

[7] [7]

An endoscopic technique for restoration of voice after laryngectomy,

M. I. Singer and E. D. Blom, “An endoscopic technique for restoration of voice after laryngectomy,”Annals of Otology, Rhinology & Laryngology, vol. 89, no. 6, pp. 529–533, 1980

1980

[8] [8]

Industrialization of the electrolarynx with a pitch control function and its evaluation,

M. Hashiba, “Industrialization of the electrolarynx with a pitch control function and its evaluation,”IEICE Trans. Inf. & Syst. (Japanese Edition), D-II, vol. 94, no. 6, pp. 1240–1247, 2001

2001

[9] [9]

Differences in speaking proficiencies in three laryngectomee groups,

S. E. Williams and J. B. Watson, “Differences in speaking proficiencies in three laryngectomee groups,”Archives of Otolaryngology, vol. 111, no. 4, pp. 216–219, 1985

1985

[10] [10]

Improvement of electrolaryngeal speech by introducing normal excitation informa- tion,

K. Ma, P. Demirel, C. Y . Espy-Wilson, and J. MacAuslan, “Improvement of electrolaryngeal speech by introducing normal excitation informa- tion,” inEUROSPEECH, 1999, pp. 323–326

1999

[11] [11]

Recognition of the electrolaryn- geal speech: comparison between human and machine,

P. Stanislav, J. V . Psutka, and J. Psutka, “Recognition of the electrolaryn- geal speech: comparison between human and machine,” inText, Speech, and Dialogue: 20th International Conference, TSD 2017, Prague, Czech Republic, August 27-31, 2017, Proceedings 20. Springer, 2017, pp. 509–517

2017

[12] [12]

V oice conversion: Factors responsible for quality,

D. Childers, B. Yegnanarayana, and K. Wu, “V oice conversion: Factors responsible for quality,” inICASSP’85. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 10. IEEE, 1985, pp. 748–751

1985

[13] [13]

Continuous probabilistic transform for voice conversion,

Y . Stylianou, O. Capp ´e, and E. Moulines, “Continuous probabilistic transform for voice conversion,”IEEE Transactions on speech and audio processing, vol. 6, no. 2, pp. 131–142, 1998

1998

[14] [14]

V oice conversion based on maximum-likelihood estimation of spectral parameter trajectory,

T. Toda, A. W. Black, and K. Tokuda, “V oice conversion based on maximum-likelihood estimation of spectral parameter trajectory,”IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 8, pp. 2222–2235, 2007

2007

[15] [15]

Speaking-aid systems using gmm-based voice conversion for electrolaryngeal speech,

K. Nakamura, T. Toda, H. Saruwatari, and K. Shikano, “Speaking-aid systems using gmm-based voice conversion for electrolaryngeal speech,” Speech communication, vol. 54, no. 1, pp. 134–146, 2012

2012

[16] [16]

A hy- brid approach to electrolaryngeal speech enhancement based on noise reduction and statistical excitation generation,

K. Tanaka, T. Toda, G. Neubig, S. Sakti, and S. Nakamura, “A hy- brid approach to electrolaryngeal speech enhancement based on noise reduction and statistical excitation generation,”IEICE Transactions on Information and Systems, vol. 97, no. 6, pp. 1429–1437, 2014

2014

[17] [17]

Alaryn- geal speech enhancement based on one-to-many eigenvoice conversion,

H. Doi, T. Toda, K. Nakamura, H. Saruwatari, and K. Shikano, “Alaryn- geal speech enhancement based on one-to-many eigenvoice conversion,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 1, pp. 172–183, 2013

2013

[18] [18]

Mandarin electro-laryngeal speech enhancement based on statistical voice conversion and manual tone control,

Z. Qian, H. Niu, L. Wang, K. Kobayashi, S. Zhang, and T. Toda, “Mandarin electro-laryngeal speech enhancement based on statistical voice conversion and manual tone control,” in2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 2021, pp. 546–552

2021

[19] [19]

Sequence to sequence learning with neural networks,

I. Sutskever, O. Vinyals, and Q. V . Le, “Sequence to sequence learning with neural networks,” inAdvances in Neural Information Processing Systems (NeurIPS), 2014, pp. 3104–3112

2014

[20] [20]

V oice conver- sion using sequence-to-sequence learning of context posterior probabil- ities,

H. Miyoshi, Y . Saito, S. Takamichi, and H. Saruwatari, “V oice conver- sion using sequence-to-sequence learning of context posterior probabil- ities,”arXiv preprint arXiv:1704.02360, 2017

Pith/arXiv arXiv 2017

[21] [21]

AttS2S-VC: Sequence-to-sequence voice conversion with attention and context preservation mechanisms,

K. Tanaka, H. Kameoka, T. Kaneko, and N. Hojo, “AttS2S-VC: Sequence-to-sequence voice conversion with attention and context preservation mechanisms,” inICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 6805–6809

2019

[22] [22]

Pretraining and fine- tuning techniques for electrolaryngeal speech enhancement based on sequence-to-sequence voice conversion,

D. Ma, L. P. Violeta, K. Kobayashi, and T. Toda, “Pretraining and fine- tuning techniques for electrolaryngeal speech enhancement based on sequence-to-sequence voice conversion,”IEEE Transactions on Audio, Speech and Language Processing, vol. 33, pp. 3189–3201, 2025

2025

[23] [23]

Pretraining and adap- tation techniques for electrolaryngeal speech recognition,

L. P. Violeta, D. Ma, W.-C. Huang, and T. Toda, “Pretraining and adap- tation techniques for electrolaryngeal speech recognition,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 2777–2789, 2024

2024

[24] [24]

Mandarin electrolaryngeal speech voice conversion with sequence-to-sequence modeling,

M.-C. Yen, W.-C. Huang, K. Kobayashi, Y .-H. Peng, S.-W. Tsai, Y . Tsao, T. Toda, J.-S. R. Jang, and H.-M. Wang, “Mandarin electrolaryngeal speech voice conversion with sequence-to-sequence modeling,” in2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2021, pp. 650–657. AUTHORet al.: PREPARATION OF PAPERS FOR IEEE TRANSACTIO...

2021

[25] [25]

V oice transformer network: Sequence-to-sequence voice conversion using transformer with text-to-speech pretraining,

W.-C. Huang, T. Hayashi, Y .-C. Wu, H. Kameoka, and T. Toda, “V oice transformer network: Sequence-to-sequence voice conversion using transformer with text-to-speech pretraining,” inInterspeech, 2021, pp. 4676–4680

2021

[26] [26]

Pretraining techniques for sequence-to-sequence voice conver- sion,

——, “Pretraining techniques for sequence-to-sequence voice conver- sion,”IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, vol. 29, pp. 745–755, 2021

2021

[27] [27]

Cotatron: Transcription-guided speech encoder for any-to-many voice conversion without parallel data,

S.-w. Park, D.-y. Kim, and M.-c. Joe, “Cotatron: Transcription-guided speech encoder for any-to-many voice conversion without parallel data,” inInterspeech, 2020, pp. 4696–4700

2020

[28] [28]

Joint training framework for text-to-speech and voice conversion using multi-source tacotron and wavenet,

M. Zhang, X. Wang, F. Fang, H. Li, and J. Yamagishi, “Joint training framework for text-to-speech and voice conversion using multi-source tacotron and wavenet,” inInterspeech, 2019, pp. 15–19

2019

[29] [29]

StyleFusion TTS: Multimodal style-control and enhanced feature fusion for zero-shot text-to-speech synthesis,

Z. Chen, X. Li, Z. Ai, and S. Xu, “StyleFusion TTS: Multimodal style-control and enhanced feature fusion for zero-shot text-to-speech synthesis,” inChinese Conference on Pattern Recognition and Computer Vision (PRCV). Springer, 2024, pp. 263–277

2024

[30] [30]

MM-TTS: Multi-modal prompt based style transfer for expressive text-to-speech synthesis,

W. Guan, Y . Li, T. Li, H. Huang, F. Wang, J. Lin, L. Huang, L. Li, and Q. Hong, “MM-TTS: Multi-modal prompt based style transfer for expressive text-to-speech synthesis,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 16, 2024, pp. 18 117– 18 125

2024

[31] [31]

Crossmodal voice conversion,

H. Kameoka, K. Tanaka, A. V . Puche, Y . Ohishi, and T. Kaneko, “Crossmodal voice conversion,”arXiv preprint arXiv:1904.04540, 2019

Pith/arXiv arXiv 1904

[32] [32]

HybridVC: Efficient V oice Style Conversion with Text and Audio Prompts,

X. Niu, J. Zhang, and C. P. Martin, “HybridVC: Efficient V oice Style Conversion with Text and Audio Prompts,” inInterspeech 2024, 2024, pp. 4368–4372

2024

[33] [33]

AlignSTS: Speech- to-singing conversion via cross-modal alignment,

R. Li, R. Huang, L. Zhang, J. Liu, and Z. Zhao, “AlignSTS: Speech- to-singing conversion via cross-modal alignment,” inFindings of the Association for Computational Linguistics: ACL 2023. Association for Computational Linguistics, 2023, pp. 7074–7088

2023

[34] [34]

WavFusion: Towards wav2vec 2.0 multimodal speech emotion recognition,

F. Li, J. Luo, and W. Xia, “WavFusion: Towards wav2vec 2.0 multimodal speech emotion recognition,” inInternational Conference on Multimedia Modeling. Springer, 2025, pp. 325–336

2025

[35] [35]

A robustly optimized BERT pre-training approach with post-training,

L. Zhuang, L. Wayne, S. Ya, and Z. Jun, “A robustly optimized BERT pre-training approach with post-training,” inProceedings of the 20th Chinese National Conference on Computational Linguistics, S. Li, M. Sun, Y . Liu, H. Wu, K. Liu, W. Che, S. He, and G. Rao, Eds., 2021, pp. 1218–1227

2021

[36] [36]

Wav2vec 2.0: A framework for self-supervised learning of speech representations,

A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “Wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12 449– 12 460, 2020

2020

[37] [37]

Enhanced multimodal speech processing for healthcare applications: A deep fusion approach,

J. Lv, W. Boulila, S. Rani, and H. Jiang, “Enhanced multimodal speech processing for healthcare applications: A deep fusion approach,”IEEE Journal of Selected Topics in Signal Processing, vol. 19, no. 4, pp. 600– 612, 2025

2025

[38] [38]

Large language models are strong audio-visual speech recognition learners,

U. Cappellazzo, M. Kim, H. Chen, P. Ma, S. Petridis, D. Falavigna, A. Brutti, and M. Pantic, “Large language models are strong audio-visual speech recognition learners,” inICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

2025

[39] [39]

LoRA: Low-Rank Adaptation of large language models,

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-Rank Adaptation of large language models,” ICLR, vol. 1, no. 2, pp. 3–23, 2022

2022

[40] [40]

Improving electrolaryngeal speech enhancement via a representation learning method based on integrated text and speech representations,

D. Ma, J. Mi, F. Li, L. P. Violeta, K. Kobayashi, and T. Toda, “Improving electrolaryngeal speech enhancement via a representation learning method based on integrated text and speech representations,” in2025 47th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), 2025, pp. 1–6

2025

[41] [41]

Attention is all you need,

A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in Neural Information Processing Systems (NeurIPS), vol. 30, pp. 5998– 6008, 2017

2017

[42] [42]

Efficiently trainable text- to-speech system based on deep convolutional networks with guided attention,

H. Tachibana, K. Uenoyama, and S. Aihara, “Efficiently trainable text- to-speech system based on deep convolutional networks with guided attention,” in2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), 2018, pp. 4784–4788

2018

[43] [43]

JSUT corpus: free large- scale Japanese scpeech corpus for end-to-end speech synthesis,

R. Sonobe, S. Takamichi, and H. Saruwatari, “JSUT corpus: free large- scale Japanese scpeech corpus for end-to-end speech synthesis,”arXiv Preprint, arXiv:1711.00354, 2017

Pith/arXiv arXiv 2017

[44] [44]

ESPnet: End-to-end speech processing toolkit,

S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y . Unno, N. E. Y . Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai, “ESPnet: End-to-end speech processing toolkit,” in Interspeech, 2018, pp. 2207–2211

2018

[45] [45]

Large batch optimization for deep learning: Training bert in 76 minutes,

Y . You, J. Li, S. Reddi, J. Hseu, S. Kumar, S. Bhojanapalli, X. Song, J. Demmel, K. Keutzer, and C.-J. Hsieh, “Large batch optimization for deep learning: Training bert in 76 minutes,” inProceedings of International Conference on Learning Representations, 2020, p. 36 pages

2020

[46] [46]

Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,

R. Yamamoto, E. Song, and J.-M. Kim, “Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,” inIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2020), 2020, pp. 6199–6203

2020

[47] [47]

Mel-cepstral distance measure for objective speech qual- ity assessment,

R. Kubichek, “Mel-cepstral distance measure for objective speech qual- ity assessment,” inProceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing, vol. 1, pp. 125–128

[48] [48]

Investigating self-supervised pretraining frameworks for pathological speech recognition,

L. P. Violeta, W.-C. Huang, and T. Toda, “Investigating self-supervised pretraining frameworks for pathological speech recognition,” inInter- speech, 2022, pp. 41–45

2022

[49] [49]

I. T. U. T. S. Sector,Methods for Subjective Determination of Trans- mission Quality. International Telecommunication Union, 1996

1996

[50] [50]

A voice conversion system from electrolarynx speech to preoperative patient’s speech for total laryngectomy,

N. Nishio, K. Kobayashi, D. Ma, S. Mitani, M. Sone, and T. Toda, “A voice conversion system from electrolarynx speech to preoperative patient’s speech for total laryngectomy,”OTO open, vol. 10, no. 1, p. e70207, 2026

2026

[51] [51]

CycleGAN-based prosody and spectrum modeling for Man- darin touch-controlled electrolaryngeal speech enhancement,

J. Zhou, L. Wang, F. Li, S. Zhang, F. Shen, F. Fan, T. Liu, X. Chen, and H. Niu, “CycleGAN-based prosody and spectrum modeling for Man- darin touch-controlled electrolaryngeal speech enhancement,”Biomedi- cal Signal Processing and Control, vol. 118, p. 109746, 2026

2026

[52] [52]

Electrolaryngeal speech intelligibility enhancement through robust linguistic encoders,

L. P. Violeta, W.-C. Huang, D. Ma, R. Yamamoto, K. Kobayashi, and T. Toda, “Electrolaryngeal speech intelligibility enhancement through robust linguistic encoders,” inIEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP 2024), pp. 10 961–10 965

2024

[53] [53]

Conan: A chunkwise online network for zero-shot adaptive voice conversion,

Y . Zhang, B. Tian, and Z. Duan, “Conan: A chunkwise online network for zero-shot adaptive voice conversion,” in2025 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2025, pp. 1–8

2025

[54] [54]

Study of lightweight Transformer architectures for single-channel speech enhancement,

H. Zhao and N. Madhu, “Study of lightweight Transformer architectures for single-channel speech enhancement,” in2025 33rd European Signal Processing Conference (EUSIPCO). IEEE, 2025, pp. 101–105

2025

[55] [55]

BinauralFlow: A causal and streamable approach for high-quality binaural speech synthesis with flow matching models,

S. Liang, D. Markovic, I. D. Gebru, S. Krenn, T. Keebler, J. Sandakly, F. Yu, S. Hassel, C. Xu, and A. Richard, “BinauralFlow: A causal and streamable approach for high-quality binaural speech synthesis with flow matching models,” inForty-second International Conference on Machine Learning, 2025, pp. 1–18

2025

[56] [56]

FastS2S-VC: Stream- ing non-autoregressive Sequence-to-Sequence V oice Conversion,

H. Kameoka, K. Tanaka, and T. Kaneko, “FastS2S-VC: Stream- ing non-autoregressive Sequence-to-Sequence V oice Conversion,”arXiv Preprint, arXiv:2104.06900, 2021

arXiv 2021

[57] [57]

An investigation of streaming non-autoregressive sequence-to-sequence voice conversion,

T. Hayashi, K. Kobayashi, and T. Toda, “An investigation of streaming non-autoregressive sequence-to-sequence voice conversion,” inProceed- ings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022, 2022, pp. 6802–6806

2022

[58] [58]

Low-latency electrolaryngeal speech enhancement based on FastSpeech2-based voice conversion and self-supervised speech representation,

K. Kobayashi, T. Hayashi, and T. Toda, “Low-latency electrolaryngeal speech enhancement based on FastSpeech2-based voice conversion and self-supervised speech representation,” inProceedings of IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023, 2023, p. 5 pages

2023