Advancing Electrolaryngeal Speech Enhancement Through Speech-Text Representation Learning
Pith reviewed 2026-06-28 12:41 UTC · model grok-4.3
The pith
Integrating speech and text representations improves electrolaryngeal speech conversion without added complexity.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that constructing a network with pretrained modules to learn speech-text integrated representations, followed by an autoencoder-style reconstruction strategy to inherit these in the seq2seq voice conversion model, leads to better EL2SP performance. Three fusion strategies—middle-, input-, and hybrid-level—are introduced, along with an additional reconstruction loss, and when combined with data augmentations, the approach outperforms baselines relying solely on speech representations.
What carries the argument
Speech-text representation integration via middle-, input-, and hybrid-level fusion strategies in a network built from pretrained modules, transferred to the reconstruction stage through an additional loss term.
Load-bearing premise
Pretrained speech and text modules can be fused to produce integrated representations that transfer cleanly to the EL2SP reconstruction stage without introducing cumulative mapping errors or requiring increased model complexity.
What would settle it
A set of experiments on held-out EL2SP datasets where the speech-text methods fail to outperform or match the performance of speech-only baselines would falsify the central claim.
read the original abstract
Objective: laryngectomees depend on an electromechanical device to generate electrolaryngeal (EL) speech. Compared with normal speech, EL speech suffers from severe distortion, limited phonetic variation, unnatural prosody, and temporal shifts, degrading naturalness and intelligibility. Although sequence-to-sequence (seq2seq) voice conversion (VC) based EL-speech-to-normal-speech conversion (EL2SP) is promising, substantial mismatches between EL and normal speech inevitably cause cumulative mapping errors that limit performance. To address this, we describe a novel representation learning framework integrating speech and text representations to improve mapping and reconstruction quality within a seq2seq VC model. Methods: our methodology comprises two main stages: 1) representation integration and learning, and 2) reconstruction training. A network capable of incorporating auxiliary text information is first constructed with pretrained modules to learn speech--text-based integrated representations. Then, an autoencoder-style reconstruction strategy finalizes EL2SP model to inherit these representations without increasing model complexity. We introduce three fusion strategies including middle-, input-, and hybrid-level fusion strategies that progressively enhance learning. Moreover, besides standard seq2seq VC objectives, an additional reconstruction loss on the integrated representation is introduced to refine representation transfer. Results: experiments under different EL2SP datasets consistently demonstrate that our methods, combined with data augmentations, outperform baselines relying solely on speech representations. Furthermore, progressive improvements with system design depth validate the effectiveness of our methods. Significance: the proposed methods provide an extensible and practical methodology for EL speech enhancement and assistive communication technologies.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a two-stage representation learning framework for electrolaryngeal speech enhancement in seq2seq voice conversion (EL2SP). Stage 1 constructs a network with pretrained speech and text modules to learn integrated representations via three fusion strategies (middle-, input-, and hybrid-level). Stage 2 applies autoencoder-style reconstruction training with an auxiliary reconstruction loss on the integrated representation to enable transfer without added model complexity. The central claim is that the resulting systems, when combined with data augmentations, consistently outperform speech-only baselines across multiple EL2SP datasets, with progressive gains validating the design choices.
Significance. If the empirical outperformance holds under detailed scrutiny, the work offers a practical, extensible approach to mitigating domain mismatch and cumulative mapping errors in EL2SP without increasing model complexity. This could meaningfully advance assistive communication technologies for laryngectomees by leveraging readily available text information alongside speech representations.
major comments (1)
- [Abstract] Abstract (Results paragraph): the claim that experiments under different EL2SP datasets consistently demonstrate outperformance is unsupported by any quantitative metrics, error bars, dataset sizes, ablation studies, or statistical tests in the provided text, rendering the central empirical claim unverifiable.
minor comments (1)
- [Abstract] Abstract (Methods paragraph): the three fusion strategies and the auxiliary reconstruction loss are described at a high level only; without equations, architecture diagrams, or pseudocode it is difficult to assess how representation transfer avoids cumulative errors.
Simulated Author's Rebuttal
Thank you for the opportunity to respond to the referee's comments. We address the major comment point by point below.
read point-by-point responses
-
Referee: [Abstract] Abstract (Results paragraph): the claim that experiments under different EL2SP datasets consistently demonstrate outperformance is unsupported by any quantitative metrics, error bars, dataset sizes, ablation studies, or statistical tests in the provided text, rendering the central empirical claim unverifiable.
Authors: We thank the referee for this observation. The full manuscript contains the requested details (quantitative metrics, error bars, dataset sizes, ablation studies, and statistical tests) in the Experiments and Results sections. However, the abstract's Results paragraph summarizes these findings at a high level without specific numbers. To address the concern and make the abstract self-contained, we will revise it to include key quantitative results supporting the outperformance claim across datasets. revision: yes
Circularity Check
No significant circularity
full rationale
The paper describes an empirical two-stage pipeline (pretrained speech-text fusion followed by autoencoder-style reconstruction with auxiliary loss) whose central claim is outperformance on EL2SP datasets versus speech-only baselines. No equations, fitted parameters, or derivations are presented that reduce by construction to inputs, self-citations, or ansatzes. The methodology is self-contained, relying on standard pretrained modules and fusion strategies without load-bearing uniqueness theorems or self-referential definitions. The reader's assessment of score 2 is consistent with the absence of any circular reduction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Electrolaryngeal speech enhancement with statistical voice conversion based on CLDNN,
K. Kobayashi and T. Toda, “Electrolaryngeal speech enhancement with statistical voice conversion based on CLDNN,” in2018 26th European Signal Processing Conference (EUSIPCO). IEEE, 2018, pp. 2115– 2119
2018
-
[2]
A comprehensive review of speech emotion recognition systems,
T. M. Wani, T. S. Gunawan, S. A. A. Qadri, M. Kartiwi, and E. Am- bikairajah, “A comprehensive review of speech emotion recognition systems,”IEEE access, vol. 9, pp. 47 795–47 814, 2021
2021
-
[3]
Psychosocial quality of life in patients after total laryngectomy
E. Babin, D. Beynier, D. Le Gall, and M. Hitier, “Psychosocial quality of life in patients after total laryngectomy.”Revue de laryngologie-otologie- rhinologie, vol. 130, no. 1, pp. 29–34, 2009
2009
-
[4]
The effects of indwelling voice prosthesis on the quality of life, depressive symptoms, and self-esteem in patients with total laryngectomy,
B. Polat, K. S. Orhan, M. C. Kesimli, Y . Gorgulu, M. Ulusan, and K. Deger, “The effects of indwelling voice prosthesis on the quality of life, depressive symptoms, and self-esteem in patients with total laryngectomy,”European Archives of Oto-rhino-laryngology, vol. 272, no. 11, pp. 3431–3437, 2015
2015
-
[5]
V oice restoration after total laryngec- tomy,
C. G. Tang and C. F. Sinclair, “V oice restoration after total laryngec- tomy,”Otolaryngologic Clinics of North America, vol. 48, no. 4, pp. 687–702, 2015
2015
-
[6]
End-to-end mandarin speech reconstruction based on ultrasound tongue images using deep learning,
F. Li, F. Shen, D. Ma, J. Zhou, S. Zhang, L. Wang, F. Fan, T. Liu, X. Chen, T. Todaet al., “End-to-end mandarin speech reconstruction based on ultrasound tongue images using deep learning,”IEEE Trans- actions on Neural Systems and Rehabilitation Engineering, vol. 33, pp. 130–149, 2024
2024
-
[7]
An endoscopic technique for restoration of voice after laryngectomy,
M. I. Singer and E. D. Blom, “An endoscopic technique for restoration of voice after laryngectomy,”Annals of Otology, Rhinology & Laryngology, vol. 89, no. 6, pp. 529–533, 1980
1980
-
[8]
Industrialization of the electrolarynx with a pitch control function and its evaluation,
M. Hashiba, “Industrialization of the electrolarynx with a pitch control function and its evaluation,”IEICE Trans. Inf. & Syst. (Japanese Edition), D-II, vol. 94, no. 6, pp. 1240–1247, 2001
2001
-
[9]
Differences in speaking proficiencies in three laryngectomee groups,
S. E. Williams and J. B. Watson, “Differences in speaking proficiencies in three laryngectomee groups,”Archives of Otolaryngology, vol. 111, no. 4, pp. 216–219, 1985
1985
-
[10]
Improvement of electrolaryngeal speech by introducing normal excitation informa- tion,
K. Ma, P. Demirel, C. Y . Espy-Wilson, and J. MacAuslan, “Improvement of electrolaryngeal speech by introducing normal excitation informa- tion,” inEUROSPEECH, 1999, pp. 323–326
1999
-
[11]
Recognition of the electrolaryn- geal speech: comparison between human and machine,
P. Stanislav, J. V . Psutka, and J. Psutka, “Recognition of the electrolaryn- geal speech: comparison between human and machine,” inText, Speech, and Dialogue: 20th International Conference, TSD 2017, Prague, Czech Republic, August 27-31, 2017, Proceedings 20. Springer, 2017, pp. 509–517
2017
-
[12]
V oice conversion: Factors responsible for quality,
D. Childers, B. Yegnanarayana, and K. Wu, “V oice conversion: Factors responsible for quality,” inICASSP’85. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 10. IEEE, 1985, pp. 748–751
1985
-
[13]
Continuous probabilistic transform for voice conversion,
Y . Stylianou, O. Capp ´e, and E. Moulines, “Continuous probabilistic transform for voice conversion,”IEEE Transactions on speech and audio processing, vol. 6, no. 2, pp. 131–142, 1998
1998
-
[14]
V oice conversion based on maximum-likelihood estimation of spectral parameter trajectory,
T. Toda, A. W. Black, and K. Tokuda, “V oice conversion based on maximum-likelihood estimation of spectral parameter trajectory,”IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 8, pp. 2222–2235, 2007
2007
-
[15]
Speaking-aid systems using gmm-based voice conversion for electrolaryngeal speech,
K. Nakamura, T. Toda, H. Saruwatari, and K. Shikano, “Speaking-aid systems using gmm-based voice conversion for electrolaryngeal speech,” Speech communication, vol. 54, no. 1, pp. 134–146, 2012
2012
-
[16]
A hy- brid approach to electrolaryngeal speech enhancement based on noise reduction and statistical excitation generation,
K. Tanaka, T. Toda, G. Neubig, S. Sakti, and S. Nakamura, “A hy- brid approach to electrolaryngeal speech enhancement based on noise reduction and statistical excitation generation,”IEICE Transactions on Information and Systems, vol. 97, no. 6, pp. 1429–1437, 2014
2014
-
[17]
Alaryn- geal speech enhancement based on one-to-many eigenvoice conversion,
H. Doi, T. Toda, K. Nakamura, H. Saruwatari, and K. Shikano, “Alaryn- geal speech enhancement based on one-to-many eigenvoice conversion,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 1, pp. 172–183, 2013
2013
-
[18]
Mandarin electro-laryngeal speech enhancement based on statistical voice conversion and manual tone control,
Z. Qian, H. Niu, L. Wang, K. Kobayashi, S. Zhang, and T. Toda, “Mandarin electro-laryngeal speech enhancement based on statistical voice conversion and manual tone control,” in2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 2021, pp. 546–552
2021
-
[19]
Sequence to sequence learning with neural networks,
I. Sutskever, O. Vinyals, and Q. V . Le, “Sequence to sequence learning with neural networks,” inAdvances in Neural Information Processing Systems (NeurIPS), 2014, pp. 3104–3112
2014
-
[20]
V oice conver- sion using sequence-to-sequence learning of context posterior probabil- ities,
H. Miyoshi, Y . Saito, S. Takamichi, and H. Saruwatari, “V oice conver- sion using sequence-to-sequence learning of context posterior probabil- ities,”arXiv preprint arXiv:1704.02360, 2017
Pith/arXiv arXiv 2017
-
[21]
AttS2S-VC: Sequence-to-sequence voice conversion with attention and context preservation mechanisms,
K. Tanaka, H. Kameoka, T. Kaneko, and N. Hojo, “AttS2S-VC: Sequence-to-sequence voice conversion with attention and context preservation mechanisms,” inICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 6805–6809
2019
-
[22]
Pretraining and fine- tuning techniques for electrolaryngeal speech enhancement based on sequence-to-sequence voice conversion,
D. Ma, L. P. Violeta, K. Kobayashi, and T. Toda, “Pretraining and fine- tuning techniques for electrolaryngeal speech enhancement based on sequence-to-sequence voice conversion,”IEEE Transactions on Audio, Speech and Language Processing, vol. 33, pp. 3189–3201, 2025
2025
-
[23]
Pretraining and adap- tation techniques for electrolaryngeal speech recognition,
L. P. Violeta, D. Ma, W.-C. Huang, and T. Toda, “Pretraining and adap- tation techniques for electrolaryngeal speech recognition,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 2777–2789, 2024
2024
-
[24]
Mandarin electrolaryngeal speech voice conversion with sequence-to-sequence modeling,
M.-C. Yen, W.-C. Huang, K. Kobayashi, Y .-H. Peng, S.-W. Tsai, Y . Tsao, T. Toda, J.-S. R. Jang, and H.-M. Wang, “Mandarin electrolaryngeal speech voice conversion with sequence-to-sequence modeling,” in2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2021, pp. 650–657. AUTHORet al.: PREPARATION OF PAPERS FOR IEEE TRANSACTIO...
2021
-
[25]
V oice transformer network: Sequence-to-sequence voice conversion using transformer with text-to-speech pretraining,
W.-C. Huang, T. Hayashi, Y .-C. Wu, H. Kameoka, and T. Toda, “V oice transformer network: Sequence-to-sequence voice conversion using transformer with text-to-speech pretraining,” inInterspeech, 2021, pp. 4676–4680
2021
-
[26]
Pretraining techniques for sequence-to-sequence voice conver- sion,
——, “Pretraining techniques for sequence-to-sequence voice conver- sion,”IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, vol. 29, pp. 745–755, 2021
2021
-
[27]
Cotatron: Transcription-guided speech encoder for any-to-many voice conversion without parallel data,
S.-w. Park, D.-y. Kim, and M.-c. Joe, “Cotatron: Transcription-guided speech encoder for any-to-many voice conversion without parallel data,” inInterspeech, 2020, pp. 4696–4700
2020
-
[28]
Joint training framework for text-to-speech and voice conversion using multi-source tacotron and wavenet,
M. Zhang, X. Wang, F. Fang, H. Li, and J. Yamagishi, “Joint training framework for text-to-speech and voice conversion using multi-source tacotron and wavenet,” inInterspeech, 2019, pp. 15–19
2019
-
[29]
StyleFusion TTS: Multimodal style-control and enhanced feature fusion for zero-shot text-to-speech synthesis,
Z. Chen, X. Li, Z. Ai, and S. Xu, “StyleFusion TTS: Multimodal style-control and enhanced feature fusion for zero-shot text-to-speech synthesis,” inChinese Conference on Pattern Recognition and Computer Vision (PRCV). Springer, 2024, pp. 263–277
2024
-
[30]
MM-TTS: Multi-modal prompt based style transfer for expressive text-to-speech synthesis,
W. Guan, Y . Li, T. Li, H. Huang, F. Wang, J. Lin, L. Huang, L. Li, and Q. Hong, “MM-TTS: Multi-modal prompt based style transfer for expressive text-to-speech synthesis,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 16, 2024, pp. 18 117– 18 125
2024
-
[31]
H. Kameoka, K. Tanaka, A. V . Puche, Y . Ohishi, and T. Kaneko, “Crossmodal voice conversion,”arXiv preprint arXiv:1904.04540, 2019
Pith/arXiv arXiv 1904
-
[32]
HybridVC: Efficient V oice Style Conversion with Text and Audio Prompts,
X. Niu, J. Zhang, and C. P. Martin, “HybridVC: Efficient V oice Style Conversion with Text and Audio Prompts,” inInterspeech 2024, 2024, pp. 4368–4372
2024
-
[33]
AlignSTS: Speech- to-singing conversion via cross-modal alignment,
R. Li, R. Huang, L. Zhang, J. Liu, and Z. Zhao, “AlignSTS: Speech- to-singing conversion via cross-modal alignment,” inFindings of the Association for Computational Linguistics: ACL 2023. Association for Computational Linguistics, 2023, pp. 7074–7088
2023
-
[34]
WavFusion: Towards wav2vec 2.0 multimodal speech emotion recognition,
F. Li, J. Luo, and W. Xia, “WavFusion: Towards wav2vec 2.0 multimodal speech emotion recognition,” inInternational Conference on Multimedia Modeling. Springer, 2025, pp. 325–336
2025
-
[35]
A robustly optimized BERT pre-training approach with post-training,
L. Zhuang, L. Wayne, S. Ya, and Z. Jun, “A robustly optimized BERT pre-training approach with post-training,” inProceedings of the 20th Chinese National Conference on Computational Linguistics, S. Li, M. Sun, Y . Liu, H. Wu, K. Liu, W. Che, S. He, and G. Rao, Eds., 2021, pp. 1218–1227
2021
-
[36]
Wav2vec 2.0: A framework for self-supervised learning of speech representations,
A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “Wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12 449– 12 460, 2020
2020
-
[37]
Enhanced multimodal speech processing for healthcare applications: A deep fusion approach,
J. Lv, W. Boulila, S. Rani, and H. Jiang, “Enhanced multimodal speech processing for healthcare applications: A deep fusion approach,”IEEE Journal of Selected Topics in Signal Processing, vol. 19, no. 4, pp. 600– 612, 2025
2025
-
[38]
Large language models are strong audio-visual speech recognition learners,
U. Cappellazzo, M. Kim, H. Chen, P. Ma, S. Petridis, D. Falavigna, A. Brutti, and M. Pantic, “Large language models are strong audio-visual speech recognition learners,” inICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5
2025
-
[39]
LoRA: Low-Rank Adaptation of large language models,
E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-Rank Adaptation of large language models,” ICLR, vol. 1, no. 2, pp. 3–23, 2022
2022
-
[40]
Improving electrolaryngeal speech enhancement via a representation learning method based on integrated text and speech representations,
D. Ma, J. Mi, F. Li, L. P. Violeta, K. Kobayashi, and T. Toda, “Improving electrolaryngeal speech enhancement via a representation learning method based on integrated text and speech representations,” in2025 47th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), 2025, pp. 1–6
2025
-
[41]
Attention is all you need,
A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in Neural Information Processing Systems (NeurIPS), vol. 30, pp. 5998– 6008, 2017
2017
-
[42]
Efficiently trainable text- to-speech system based on deep convolutional networks with guided attention,
H. Tachibana, K. Uenoyama, and S. Aihara, “Efficiently trainable text- to-speech system based on deep convolutional networks with guided attention,” in2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), 2018, pp. 4784–4788
2018
-
[43]
JSUT corpus: free large- scale Japanese scpeech corpus for end-to-end speech synthesis,
R. Sonobe, S. Takamichi, and H. Saruwatari, “JSUT corpus: free large- scale Japanese scpeech corpus for end-to-end speech synthesis,”arXiv Preprint, arXiv:1711.00354, 2017
Pith/arXiv arXiv 2017
-
[44]
ESPnet: End-to-end speech processing toolkit,
S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y . Unno, N. E. Y . Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai, “ESPnet: End-to-end speech processing toolkit,” in Interspeech, 2018, pp. 2207–2211
2018
-
[45]
Large batch optimization for deep learning: Training bert in 76 minutes,
Y . You, J. Li, S. Reddi, J. Hseu, S. Kumar, S. Bhojanapalli, X. Song, J. Demmel, K. Keutzer, and C.-J. Hsieh, “Large batch optimization for deep learning: Training bert in 76 minutes,” inProceedings of International Conference on Learning Representations, 2020, p. 36 pages
2020
-
[46]
Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,
R. Yamamoto, E. Song, and J.-M. Kim, “Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,” inIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2020), 2020, pp. 6199–6203
2020
-
[47]
Mel-cepstral distance measure for objective speech qual- ity assessment,
R. Kubichek, “Mel-cepstral distance measure for objective speech qual- ity assessment,” inProceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing, vol. 1, pp. 125–128
-
[48]
Investigating self-supervised pretraining frameworks for pathological speech recognition,
L. P. Violeta, W.-C. Huang, and T. Toda, “Investigating self-supervised pretraining frameworks for pathological speech recognition,” inInter- speech, 2022, pp. 41–45
2022
-
[49]
I. T. U. T. S. Sector,Methods for Subjective Determination of Trans- mission Quality. International Telecommunication Union, 1996
1996
-
[50]
A voice conversion system from electrolarynx speech to preoperative patient’s speech for total laryngectomy,
N. Nishio, K. Kobayashi, D. Ma, S. Mitani, M. Sone, and T. Toda, “A voice conversion system from electrolarynx speech to preoperative patient’s speech for total laryngectomy,”OTO open, vol. 10, no. 1, p. e70207, 2026
2026
-
[51]
CycleGAN-based prosody and spectrum modeling for Man- darin touch-controlled electrolaryngeal speech enhancement,
J. Zhou, L. Wang, F. Li, S. Zhang, F. Shen, F. Fan, T. Liu, X. Chen, and H. Niu, “CycleGAN-based prosody and spectrum modeling for Man- darin touch-controlled electrolaryngeal speech enhancement,”Biomedi- cal Signal Processing and Control, vol. 118, p. 109746, 2026
2026
-
[52]
Electrolaryngeal speech intelligibility enhancement through robust linguistic encoders,
L. P. Violeta, W.-C. Huang, D. Ma, R. Yamamoto, K. Kobayashi, and T. Toda, “Electrolaryngeal speech intelligibility enhancement through robust linguistic encoders,” inIEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP 2024), pp. 10 961–10 965
2024
-
[53]
Conan: A chunkwise online network for zero-shot adaptive voice conversion,
Y . Zhang, B. Tian, and Z. Duan, “Conan: A chunkwise online network for zero-shot adaptive voice conversion,” in2025 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2025, pp. 1–8
2025
-
[54]
Study of lightweight Transformer architectures for single-channel speech enhancement,
H. Zhao and N. Madhu, “Study of lightweight Transformer architectures for single-channel speech enhancement,” in2025 33rd European Signal Processing Conference (EUSIPCO). IEEE, 2025, pp. 101–105
2025
-
[55]
BinauralFlow: A causal and streamable approach for high-quality binaural speech synthesis with flow matching models,
S. Liang, D. Markovic, I. D. Gebru, S. Krenn, T. Keebler, J. Sandakly, F. Yu, S. Hassel, C. Xu, and A. Richard, “BinauralFlow: A causal and streamable approach for high-quality binaural speech synthesis with flow matching models,” inForty-second International Conference on Machine Learning, 2025, pp. 1–18
2025
-
[56]
FastS2S-VC: Stream- ing non-autoregressive Sequence-to-Sequence V oice Conversion,
H. Kameoka, K. Tanaka, and T. Kaneko, “FastS2S-VC: Stream- ing non-autoregressive Sequence-to-Sequence V oice Conversion,”arXiv Preprint, arXiv:2104.06900, 2021
arXiv 2021
-
[57]
An investigation of streaming non-autoregressive sequence-to-sequence voice conversion,
T. Hayashi, K. Kobayashi, and T. Toda, “An investigation of streaming non-autoregressive sequence-to-sequence voice conversion,” inProceed- ings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022, 2022, pp. 6802–6806
2022
-
[58]
Low-latency electrolaryngeal speech enhancement based on FastSpeech2-based voice conversion and self-supervised speech representation,
K. Kobayashi, T. Hayashi, and T. Toda, “Low-latency electrolaryngeal speech enhancement based on FastSpeech2-based voice conversion and self-supervised speech representation,” inProceedings of IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023, 2023, p. 5 pages
2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.