pith. sign in

arxiv: 2606.01905 · v1 · pith:FS35IYGSnew · submitted 2026-06-01 · 📡 eess.AS · cs.SD

Advancing Electrolaryngeal Speech Enhancement Through Speech-Text Representation Learning

Pith reviewed 2026-06-28 12:41 UTC · model grok-4.3

classification 📡 eess.AS cs.SD
keywords electrolaryngeal speech enhancementspeech-text representation learningvoice conversionsequence-to-sequence modelassistive communicationdata augmentation
0
0 comments X

The pith

Integrating speech and text representations improves electrolaryngeal speech conversion without added complexity.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes a framework that combines pretrained speech and text modules to create integrated representations for converting distorted electrolaryngeal speech into more natural speech. This addresses mismatches that cause errors in standard voice conversion approaches. A sympathetic reader would care because it offers a practical way to enhance assistive devices for people who have lost their larynx, potentially improving intelligibility and naturalness in communication. The method uses fusion strategies and an extra loss term to transfer the representations effectively. Experiments show consistent outperformance over speech-only methods across datasets.

Core claim

The paper claims that constructing a network with pretrained modules to learn speech-text integrated representations, followed by an autoencoder-style reconstruction strategy to inherit these in the seq2seq voice conversion model, leads to better EL2SP performance. Three fusion strategies—middle-, input-, and hybrid-level—are introduced, along with an additional reconstruction loss, and when combined with data augmentations, the approach outperforms baselines relying solely on speech representations.

What carries the argument

Speech-text representation integration via middle-, input-, and hybrid-level fusion strategies in a network built from pretrained modules, transferred to the reconstruction stage through an additional loss term.

Load-bearing premise

Pretrained speech and text modules can be fused to produce integrated representations that transfer cleanly to the EL2SP reconstruction stage without introducing cumulative mapping errors or requiring increased model complexity.

What would settle it

A set of experiments on held-out EL2SP datasets where the speech-text methods fail to outperform or match the performance of speech-only baselines would falsify the central claim.

read the original abstract

Objective: laryngectomees depend on an electromechanical device to generate electrolaryngeal (EL) speech. Compared with normal speech, EL speech suffers from severe distortion, limited phonetic variation, unnatural prosody, and temporal shifts, degrading naturalness and intelligibility. Although sequence-to-sequence (seq2seq) voice conversion (VC) based EL-speech-to-normal-speech conversion (EL2SP) is promising, substantial mismatches between EL and normal speech inevitably cause cumulative mapping errors that limit performance. To address this, we describe a novel representation learning framework integrating speech and text representations to improve mapping and reconstruction quality within a seq2seq VC model. Methods: our methodology comprises two main stages: 1) representation integration and learning, and 2) reconstruction training. A network capable of incorporating auxiliary text information is first constructed with pretrained modules to learn speech--text-based integrated representations. Then, an autoencoder-style reconstruction strategy finalizes EL2SP model to inherit these representations without increasing model complexity. We introduce three fusion strategies including middle-, input-, and hybrid-level fusion strategies that progressively enhance learning. Moreover, besides standard seq2seq VC objectives, an additional reconstruction loss on the integrated representation is introduced to refine representation transfer. Results: experiments under different EL2SP datasets consistently demonstrate that our methods, combined with data augmentations, outperform baselines relying solely on speech representations. Furthermore, progressive improvements with system design depth validate the effectiveness of our methods. Significance: the proposed methods provide an extensible and practical methodology for EL speech enhancement and assistive communication technologies.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper proposes a two-stage representation learning framework for electrolaryngeal speech enhancement in seq2seq voice conversion (EL2SP). Stage 1 constructs a network with pretrained speech and text modules to learn integrated representations via three fusion strategies (middle-, input-, and hybrid-level). Stage 2 applies autoencoder-style reconstruction training with an auxiliary reconstruction loss on the integrated representation to enable transfer without added model complexity. The central claim is that the resulting systems, when combined with data augmentations, consistently outperform speech-only baselines across multiple EL2SP datasets, with progressive gains validating the design choices.

Significance. If the empirical outperformance holds under detailed scrutiny, the work offers a practical, extensible approach to mitigating domain mismatch and cumulative mapping errors in EL2SP without increasing model complexity. This could meaningfully advance assistive communication technologies for laryngectomees by leveraging readily available text information alongside speech representations.

major comments (1)
  1. [Abstract] Abstract (Results paragraph): the claim that experiments under different EL2SP datasets consistently demonstrate outperformance is unsupported by any quantitative metrics, error bars, dataset sizes, ablation studies, or statistical tests in the provided text, rendering the central empirical claim unverifiable.
minor comments (1)
  1. [Abstract] Abstract (Methods paragraph): the three fusion strategies and the auxiliary reconstruction loss are described at a high level only; without equations, architecture diagrams, or pseudocode it is difficult to assess how representation transfer avoids cumulative errors.

Simulated Author's Rebuttal

1 responses · 0 unresolved

Thank you for the opportunity to respond to the referee's comments. We address the major comment point by point below.

read point-by-point responses
  1. Referee: [Abstract] Abstract (Results paragraph): the claim that experiments under different EL2SP datasets consistently demonstrate outperformance is unsupported by any quantitative metrics, error bars, dataset sizes, ablation studies, or statistical tests in the provided text, rendering the central empirical claim unverifiable.

    Authors: We thank the referee for this observation. The full manuscript contains the requested details (quantitative metrics, error bars, dataset sizes, ablation studies, and statistical tests) in the Experiments and Results sections. However, the abstract's Results paragraph summarizes these findings at a high level without specific numbers. To address the concern and make the abstract self-contained, we will revise it to include key quantitative results supporting the outperformance claim across datasets. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper describes an empirical two-stage pipeline (pretrained speech-text fusion followed by autoencoder-style reconstruction with auxiliary loss) whose central claim is outperformance on EL2SP datasets versus speech-only baselines. No equations, fitted parameters, or derivations are presented that reduce by construction to inputs, self-citations, or ansatzes. The methodology is self-contained, relying on standard pretrained modules and fusion strategies without load-bearing uniqueness theorems or self-referential definitions. The reader's assessment of score 2 is consistent with the absence of any circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the central claim rests on the unstated premise that text auxiliaries improve mapping without new error sources.

pith-pipeline@v0.9.1-grok · 5835 in / 1047 out tokens · 28957 ms · 2026-06-28T12:41:38.069383+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

58 extracted references · 3 linked inside Pith

  1. [1]

    Electrolaryngeal speech enhancement with statistical voice conversion based on CLDNN,

    K. Kobayashi and T. Toda, “Electrolaryngeal speech enhancement with statistical voice conversion based on CLDNN,” in2018 26th European Signal Processing Conference (EUSIPCO). IEEE, 2018, pp. 2115– 2119

  2. [2]

    A comprehensive review of speech emotion recognition systems,

    T. M. Wani, T. S. Gunawan, S. A. A. Qadri, M. Kartiwi, and E. Am- bikairajah, “A comprehensive review of speech emotion recognition systems,”IEEE access, vol. 9, pp. 47 795–47 814, 2021

  3. [3]

    Psychosocial quality of life in patients after total laryngectomy

    E. Babin, D. Beynier, D. Le Gall, and M. Hitier, “Psychosocial quality of life in patients after total laryngectomy.”Revue de laryngologie-otologie- rhinologie, vol. 130, no. 1, pp. 29–34, 2009

  4. [4]

    The effects of indwelling voice prosthesis on the quality of life, depressive symptoms, and self-esteem in patients with total laryngectomy,

    B. Polat, K. S. Orhan, M. C. Kesimli, Y . Gorgulu, M. Ulusan, and K. Deger, “The effects of indwelling voice prosthesis on the quality of life, depressive symptoms, and self-esteem in patients with total laryngectomy,”European Archives of Oto-rhino-laryngology, vol. 272, no. 11, pp. 3431–3437, 2015

  5. [5]

    V oice restoration after total laryngec- tomy,

    C. G. Tang and C. F. Sinclair, “V oice restoration after total laryngec- tomy,”Otolaryngologic Clinics of North America, vol. 48, no. 4, pp. 687–702, 2015

  6. [6]

    End-to-end mandarin speech reconstruction based on ultrasound tongue images using deep learning,

    F. Li, F. Shen, D. Ma, J. Zhou, S. Zhang, L. Wang, F. Fan, T. Liu, X. Chen, T. Todaet al., “End-to-end mandarin speech reconstruction based on ultrasound tongue images using deep learning,”IEEE Trans- actions on Neural Systems and Rehabilitation Engineering, vol. 33, pp. 130–149, 2024

  7. [7]

    An endoscopic technique for restoration of voice after laryngectomy,

    M. I. Singer and E. D. Blom, “An endoscopic technique for restoration of voice after laryngectomy,”Annals of Otology, Rhinology & Laryngology, vol. 89, no. 6, pp. 529–533, 1980

  8. [8]

    Industrialization of the electrolarynx with a pitch control function and its evaluation,

    M. Hashiba, “Industrialization of the electrolarynx with a pitch control function and its evaluation,”IEICE Trans. Inf. & Syst. (Japanese Edition), D-II, vol. 94, no. 6, pp. 1240–1247, 2001

  9. [9]

    Differences in speaking proficiencies in three laryngectomee groups,

    S. E. Williams and J. B. Watson, “Differences in speaking proficiencies in three laryngectomee groups,”Archives of Otolaryngology, vol. 111, no. 4, pp. 216–219, 1985

  10. [10]

    Improvement of electrolaryngeal speech by introducing normal excitation informa- tion,

    K. Ma, P. Demirel, C. Y . Espy-Wilson, and J. MacAuslan, “Improvement of electrolaryngeal speech by introducing normal excitation informa- tion,” inEUROSPEECH, 1999, pp. 323–326

  11. [11]

    Recognition of the electrolaryn- geal speech: comparison between human and machine,

    P. Stanislav, J. V . Psutka, and J. Psutka, “Recognition of the electrolaryn- geal speech: comparison between human and machine,” inText, Speech, and Dialogue: 20th International Conference, TSD 2017, Prague, Czech Republic, August 27-31, 2017, Proceedings 20. Springer, 2017, pp. 509–517

  12. [12]

    V oice conversion: Factors responsible for quality,

    D. Childers, B. Yegnanarayana, and K. Wu, “V oice conversion: Factors responsible for quality,” inICASSP’85. IEEE International Conference on Acoustics, Speech, and Signal Processing, vol. 10. IEEE, 1985, pp. 748–751

  13. [13]

    Continuous probabilistic transform for voice conversion,

    Y . Stylianou, O. Capp ´e, and E. Moulines, “Continuous probabilistic transform for voice conversion,”IEEE Transactions on speech and audio processing, vol. 6, no. 2, pp. 131–142, 1998

  14. [14]

    V oice conversion based on maximum-likelihood estimation of spectral parameter trajectory,

    T. Toda, A. W. Black, and K. Tokuda, “V oice conversion based on maximum-likelihood estimation of spectral parameter trajectory,”IEEE Transactions on Audio, Speech, and Language Processing, vol. 15, no. 8, pp. 2222–2235, 2007

  15. [15]

    Speaking-aid systems using gmm-based voice conversion for electrolaryngeal speech,

    K. Nakamura, T. Toda, H. Saruwatari, and K. Shikano, “Speaking-aid systems using gmm-based voice conversion for electrolaryngeal speech,” Speech communication, vol. 54, no. 1, pp. 134–146, 2012

  16. [16]

    A hy- brid approach to electrolaryngeal speech enhancement based on noise reduction and statistical excitation generation,

    K. Tanaka, T. Toda, G. Neubig, S. Sakti, and S. Nakamura, “A hy- brid approach to electrolaryngeal speech enhancement based on noise reduction and statistical excitation generation,”IEICE Transactions on Information and Systems, vol. 97, no. 6, pp. 1429–1437, 2014

  17. [17]

    Alaryn- geal speech enhancement based on one-to-many eigenvoice conversion,

    H. Doi, T. Toda, K. Nakamura, H. Saruwatari, and K. Shikano, “Alaryn- geal speech enhancement based on one-to-many eigenvoice conversion,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 22, no. 1, pp. 172–183, 2013

  18. [18]

    Mandarin electro-laryngeal speech enhancement based on statistical voice conversion and manual tone control,

    Z. Qian, H. Niu, L. Wang, K. Kobayashi, S. Zhang, and T. Toda, “Mandarin electro-laryngeal speech enhancement based on statistical voice conversion and manual tone control,” in2021 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE, 2021, pp. 546–552

  19. [19]

    Sequence to sequence learning with neural networks,

    I. Sutskever, O. Vinyals, and Q. V . Le, “Sequence to sequence learning with neural networks,” inAdvances in Neural Information Processing Systems (NeurIPS), 2014, pp. 3104–3112

  20. [20]

    V oice conver- sion using sequence-to-sequence learning of context posterior probabil- ities,

    H. Miyoshi, Y . Saito, S. Takamichi, and H. Saruwatari, “V oice conver- sion using sequence-to-sequence learning of context posterior probabil- ities,”arXiv preprint arXiv:1704.02360, 2017

  21. [21]

    AttS2S-VC: Sequence-to-sequence voice conversion with attention and context preservation mechanisms,

    K. Tanaka, H. Kameoka, T. Kaneko, and N. Hojo, “AttS2S-VC: Sequence-to-sequence voice conversion with attention and context preservation mechanisms,” inICASSP 2019-2019 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, pp. 6805–6809

  22. [22]

    Pretraining and fine- tuning techniques for electrolaryngeal speech enhancement based on sequence-to-sequence voice conversion,

    D. Ma, L. P. Violeta, K. Kobayashi, and T. Toda, “Pretraining and fine- tuning techniques for electrolaryngeal speech enhancement based on sequence-to-sequence voice conversion,”IEEE Transactions on Audio, Speech and Language Processing, vol. 33, pp. 3189–3201, 2025

  23. [23]

    Pretraining and adap- tation techniques for electrolaryngeal speech recognition,

    L. P. Violeta, D. Ma, W.-C. Huang, and T. Toda, “Pretraining and adap- tation techniques for electrolaryngeal speech recognition,”IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 32, pp. 2777–2789, 2024

  24. [24]

    Mandarin electrolaryngeal speech voice conversion with sequence-to-sequence modeling,

    M.-C. Yen, W.-C. Huang, K. Kobayashi, Y .-H. Peng, S.-W. Tsai, Y . Tsao, T. Toda, J.-S. R. Jang, and H.-M. Wang, “Mandarin electrolaryngeal speech voice conversion with sequence-to-sequence modeling,” in2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE, 2021, pp. 650–657. AUTHORet al.: PREPARATION OF PAPERS FOR IEEE TRANSACTIO...

  25. [25]

    V oice transformer network: Sequence-to-sequence voice conversion using transformer with text-to-speech pretraining,

    W.-C. Huang, T. Hayashi, Y .-C. Wu, H. Kameoka, and T. Toda, “V oice transformer network: Sequence-to-sequence voice conversion using transformer with text-to-speech pretraining,” inInterspeech, 2021, pp. 4676–4680

  26. [26]

    Pretraining techniques for sequence-to-sequence voice conver- sion,

    ——, “Pretraining techniques for sequence-to-sequence voice conver- sion,”IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, vol. 29, pp. 745–755, 2021

  27. [27]

    Cotatron: Transcription-guided speech encoder for any-to-many voice conversion without parallel data,

    S.-w. Park, D.-y. Kim, and M.-c. Joe, “Cotatron: Transcription-guided speech encoder for any-to-many voice conversion without parallel data,” inInterspeech, 2020, pp. 4696–4700

  28. [28]

    Joint training framework for text-to-speech and voice conversion using multi-source tacotron and wavenet,

    M. Zhang, X. Wang, F. Fang, H. Li, and J. Yamagishi, “Joint training framework for text-to-speech and voice conversion using multi-source tacotron and wavenet,” inInterspeech, 2019, pp. 15–19

  29. [29]

    StyleFusion TTS: Multimodal style-control and enhanced feature fusion for zero-shot text-to-speech synthesis,

    Z. Chen, X. Li, Z. Ai, and S. Xu, “StyleFusion TTS: Multimodal style-control and enhanced feature fusion for zero-shot text-to-speech synthesis,” inChinese Conference on Pattern Recognition and Computer Vision (PRCV). Springer, 2024, pp. 263–277

  30. [30]

    MM-TTS: Multi-modal prompt based style transfer for expressive text-to-speech synthesis,

    W. Guan, Y . Li, T. Li, H. Huang, F. Wang, J. Lin, L. Huang, L. Li, and Q. Hong, “MM-TTS: Multi-modal prompt based style transfer for expressive text-to-speech synthesis,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 38, no. 16, 2024, pp. 18 117– 18 125

  31. [31]

    Crossmodal voice conversion,

    H. Kameoka, K. Tanaka, A. V . Puche, Y . Ohishi, and T. Kaneko, “Crossmodal voice conversion,”arXiv preprint arXiv:1904.04540, 2019

  32. [32]

    HybridVC: Efficient V oice Style Conversion with Text and Audio Prompts,

    X. Niu, J. Zhang, and C. P. Martin, “HybridVC: Efficient V oice Style Conversion with Text and Audio Prompts,” inInterspeech 2024, 2024, pp. 4368–4372

  33. [33]

    AlignSTS: Speech- to-singing conversion via cross-modal alignment,

    R. Li, R. Huang, L. Zhang, J. Liu, and Z. Zhao, “AlignSTS: Speech- to-singing conversion via cross-modal alignment,” inFindings of the Association for Computational Linguistics: ACL 2023. Association for Computational Linguistics, 2023, pp. 7074–7088

  34. [34]

    WavFusion: Towards wav2vec 2.0 multimodal speech emotion recognition,

    F. Li, J. Luo, and W. Xia, “WavFusion: Towards wav2vec 2.0 multimodal speech emotion recognition,” inInternational Conference on Multimedia Modeling. Springer, 2025, pp. 325–336

  35. [35]

    A robustly optimized BERT pre-training approach with post-training,

    L. Zhuang, L. Wayne, S. Ya, and Z. Jun, “A robustly optimized BERT pre-training approach with post-training,” inProceedings of the 20th Chinese National Conference on Computational Linguistics, S. Li, M. Sun, Y . Liu, H. Wu, K. Liu, W. Che, S. He, and G. Rao, Eds., 2021, pp. 1218–1227

  36. [36]

    Wav2vec 2.0: A framework for self-supervised learning of speech representations,

    A. Baevski, Y . Zhou, A. Mohamed, and M. Auli, “Wav2vec 2.0: A framework for self-supervised learning of speech representations,” Advances in neural information processing systems, vol. 33, pp. 12 449– 12 460, 2020

  37. [37]

    Enhanced multimodal speech processing for healthcare applications: A deep fusion approach,

    J. Lv, W. Boulila, S. Rani, and H. Jiang, “Enhanced multimodal speech processing for healthcare applications: A deep fusion approach,”IEEE Journal of Selected Topics in Signal Processing, vol. 19, no. 4, pp. 600– 612, 2025

  38. [38]

    Large language models are strong audio-visual speech recognition learners,

    U. Cappellazzo, M. Kim, H. Chen, P. Ma, S. Petridis, D. Falavigna, A. Brutti, and M. Pantic, “Large language models are strong audio-visual speech recognition learners,” inICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5

  39. [39]

    LoRA: Low-Rank Adaptation of large language models,

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-Rank Adaptation of large language models,” ICLR, vol. 1, no. 2, pp. 3–23, 2022

  40. [40]

    Improving electrolaryngeal speech enhancement via a representation learning method based on integrated text and speech representations,

    D. Ma, J. Mi, F. Li, L. P. Violeta, K. Kobayashi, and T. Toda, “Improving electrolaryngeal speech enhancement via a representation learning method based on integrated text and speech representations,” in2025 47th Annual International Conference of the IEEE Engineering in Medicine and Biology Society (EMBC), 2025, pp. 1–6

  41. [41]

    Attention is all you need,

    A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,”Advances in Neural Information Processing Systems (NeurIPS), vol. 30, pp. 5998– 6008, 2017

  42. [42]

    Efficiently trainable text- to-speech system based on deep convolutional networks with guided attention,

    H. Tachibana, K. Uenoyama, and S. Aihara, “Efficiently trainable text- to-speech system based on deep convolutional networks with guided attention,” in2018 IEEE international conference on acoustics, speech and signal processing (ICASSP), 2018, pp. 4784–4788

  43. [43]

    JSUT corpus: free large- scale Japanese scpeech corpus for end-to-end speech synthesis,

    R. Sonobe, S. Takamichi, and H. Saruwatari, “JSUT corpus: free large- scale Japanese scpeech corpus for end-to-end speech synthesis,”arXiv Preprint, arXiv:1711.00354, 2017

  44. [44]

    ESPnet: End-to-end speech processing toolkit,

    S. Watanabe, T. Hori, S. Karita, T. Hayashi, J. Nishitoba, Y . Unno, N. E. Y . Soplin, J. Heymann, M. Wiesner, N. Chen, A. Renduchintala, and T. Ochiai, “ESPnet: End-to-end speech processing toolkit,” in Interspeech, 2018, pp. 2207–2211

  45. [45]

    Large batch optimization for deep learning: Training bert in 76 minutes,

    Y . You, J. Li, S. Reddi, J. Hseu, S. Kumar, S. Bhojanapalli, X. Song, J. Demmel, K. Keutzer, and C.-J. Hsieh, “Large batch optimization for deep learning: Training bert in 76 minutes,” inProceedings of International Conference on Learning Representations, 2020, p. 36 pages

  46. [46]

    Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,

    R. Yamamoto, E. Song, and J.-M. Kim, “Parallel WaveGAN: A fast waveform generation model based on generative adversarial networks with multi-resolution spectrogram,” inIEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP 2020), 2020, pp. 6199–6203

  47. [47]

    Mel-cepstral distance measure for objective speech qual- ity assessment,

    R. Kubichek, “Mel-cepstral distance measure for objective speech qual- ity assessment,” inProceedings of IEEE Pacific Rim Conference on Communications Computers and Signal Processing, vol. 1, pp. 125–128

  48. [48]

    Investigating self-supervised pretraining frameworks for pathological speech recognition,

    L. P. Violeta, W.-C. Huang, and T. Toda, “Investigating self-supervised pretraining frameworks for pathological speech recognition,” inInter- speech, 2022, pp. 41–45

  49. [49]

    I. T. U. T. S. Sector,Methods for Subjective Determination of Trans- mission Quality. International Telecommunication Union, 1996

  50. [50]

    A voice conversion system from electrolarynx speech to preoperative patient’s speech for total laryngectomy,

    N. Nishio, K. Kobayashi, D. Ma, S. Mitani, M. Sone, and T. Toda, “A voice conversion system from electrolarynx speech to preoperative patient’s speech for total laryngectomy,”OTO open, vol. 10, no. 1, p. e70207, 2026

  51. [51]

    CycleGAN-based prosody and spectrum modeling for Man- darin touch-controlled electrolaryngeal speech enhancement,

    J. Zhou, L. Wang, F. Li, S. Zhang, F. Shen, F. Fan, T. Liu, X. Chen, and H. Niu, “CycleGAN-based prosody and spectrum modeling for Man- darin touch-controlled electrolaryngeal speech enhancement,”Biomedi- cal Signal Processing and Control, vol. 118, p. 109746, 2026

  52. [52]

    Electrolaryngeal speech intelligibility enhancement through robust linguistic encoders,

    L. P. Violeta, W.-C. Huang, D. Ma, R. Yamamoto, K. Kobayashi, and T. Toda, “Electrolaryngeal speech intelligibility enhancement through robust linguistic encoders,” inIEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP 2024), pp. 10 961–10 965

  53. [53]

    Conan: A chunkwise online network for zero-shot adaptive voice conversion,

    Y . Zhang, B. Tian, and Z. Duan, “Conan: A chunkwise online network for zero-shot adaptive voice conversion,” in2025 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU), 2025, pp. 1–8

  54. [54]

    Study of lightweight Transformer architectures for single-channel speech enhancement,

    H. Zhao and N. Madhu, “Study of lightweight Transformer architectures for single-channel speech enhancement,” in2025 33rd European Signal Processing Conference (EUSIPCO). IEEE, 2025, pp. 101–105

  55. [55]

    BinauralFlow: A causal and streamable approach for high-quality binaural speech synthesis with flow matching models,

    S. Liang, D. Markovic, I. D. Gebru, S. Krenn, T. Keebler, J. Sandakly, F. Yu, S. Hassel, C. Xu, and A. Richard, “BinauralFlow: A causal and streamable approach for high-quality binaural speech synthesis with flow matching models,” inForty-second International Conference on Machine Learning, 2025, pp. 1–18

  56. [56]

    FastS2S-VC: Stream- ing non-autoregressive Sequence-to-Sequence V oice Conversion,

    H. Kameoka, K. Tanaka, and T. Kaneko, “FastS2S-VC: Stream- ing non-autoregressive Sequence-to-Sequence V oice Conversion,”arXiv Preprint, arXiv:2104.06900, 2021

  57. [57]

    An investigation of streaming non-autoregressive sequence-to-sequence voice conversion,

    T. Hayashi, K. Kobayashi, and T. Toda, “An investigation of streaming non-autoregressive sequence-to-sequence voice conversion,” inProceed- ings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) 2022, 2022, pp. 6802–6806

  58. [58]

    Low-latency electrolaryngeal speech enhancement based on FastSpeech2-based voice conversion and self-supervised speech representation,

    K. Kobayashi, T. Hayashi, and T. Toda, “Low-latency electrolaryngeal speech enhancement based on FastSpeech2-based voice conversion and self-supervised speech representation,” inProceedings of IEEE Interna- tional Conference on Acoustics, Speech and Signal Processing (ICASSP) 2023, 2023, p. 5 pages