Large-model adaptation with Tibetan text handling produces natural speech from limited data, outperforming commercial systems.
Hifi-gan: Generative adversarial net- works for efficient and high fidelity speech synthesis
4 Pith papers cite this work. Polarity classification is still indexing.
years
2026 4representative citing papers
Few-shot TTS adaptation combined with LLM-guided phoneme editing produces synthetic accented speech that improves ASR word error rates on real accented audio even in cross-speaker and ultra-low-data settings.
Woosh is a new publicly released foundation model optimized for high-quality sound effect generation from text or video, showing competitive or better results than open alternatives like Stable Audio Open.
LTX-2 generates high-quality synchronized audiovisual content from text prompts via an asymmetric 14B-video / 5B-audio dual-stream transformer with cross-attention and modality-aware guidance.
citing papers explorer
-
Tibetan-TTS:Low-Resource Tibetan Speech Synthesis with Large Model Adaptation
Large-model adaptation with Tibetan text handling produces natural speech from limited data, outperforming commercial systems.
-
Few-Shot Accent Synthesis for ASR with LLM-Guided Phoneme Editing
Few-shot TTS adaptation combined with LLM-guided phoneme editing produces synthetic accented speech that improves ASR word error rates on real accented audio even in cross-speaker and ultra-low-data settings.
-
Woosh: A Sound Effects Foundation Model
Woosh is a new publicly released foundation model optimized for high-quality sound effect generation from text or video, showing competitive or better results than open alternatives like Stable Audio Open.
-
LTX-2: Efficient Joint Audio-Visual Foundation Model
LTX-2 generates high-quality synchronized audiovisual content from text prompts via an asymmetric 14B-video / 5B-audio dual-stream transformer with cross-attention and modality-aware guidance.