Large-model adaptation with Tibetan text handling produces natural speech from limited data, outperforming commercial systems.
HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
6 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
method 2polarities
use method 2representative citing papers
Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and conversational benchmarks.
CAFNet performs joint ternary classification and temporal boundary regression for half-truth audio deepfakes via cross-attentive fusion of MFCC, LFCC, and Chroma-STFT features, reporting 92.71% accuracy and 0.075s MAE on MLADDC T2+T3.
Woosh is a new publicly released foundation model optimized for high-quality sound effect generation from text or video, showing competitive or better results than open alternatives like Stable Audio Open.
LTX-2 generates high-quality synchronized audiovisual content from text prompts via an asymmetric 14B-video / 5B-audio dual-stream transformer with cross-attention and modality-aware guidance.
citing papers explorer
-
Tibetan-TTS:Low-Resource Tibetan Speech Synthesis with Large Model Adaptation
Large-model adaptation with Tibetan text handling produces natural speech from limited data, outperforming commercial systems.
-
Step-Audio 2 Technical Report
Step-Audio 2 integrates a latent audio encoder, reasoning-centric reinforcement learning, and discrete audio token generation into language modeling to deliver state-of-the-art performance on audio understanding and conversational benchmarks.
-
Audio Deepfake Detection with Half-Truth Localisation Using Cross-Attentive Feature Fusion
CAFNet performs joint ternary classification and temporal boundary regression for half-truth audio deepfakes via cross-attentive fusion of MFCC, LFCC, and Chroma-STFT features, reporting 92.71% accuracy and 0.075s MAE on MLADDC T2+T3.
-
Woosh: A Sound Effects Foundation Model
Woosh is a new publicly released foundation model optimized for high-quality sound effect generation from text or video, showing competitive or better results than open alternatives like Stable Audio Open.
-
LTX-2: Efficient Joint Audio-Visual Foundation Model
LTX-2 generates high-quality synchronized audiovisual content from text prompts via an asymmetric 14B-video / 5B-audio dual-stream transformer with cross-attention and modality-aware guidance.
- Few-Shot Synthetic Accented Speech for ASR Fine-Tuning: What Helps and When?