pith. machine review for the scientific record. sign in

Hifi-gan: Generative adversarial net- works for efficient and high fidelity speech synthesis

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

fields

cs.SD 3 cs.CV 1

years

2026 4

representative citing papers

Few-Shot Accent Synthesis for ASR with LLM-Guided Phoneme Editing

cs.SD · 2026-04-30 · unverdicted · novelty 5.0

Few-shot TTS adaptation combined with LLM-guided phoneme editing produces synthetic accented speech that improves ASR word error rates on real accented audio even in cross-speaker and ultra-low-data settings.

Woosh: A Sound Effects Foundation Model

cs.SD · 2026-04-02 · accept · novelty 5.0

Woosh is a new publicly released foundation model optimized for high-quality sound effect generation from text or video, showing competitive or better results than open alternatives like Stable Audio Open.

LTX-2: Efficient Joint Audio-Visual Foundation Model

cs.CV · 2026-01-06 · conditional · novelty 5.0

LTX-2 generates high-quality synchronized audiovisual content from text prompts via an asymmetric 14B-video / 5B-audio dual-stream transformer with cross-attention and modality-aware guidance.

citing papers explorer

Showing 4 of 4 citing papers.

  • Tibetan-TTS:Low-Resource Tibetan Speech Synthesis with Large Model Adaptation cs.SD · 2026-05-04 · unverdicted · none · ref 16

    Large-model adaptation with Tibetan text handling produces natural speech from limited data, outperforming commercial systems.

  • Few-Shot Accent Synthesis for ASR with LLM-Guided Phoneme Editing cs.SD · 2026-04-30 · unverdicted · none · ref 39

    Few-shot TTS adaptation combined with LLM-guided phoneme editing produces synthetic accented speech that improves ASR word error rates on real accented audio even in cross-speaker and ultra-low-data settings.

  • Woosh: A Sound Effects Foundation Model cs.SD · 2026-04-02 · accept · none · ref 20

    Woosh is a new publicly released foundation model optimized for high-quality sound effect generation from text or video, showing competitive or better results than open alternatives like Stable Audio Open.

  • LTX-2: Efficient Joint Audio-Visual Foundation Model cs.CV · 2026-01-06 · conditional · none · ref 13

    LTX-2 generates high-quality synchronized audiovisual content from text prompts via an asymmetric 14B-video / 5B-audio dual-stream transformer with cross-attention and modality-aware guidance.