Poly-SVC converts singing voices from polyphonic recordings while keeping melody, lyrics, and harmonies by combining CQT-based pitch extraction with a conditional flow matching diffusion decoder.
Zero-shot voice conversion with diffusion transform- ers
6 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 6representative citing papers
X-VC achieves zero-shot streaming voice conversion via one-step codec-space conversion with dual-conditioning acoustic converter and role-assignment training on generated paired data.
RTCFake is the first large-scale dataset of real-time communication speech deepfakes paired with offline versions, paired with a phoneme-guided consistency learning method that improves cross-platform and noise-robust detection.
Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.
Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million hours of speech, sound, and music data.
AT-ADD introduces standardized tracks and datasets for evaluating audio deepfake detectors on speech under real-world conditions and on diverse unknown audio types to promote generalization beyond speech-centric methods.
citing papers explorer
-
Poly-SVC: Polyphony-Aware Singing Voice Conversion with Harmonic Modeling
Poly-SVC converts singing voices from polyphonic recordings while keeping melody, lyrics, and harmonies by combining CQT-based pitch extraction with a conditional flow matching diffusion decoder.
-
X-VC: Zero-shot Streaming Voice Conversion in Codec Space
X-VC achieves zero-shot streaming voice conversion via one-step codec-space conversion with dual-conditioning acoustic converter and role-assignment training on generated paired data.
-
RTCFake: Speech Deepfake Detection in Real-Time Communication
RTCFake is the first large-scale dataset of real-time communication speech deepfakes paired with offline versions, paired with a phoneme-guided consistency learning method that improves cross-platform and noise-robust detection.
-
How Far Are Video Models from True Multimodal Reasoning?
Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.
-
Kimi-Audio Technical Report
Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million hours of speech, sound, and music data.
-
AT-ADD: All-Type Audio Deepfake Detection Challenge Evaluation Plan
AT-ADD introduces standardized tracks and datasets for evaluating audio deepfake detectors on speech under real-world conditions and on diverse unknown audio types to promote generalization beyond speech-centric methods.