hub Mixed citations

Liu, ”Zero-shot Voice Conversion with Diffusion Transformers,”

· 2024 · arXiv 2411.09943

Mixed citation behavior. Most common role is background (60%).

18 Pith papers citing it

Background 60% of classified citations

read on arXiv browse 18 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3 dataset 1 method 1

citation-polarity summary

background 3 use dataset 1 use method 1

representative citing papers

SpeechEditBench: A Bilingual Multi-Attribute Benchmark for Instruction-Guided Speech Editing

eess.AS · 2026-06-01 · unverdicted · novelty 7.0

SpeechEditBench provides seven atomic editing tasks, compositional multi-operation instructions, and an anchor-based protocol yielding target success, preservation success, and joint success metrics; evaluations show no model excels across dimensions and compositional editing is especially difficult

Poly-SVC: Polyphony-Aware Singing Voice Conversion with Harmonic Modeling

cs.SD · 2026-05-12 · unverdicted · novelty 7.0

Poly-SVC converts singing voices from polyphonic recordings while keeping melody, lyrics, and harmonies by combining CQT-based pitch extraction with a conditional flow matching diffusion decoder.

X-VC: Zero-shot Streaming Voice Conversion in Codec Space

eess.AS · 2026-04-14 · unverdicted · novelty 7.0

X-VC achieves zero-shot streaming voice conversion via one-step codec-space conversion with dual-conditioning acoustic converter and role-assignment training on generated paired data.

From Seeing it to Experiencing it: Interactive Evaluation of Intersectional Voice Bias in Human-AI Speech Interaction

cs.HC · 2026-03-19 · unverdicted · novelty 7.0

Voice conversion in interactive studies boosts user trust in SpeechLLM responses while automated metrics detect accent-by-gender disparities in alignment and verbosity.

ProsoCodec: Prosody-Oriented Speech Codec for Voice Conversion

eess.AS · 2026-06-20 · unverdicted · novelty 6.0

ProsoCodec models prosody as a conditional residual in a speech codec via text and speaker prefix conditioning, yielding improved prosody preservation and less timbre leakage in voice conversion experiments.

Speaker Identity in Non-Verbal Vocalizations: Conditional Distillation and Mixture of Experts Approach

eess.AS · 2026-06-19 · unverdicted · novelty 6.0

A MoE-enhanced model with conditional distillation reduces speech-NVV EER from 38.93% to 22.66% and speech EER from 13.17% to 9.24% across 10 NVV types.

RTCFake: Speech Deepfake Detection in Real-Time Communication

cs.SD · 2026-04-26 · unverdicted · novelty 6.0

RTCFake is the first large-scale dataset of real-time communication speech deepfakes paired with offline versions, paired with a phoneme-guided consistency learning method that improves cross-platform and noise-robust detection.

How Far Are Video Models from True Multimodal Reasoning?

cs.CV · 2026-04-21 · unverdicted · novelty 6.0

Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.

AugCodec: A Low-Bitrate Disentangled Neural Speech Codec via Data Augmentation

cs.SD · 2026-06-20 · unverdicted · novelty 5.0

AugCodec disentangles speech into semantic, speaker, and prosody tokens via tailored data augmentations, achieving 12.5 Hz operation with three streams and outperforming prior codecs on LibriSpeech reconstruction and disentanglement metrics.

Zero-VC: Zero-Lookahead Streaming Voice Conversion via Speaker Anonymization

cs.SD · 2026-06-18 · unverdicted · novelty 5.0

Zero-VC applies speaker anonymization as a perturbation to achieve strictly causal zero-lookahead streaming voice conversion by balancing timbre leakage against prosodic utility.

Vibrato Expression Control for Singing Voice Conversion with Improving Independent Control

cs.SD · 2026-06-15 · unverdicted · novelty 5.0

VibE-SVC2 extends prior singing voice conversion work with new modules for independent pitch-style and timbre-style control, claiming better performance and finer controllability than existing methods.

From A to B to A: Palindromic Zero-Shot Voice Conversion with Non-Parallel Data

cs.SD · 2026-06-07 · unverdicted · novelty 5.0 · 2 refs

KNN retrieval over WavLM representations creates synthetic source-target pairs from non-parallel data for supervised voice conversion training with a speaker loss, achieving strong results on multilingual test sets despite English-only training.

Kimi-Audio Technical Report

eess.AS · 2025-04-25 · unverdicted · novelty 5.0

Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million hours of speech, sound, and music data.

Beyond Words: Towards Effective Modeling of Non-Verbal Vocalizations in ASR

eess.AS · 2026-07-02 · unverdicted · novelty 4.0

Three data-centric strategies are studied to improve rare non-verbal vocalization recognition in ASR while preserving lexical accuracy.

Enhancing Flow Matching with A Unified Guidance Framework for Efficient and Robust Speech Synthesis

cs.SD · 2026-07-01 · unverdicted · novelty 4.0

Unified guidance framework for Flow Matching speech synthesis achieves nearly 3x faster inference and improved speaker similarity by combining heterogeneous data augmentation with intrinsic model guidance to eliminate CFG overhead.

MeanVC 2: Robust Low-Latency Streaming Zero-Shot Voice Conversion

eess.AS · 2026-06-08 · unverdicted · novelty 4.0

MeanVC 2 introduces future-receptive chunking and a universal timbre token encoder to achieve lower-latency and more robust streaming zero-shot voice conversion than the original MeanVC.

AT-ADD: All-Type Audio Deepfake Detection Challenge Evaluation Plan

cs.SD · 2026-04-09 · unverdicted · novelty 3.0

AT-ADD introduces standardized tracks and datasets for evaluating audio deepfake detectors on speech under real-world conditions and on diverse unknown audio types to promote generalization beyond speech-centric methods.

Intelligent Agents with Emotional Intelligence: Current Trends, Challenges, and Future Prospects

cs.HC · 2025-10-11 · unverdicted · novelty 2.0

A holistic survey of affective computing for intelligent agents covering emotion understanding via multimodal data, affective cognition, emotional expression synthesis, key challenges, and future directions emphasizing generative technologies.

citing papers explorer

Showing 18 of 18 citing papers after filters.

SpeechEditBench: A Bilingual Multi-Attribute Benchmark for Instruction-Guided Speech Editing eess.AS · 2026-06-01 · unverdicted · none · ref 45
SpeechEditBench provides seven atomic editing tasks, compositional multi-operation instructions, and an anchor-based protocol yielding target success, preservation success, and joint success metrics; evaluations show no model excels across dimensions and compositional editing is especially difficult
Poly-SVC: Polyphony-Aware Singing Voice Conversion with Harmonic Modeling cs.SD · 2026-05-12 · unverdicted · none · ref 7
Poly-SVC converts singing voices from polyphonic recordings while keeping melody, lyrics, and harmonies by combining CQT-based pitch extraction with a conditional flow matching diffusion decoder.
X-VC: Zero-shot Streaming Voice Conversion in Codec Space eess.AS · 2026-04-14 · unverdicted · none · ref 24
X-VC achieves zero-shot streaming voice conversion via one-step codec-space conversion with dual-conditioning acoustic converter and role-assignment training on generated paired data.
From Seeing it to Experiencing it: Interactive Evaluation of Intersectional Voice Bias in Human-AI Speech Interaction cs.HC · 2026-03-19 · unverdicted · none · ref 18
Voice conversion in interactive studies boosts user trust in SpeechLLM responses while automated metrics detect accent-by-gender disparities in alignment and verbosity.
ProsoCodec: Prosody-Oriented Speech Codec for Voice Conversion eess.AS · 2026-06-20 · unverdicted · none · ref 40
ProsoCodec models prosody as a conditional residual in a speech codec via text and speaker prefix conditioning, yielding improved prosody preservation and less timbre leakage in voice conversion experiments.
Speaker Identity in Non-Verbal Vocalizations: Conditional Distillation and Mixture of Experts Approach eess.AS · 2026-06-19 · unverdicted · none · ref 13
A MoE-enhanced model with conditional distillation reduces speech-NVV EER from 38.93% to 22.66% and speech EER from 13.17% to 9.24% across 10 NVV types.
RTCFake: Speech Deepfake Detection in Real-Time Communication cs.SD · 2026-04-26 · unverdicted · none · ref 16
RTCFake is the first large-scale dataset of real-time communication speech deepfakes paired with offline versions, paired with a phoneme-guided consistency learning method that improves cross-platform and noise-robust detection.
How Far Are Video Models from True Multimodal Reasoning? cs.CV · 2026-04-21 · unverdicted · none · ref 44
Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.
AugCodec: A Low-Bitrate Disentangled Neural Speech Codec via Data Augmentation cs.SD · 2026-06-20 · unverdicted · none · ref 22
AugCodec disentangles speech into semantic, speaker, and prosody tokens via tailored data augmentations, achieving 12.5 Hz operation with three streams and outperforming prior codecs on LibriSpeech reconstruction and disentanglement metrics.
Zero-VC: Zero-Lookahead Streaming Voice Conversion via Speaker Anonymization cs.SD · 2026-06-18 · unverdicted · none · ref 13
Zero-VC applies speaker anonymization as a perturbation to achieve strictly causal zero-lookahead streaming voice conversion by balancing timbre leakage against prosodic utility.
Vibrato Expression Control for Singing Voice Conversion with Improving Independent Control cs.SD · 2026-06-15 · unverdicted · none · ref 47
VibE-SVC2 extends prior singing voice conversion work with new modules for independent pitch-style and timbre-style control, claiming better performance and finer controllability than existing methods.
From A to B to A: Palindromic Zero-Shot Voice Conversion with Non-Parallel Data cs.SD · 2026-06-07 · unverdicted · none · ref 21 · 2 links
KNN retrieval over WavLM representations creates synthetic source-target pairs from non-parallel data for supervised voice conversion training with a speaker loss, achieving strong results on multilingual test sets despite English-only training.
Kimi-Audio Technical Report eess.AS · 2025-04-25 · unverdicted · none · ref 46
Kimi-Audio is an open-source audio foundation model that achieves state-of-the-art results on speech recognition, audio understanding, question answering, and conversation after pre-training on more than 13 million hours of speech, sound, and music data.
Beyond Words: Towards Effective Modeling of Non-Verbal Vocalizations in ASR eess.AS · 2026-07-02 · unverdicted · none · ref 5
Three data-centric strategies are studied to improve rare non-verbal vocalization recognition in ASR while preserving lexical accuracy.
Enhancing Flow Matching with A Unified Guidance Framework for Efficient and Robust Speech Synthesis cs.SD · 2026-07-01 · unverdicted · none · ref 20
Unified guidance framework for Flow Matching speech synthesis achieves nearly 3x faster inference and improved speaker similarity by combining heterogeneous data augmentation with intrinsic model guidance to eliminate CFG overhead.
MeanVC 2: Robust Low-Latency Streaming Zero-Shot Voice Conversion eess.AS · 2026-06-08 · unverdicted · none · ref 22
MeanVC 2 introduces future-receptive chunking and a universal timbre token encoder to achieve lower-latency and more robust streaming zero-shot voice conversion than the original MeanVC.
AT-ADD: All-Type Audio Deepfake Detection Challenge Evaluation Plan cs.SD · 2026-04-09 · unverdicted · none · ref 38
AT-ADD introduces standardized tracks and datasets for evaluating audio deepfake detectors on speech under real-world conditions and on diverse unknown audio types to promote generalization beyond speech-centric methods.
Intelligent Agents with Emotional Intelligence: Current Trends, Challenges, and Future Prospects cs.HC · 2025-10-11 · unverdicted · none · ref 95
A holistic survey of affective computing for intelligent agents covering emotion understanding via multimodal data, affective cognition, emotional expression synthesis, key challenges, and future directions emphasizing generative technologies.

Liu, ”Zero-shot Voice Conversion with Diffusion Transformers,”

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer