Sarashina2.2-TTS achieves SOTA kanji reading accuracy via data scaling and Joyo-kanji-targeted synthesis, introduces the Joyo Kanji Yomi Benchmark and Kana-CER metric, and shows stable cross-lingual performance.
Title resolution pending
10 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
X-VC achieves zero-shot streaming voice conversion via one-step codec-space conversion with dual-conditioning acoustic converter and role-assignment training on generated paired data.
A multi-axis RL alignment technique improves pause handling, turn-taking, backchanneling, and interruption response in full-duplex spoken dialogue models by optimizing axis-specific rewards derived from human audio segments.
BareWave develops a waveform-native flow-matching framework for direct text-to-waveform TTS using representation alignment, staged noise scheduling, and velocity-aware perceptual alignment to achieve strong zero-shot voice cloning results.
Shared-score quaternion self-attention reduces score multiplications by 75% and softmax operations from four to one while proving equivalence to component-wise attention under quaternion linear projections.
Voxtral TTS produces expressive multilingual speech from 3-second reference audio with a hybrid autoregressive-plus-flow-matching architecture and a new VQ-FSQ tokenizer, achieving 68.4% win rate over ElevenLabs in human evaluations.
A local pipeline integrating diarization, ASR, SER and adversarial LLM reasoning for privacy-preserving conversational analysis, reporting 48.8% SER accuracy on RAVDESS four-class subset with honest cross-corpus limitations.
Audio language models are benchmarked on five semantic and paralinguistic reasoning tasks to reveal limitations in handling spoken audio evidence, accent variation, and domain shifts.
AT-ADD introduces standardized tracks and datasets for evaluating audio deepfake detectors on speech under real-world conditions and on diverse unknown audio types to promote generalization beyond speech-centric methods.
citing papers explorer
-
Sarashina2.2-TTS: Tackling Kanji Polyphony in Japanese Speech Generation via Data Scaling and Targeted Data Synthesis
Sarashina2.2-TTS achieves SOTA kanji reading accuracy via data scaling and Joyo-kanji-targeted synthesis, introduces the Joyo Kanji Yomi Benchmark and Kana-CER metric, and shows stable cross-lingual performance.
-
X-VC: Zero-shot Streaming Voice Conversion in Codec Space
X-VC achieves zero-shot streaming voice conversion via one-step codec-space conversion with dual-conditioning acoustic converter and role-assignment training on generated paired data.
-
Multi-Faceted Interactivity Alignment in Full-Duplex Speech Models
A multi-axis RL alignment technique improves pause handling, turn-taking, backchanneling, and interruption response in full-duplex spoken dialogue models by optimizing axis-specific rewards derived from human audio segments.
-
BareWave: Waveform-Native Flow-Matching Text-to-Speech
BareWave develops a waveform-native flow-matching framework for direct text-to-waveform TTS using representation alignment, staged noise scheduling, and velocity-aware perceptual alignment to achieve strong zero-shot voice cloning results.
-
Quaternion Self-Attention with Shared Scores
Shared-score quaternion self-attention reduces score multiplications by 75% and softmax operations from four to one while proving equivalence to component-wise attention under quaternion linear projections.
-
Voxtral TTS
Voxtral TTS produces expressive multilingual speech from 3-second reference audio with a hybrid autoregressive-plus-flow-matching architecture and a new VQ-FSQ tokenizer, achieving 68.4% win rate over ElevenLabs in human evaluations.
-
EmotionAI: A Privacy-Preserving Computational Intelligence Pipeline for Speech-Emotion-Grounded Conversational Analysis
A local pipeline integrating diarization, ASR, SER and adversarial LLM reasoning for privacy-preserving conversational analysis, reporting 48.8% SER accuracy on RAVDESS four-class subset with honest cross-corpus limitations.
-
Afrispeech Semantics: Evaluating Audio Semantic Reasoning in Spoken Language Models Across Domains and Accents
Audio language models are benchmarked on five semantic and paralinguistic reasoning tasks to reveal limitations in handling spoken audio evidence, accent variation, and domain shifts.
-
AT-ADD: All-Type Audio Deepfake Detection Challenge Evaluation Plan
AT-ADD introduces standardized tracks and datasets for evaluating audio deepfake detectors on speech under real-world conditions and on diverse unknown audio types to promote generalization beyond speech-centric methods.
- Balalaika: Data-Centric, Prosody-Aware Annotation Pipeline for Russian Speech