arXiv preprint arXiv:2312.05187 , year=

Seamless: Multilingual Expressive, Streaming Speech Translation , author= · 2023 · arXiv 2312.05187

9 Pith papers cite this work. Polarity classification is still indexing.

9 Pith papers citing it

representative citing papers

AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling

cs.SD · 2026-05-11 · unverdicted · novelty 7.0

AffectCodec is an emotion-guided neural speech codec that preserves emotional cues during quantization while maintaining semantic fidelity and prosodic naturalness.

NaijaS2ST: A Multi-Accent Benchmark for Speech-to-Speech Translation in Low-Resource Nigerian Languages

cs.SD · 2026-04-17 · unverdicted · novelty 7.0

NaijaS2ST introduces a 50-hour multi-accent speech translation dataset for four Nigerian languages and shows audio LLMs excel at speech-to-text but leave substantial room for improvement in speech-to-speech translation.

Benchmarking Multilingual Speech Models on Pashto: Zero-Shot ASR, Script Failure, and Cross-Domain Evaluation

cs.CL · 2026-04-06 · conditional · novelty 7.0

Multilingual ASR models show 39.7-297% zero-shot WER on Pashto public data, Whisper models output correct script in under 0.8% of cases, and fine-tuned models degrade to 32.5-59% WER on out-of-domain sets.

PoDAR: Power-Disentangled Audio Representation for Generative Modeling

eess.AS · 2026-05-11 · unverdicted · novelty 6.0

PoDAR disentangles audio signal power from semantic content in latents using power augmentation and consistency objectives, yielding 2x faster convergence and gains of 0.055 speaker similarity and 0.22 UTMOS when applied to Stable Audio VAE with F5-TTS.

The False Resonance: A Critical Examination of Emotion Embedding Similarity for Speech Generation Evaluation

eess.AS · 2026-04-29 · unverdicted · novelty 6.0

Emotion embedding similarities are unsuitable for zero-shot evaluation of emotional expressiveness in speech generation due to confounding by non-emotional acoustic features.

DM-ASR: Diarization-aware Multi-speaker ASR with Large Language Models

eess.AS · 2026-04-24 · unverdicted · novelty 6.0

DM-ASR reformulates multi-speaker ASR as multi-turn dialogue generation conditioned on diarization results, achieving competitive benchmark performance with relatively small models and limited data.

MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation

cs.CL · 2026-04-19 · unverdicted · novelty 6.0

MoVE uses specialized LoRA expert adapters and a soft router to translate non-verbal vocalizations in S2ST, reproducing them in 76% of cases versus at most 14% for baselines while scoring highest on naturalness and emotional fidelity.

DialogueSidon: Recovering Full-Duplex Dialogue Tracks from In-the-Wild Dialogue Audio

cs.SD · 2026-04-10 · unverdicted · novelty 6.0

DialogueSidon recovers separate speaker tracks from mixed in-the-wild dialogue audio by compressing SSL features with a VAE and predicting clean latents via diffusion.

"OK Aura, Be Fair With Me": Demographics-Agnostic Training for Bias Mitigation in Wake-up Word Detection

cs.CL · 2026-04-07 · unverdicted · novelty 4.0

Demographics-agnostic training with augmentation and distillation reduces predictive disparity in wake-up word detection by 40-84% across demographic groups.

citing papers explorer

Showing 9 of 9 citing papers.

AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling cs.SD · 2026-05-11 · unverdicted · none · ref 22
AffectCodec is an emotion-guided neural speech codec that preserves emotional cues during quantization while maintaining semantic fidelity and prosodic naturalness.
NaijaS2ST: A Multi-Accent Benchmark for Speech-to-Speech Translation in Low-Resource Nigerian Languages cs.SD · 2026-04-17 · unverdicted · none · ref 6
NaijaS2ST introduces a 50-hour multi-accent speech translation dataset for four Nigerian languages and shows audio LLMs excel at speech-to-text but leave substantial room for improvement in speech-to-speech translation.
Benchmarking Multilingual Speech Models on Pashto: Zero-Shot ASR, Script Failure, and Cross-Domain Evaluation cs.CL · 2026-04-06 · conditional · none · ref 5
Multilingual ASR models show 39.7-297% zero-shot WER on Pashto public data, Whisper models output correct script in under 0.8% of cases, and fine-tuned models degrade to 32.5-59% WER on out-of-domain sets.
PoDAR: Power-Disentangled Audio Representation for Generative Modeling eess.AS · 2026-05-11 · unverdicted · none · ref 25
PoDAR disentangles audio signal power from semantic content in latents using power augmentation and consistency objectives, yielding 2x faster convergence and gains of 0.055 speaker similarity and 0.22 UTMOS when applied to Stable Audio VAE with F5-TTS.
The False Resonance: A Critical Examination of Emotion Embedding Similarity for Speech Generation Evaluation eess.AS · 2026-04-29 · unverdicted · none · ref 14
Emotion embedding similarities are unsuitable for zero-shot evaluation of emotional expressiveness in speech generation due to confounding by non-emotional acoustic features.
DM-ASR: Diarization-aware Multi-speaker ASR with Large Language Models eess.AS · 2026-04-24 · unverdicted · none · ref 10
DM-ASR reformulates multi-speaker ASR as multi-turn dialogue generation conditioned on diarization results, achieving competitive benchmark performance with relatively small models and limited data.
MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation cs.CL · 2026-04-19 · unverdicted · none · ref 18
MoVE uses specialized LoRA expert adapters and a soft router to translate non-verbal vocalizations in S2ST, reproducing them in 76% of cases versus at most 14% for baselines while scoring highest on naturalness and emotional fidelity.
DialogueSidon: Recovering Full-Duplex Dialogue Tracks from In-the-Wild Dialogue Audio cs.SD · 2026-04-10 · unverdicted · none · ref 1
DialogueSidon recovers separate speaker tracks from mixed in-the-wild dialogue audio by compressing SSL features with a VAE and predicting clean latents via diffusion.
"OK Aura, Be Fair With Me": Demographics-Agnostic Training for Bias Mitigation in Wake-up Word Detection cs.CL · 2026-04-07 · unverdicted · none · ref 9
Demographics-agnostic training with augmentation and distillation reduces predictive disparity in wake-up word detection by 40-84% across demographic groups.

arXiv preprint arXiv:2312.05187 , year=

fields

years

verdicts

representative citing papers

citing papers explorer