hub Mixed citations

Qwen3-TTS Technical Report

Hangrui Hu, Xinfa Zhu, Ting He, Dake Guo, Bin Zhang, Xiong Wang · 2026 · cs.SD · arXiv 2601.15621

Mixed citation behavior. Most common role is background (50%).

45 Pith papers citing it

Background 50% of classified citations

open full Pith review browse 45 citing papers arXiv PDF

abstract

In this report, we present the Qwen3-TTS series, a family of advanced multilingual, controllable, robust, and streaming text-to-speech models. Qwen3-TTS supports state-of-the-art 3-second voice cloning and description-based control, allowing both the creation of entirely novel voices and fine-grained manipulation over the output speech. Trained on over 5 million hours of speech data spanning 10 languages, Qwen3-TTS adopts a dual-track LM architecture for real-time synthesis, coupled with two speech tokenizers: 1) Qwen-TTS-Tokenizer-25Hz is a single-codebook codec emphasizing semantic content, which offers seamlessly integration with Qwen-Audio and enables streaming waveform reconstruction via a block-wise DiT. 2) Qwen-TTS-Tokenizer-12Hz achieves extreme bitrate reduction and ultra-low-latency streaming, enabling immediate first-packet emission ($97\,\mathrm{ms}$) through its 12.5 Hz, 16-layer multi-codebook design and a lightweight causal ConvNet. Extensive experiments indicate state-of-the-art performance across diverse objective and subjective benchmark (e.g., TTS multilingual test set, InstructTTSEval, and our long speech test set). To facilitate community research and development, we release both tokenizers and models under the Apache 2.0 license.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3 baseline 2 method 1

citation-polarity summary

background 3 baseline 2 use method 1

representative citing papers

WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling

eess.AS · 2026-06-02 · unverdicted · novelty 8.0

WavTTS is the first raw-waveform diffusion TTS model using DiT flow matching and multi-scale mel supervision that approaches SOTA latent zero-shot performance while beating prior end-to-end models.

FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model

cs.SD · 2026-06-30 · unverdicted · novelty 7.0

FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates like 6.25 Hz.

JAVEDIT: Joint Audio-Visual Instruction-Guided Video Editing with Agentic Data Curation

cs.CV · 2026-06-02 · unverdicted · novelty 7.0

JAVEdit-100k is the first large-scale dataset for instruction-guided joint audio-visual video editing, accompanied by JAVEditBench and the JAVEdit model that outperforms baselines on five of six metrics.

Unified Synthesis of Compositional Speech and Sound from Free-Form Text Prompts

cs.SD · 2026-05-27 · unverdicted · novelty 7.0

PlanAudio introduces a unified autoregressive LLM framework with semantic latent chain-of-thought for generating composite speech and sound audio from free-form text, plus a new benchmark.

Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech

eess.AS · 2026-05-10 · unverdicted · novelty 7.0

GibbsTTS combines a training-free kinetic-optimal scheduler with finite-step moment correction in MI-DFM to deliver top naturalness and strong speaker similarity in zero-shot TTS.

VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing

cs.CL · 2026-05-07 · unverdicted · novelty 7.0

VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.

Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling

cs.CV · 2026-04-26 · unverdicted · novelty 7.0

Talker-T2AV achieves better lip-sync accuracy, video quality, and audio quality than dual-branch baselines by separating high-level shared autoregressive modeling from modality-specific low-level diffusion refinement in a joint audio-video generation framework.

MINT-Bench: A Comprehensive Multilingual Benchmark for Instruction-Following Text-to-Speech

eess.AS · 2026-04-20 · unverdicted · novelty 7.0

MINT-Bench is a new benchmark using hierarchical taxonomy, multi-stage data pipeline, and hybrid evaluation to assess instruction-following TTS systems, revealing major gaps in compositional and paralinguistic controls.

NVBench: A Benchmark for Speech Synthesis with Non-Verbal Vocalizations

cs.SD · 2026-04-17 · unverdicted · novelty 7.0

NVBench provides a standardized bilingual benchmark and evaluation protocol for assessing non-verbal vocalization generation, placement, and salience in text-to-speech systems.

HumDial-EIBench: A Human-Recorded Multi-Turn Emotional Intelligence Benchmark for Audio Language Models

eess.AS · 2026-04-13 · unverdicted · novelty 7.0

HumDial-EIBench is a new benchmark using real human dialogues to evaluate audio language models on emotional intelligence tasks including multi-turn tracking, causal reasoning, empathy generation, and acoustic-semantic conflict resolution.

Knowing What to Stress: A Discourse-Conditioned Text-to-Speech Benchmark

cs.CL · 2026-04-12 · unverdicted · novelty 7.0

CAST benchmark shows language models infer correct word stress from discourse context but TTS systems frequently fail to produce it in speech.

CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation

cs.SD · 2026-04-09 · unverdicted · novelty 7.0

CapTalk unifies single-utterance and dialogue voice design via utterance- and speaker-level captions plus a hierarchical variational module for stable timbre with adaptive expression.

UniSAE: Unified Speech Attribute Editing on Speaker, Emotion and Low-Level Content via Discrete Phonetic Posteriorgram Modelling

cs.SD · 2026-06-30 · unverdicted · novelty 6.0

UniSAE unifies speaker, emotion, and multi-granularity content editing in speech via a new discrete phonetic posteriorgram representation and diffusion-based rendering.

HPRO: Hierarchical Progressive Reward Optimization via Preference Extraction for Emotional Text-to-Speech

eess.AS · 2026-06-26 · unverdicted · novelty 6.0

HPRO uses a differentiable HD-Emo codec to extract separate content and style tokens and progressively aligns frame-, word-, and sentence-level rewards to improve emotional expressiveness in TTS while preserving intelligibility.

What Makes Synthetic Speech Sound Sarcastic? A Prosody-Controlled Perception Study

cs.SD · 2026-06-08 · unverdicted · novelty 6.0

Controllable neural TTS reveals that loudness drives human sarcasm perception while a model prioritizes speech rate.

EmoInstruct-TTS: Dual-Path Instruction-Guided Emotional Speech Synthesis

cs.CL · 2026-06-08 · unverdicted · novelty 6.0

EmoInstruct-TTS uses Emotion2embed and an Instruction-Conditioned Emotion Flow Model (ICE-Flow) to generate acoustically grounded emotion representations from free-form instructions and integrate them into an LLM-based TTS pipeline.

dots.tts Technical Report

cs.SD · 2026-06-05 · unverdicted · novelty 6.0

dots.tts reports SOTA benchmark results on Seed-TTS-Eval and other tests via continuous latent-space autoregressive modeling with three listed innovations and code release.

CleanCodec: Efficient and Robust Speech Tokenization via Perceptually Guided Encoding

cs.SD · 2026-06-03 · unverdicted · novelty 6.0

CleanCodec reframes audio tokenization as a selective information bottleneck to encode only perceptually important features at 12.5 tokens per second, outperforming prior codecs in efficiency, speaker similarity, and intelligibility.

LaSR: Context-Aware Speech Recognition via Latent Reasoning

cs.CL · 2026-05-30 · unverdicted · novelty 6.0

LaSR improves context-aware terminology recognition in speech LLMs by aligning latent CoT supervision on acoustic regions and introducing latent reasoning periods, shown on a new academic corpus to outperform standard fine-tuning without added latency.

MindVoice: Reconstructing Intelligible Speech from Non-invasive Neural Signals with Pretrained Priors

cs.SD · 2026-05-29 · unverdicted · novelty 6.0

MindVoice disentangles neural-to-speech reconstruction into semantic and acoustic pathways using pretrained priors, then fuses them with speech generation models to produce intelligible output from non-invasive recordings.

Dasheng AudioGen: A Unified Model for Generating Coherent Audio Scenes from Text

cs.SD · 2026-05-27 · unverdicted · novelty 6.0

Dasheng AudioGen uses multi-view captions and a unified semantic-acoustic representation to enable end-to-end generation of mixed audio scenes from text descriptions.

Learning When to Think While Listening in Large Audio-Language Models

cs.CL · 2026-05-26 · unverdicted · novelty 6.0

A wait-think-answer controller for LALMs is trained via SFT followed by six-reward DAPO, raising row-weighted accuracy from 67.6% to 70.3% and cutting post-endpoint thinking length by 14% on synthetic spoken QA while remaining functional on real recorded audio.

The False Resonance: A Critical Examination of Emotion Embedding Similarity for Speech Generation Evaluation

eess.AS · 2026-04-29 · unverdicted · novelty 6.0

Emotion embedding similarities are unsuitable for zero-shot evaluation of emotional expressiveness in speech generation due to confounding by non-emotional acoustic features.

TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis

cs.CL · 2026-04-24 · unverdicted · novelty 6.0

TTS-PRISM defines a 12-dimensional perceptual schema, builds a targeted diagnostic dataset via adversarial synthesis and expert labels, and tunes an end-to-end model that outperforms generalist LLMs in human alignment on a 1,600-sample Mandarin test set while profiling six TTS paradigms.

citing papers explorer

Showing 45 of 45 citing papers.

WavTTS: Towards High-Quality Zero-Shot TTS via Direct Raw Waveform Modeling eess.AS · 2026-06-02 · unverdicted · none · ref 31 · internal anchor
WavTTS is the first raw-waveform diffusion TTS model using DiT flow matching and multi-scale mel supervision that approaches SOTA latent zero-shot performance while beating prior end-to-end models.
FlexiSLM: A Dynamic and Controllable Frame Rate Spoken Language Model cs.SD · 2026-06-30 · unverdicted · none · ref 269 · internal anchor
FlexiSLM is the first spoken language model supporting dynamic and controllable frame rates on speech input and output, outperforming fixed-rate 7B models at high quality and enabling faster inference at lower rates like 6.25 Hz.
JAVEDIT: Joint Audio-Visual Instruction-Guided Video Editing with Agentic Data Curation cs.CV · 2026-06-02 · unverdicted · none · ref 26 · internal anchor
JAVEdit-100k is the first large-scale dataset for instruction-guided joint audio-visual video editing, accompanied by JAVEditBench and the JAVEdit model that outperforms baselines on five of six metrics.
Unified Synthesis of Compositional Speech and Sound from Free-Form Text Prompts cs.SD · 2026-05-27 · unverdicted · none · ref 4 · internal anchor
PlanAudio introduces a unified autoregressive LLM framework with semantic latent chain-of-thought for generating composite speech and sound audio from free-form text, plus a new benchmark.
Kinetic-Optimal Scheduling with Moment Correction for Metric-Induced Discrete Flow Matching in Zero-Shot Text-to-Speech eess.AS · 2026-05-10 · unverdicted · none · ref 13 · internal anchor
GibbsTTS combines a training-free kinetic-optimal scheduler with finite-step moment correction in MI-DFM to deliver top naturalness and strong speaker similarity in zero-shot TTS.
VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing cs.CL · 2026-05-07 · unverdicted · none · ref 113 · internal anchor
VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.
Talker-T2AV: Joint Talking Audio-Video Generation with Autoregressive Diffusion Modeling cs.CV · 2026-04-26 · unverdicted · none · ref 7 · internal anchor
Talker-T2AV achieves better lip-sync accuracy, video quality, and audio quality than dual-branch baselines by separating high-level shared autoregressive modeling from modality-specific low-level diffusion refinement in a joint audio-video generation framework.
MINT-Bench: A Comprehensive Multilingual Benchmark for Instruction-Following Text-to-Speech eess.AS · 2026-04-20 · unverdicted · none · ref 30 · internal anchor
MINT-Bench is a new benchmark using hierarchical taxonomy, multi-stage data pipeline, and hybrid evaluation to assess instruction-following TTS systems, revealing major gaps in compositional and paralinguistic controls.
NVBench: A Benchmark for Speech Synthesis with Non-Verbal Vocalizations cs.SD · 2026-04-17 · unverdicted · none · ref 44 · internal anchor
NVBench provides a standardized bilingual benchmark and evaluation protocol for assessing non-verbal vocalization generation, placement, and salience in text-to-speech systems.
HumDial-EIBench: A Human-Recorded Multi-Turn Emotional Intelligence Benchmark for Audio Language Models eess.AS · 2026-04-13 · unverdicted · none · ref 32 · internal anchor
HumDial-EIBench is a new benchmark using real human dialogues to evaluate audio language models on emotional intelligence tasks including multi-turn tracking, causal reasoning, empathy generation, and acoustic-semantic conflict resolution.
Knowing What to Stress: A Discourse-Conditioned Text-to-Speech Benchmark cs.CL · 2026-04-12 · unverdicted · none · ref 15 · internal anchor
CAST benchmark shows language models infer correct word stress from discourse context but TTS systems frequently fail to produce it in speech.
CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation cs.SD · 2026-04-09 · unverdicted · none · ref 15 · internal anchor
CapTalk unifies single-utterance and dialogue voice design via utterance- and speaker-level captions plus a hierarchical variational module for stable timbre with adaptive expression.
UniSAE: Unified Speech Attribute Editing on Speaker, Emotion and Low-Level Content via Discrete Phonetic Posteriorgram Modelling cs.SD · 2026-06-30 · unverdicted · none · ref 8 · internal anchor
UniSAE unifies speaker, emotion, and multi-granularity content editing in speech via a new discrete phonetic posteriorgram representation and diffusion-based rendering.
HPRO: Hierarchical Progressive Reward Optimization via Preference Extraction for Emotional Text-to-Speech eess.AS · 2026-06-26 · unverdicted · none · ref 3 · internal anchor
HPRO uses a differentiable HD-Emo codec to extract separate content and style tokens and progressively aligns frame-, word-, and sentence-level rewards to improve emotional expressiveness in TTS while preserving intelligibility.
What Makes Synthetic Speech Sound Sarcastic? A Prosody-Controlled Perception Study cs.SD · 2026-06-08 · unverdicted · none · ref 22 · internal anchor
Controllable neural TTS reveals that loudness drives human sarcasm perception while a model prioritizes speech rate.
EmoInstruct-TTS: Dual-Path Instruction-Guided Emotional Speech Synthesis cs.CL · 2026-06-08 · unverdicted · none · ref 28 · internal anchor
EmoInstruct-TTS uses Emotion2embed and an Instruction-Conditioned Emotion Flow Model (ICE-Flow) to generate acoustically grounded emotion representations from free-form instructions and integrate them into an LLM-based TTS pipeline.
dots.tts Technical Report cs.SD · 2026-06-05 · unverdicted · none · ref 4 · internal anchor
dots.tts reports SOTA benchmark results on Seed-TTS-Eval and other tests via continuous latent-space autoregressive modeling with three listed innovations and code release.
CleanCodec: Efficient and Robust Speech Tokenization via Perceptually Guided Encoding cs.SD · 2026-06-03 · unverdicted · none · ref 28 · internal anchor
CleanCodec reframes audio tokenization as a selective information bottleneck to encode only perceptually important features at 12.5 tokens per second, outperforming prior codecs in efficiency, speaker similarity, and intelligibility.
LaSR: Context-Aware Speech Recognition via Latent Reasoning cs.CL · 2026-05-30 · unverdicted · none · ref 33 · internal anchor
LaSR improves context-aware terminology recognition in speech LLMs by aligning latent CoT supervision on acoustic regions and introducing latent reasoning periods, shown on a new academic corpus to outperform standard fine-tuning without added latency.
MindVoice: Reconstructing Intelligible Speech from Non-invasive Neural Signals with Pretrained Priors cs.SD · 2026-05-29 · unverdicted · none · ref 34 · internal anchor
MindVoice disentangles neural-to-speech reconstruction into semantic and acoustic pathways using pretrained priors, then fuses them with speech generation models to produce intelligible output from non-invasive recordings.
Dasheng AudioGen: A Unified Model for Generating Coherent Audio Scenes from Text cs.SD · 2026-05-27 · unverdicted · none · ref 1 · internal anchor
Dasheng AudioGen uses multi-view captions and a unified semantic-acoustic representation to enable end-to-end generation of mixed audio scenes from text descriptions.
Learning When to Think While Listening in Large Audio-Language Models cs.CL · 2026-05-26 · unverdicted · none · ref 18 · internal anchor
A wait-think-answer controller for LALMs is trained via SFT followed by six-reward DAPO, raising row-weighted accuracy from 67.6% to 70.3% and cutting post-endpoint thinking length by 14% on synthetic spoken QA while remaining functional on real recorded audio.
The False Resonance: A Critical Examination of Emotion Embedding Similarity for Speech Generation Evaluation eess.AS · 2026-04-29 · unverdicted · none · ref 62 · internal anchor
Emotion embedding similarities are unsuitable for zero-shot evaluation of emotional expressiveness in speech generation due to confounding by non-emotional acoustic features.
TTS-PRISM: A Perceptual Reasoning and Interpretable Speech Model for Fine-Grained Diagnosis cs.CL · 2026-04-24 · unverdicted · none · ref 10 · internal anchor
TTS-PRISM defines a 12-dimensional perceptual schema, builds a targeted diagnostic dataset via adversarial synthesis and expert labels, and tunes an end-to-end model that outperforms generalist LLMs in human alignment on a 1,600-sample Mandarin test set while profiling six TTS paradigms.
Audio2Tool: Speak, Call, Act -- A Dataset for Benchmarking Speech Tool Use cs.SD · 2026-04-17 · unverdicted · none · ref 29 · internal anchor
Audio2Tool is a new benchmark dataset that shows speech models perform well on simple commands but degrade sharply on compositional tasks and realistic acoustic noise.
ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models cs.CL · 2026-04-11 · unverdicted · none · ref 27 · internal anchor
ASPIRin decouples speaking timing from token content via binary action space projection and applies GRPO with rule-based rewards to optimize interactivity in SLMs without semantic collapse or repetition.
OmniVoice: Towards Omnilingual Zero-Shot Text-to-Speech with Diffusion Language Models cs.CL · 2026-04-01 · unverdicted · none · ref 42 · internal anchor
OmniVoice introduces a diffusion language model-style non-autoregressive TTS system that directly maps text to multi-codebook acoustic tokens, scaling zero-shot synthesis to over 600 languages with SOTA results on multilingual benchmarks using 581k hours of open data.
How to Leverage Synthetic Speech for LLM-Based ASR Systems? cs.CL · 2026-06-27 · unverdicted · none · ref 33 · internal anchor
Layer selection plus RIR augmentation on synthetic speech matches full real-data ASR performance using 25% real speech in SLAM-ASR.
ContextCodec: Content-Focused Context Guidance for Ultra-Low Bitrate Speech Coding cs.SD · 2026-06-09 · unverdicted · none · ref 12 · internal anchor
ContextCodec uses a dual-branch encoder with CLIP-style contrastive training on phoneme-aligned context features plus autoregressive refinement to improve quality-intelligibility at bitrates down to 500 bps.
End-to-End Training for Discrete Token LLM based TTS System cs.SD · 2026-06-08 · unverdicted · none · ref 3 · internal anchor
An end-to-end optimization framework jointly trains the speech tokenizer, LLM, FM model, and reward model for discrete-token TTS, reporting new SOTA WER of 0.78% and 1.56% on Seed-TTS-Eval with 0.6B LLM and 0.5B FM.
FlashTTS: Fast Streaming TTS with MTP Acceleration and X-pred Mean Flow Distillation eess.AS · 2026-06-08 · unverdicted · none · ref 15 · internal anchor
FlashTTS delivers a streaming TTS system using multi-track input processing and X-pred mean flow matching to reach 325 ms latency in two function evaluations while retaining zero-shot voice cloning.
VoxCPM2 Technical Report cs.SD · 2026-06-05 · unverdicted · none · ref 15 · internal anchor
VoxCPM2 scales hierarchical continuous-latent speech modeling to 2B parameters and over 2M hours of multilingual data, unifying voice cloning, style control, and continuation in one backbone with open release.
Do speech foundation models perceive speaker similarity as humans do? cs.SD · 2026-06-04 · unverdicted · none · ref 25 · internal anchor
The study compares speaker embeddings from more than 40 speech foundation models with human subjective similarity scores and identifies model factors that better align with human perception.
Sympatheia: Emotionally Adaptive Voice Assistant with Continuous Affect Conditioning cs.SD · 2026-05-30 · unverdicted · none · ref 27 · internal anchor
Sympatheia introduces a continuous affect-conditioned speech dialogue model and the Sympatheia-18k synthetic dataset, showing improved emotional appropriateness over baselines when speech cues are limited.
DUET: Unified Dual-Space Emotion Control for Diffusion and Flow-Matching Driven Text-to-Speech cs.SD · 2026-05-20 · unverdicted · none · ref 17 · internal anchor
DUET enables fine-grained emotion control in pretrained diffusion and flow-matching TTS models via unified hidden-space steering and mel-space guidance, outperforming supervised baselines on multiple backbones.
Raon-OpenTTS: Open Models and Data for Robust Text-to-Speech eess.AS · 2026-05-20 · unverdicted · none · ref 12 · internal anchor
Raon-OpenTTS provides an open 510K-hour curated speech dataset and DiT-based TTS models up to 1B parameters that achieve competitive WER and speaker similarity on benchmarks versus closed models trained on millions of hours.
Omni-Customizer: End-to-End MultiModal Customization for Joint Audio-Video Generation cs.CV · 2026-05-17 · unverdicted · none · ref 24 · internal anchor
Omni-Customizer proposes an end-to-end framework using Omni-Context Fusion, Masked TTS Cross-Attention, Semantic-Anchored Multimodal RoPE, and specialized training curricula to achieve precise multimodal identity binding in joint audio-video generation.
JaiTTS: A Thai Voice Cloning Model cs.CL · 2026-04-30 · unverdicted · none · ref 10 · 2 links · internal anchor
JaiTTS-v1.0 achieves 1.94% CER on short Thai speech, beating human ground truth of 1.98%, matches humans on long speech, and wins 283 of 400 human comparisons against commercial systems.
MAVIN: Multi-Shot Audio-Visual Generation with Narrative Control cs.CV · 2026-06-28 · unverdicted · none · ref 26 · internal anchor
MAVIN proposes boundary-aware attention, ID-aware propagation, a multi-agent scripting pipeline, and the MAVINSet dataset as the first framework for multi-shot audio-visual generation with narrative control, claiming SOTA results.
HydraQE: OSU's Submission for the IWSLT 2026 Speech Translation Metrics Shared Task cs.CL · 2026-06-07 · unverdicted · none · ref 12 · internal anchor
HydraQE is a new end-to-end speech translation QE system using Qwen3-ASR backbone, sparsemax layer mixing, bidirectional Transformer, and multi-task curriculum training on human and pseudo labels that outperforms cascaded baselines.
PilotTTS: A Disciplined Modular Recipe for Competitive Speech Synthesis cs.SD · 2026-05-26 · unverdicted · none · ref 10 · internal anchor
PilotTTS achieves lowest WER 1.50% (en) and CER 0.87% (zh) plus highest speaker similarity on Seed-TTS Eval using a Q-Former conditioned autoregressive architecture and a released multi-stage open data pipeline.
RADAR Challenge 2026: Robust Audio Deepfake Recognition under Media Transformations eess.AS · 2026-05-10 · unverdicted · none · ref 24 · 3 links · internal anchor
RADAR Challenge 2026 organizes a multilingual audio deepfake detection benchmark with media transformations, reporting participation from 33 development and 22 evaluation teams using EER metric.
EdgeFM: Efficient Edge Inference for Vision-Language Models cs.CV · 2026-04-30 · unverdicted · none · ref 3 · 2 links · internal anchor
EdgeFM is an agent-driven VLM inference framework achieving up to 1.49x speedup over TensorRT-Edge-LLM on NVIDIA Orin and first end-to-end deployment on Horizon Journey platform.
One Voice, Many Tongues: Cross-Lingual Voice Cloning for Scientific Speech eess.AS · 2026-04-28 · unverdicted · none · ref 3 · internal anchor
Authors submit a cross-lingual voice cloning system to IWSLT 2026 using OmniVoice fine-tuned on ensemble-distilled synthetic data, reporting gains in WER, CER, and speaker similarity for scientific texts in three languages.
KIT's Submission to Cross-Lingual Voice Cloning in IWSLT 2026 cs.CL · 2026-06-05 · unverdicted · none · ref 11 · internal anchor
KIT's IWSLT 2026 submission adapts a multilingual TTS model with language prompting, RL fine-tuning, and reference-conditioned lexical matching, reporting largest gains from prompting.

Qwen3-TTS Technical Report

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer