Interleaved SLMs implicitly transcribe spoken words to text tokens in middle layers (top candidate for 77% of data) before predicting in text space and returning to speech.
Mor Geva, Avi Caciularu, Kevin Wang, and Yoav Gold- berg
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.
SEAM achieves 0.971 ROC-AUC on external interview data for real-time scripted speech detection by combining shortcut-prevention data techniques with a compact audio backbone.
A hybrid semi-supervised framework fusing Whisper embeddings with acoustic and prosodic features achieves 0.751 Macro-F1 for speaker confidence detection and outperforms baselines including WavLM, HuBERT, and Wav2Vec 2.0.
A multi-stage training method for LLM-based ASR uses new entropy allocation metrics to achieve competitive benchmark performance with 2.3B parameters while mitigating hallucinations via better encoder-LLM decoupling.
StableToken introduces a multi-branch architecture with bit-wise voting to create noise-robust semantic speech tokens, achieving lower Unit Edit Distance and better SpeechLLM robustness than prior single-path tokenizers.
Raon-OpenTTS provides an open 510K-hour curated speech dataset and DiT-based TTS models up to 1B parameters that achieve competitive WER and speaker similarity on benchmarks versus closed models trained on millions of hours.
RADAR Challenge 2026 organizes a multilingual audio deepfake detection benchmark with media transformations, reporting participation from 33 development and 22 evaluation teams using EER metric.
citing papers explorer
-
Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs
A multi-stage training method for LLM-based ASR uses new entropy allocation metrics to achieve competitive benchmark performance with 2.3B parameters while mitigating hallucinations via better encoder-LLM decoupling.