Interleaved SLMs implicitly transcribe spoken words to text tokens in middle layers (top candidate for 77% of data) before predicting in text space and returning to speech.
Mor Geva, Avi Caciularu, Kevin Wang, and Yoav Gold- berg
8 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.
SEAM achieves 0.971 ROC-AUC on external interview data for real-time scripted speech detection by combining shortcut-prevention data techniques with a compact audio backbone.
A hybrid semi-supervised framework fusing Whisper embeddings with acoustic and prosodic features achieves 0.751 Macro-F1 for speaker confidence detection and outperforms baselines including WavLM, HuBERT, and Wav2Vec 2.0.
A multi-stage training method for LLM-based ASR uses new entropy allocation metrics to achieve competitive benchmark performance with 2.3B parameters while mitigating hallucinations via better encoder-LLM decoupling.
StableToken introduces a multi-branch architecture with bit-wise voting to create noise-robust semantic speech tokens, achieving lower Unit Edit Distance and better SpeechLLM robustness than prior single-path tokenizers.
Raon-OpenTTS provides an open 510K-hour curated speech dataset and DiT-based TTS models up to 1B parameters that achieve competitive WER and speaker similarity on benchmarks versus closed models trained on millions of hours.
RADAR Challenge 2026 organizes a multilingual audio deepfake detection benchmark with media transformations, reporting participation from 33 development and 22 evaluation teams using EER metric.
citing papers explorer
-
Interleaved Speech Language Models Latently Work In Text
Interleaved SLMs implicitly transcribe spoken words to text tokens in middle layers (top candidate for 77% of data) before predicting in text space and returning to speech.
-
VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing
VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conversational benchmarks.
-
SEAM: Shortcut-Aware Real-Time Detection of Scripted vs. Spontaneous Speech for Interview Guardrails
SEAM achieves 0.971 ROC-AUC on external interview data for real-time scripted speech detection by combining shortcut-prevention data techniques with a compact audio backbone.
-
A Semi-Supervised Framework for Speech Confidence Detection using Whisper
A hybrid semi-supervised framework fusing Whisper embeddings with acoustic and prosodic features achieves 0.751 Macro-F1 for speaker confidence detection and outperforms baselines including WavLM, HuBERT, and Wav2Vec 2.0.
-
Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs
A multi-stage training method for LLM-based ASR uses new entropy allocation metrics to achieve competitive benchmark performance with 2.3B parameters while mitigating hallucinations via better encoder-LLM decoupling.
-
StableToken: A Noise-Robust Semantic Speech Tokenizer for Resilient SpeechLLMs
StableToken introduces a multi-branch architecture with bit-wise voting to create noise-robust semantic speech tokens, achieving lower Unit Edit Distance and better SpeechLLM robustness than prior single-path tokenizers.
-
Raon-OpenTTS: Open Models and Data for Robust Text-to-Speech
Raon-OpenTTS provides an open 510K-hour curated speech dataset and DiT-based TTS models up to 1B parameters that achieve competitive WER and speaker similarity on benchmarks versus closed models trained on millions of hours.
-
RADAR Challenge 2026: Robust Audio Deepfake Recognition under Media Transformations
RADAR Challenge 2026 organizes a multilingual audio deepfake detection benchmark with media transformations, reporting participation from 33 development and 22 evaluation teams using EER metric.