Recognition: no theorem link
Step-Audio 2 Technical Report
Pith reviewed 2026-05-16 05:55 UTC · model grok-4.3
The pith
Step-Audio 2 integrates latent audio encoding and discrete token generation to deliver state-of-the-art audio understanding and expressive end-to-end speech conversation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Step-Audio 2 demonstrates that an integrated architecture using latent audio encoding, reasoning-centric reinforcement learning, discrete audio token generation within language modeling, and retrieval-augmented generation produces stronger automatic speech recognition, audio understanding, and responsive conversational output than prior separate-component systems.
What carries the argument
Discrete audio token generation embedded in the language modeling process, which enables direct responsiveness to paralinguistic cues such as emotion and style.
If this is right
- Direct modeling of paralinguistic information reduces the need for separate emotion or style modules in conversational agents
- Tool calling for web search and audio retrieval measurably lowers hallucination rates in spoken responses
- End-to-end discrete token output supports lower-latency turn-taking in multi-turn dialogue
- Scaling to millions of hours of training data yields consistent gains across diverse conversational domains
Where Pith is reading between the lines
- The same token-generation approach could be applied to video or sensor streams to create unified multi-modal conversational systems
- External tool integration may allow future models to maintain up-to-date knowledge without full retraining
- If the RL component generalizes well, similar reasoning-centric training could improve robustness in low-resource languages or noisy environments
Load-bearing premise
The combination of latent encoding, RL reasoning, discrete tokens, and RAG produces robust performance on real-world conversational audio beyond the evaluated benchmarks.
What would settle it
A new test set of long-form conversational audio with varied emotions and accents where Step-Audio 2 shows no accuracy or naturalness advantage over strong baseline models.
read the original abstract
This paper presents Step-Audio 2, an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation. By integrating a latent audio encoder and reasoning-centric reinforcement learning (RL), Step-Audio 2 achieves promising performance in automatic speech recognition (ASR) and audio understanding. To facilitate genuine end-to-end speech conversation, Step-Audio 2 incorporates the generation of discrete audio tokens into language modeling, significantly enhancing its responsiveness to paralinguistic information such as speaking styles and emotions. To effectively leverage the rich textual and acoustic knowledge in real-world data, Step-Audio 2 integrates retrieval-augmented generation (RAG) and is able to call external tools such as web search to mitigate hallucination and audio search to switch timbres. Trained on millions of hours of speech and audio data, Step-Audio 2 delivers intelligence and expressiveness across diverse conversational scenarios. Evaluation results demonstrate that Step-Audio 2 achieves state-of-the-art performance on various audio understanding and conversational benchmarks compared to other open-source and commercial solutions. Please visit https://github.com/stepfun-ai/Step-Audio2 for more information.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Step-Audio 2, an end-to-end multi-modal LLM for audio understanding and conversational speech. It combines a latent audio encoder, reasoning-centric reinforcement learning, discrete audio token generation to capture paralinguistic cues, and RAG with external tool calling (web search, audio search) to reduce hallucinations. Trained on millions of hours of speech and audio data, the model claims state-of-the-art results on various audio understanding and conversational benchmarks relative to open-source and commercial baselines.
Significance. If the performance claims are substantiated with complete benchmark tables, baselines, and ablations, the work would advance practical end-to-end audio LLMs by showing how RL-driven reasoning and RAG can be integrated with discrete token modeling for expressive, low-hallucination conversation. The industry-oriented framing and emphasis on real-world tool use are strengths.
major comments (2)
- [Evaluation] Evaluation section: the SOTA claim is presented without named benchmarks (e.g., no LibriSpeech, CommonVoice, or conversational test sets), exact baseline versions, metric definitions, data splits, error bars, or ablation results on the RL or RAG components. This absence makes the central empirical result unverifiable and load-bearing for the paper's contribution.
- [§3 (Architecture) and §4 (Training)] §3 (Architecture) and §4 (Training): the claim that discrete token generation 'significantly enhances responsiveness to paralinguistic information' is stated without quantitative comparison to a continuous-token or non-RL baseline, leaving the contribution of this design choice unsupported by evidence.
minor comments (2)
- [Abstract] Abstract: phrases such as 'promising performance' and 'state-of-the-art' are used without accompanying metrics or qualifiers.
- [Conclusion] The GitHub link is provided but no details on released code, checkpoints, or evaluation scripts are given in the text.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to improve clarity and verifiability of the results.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section: the SOTA claim is presented without named benchmarks (e.g., no LibriSpeech, CommonVoice, or conversational test sets), exact baseline versions, metric definitions, data splits, error bars, or ablation results on the RL or RAG components. This absence makes the central empirical result unverifiable and load-bearing for the paper's contribution.
Authors: We acknowledge that the current evaluation section provides only high-level SOTA claims without the requested specifics. In the revised manuscript we will expand this section to explicitly name the benchmarks (including LibriSpeech, CommonVoice, and conversational test sets), list exact baseline versions, define all metrics, specify data splits, report error bars where available, and include ablation results isolating the RL and RAG components. revision: yes
-
Referee: [§3 (Architecture) and §4 (Training)] §3 (Architecture) and §4 (Training): the claim that discrete token generation 'significantly enhances responsiveness to paralinguistic information' is stated without quantitative comparison to a continuous-token or non-RL baseline, leaving the contribution of this design choice unsupported by evidence.
Authors: We agree that the claim would be stronger with direct quantitative support. The revised version will add comparisons to continuous-token and non-RL baselines, reporting relevant metrics that demonstrate the contribution of discrete token generation to paralinguistic responsiveness. revision: yes
Circularity Check
No significant circularity detected
full rationale
The paper is an empirical technical report on model architecture, training data, and benchmark results with no mathematical derivations, equations, or self-referential definitions present. Performance claims reference external benchmarks and datasets rather than quantities defined or fitted inside the paper. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the provided text. The central claims reduce to standard empirical evaluation and are therefore self-contained.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 20 Pith papers
-
HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models
HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-lang...
-
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.
-
How Should LLMs Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue
Channel fusion gives better semantic grounding and QA performance in full-duplex LLM dialogue but is vulnerable to context corruption during interruptions, while cross-attention routing is more robust at the cost of w...
-
SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation
SpeechParaling-Bench is a new evaluation framework for paralinguistic-aware speech generation that reveals major limitations in current large audio-language models.
-
HumDial-EIBench: A Human-Recorded Multi-Turn Emotional Intelligence Benchmark for Audio Language Models
HumDial-EIBench is a new benchmark using real human dialogues to evaluate audio language models on emotional intelligence tasks including multi-turn tracking, causal reasoning, empathy generation, and acoustic-semanti...
-
CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation
CapTalk unifies single-utterance and dialogue voice design via utterance- and speaker-level captions plus a hierarchical variational module for stable timbre with adaptive expression.
-
Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR
Speaker-Reasoner is an end-to-end speech LLM that iteratively analyzes audio structure, predicts temporal boundaries, and jointly models speaker identity, gender, timestamps, and transcription using a speaker-aware ca...
-
TiCo: Time-Controllable Spoken Dialogue Model
TiCo enables spoken dialogue models to follow explicit time constraints in generated responses using Spoken Time Markers and reinforcement learning with verifiable rewards, cutting duration error by 2.7x over its backbone.
-
Towards Fine-Grained Multi-Dimensional Speech Understanding: Data Pipeline, Benchmark, and Model
A data pipeline, 14-dimension benchmark, and decoupled fine-tuning model are presented to advance fine-grained multi-dimensional speech understanding in LLMs.
-
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
Staged post-training with self-distillation lets a 3B omni-modal model match or slightly exceed a 30B model on a visually debiased benchmark.
-
VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models
VocalParse applies interleaved and Chain-of-Thought prompting to a Large Audio Language Model to jointly transcribe lyrics, melody and word-note alignments, achieving state-of-the-art results on multiple singing datasets.
-
VIBE: Voice-Induced open-ended Bias Evaluation for Large Audio-Language Models via Real-World Speech
VIBE evaluates generative biases in large audio-language models with real-world speech and open-ended tasks, showing that gender cues produce larger distributional shifts than accent cues across 11 tested models.
-
Audio2Tool: Speak, Call, Act -- A Dataset for Benchmarking Speech Tool Use
Audio2Tool is a new benchmark dataset that shows speech models perform well on simple commands but degrade sharply on compositional tasks and realistic acoustic noise.
-
Why Your Tokenizer Fails in Information Fusion: A Timing-Aware Pre-Quantization Fusion for Video-Enhanced Audio Tokenization
A timing-aware pre-quantization fusion approach integrates visual cues into audio tokenizers along the temporal axis, maintaining reconstruction quality while outperforming audio-only and prior multimodal baselines on...
-
Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs
A multi-stage training method for LLM-based ASR uses new entropy allocation metrics to achieve competitive benchmark performance with 2.3B parameters while mitigating hallucinations via better encoder-LLM decoupling.
-
Audio-DeepThinker: Progressive Reasoning-Aware Reinforcement Learning for High-Quality Chain-of-Thought Emergence in Audio Language Models
A hybrid-reward progressive RL curriculum enables high-quality chain-of-thought to emerge in audio language models without prior supervised CoT training, yielding SOTA results on MMAR, MMAU, and MMSU benchmarks.
-
Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models
Audio-Cogito is an open-source LALM using Cogito-pipe data curation and self-distillation to achieve leading open-source performance on audio reasoning benchmarks.
-
Step-Audio-R1.5 Technical Report
Step-Audio-R1.5 applies RLHF to audio reasoning models to maintain analytical performance while improving prosodic naturalness and immersion in extended spoken interactions.
-
NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASR
NIM4-ASR delivers SOTA ASR performance on public benchmarks using a 2.3B-parameter LLM with multi-stage training, real-time streaming, and million-scale hotword customization via RAG.
-
OmniFysics: Towards Physical Intelligence Evolution via Omni-Modal Signal Processing and Network Optimization
OmniFysics is an omni-modal network using a dynamic physical data engine and evolutive tuning to improve performance on multimodal benchmarks and physics-oriented tasks.
Reference graph
Works this paper leans on
-
[1]
Seed-TTS: A Family of High-Quality Versatile Speech Generation Models
Philip Anastassiou et al. “Seed-tts: A family of high-quality versatile speech generation models”. In: arXiv preprint arXiv:2406.02430 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[2]
Rohan Anil et al. PaLM 2 Technical Report. 2023. arXiv: 2305.10403 [cs.CL]. URL: https://arxiv.org/ abs/2305.10403
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations
Alexei Baevski et al. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. 2020. arXiv: 2006.11477 [cs.CL]. URL: https://arxiv.org/abs/2006.11477
-
[4]
Jinze Bai et al. Qwen Technical Report. 2023. arXiv: 2309.16609 [cs.CL]. URL: https://arxiv.org/abs/ 2309.16609
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Seed-asr: Understanding diverse speech and contexts with llm-based speech recognition
Ye Bai et al. “Seed-asr: Understanding diverse speech and contexts with llm-based speech recognition”. In:arXiv preprint arXiv:2407.04675 (2024)
-
[6]
Better speech synthesis through scaling
James Betker. Better speech synthesis through scaling . 2023. arXiv: 2305 . 07243 [cs.SD]. URL: https : //arxiv.org/abs/2305.07243
-
[7]
Audiolm: a language modeling approach to audio generation
Zalán Borsos et al. “Audiolm: a language modeling approach to audio generation”. In: IEEE/ACM transactions on audio, speech, and language processing 31 (2023), pp. 2523–2533
work page 2023
-
[8]
GigaSpeech: An Evolving, Multi-Domain ASR Corpus with 10,000 Hours of Transcribed Audio
Guoguo Chen et al. “GigaSpeech: An Evolving, Multi-Domain ASR Corpus with 10,000 Hours of Transcribed Audio”. In: Interspeech 2021. ISCA, Aug. 2021. DOI: 10.21437/interspeech.2021- 1965 . URL: http: //dx.doi.org/10.21437/Interspeech.2021-1965
-
[9]
Minmo: A multimodal large language model for seamless voice interaction
Qian Chen et al. “Minmo: A multimodal large language model for seamless voice interaction”. In:arXiv preprint arXiv:2501.06282 (2025)
- [10]
-
[11]
neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html
Sanyuan Chen et al. “WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing”. In: IEEE Journal of Selected Topics in Signal Processing16.6 (Oct. 2022), pp. 1505–1518. ISSN : 1941-0484. DOI: 10.1109/jstsp.2022.3188113. URL: http://dx.doi.org/10.1109/JSTSP.2022.3188113
-
[12]
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
Yunfei Chu et al. “Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models”. In: arXiv preprint arXiv:2311.07919 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[13]
Yunfei Chu et al. “Qwen2-audio technical report”. In: arXiv preprint arXiv:2407.10759 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
Simple and Controllable Music Generation
Jade Copet et al. Simple and Controllable Music Generation. 2024. arXiv: 2306.05284 [cs.SD]. URL: https: //arxiv.org/abs/2306.05284
-
[15]
High Fidelity Neural Audio Compression
Alexandre Défossez et al. “High fidelity neural audio compression”. In:arXiv preprint arXiv:2210.13438 (2022)
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[16]
Moshi: a speech-text foundation model for real-time dialogue
Alexandre Défossez et al. “Moshi: a speech-text foundation model for real-time dialogue”. In:arXiv preprint arXiv:2410.00037 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
Pengi: An audio language model for audio tasks
Soham Deshmukh et al. “Pengi: An audio language model for audio tasks”. In: Advances in Neural Information Processing Systems 36 (2023), pp. 18090–18108
work page 2023
-
[18]
Ding Ding et al. “Kimi-audio technical report”. In: arXiv preprint arXiv:2504.18425 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models
Zhihao Du et al. “Cosyvoice 2: Scalable streaming speech synthesis with large language models”. In: arXiv preprint arXiv:2412.10117 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[20]
Zhihao Du et al. CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens. 2024. arXiv: 2407.05407 [cs.SD]. URL: https://arxiv.org/abs/2407.05407
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Llama-omni: Seamless speech interaction with large language models
Qingkai Fang et al. “Llama-omni: Seamless speech interaction with large language models”. In: arXiv preprint arXiv:2409.06666 (2024)
-
[22]
LUCY: Linguistic Understanding and Control Yielding Early Stage of Her
Heting Gao et al. LUCY: Linguistic Understanding and Control Yielding Early Stage of Her . 2025. arXiv: 2501.16327 [cs.CL]. URL: https://arxiv.org/abs/2501.16327
-
[23]
Audio Set: An ontology and human-labeled dataset for audio events
Jort F. Gemmeke et al. “Audio Set: An ontology and human-labeled dataset for audio events”. In:2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . 2017, pp. 776–780. DOI: 10.1109/ICASSP.2017.7952261. 14 Step-Audio 2 Technical Report
-
[24]
Sreyan Ghosh et al. Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities. 2025. arXiv: 2503.03983 [cs.SD]. URL: https://arxiv.org/abs/2503.03983
-
[25]
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
Arushi Goel et al. Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models. 2025. arXiv: 2507.08128 [cs.SD]. URL: https://arxiv.org/abs/2507.08128
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[26]
V ocalsound: A Dataset for Improving Human V ocal Sounds Recognition
Yuan Gong, Jin Yu, and James Glass. “V ocalsound: A Dataset for Improving Human V ocal Sounds Recognition”. In: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2022, pp. 151–155. DOI: 10.1109/ICASSP43922.2022.9746828
-
[27]
Joint audio and speech understanding
Yuan Gong et al. “Joint audio and speech understanding”. In:2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE. 2023, pp. 1–8
work page 2023
-
[28]
Yuan Gong et al. “Listen, think, and understand”. In: arXiv preprint arXiv:2305.10790 (2023)
-
[29]
Aaron Grattafiori et al. The Llama 3 Herd of Models . 2024. arXiv: 2407 . 21783 [cs.AI]. URL: https : //arxiv.org/abs/2407.21783
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units
Wei-Ning Hsu et al. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. 2021. arXiv: 2106.07447 [cs.CL]. URL: https://arxiv.org/abs/2106.07447
-
[31]
Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model
Ailin Huang et al. “Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model”. In:arXiv preprint arXiv:2506.08967 (2025)
-
[32]
Step-audio: Unified understanding and generation in intelligent speech interaction
Ailin Huang et al. “Step-audio: Unified understanding and generation in intelligent speech interaction”. In: arXiv preprint arXiv:2502.11946 (2025)
-
[33]
Audiogpt: Understanding and generating speech, music, sound, and talking head
Rongjie Huang et al. “Audiogpt: Understanding and generating speech, music, sound, and talking head”. In: Proceedings of the AAAI Conference on Artificial Intelligence. V ol. 38. 21. 2024, pp. 23802–23804
work page 2024
-
[34]
Aaron Hurst et al. “Gpt-4o system card”. In: arXiv preprint arXiv:2410.21276 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
CochlScene: Acquisition of acoustic scene data using crowdsourcing
Il-Young Jeong and Jeongsoo Park. “CochlScene: Acquisition of acoustic scene data using crowdsourcing”. In: 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). 2022, pp. 17–21. DOI: 10.23919/APSIPAASC55919.2022.9979822
-
[36]
Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling
Shengpeng Ji et al. “Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling”. In: arXiv preprint arXiv:2408.16532 (2024)
-
[37]
CVSS Corpus and Massively Multilingual Speech-to-Speech Translation
Ye Jia et al. CVSS Corpus and Massively Multilingual Speech-to-Speech Translation. 2022. arXiv: 2201.03713 [cs.CL]. URL: https://arxiv.org/abs/2201.03713
-
[38]
Direct speech-to-speech translation with a sequence-to-sequence model
Ye Jia et al. “Direct speech-to-speech translation with a sequence-to-sequence model”. In: arXiv preprint arXiv:1904.06037 (2019)
work page internal anchor Pith review Pith/arXiv arXiv 1904
-
[39]
Translatotron 2: High-quality direct speech-to-speech translation with voice preservation
Ye Jia et al. “Translatotron 2: High-quality direct speech-to-speech translation with voice preservation”. In: International conference on machine learning. PMLR. 2022, pp. 10120–10134
work page 2022
-
[40]
Eugene Kharitonov et al. Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision. 2023. arXiv: 2302.03540 [cs.SD]. URL: https://arxiv.org/abs/2302.03540
-
[41]
Audiocaps: Generating captions for audios in the wild
Chris Dongjoo Kim et al. “Audiocaps: Generating captions for audios in the wild”. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019, pp. 119–132
work page 2019
-
[42]
HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis
Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. 2020. arXiv: 2010.05646 [cs.SD]. URL: https://arxiv.org/abs/2010. 05646
-
[43]
Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities
Zhifeng Kong et al. Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities. 2024. arXiv: 2402.01831 [cs.SD]. URL: https://arxiv.org/abs/2402.01831
-
[44]
Transvip: Speech to speech translation system with voice and isochrony preservation
Chenyang Le et al. “Transvip: Speech to speech translation system with voice and isochrony preservation”. In: Advances in Neural Information Processing Systems 37 (2024), pp. 89682–89705
work page 2024
-
[45]
Textless speech-to-speech translation on real data
Ann Lee et al. “Textless speech-to-speech translation on real data”. In:arXiv preprint arXiv:2112.08352 (2021)
-
[46]
BigVGAN: A Universal Neural Vocoder with Large-Scale Training
Sang-gil Lee et al. BigVGAN: A Universal Neural Vocoder with Large-Scale Training. 2023. arXiv: 2206.04658 [cs.SD]. URL: https://arxiv.org/abs/2206.04658
-
[47]
Guan-Ting Lin, Cheng-Han Chiang, and Hung-yi Lee. “Advancing large language models to capture varied speaking styles and respond properly in spoken conversations”. In: arXiv preprint arXiv:2402.12786 (2024)
-
[48]
Paralinguistics-enhanced large language modeling of spoken dialogue
Guan-Ting Lin et al. “Paralinguistics-enhanced large language modeling of spoken dialogue”. In: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2024, pp. 10316–10320
work page 2024
-
[49]
Spirit LM: Interleaved Spoken and Written Language Model
Tu Anh Nguyen et al. Spirit LM: Interleaved Spoken and Written Language Model. 2024. arXiv: 2402.05755 [cs.CL]. URL: https://arxiv.org/abs/2402.05755
-
[50]
OpenAI. GPT-4 Technical Report. https://openai.com/research/gpt-4. Accessed: 2025-07-11. 2023. 15 Step-Audio 2 Technical Report
work page 2025
-
[51]
OpenAI. Introducing ChatGPT. Accessed: 2025-07-11. 2022. URL: https://openai.com/blog/chatgpt
work page 2025
-
[52]
Deep voice 3: 2000-speaker neural text-to-speech
Wei Ping et al. “Deep voice 3: 2000-speaker neural text-to-speech”. In:proc. ICLR. V ol. 79. 2018, pp. 1094–1099
work page 2000
-
[53]
Robust speech recognition via large-scale weak supervision
Alec Radford et al. “Robust speech recognition via large-scale weak supervision”. In:International conference on machine learning. PMLR. 2023, pp. 28492–28518
work page 2023
-
[54]
Direct preference optimization: Your language model is secretly a reward model
Rafael Rafailov et al. “Direct preference optimization: Your language model is secretly a reward model”. In: Advances in Neural Information Processing Systems 36 (2023), pp. 53728–53741
work page 2023
-
[55]
Fastspeech 2: Fast and high-quality end-to-end text to speech
Yi Ren et al. “Fastspeech 2: Fast and high-quality end-to-end text to speech”. In:arXiv preprint arXiv:2006.04558 (2020)
-
[56]
Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?
Andrew Rouditchenko et al. “Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?” In:arXiv preprint arXiv:2505.09439 (2025)
-
[57]
AudioPaLM: A Large Language Model That Can Speak and Listen
Paul K Rubenstein et al. “Audiopalm: A large language model that can speak and listen”. In: arXiv preprint arXiv:2306.12925 (2023)
work page internal anchor Pith review arXiv 2023
-
[58]
MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark
S Sakshi et al. “Mmau: A massive multi-task audio understanding and reasoning benchmark”. In: arXiv preprint arXiv:2410.19168 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[59]
Natural tts synthesis by conditioning wavenet on mel spectrogram predictions
Jonathan Shen et al. “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions”. In:2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE. 2018, pp. 4779–4783
work page 2018
-
[60]
Snac: Multi-scale neural audio codec
Hubert Siuzdak, Florian Grötschla, and Luca A Lanzendörfer. “Snac: Multi-scale neural audio codec”. In: arXiv preprint arXiv:2410.14411 (2024)
-
[61]
Salmonn: Towards generic hearing abilities for large language models,
Changli Tang et al. “Salmonn: Towards generic hearing abilities for large language models”. In:arXiv preprint arXiv:2310.13289 (2023)
-
[62]
Springer Science & Business Media, 2013
Wolfgang Wahlster.Verbmobil: foundations of speech-to-speech translation. Springer Science & Business Media, 2013
work page 2013
-
[63]
Changhan Wang, Anne Wu, and Juan Pino.CoVoST 2 and Massively Multilingual Speech-to-Text Translation
-
[64]
URL: https://arxiv.org/abs/2007.10310
arXiv: 2007.10310 [cs.CL]. URL: https://arxiv.org/abs/2007.10310
-
[65]
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
Chengyi Wang et al. “Neural codec language models are zero-shot text to speech synthesizers”. In:arXiv preprint arXiv:2301.02111 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[66]
Spark-tts: An efficient llm-based text-to-speech model with single-stream decoupled speech tokens
Xinsheng Wang et al. “Spark-tts: An efficient llm-based text-to-speech model with single-stream decoupled speech tokens”. In: arXiv preprint arXiv:2503.01710 (2025)
-
[67]
Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen llm
Xiong Wang et al. “Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen llm”. In: arXiv preprint arXiv:2411.00774 (2024)
-
[68]
Maskgct: Zero-shot text-to-speech with masked generative codec transformer
Yuancheng Wang et al. “Maskgct: Zero-shot text-to-speech with masked generative codec transformer”. In:arXiv preprint arXiv:2409.00750 (2024)
-
[69]
Tacotron: Towards End-to-End Speech Synthesis
Yuxuan Wang et al. “Tacotron: Towards end-to-end speech synthesis”. In: arXiv preprint arXiv:1703.10135 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[70]
Finetuned Language Models Are Zero-Shot Learners
Jason Wei et al. “Finetuned language models are zero-shot learners”. In:arXiv preprint arXiv:2109.01652 (2021)
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[71]
Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation
Yonghui Wu et al. “Google’s neural machine translation system: Bridging the gap between human and machine translation”. In: arXiv preprint arXiv:1609.08144 (2016)
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[72]
Mini-omni: Language models can hear, talk while thinking in streaming
Zhifei Xie and Changqiao Wu. “Mini-omni: Language models can hear, talk while thinking in streaming”. In: arXiv preprint arXiv:2408.16725 (2024)
-
[73]
Mini-omni2: Towards open-source gpt-4o with vision, speech and duplex capabilities
Zhifei Xie and Changqiao Wu. “Mini-omni2: Towards open-source gpt-4o with vision, speech and duplex capabilities”. In: arXiv preprint arXiv:2410.11190 (2024)
-
[74]
Bigcodec: Pushing the limits of low-bitrate neural speech codec
Detai Xin et al. “Bigcodec: Pushing the limits of low-bitrate neural speech codec”. In: arXiv preprint arXiv:2409.05377 (2024)
-
[75]
Jin Xu et al. Qwen2.5-Omni Technical Report. 2025. arXiv: 2503.20215 [cs.CL] . URL: https://arxiv. org/abs/2503.20215
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[76]
URO-Bench: A Comprehensive Benchmark for End-to-End Spoken Dialogue Models
Ruiqi Yan et al. URO-Bench: A Comprehensive Benchmark for End-to-End Spoken Dialogue Models . 2025. arXiv: 2502.17810 [cs.CL]. URL: https://arxiv.org/abs/2502.17810
-
[77]
Soundstream: An end-to-end neural audio codec
Neil Zeghidour et al. “Soundstream: An end-to-end neural audio codec”. In: IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2021), pp. 495–507
work page 2021
-
[78]
GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot
Aohan Zeng et al. “Glm-4-voice: Towards intelligent and human-like end-to-end spoken chatbot”. In: arXiv preprint arXiv:2412.02612 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[79]
WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition
Binbin Zhang et al. WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition
-
[80]
URL: https://arxiv.org/abs/2110.03370
arXiv: 2110.03370 [cs.SD]. URL: https://arxiv.org/abs/2110.03370
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.