arxiv: 2507.16632 · v3 · submitted 2025-07-22 · 💻 cs.CL · cs.SD· eess.AS

Recognition: no theorem link

Step-Audio 2 Technical Report

Boyong Wu , Chao Yan , Chen Hu , Cheng Yi , Chengli Feng , Fei Tian , Feiyu Shen , Gang Yu

show 101 more authors

Haoyang Zhang Jingbei Li Mingrui Chen Peng Liu Wang You Xiangyu Tony Zhang Xingyuan Li Xuerui Yang Yayue Deng Yechang Huang Yuxin Li Yuxin Zhang Zhao You Brian Li Changyi Wan Hanpeng Hu Jiangjie Zhen Siyu Chen Song Yuan Xuelin Zhang Yimin Jiang Yu Zhou Yuxiang Yang Bingxin Li Buyun Ma Changhe Song Dongqing Pang Guoqiang Hu Haiyang Sun Kang An Na Wang Shuli Gao Wei Ji Wen Li Wen Sun Xuan Wen Yong Ren Yuankai Ma Yufan Lu Bin Wang Bo Li Changxin Miao Che Liu Chen Xu Dapeng Shi Dingyuan Hu Donghang Wu Enle Liu Guanzhe Huang Gulin Yan Han Zhang Hao Nie Haonan Jia Hongyu Zhou Jianjian Sun Jiaoren Wu Jie Wu Jie Yang Jin Yang Junzhe Lin Kaixiang Li Lei Yang Liying Shi Li Zhou Longlong Gu Ming Li Mingliang Li Mingxiao Li Nan Wu Qi Han Qinyuan Tan Shaoliang Pang Shengjie Fan Siqi Liu Tiancheng Cao Wanying Lu Wenqing He Wuxun Xie Xu Zhao Xueqi Li Yanbo Yu Yang Yang Yi Liu Yifan Lu Yilei Wang Yuanhao Ding Yuanwei Liang Yuanwei Lu Yuchu Luo Yuhe Yin Yumeng Zhan Yuxiang Zhang Zidong Yang Zixin Zhang Binxing Jiao Daxin Jiang Heung-Yeung Shum Jiansheng Chen Jing Li Xiangyu Zhang Yibo Zhu

Authors on Pith no claims yet

Pith reviewed 2026-05-16 05:55 UTC · model grok-4.3

classification 💻 cs.CL cs.SDeess.AS

keywords multi-modal LLMaudio understandingspeech conversationdiscrete audio tokensreinforcement learningretrieval-augmented generationend-to-end model

0 comments

The pith

Step-Audio 2 integrates latent audio encoding and discrete token generation to deliver state-of-the-art audio understanding and expressive end-to-end speech conversation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Step-Audio 2 as an end-to-end multi-modal large language model built specifically for industry-level audio tasks and natural speech interaction. It combines a latent audio encoder with reasoning-centric reinforcement learning to strengthen automatic speech recognition and general audio comprehension. Discrete audio tokens are generated directly inside the language model to capture speaking styles and emotions in responses. Retrieval-augmented generation plus tool calls for web search and audio search reduce hallucinations while allowing timbre changes. Trained on millions of hours of speech data, the model reports superior results on standard benchmarks against both open-source and commercial alternatives.

Core claim

Step-Audio 2 demonstrates that an integrated architecture using latent audio encoding, reasoning-centric reinforcement learning, discrete audio token generation within language modeling, and retrieval-augmented generation produces stronger automatic speech recognition, audio understanding, and responsive conversational output than prior separate-component systems.

What carries the argument

Discrete audio token generation embedded in the language modeling process, which enables direct responsiveness to paralinguistic cues such as emotion and style.

If this is right

Direct modeling of paralinguistic information reduces the need for separate emotion or style modules in conversational agents
Tool calling for web search and audio retrieval measurably lowers hallucination rates in spoken responses
End-to-end discrete token output supports lower-latency turn-taking in multi-turn dialogue
Scaling to millions of hours of training data yields consistent gains across diverse conversational domains

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same token-generation approach could be applied to video or sensor streams to create unified multi-modal conversational systems
External tool integration may allow future models to maintain up-to-date knowledge without full retraining
If the RL component generalizes well, similar reasoning-centric training could improve robustness in low-resource languages or noisy environments

Load-bearing premise

The combination of latent encoding, RL reasoning, discrete tokens, and RAG produces robust performance on real-world conversational audio beyond the evaluated benchmarks.

What would settle it

A new test set of long-form conversational audio with varied emotions and accents where Step-Audio 2 shows no accuracy or naturalness advantage over strong baseline models.

read the original abstract

This paper presents Step-Audio 2, an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation. By integrating a latent audio encoder and reasoning-centric reinforcement learning (RL), Step-Audio 2 achieves promising performance in automatic speech recognition (ASR) and audio understanding. To facilitate genuine end-to-end speech conversation, Step-Audio 2 incorporates the generation of discrete audio tokens into language modeling, significantly enhancing its responsiveness to paralinguistic information such as speaking styles and emotions. To effectively leverage the rich textual and acoustic knowledge in real-world data, Step-Audio 2 integrates retrieval-augmented generation (RAG) and is able to call external tools such as web search to mitigate hallucination and audio search to switch timbres. Trained on millions of hours of speech and audio data, Step-Audio 2 delivers intelligence and expressiveness across diverse conversational scenarios. Evaluation results demonstrate that Step-Audio 2 achieves state-of-the-art performance on various audio understanding and conversational benchmarks compared to other open-source and commercial solutions. Please visit https://github.com/stepfun-ai/Step-Audio2 for more information.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Step-Audio 2 combines latent audio encoding, reasoning RL, and discrete token generation for end-to-end speech with tool use, but the SOTA claims rest on unreported benchmark details.

read the letter

Step-Audio 2 is a technical report on an end-to-end multimodal model that handles both audio understanding and speech generation in one system. The core setup uses a latent audio encoder, trains with reasoning-centric reinforcement learning, and adds discrete audio token generation so the model can respond to speaking style and emotion. It layers on retrieval-augmented generation plus tool calls for web search and audio search to cut down on hallucinations and switch timbres. The training runs on millions of hours of speech and audio data, which is the scale that usually matters for conversational robustness.

Referee Report

2 major / 2 minor

Summary. The paper introduces Step-Audio 2, an end-to-end multi-modal LLM for audio understanding and conversational speech. It combines a latent audio encoder, reasoning-centric reinforcement learning, discrete audio token generation to capture paralinguistic cues, and RAG with external tool calling (web search, audio search) to reduce hallucinations. Trained on millions of hours of speech and audio data, the model claims state-of-the-art results on various audio understanding and conversational benchmarks relative to open-source and commercial baselines.

Significance. If the performance claims are substantiated with complete benchmark tables, baselines, and ablations, the work would advance practical end-to-end audio LLMs by showing how RL-driven reasoning and RAG can be integrated with discrete token modeling for expressive, low-hallucination conversation. The industry-oriented framing and emphasis on real-world tool use are strengths.

major comments (2)

[Evaluation] Evaluation section: the SOTA claim is presented without named benchmarks (e.g., no LibriSpeech, CommonVoice, or conversational test sets), exact baseline versions, metric definitions, data splits, error bars, or ablation results on the RL or RAG components. This absence makes the central empirical result unverifiable and load-bearing for the paper's contribution.
[§3 (Architecture) and §4 (Training)] §3 (Architecture) and §4 (Training): the claim that discrete token generation 'significantly enhances responsiveness to paralinguistic information' is stated without quantitative comparison to a continuous-token or non-RL baseline, leaving the contribution of this design choice unsupported by evidence.

minor comments (2)

[Abstract] Abstract: phrases such as 'promising performance' and 'state-of-the-art' are used without accompanying metrics or qualifiers.
[Conclusion] The GitHub link is provided but no details on released code, checkpoints, or evaluation scripts are given in the text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to improve clarity and verifiability of the results.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the SOTA claim is presented without named benchmarks (e.g., no LibriSpeech, CommonVoice, or conversational test sets), exact baseline versions, metric definitions, data splits, error bars, or ablation results on the RL or RAG components. This absence makes the central empirical result unverifiable and load-bearing for the paper's contribution.

Authors: We acknowledge that the current evaluation section provides only high-level SOTA claims without the requested specifics. In the revised manuscript we will expand this section to explicitly name the benchmarks (including LibriSpeech, CommonVoice, and conversational test sets), list exact baseline versions, define all metrics, specify data splits, report error bars where available, and include ablation results isolating the RL and RAG components. revision: yes
Referee: [§3 (Architecture) and §4 (Training)] §3 (Architecture) and §4 (Training): the claim that discrete token generation 'significantly enhances responsiveness to paralinguistic information' is stated without quantitative comparison to a continuous-token or non-RL baseline, leaving the contribution of this design choice unsupported by evidence.

Authors: We agree that the claim would be stronger with direct quantitative support. The revised version will add comparisons to continuous-token and non-RL baselines, reporting relevant metrics that demonstrate the contribution of discrete token generation to paralinguistic responsiveness. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper is an empirical technical report on model architecture, training data, and benchmark results with no mathematical derivations, equations, or self-referential definitions present. Performance claims reference external benchmarks and datasets rather than quantities defined or fitted inside the paper. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the provided text. The central claims reduce to standard empirical evaluation and are therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract describes an empirically trained neural network without listing explicit axioms, free parameters, or newly invented theoretical entities; performance claims rest on large-scale data training and benchmark results.

pith-pipeline@v0.9.0 · 5915 in / 1048 out tokens · 29325 ms · 2026-05-16T05:55:34.925173+00:00 · methodology

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models
cs.SD 2026-04 unverdicted novelty 8.0

HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-lang...
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
cs.MM 2026-05 unverdicted novelty 7.0

Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.
How Should LLMs Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue
cs.CL 2026-05 unverdicted novelty 7.0

Channel fusion gives better semantic grounding and QA performance in full-duplex LLM dialogue but is vulnerable to context corruption during interruptions, while cross-attention routing is more robust at the cost of w...
SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation
cs.CL 2026-04 unverdicted novelty 7.0

SpeechParaling-Bench is a new evaluation framework for paralinguistic-aware speech generation that reveals major limitations in current large audio-language models.
HumDial-EIBench: A Human-Recorded Multi-Turn Emotional Intelligence Benchmark for Audio Language Models
eess.AS 2026-04 unverdicted novelty 7.0

HumDial-EIBench is a new benchmark using real human dialogues to evaluate audio language models on emotional intelligence tasks including multi-turn tracking, causal reasoning, empathy generation, and acoustic-semanti...
CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation
cs.SD 2026-04 unverdicted novelty 7.0

CapTalk unifies single-utterance and dialogue voice design via utterance- and speaker-level captions plus a hierarchical variational module for stable timbre with adaptive expression.
Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR
eess.AS 2026-04 unverdicted novelty 7.0

Speaker-Reasoner is an end-to-end speech LLM that iteratively analyzes audio structure, predicts temporal boundaries, and jointly models speaker identity, gender, timestamps, and transcription using a speaker-aware ca...
TiCo: Time-Controllable Spoken Dialogue Model
cs.CL 2026-03 unverdicted novelty 7.0

TiCo enables spoken dialogue models to follow explicit time constraints in generated responses using Spoken Time Markers and reinforcement learning with verifiable rewards, cutting duration error by 2.7x over its backbone.
Towards Fine-Grained Multi-Dimensional Speech Understanding: Data Pipeline, Benchmark, and Model
eess.AS 2026-05 unverdicted novelty 6.0

A data pipeline, 14-dimension benchmark, and decoupled fine-tuning model are presented to advance fine-grained multi-dimensional speech understanding in LLMs.
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
cs.MM 2026-05 unverdicted novelty 6.0

Staged post-training with self-distillation lets a 3B omni-modal model match or slightly exceed a 30B model on a visually debiased benchmark.
VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models
cs.SD 2026-05 unverdicted novelty 6.0

VocalParse applies interleaved and Chain-of-Thought prompting to a Large Audio Language Model to jointly transcribe lyrics, melody and word-note alignments, achieving state-of-the-art results on multiple singing datasets.
VIBE: Voice-Induced open-ended Bias Evaluation for Large Audio-Language Models via Real-World Speech
eess.AS 2026-04 unverdicted novelty 6.0

VIBE evaluates generative biases in large audio-language models with real-world speech and open-ended tasks, showing that gender cues produce larger distributional shifts than accent cues across 11 tested models.
Audio2Tool: Speak, Call, Act -- A Dataset for Benchmarking Speech Tool Use
cs.SD 2026-04 unverdicted novelty 6.0

Audio2Tool is a new benchmark dataset that shows speech models perform well on simple commands but degrade sharply on compositional tasks and realistic acoustic noise.
Why Your Tokenizer Fails in Information Fusion: A Timing-Aware Pre-Quantization Fusion for Video-Enhanced Audio Tokenization
eess.AS 2026-04 unverdicted novelty 6.0

A timing-aware pre-quantization fusion approach integrates visual cues into audio tokenizers along the temporal axis, maintaining reconstruction quality while outperforming audio-only and prior multimodal baselines on...
Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs
eess.AS 2026-04 unverdicted novelty 6.0

A multi-stage training method for LLM-based ASR uses new entropy allocation metrics to achieve competitive benchmark performance with 2.3B parameters while mitigating hallucinations via better encoder-LLM decoupling.
Audio-DeepThinker: Progressive Reasoning-Aware Reinforcement Learning for High-Quality Chain-of-Thought Emergence in Audio Language Models
cs.SD 2026-04 unverdicted novelty 5.0

A hybrid-reward progressive RL curriculum enables high-quality chain-of-thought to emerge in audio language models without prior supervised CoT training, yielding SOTA results on MMAR, MMAU, and MMSU benchmarks.
Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models
eess.AS 2026-04 unverdicted novelty 5.0

Audio-Cogito is an open-source LALM using Cogito-pipe data curation and self-distillation to achieve leading open-source performance on audio reasoning benchmarks.
Step-Audio-R1.5 Technical Report
eess.AS 2026-04 unverdicted novelty 4.0

Step-Audio-R1.5 applies RLHF to audio reasoning models to maintain analytical performance while improving prosodic naturalness and immersion in extended spoken interactions.
NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASR
eess.AS 2026-04 unverdicted novelty 4.0

NIM4-ASR delivers SOTA ASR performance on public benchmarks using a 2.3B-parameter LLM with multi-stage training, real-time streaming, and million-scale hotword customization via RAG.
OmniFysics: Towards Physical Intelligence Evolution via Omni-Modal Signal Processing and Network Optimization
cs.CV 2026-02 unverdicted novelty 4.0

OmniFysics is an omni-modal network using a dynamic physical data engine and evolutive tuning to improve performance on multimodal benchmarks and physics-oriented tasks.

Reference graph

Works this paper leans on

84 extracted references · 84 canonical work pages · cited by 19 Pith papers · 22 internal anchors

[1]

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

Philip Anastassiou et al. “Seed-tts: A family of high-quality versatile speech generation models”. In: arXiv preprint arXiv:2406.02430 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

PaLM 2 Technical Report

Rohan Anil et al. PaLM 2 Technical Report. 2023. arXiv: 2305.10403 [cs.CL]. URL: https://arxiv.org/ abs/2305.10403

work page internal anchor Pith review Pith/arXiv arXiv 2023
[3]

wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

Alexei Baevski et al. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. 2020. arXiv: 2006.11477 [cs.CL]. URL: https://arxiv.org/abs/2006.11477

work page arXiv 2020
[4]

Qwen Technical Report

Jinze Bai et al. Qwen Technical Report. 2023. arXiv: 2309.16609 [cs.CL]. URL: https://arxiv.org/abs/ 2309.16609

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Seed-asr: Understanding diverse speech and contexts with llm-based speech recognition

Ye Bai et al. “Seed-asr: Understanding diverse speech and contexts with llm-based speech recognition”. In:arXiv preprint arXiv:2407.04675 (2024)

work page arXiv 2024
[6]

Better speech synthesis through scaling

James Betker. Better speech synthesis through scaling . 2023. arXiv: 2305 . 07243 [cs.SD]. URL: https : //arxiv.org/abs/2305.07243

work page arXiv 2023
[7]

Audiolm: a language modeling approach to audio generation

Zalán Borsos et al. “Audiolm: a language modeling approach to audio generation”. In: IEEE/ACM transactions on audio, speech, and language processing 31 (2023), pp. 2523–2533

work page 2023
[8]

GigaSpeech: An Evolving, Multi-Domain ASR Corpus with 10,000 Hours of Transcribed Audio

Guoguo Chen et al. “GigaSpeech: An Evolving, Multi-Domain ASR Corpus with 10,000 Hours of Transcribed Audio”. In: Interspeech 2021. ISCA, Aug. 2021. DOI: 10.21437/interspeech.2021- 1965 . URL: http: //dx.doi.org/10.21437/Interspeech.2021-1965

work page doi:10.21437/interspeech.2021- 2021
[9]

Minmo: A multimodal large language model for seamless voice interaction

Qian Chen et al. “Minmo: A multimodal large language model for seamless voice interaction”. In:arXiv preprint arXiv:2501.06282 (2025)

work page arXiv 2025
[10]

Sanyuan Chen et al.BEATs: Audio Pre-Training with Acoustic Tokenizers. 2022. arXiv:2212.09058 [eess.AS]. URL: https://arxiv.org/abs/2212.09058

work page arXiv 2022
[11]

neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html

Sanyuan Chen et al. “WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing”. In: IEEE Journal of Selected Topics in Signal Processing16.6 (Oct. 2022), pp. 1505–1518. ISSN : 1941-0484. DOI: 10.1109/jstsp.2022.3188113. URL: http://dx.doi.org/10.1109/JSTSP.2022.3188113

work page doi:10.1109/jstsp.2022.3188113 2022
[12]

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Yunfei Chu et al. “Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models”. In: arXiv preprint arXiv:2311.07919 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Qwen2-Audio Technical Report

Yunfei Chu et al. “Qwen2-audio technical report”. In: arXiv preprint arXiv:2407.10759 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Simple and Controllable Music Generation

Jade Copet et al. Simple and Controllable Music Generation. 2024. arXiv: 2306.05284 [cs.SD]. URL: https: //arxiv.org/abs/2306.05284

work page arXiv 2024
[15]

High Fidelity Neural Audio Compression

Alexandre Défossez et al. “High fidelity neural audio compression”. In:arXiv preprint arXiv:2210.13438 (2022)

work page internal anchor Pith review Pith/arXiv arXiv 2022
[16]

Moshi: a speech-text foundation model for real-time dialogue

Alexandre Défossez et al. “Moshi: a speech-text foundation model for real-time dialogue”. In:arXiv preprint arXiv:2410.00037 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Pengi: An audio language model for audio tasks

Soham Deshmukh et al. “Pengi: An audio language model for audio tasks”. In: Advances in Neural Information Processing Systems 36 (2023), pp. 18090–18108

work page 2023
[18]

Kimi-Audio Technical Report

Ding Ding et al. “Kimi-audio technical report”. In: arXiv preprint arXiv:2504.18425 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

Zhihao Du et al. “Cosyvoice 2: Scalable streaming speech synthesis with large language models”. In: arXiv preprint arXiv:2412.10117 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

Zhihao Du et al. CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens. 2024. arXiv: 2407.05407 [cs.SD]. URL: https://arxiv.org/abs/2407.05407

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Llama-omni: Seamless speech interaction with large language models

Qingkai Fang et al. “Llama-omni: Seamless speech interaction with large language models”. In: arXiv preprint arXiv:2409.06666 (2024)

work page arXiv 2024
[22]

LUCY: Linguistic Understanding and Control Yielding Early Stage of Her

Heting Gao et al. LUCY: Linguistic Understanding and Control Yielding Early Stage of Her . 2025. arXiv: 2501.16327 [cs.CL]. URL: https://arxiv.org/abs/2501.16327

work page arXiv 2025
[23]

Audio Set: An ontology and human-labeled dataset for audio events

Jort F. Gemmeke et al. “Audio Set: An ontology and human-labeled dataset for audio events”. In:2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . 2017, pp. 776–780. DOI: 10.1109/ICASSP.2017.7952261. 14 Step-Audio 2 Technical Report

work page doi:10.1109/icassp.2017.7952261 2017
[24]

Audio flamingo 2: An audio- language model with long-audio understanding and expert reasoning abilities,

Sreyan Ghosh et al. Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities. 2025. arXiv: 2503.03983 [cs.SD]. URL: https://arxiv.org/abs/2503.03983

work page arXiv 2025
[25]

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

Arushi Goel et al. Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models. 2025. arXiv: 2507.08128 [cs.SD]. URL: https://arxiv.org/abs/2507.08128

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

V ocalsound: A Dataset for Improving Human V ocal Sounds Recognition

Yuan Gong, Jin Yu, and James Glass. “V ocalsound: A Dataset for Improving Human V ocal Sounds Recognition”. In: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2022, pp. 151–155. DOI: 10.1109/ICASSP43922.2022.9746828

work page doi:10.1109/icassp43922.2022.9746828 2022
[27]

Joint audio and speech understanding

Yuan Gong et al. “Joint audio and speech understanding”. In:2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE. 2023, pp. 1–8

work page 2023
[28]

Listen, think, and understand

Yuan Gong et al. “Listen, think, and understand”. In: arXiv preprint arXiv:2305.10790 (2023)

work page arXiv 2023
[29]

The Llama 3 Herd of Models

Aaron Grattafiori et al. The Llama 3 Herd of Models . 2024. arXiv: 2407 . 21783 [cs.AI]. URL: https : //arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units

Wei-Ning Hsu et al. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. 2021. arXiv: 2106.07447 [cs.CL]. URL: https://arxiv.org/abs/2106.07447

work page arXiv 2021
[31]

Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model

Ailin Huang et al. “Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model”. In:arXiv preprint arXiv:2506.08967 (2025)

work page arXiv 2025
[32]

Step-audio: Unified understanding and generation in intelligent speech interaction

Ailin Huang et al. “Step-audio: Unified understanding and generation in intelligent speech interaction”. In: arXiv preprint arXiv:2502.11946 (2025)

work page arXiv 2025
[33]

Audiogpt: Understanding and generating speech, music, sound, and talking head

Rongjie Huang et al. “Audiogpt: Understanding and generating speech, music, sound, and talking head”. In: Proceedings of the AAAI Conference on Artificial Intelligence. V ol. 38. 21. 2024, pp. 23802–23804

work page 2024
[34]

GPT-4o System Card

Aaron Hurst et al. “Gpt-4o system card”. In: arXiv preprint arXiv:2410.21276 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

CochlScene: Acquisition of acoustic scene data using crowdsourcing

Il-Young Jeong and Jeongsoo Park. “CochlScene: Acquisition of acoustic scene data using crowdsourcing”. In: 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). 2022, pp. 17–21. DOI: 10.23919/APSIPAASC55919.2022.9979822

work page doi:10.23919/apsipaasc55919.2022.9979822 2022
[36]

Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling

Shengpeng Ji et al. “Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling”. In: arXiv preprint arXiv:2408.16532 (2024)

work page arXiv 2024
[37]

CVSS Corpus and Massively Multilingual Speech-to-Speech Translation

Ye Jia et al. CVSS Corpus and Massively Multilingual Speech-to-Speech Translation. 2022. arXiv: 2201.03713 [cs.CL]. URL: https://arxiv.org/abs/2201.03713

work page arXiv 2022
[38]

Direct speech-to-speech translation with a sequence-to-sequence model

Ye Jia et al. “Direct speech-to-speech translation with a sequence-to-sequence model”. In: arXiv preprint arXiv:1904.06037 (2019)

work page internal anchor Pith review Pith/arXiv arXiv 1904
[39]

Translatotron 2: High-quality direct speech-to-speech translation with voice preservation

Ye Jia et al. “Translatotron 2: High-quality direct speech-to-speech translation with voice preservation”. In: International conference on machine learning. PMLR. 2022, pp. 10120–10134

work page 2022
[40]

Kharitonov, D

Eugene Kharitonov et al. Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision. 2023. arXiv: 2302.03540 [cs.SD]. URL: https://arxiv.org/abs/2302.03540

work page arXiv 2023
[41]

Audiocaps: Generating captions for audios in the wild

Chris Dongjoo Kim et al. “Audiocaps: Generating captions for audios in the wild”. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019, pp. 119–132

work page 2019
[42]

HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. 2020. arXiv: 2010.05646 [cs.SD]. URL: https://arxiv.org/abs/2010. 05646

work page arXiv 2020
[43]

Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities

Zhifeng Kong et al. Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities. 2024. arXiv: 2402.01831 [cs.SD]. URL: https://arxiv.org/abs/2402.01831

work page arXiv 2024
[44]

Transvip: Speech to speech translation system with voice and isochrony preservation

Chenyang Le et al. “Transvip: Speech to speech translation system with voice and isochrony preservation”. In: Advances in Neural Information Processing Systems 37 (2024), pp. 89682–89705

work page 2024
[45]

Textless speech-to-speech translation on real data

Ann Lee et al. “Textless speech-to-speech translation on real data”. In:arXiv preprint arXiv:2112.08352 (2021)

work page arXiv 2021
[46]

BigVGAN: A Universal Neural Vocoder with Large-Scale Training

Sang-gil Lee et al. BigVGAN: A Universal Neural Vocoder with Large-Scale Training. 2023. arXiv: 2206.04658 [cs.SD]. URL: https://arxiv.org/abs/2206.04658

work page arXiv 2023
[47]

Advancing large language models to capture varied speaking styles and respond properly in spoken conversations

Guan-Ting Lin, Cheng-Han Chiang, and Hung-yi Lee. “Advancing large language models to capture varied speaking styles and respond properly in spoken conversations”. In: arXiv preprint arXiv:2402.12786 (2024)

work page arXiv 2024
[48]

Paralinguistics-enhanced large language modeling of spoken dialogue

Guan-Ting Lin et al. “Paralinguistics-enhanced large language modeling of spoken dialogue”. In: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2024, pp. 10316–10320

work page 2024
[49]

Spirit LM: Interleaved Spoken and Written Language Model

Tu Anh Nguyen et al. Spirit LM: Interleaved Spoken and Written Language Model. 2024. arXiv: 2402.05755 [cs.CL]. URL: https://arxiv.org/abs/2402.05755

work page arXiv 2024
[50]

GPT-4 Technical Report

OpenAI. GPT-4 Technical Report. https://openai.com/research/gpt-4. Accessed: 2025-07-11. 2023. 15 Step-Audio 2 Technical Report

work page 2025
[51]

Introducing ChatGPT

OpenAI. Introducing ChatGPT. Accessed: 2025-07-11. 2022. URL: https://openai.com/blog/chatgpt

work page 2025
[52]

Deep voice 3: 2000-speaker neural text-to-speech

Wei Ping et al. “Deep voice 3: 2000-speaker neural text-to-speech”. In:proc. ICLR. V ol. 79. 2018, pp. 1094–1099

work page 2000
[53]

Robust speech recognition via large-scale weak supervision

Alec Radford et al. “Robust speech recognition via large-scale weak supervision”. In:International conference on machine learning. PMLR. 2023, pp. 28492–28518

work page 2023
[54]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov et al. “Direct preference optimization: Your language model is secretly a reward model”. In: Advances in Neural Information Processing Systems 36 (2023), pp. 53728–53741

work page 2023
[55]

Fastspeech 2: Fast and high-quality end-to-end text to speech

Yi Ren et al. “Fastspeech 2: Fast and high-quality end-to-end text to speech”. In:arXiv preprint arXiv:2006.04558 (2020)

work page arXiv 2006
[56]

Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?

Andrew Rouditchenko et al. “Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?” In:arXiv preprint arXiv:2505.09439 (2025)

work page arXiv 2025
[57]

AudioPaLM: A Large Language Model That Can Speak and Listen

Paul K Rubenstein et al. “Audiopalm: A large language model that can speak and listen”. In: arXiv preprint arXiv:2306.12925 (2023)

work page internal anchor Pith review arXiv 2023
[58]

MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

S Sakshi et al. “Mmau: A massive multi-task audio understanding and reasoning benchmark”. In: arXiv preprint arXiv:2410.19168 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[59]

Natural tts synthesis by conditioning wavenet on mel spectrogram predictions

Jonathan Shen et al. “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions”. In:2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE. 2018, pp. 4779–4783

work page 2018
[60]

Snac: Multi-scale neural audio codec

Hubert Siuzdak, Florian Grötschla, and Luca A Lanzendörfer. “Snac: Multi-scale neural audio codec”. In: arXiv preprint arXiv:2410.14411 (2024)

work page arXiv 2024
[61]

Salmonn: Towards generic hearing abilities for large language models,

Changli Tang et al. “Salmonn: Towards generic hearing abilities for large language models”. In:arXiv preprint arXiv:2310.13289 (2023)

work page arXiv 2023
[62]

Springer Science & Business Media, 2013

Wolfgang Wahlster.Verbmobil: foundations of speech-to-speech translation. Springer Science & Business Media, 2013

work page 2013
[63]

Changhan Wang, Anne Wu, and Juan Pino.CoVoST 2 and Massively Multilingual Speech-to-Text Translation

work page
[64]

URL: https://arxiv.org/abs/2007.10310

arXiv: 2007.10310 [cs.CL]. URL: https://arxiv.org/abs/2007.10310

work page arXiv 2007
[65]

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Chengyi Wang et al. “Neural codec language models are zero-shot text to speech synthesizers”. In:arXiv preprint arXiv:2301.02111 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[66]

Spark-tts: An efficient llm-based text-to-speech model with single-stream decoupled speech tokens

Xinsheng Wang et al. “Spark-tts: An efficient llm-based text-to-speech model with single-stream decoupled speech tokens”. In: arXiv preprint arXiv:2503.01710 (2025)

work page arXiv 2025
[67]

Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen llm

Xiong Wang et al. “Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen llm”. In: arXiv preprint arXiv:2411.00774 (2024)

work page arXiv 2024
[68]

Maskgct: Zero-shot text-to-speech with masked generative codec transformer

Yuancheng Wang et al. “Maskgct: Zero-shot text-to-speech with masked generative codec transformer”. In:arXiv preprint arXiv:2409.00750 (2024)

work page arXiv 2024
[69]

Tacotron: Towards End-to-End Speech Synthesis

Yuxuan Wang et al. “Tacotron: Towards end-to-end speech synthesis”. In: arXiv preprint arXiv:1703.10135 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[70]

Finetuned Language Models Are Zero-Shot Learners

Jason Wei et al. “Finetuned language models are zero-shot learners”. In:arXiv preprint arXiv:2109.01652 (2021)

work page internal anchor Pith review Pith/arXiv arXiv 2021
[71]

Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

Yonghui Wu et al. “Google’s neural machine translation system: Bridging the gap between human and machine translation”. In: arXiv preprint arXiv:1609.08144 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[72]

Mini-omni: Language models can hear, talk while thinking in streaming

Zhifei Xie and Changqiao Wu. “Mini-omni: Language models can hear, talk while thinking in streaming”. In: arXiv preprint arXiv:2408.16725 (2024)

work page arXiv 2024
[73]

Mini-omni2: Towards open-source gpt-4o with vision, speech and duplex capabilities

Zhifei Xie and Changqiao Wu. “Mini-omni2: Towards open-source gpt-4o with vision, speech and duplex capabilities”. In: arXiv preprint arXiv:2410.11190 (2024)

work page arXiv 2024
[74]

Bigcodec: Pushing the limits of low-bitrate neural speech codec

Detai Xin et al. “Bigcodec: Pushing the limits of low-bitrate neural speech codec”. In: arXiv preprint arXiv:2409.05377 (2024)

work page arXiv 2024
[75]

Qwen2.5-Omni Technical Report

Jin Xu et al. Qwen2.5-Omni Technical Report. 2025. arXiv: 2503.20215 [cs.CL] . URL: https://arxiv. org/abs/2503.20215

work page internal anchor Pith review Pith/arXiv arXiv 2025
[76]

URO-Bench: A Comprehensive Benchmark for End-to-End Spoken Dialogue Models

Ruiqi Yan et al. URO-Bench: A Comprehensive Benchmark for End-to-End Spoken Dialogue Models . 2025. arXiv: 2502.17810 [cs.CL]. URL: https://arxiv.org/abs/2502.17810

work page arXiv 2025
[77]

Soundstream: An end-to-end neural audio codec

Neil Zeghidour et al. “Soundstream: An end-to-end neural audio codec”. In: IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2021), pp. 495–507

work page 2021
[78]

GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot

Aohan Zeng et al. “Glm-4-voice: Towards intelligent and human-like end-to-end spoken chatbot”. In: arXiv preprint arXiv:2412.02612 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[79]

WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition

Binbin Zhang et al. WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition

work page
[80]

URL: https://arxiv.org/abs/2110.03370

arXiv: 2110.03370 [cs.SD]. URL: https://arxiv.org/abs/2110.03370

work page arXiv

Showing first 80 references.