StepAudio 2.5 Technical Report

Bin Lin; Boyong Wu; Bo Zhao; Brian Li; Changlin Zhang; Chang Zeng; Chao Yan; Chen Geng; Chenghao Dong; Chengli Feng

arxiv: 2605.23463 · v1 · pith:52L7GHSYnew · submitted 2026-05-22 · 📡 eess.AS

StepAudio 2.5 Technical Report

Bin Lin , Bo Zhao , Boyong Wu , Chao Yan , Chen Wu , Cheng Yi , Chengyuan Yao , Daijiao Liu

show 93 more authors

Fei Tian Feng Tian Haiyang Sun Haoyang Zhang Jiangjie Zhen Jinglan Gong Jun Chen Li Xie Peilin Li Peng Yang Pengfei Tan Qingjian Lin Runze Li Shenghua Hu Siyi Zhou Wenwen Qu Xiangyu Li Xiangyu Tony Zhang Xuerui Yang Yang Yang Yechang Huang Yu Fu Yuchu Luo Yuxin Li Yuxin Zhang Zhengyan Sheng Brian Li Chang Zeng Changlin Zhang Chen Geng Chenghao Dong Chengli Feng Dan Zhou Danni Wan Di Chen Die Zhang Dongqing Pang Guanglong Yang Guoqiang Hu Huangxi Zhu Jianzheng Gao Jinghua Liang Jinmei Wan Junjie Yuan Kang An Lei Lei Limin Zhong Lun Cai Mengqiang Ren Min Xu Mingliang Li Mingxiao Li Na Wang Qiang Tong Qiaoling Huang Qingfu Du Rui Wang Shengchen Zhou Shi Qiu Shihao Peng Shiliang Yang Siqi Tu Tianjiao Deng Ting Xu Tong Wang WeiMing Niu Wuxun Xie Xianwei Zhang Xianyu Feng Xiaojia Liu Xing Chen Xiongbin Wu Yan Wu Yang Li Yi Liu Yifan Zhang Yile Liu Yongshen Long Yu Luo Yuanhao Ding Yuhao Wang Yuhe Yin Yunfang Xu Yuxiang Yang Zhiguo Huang Zhiyue Wu Zichao Li Zichao Zhou Daxin Jiang Future Li Gang Yu Xiangyu Zhang Yibo Zhu

This is my paper

Pith reviewed 2026-05-25 02:50 UTC · model grok-4.3

classification 📡 eess.AS

keywords unified audio-language modelautomatic speech recognitiontext-to-speech synthesisrealtime spoken interactionreinforcement learning from human feedbackmultimodal foundation model

0 comments

The pith

A single audio-language model matches specialized systems at speech recognition, synthesis, and realtime dialogue.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces StepAudio 2.5 as a unified foundation that reaches state-of-the-art results in automatic speech recognition, text-to-speech synthesis, and realtime spoken interaction. It claims that once text and audio occupy a shared multimodal space, the three tasks no longer require separate architectures but can instead be handled through different choices of training data, optimization targets, and decoding rules. The work shifts from ordinary supervised training to task-specific reinforcement learning from human feedback as the main way to set those targets. A sympathetic reader would care because the result points toward fewer models needed to cover the full range of audio capabilities in applications such as voice assistants and live conversation systems.

Core claim

StepAudio 2.5 shows that a shared audio-language backbone can internalize the distinct deployment objectives of speech understanding, generation, and live interaction by advancing post-training to task-tailored RLHF together with specialized decoding, thereby matching or exceeding the performance of systems built separately for ASR, TTS, and realtime dialogue.

What carries the argument

Task-tailored Reinforcement Learning from Human Feedback applied after text and audio share a multimodal representational space, used to set distinct optimization targets and decoding constraints for each operational mode.

If this is right

ASR mode improves transcription efficiency through verifiable multi-token decoding.
TTS mode produces controllable and expressive output via preference-based RLHF and context-rich supervision.
Realtime mode delivers low-latency, persona-consistent dialogue through generative reward modeling inside the RLHF framework.
The single backbone achieves state-of-the-art numbers across all three tasks on standard benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the premise holds, developers could maintain one model instead of three separate pipelines for audio tasks.
The same operational-regime approach might allow additional audio capabilities to be added without redesigning the core architecture.
Consistent persona across understanding and generation modes could simplify building reliable conversational agents.

Load-bearing premise

Once text and audio share a multimodal representational space, task specialization reduces to choices in data construction, optimization targets, and decoding constraints.

What would settle it

Head-to-head evaluation on a standard benchmark in which StepAudio 2.5 fails to match or exceed the best specialized system in at least one of ASR, TTS, or realtime interaction.

Figures

Figures reproduced from arXiv: 2605.23463 by Bin Lin, Boyong Wu, Bo Zhao, Brian Li, Changlin Zhang, Chang Zeng, Chao Yan, Chen Geng, Chenghao Dong, Chengli Feng, Cheng Yi, Chengyuan Yao, Chen Wu, Daijiao Liu, Danni Wan, Dan Zhou, Daxin Jiang, Di Chen, Die Zhang, Dongqing Pang, Fei Tian, Feng Tian, Future Li, Gang Yu, Guanglong Yang, Guoqiang Hu, Haiyang Sun, Haoyang Zhang, Huangxi Zhu, Jiangjie Zhen, Jianzheng Gao, Jinghua Liang, Jinglan Gong, Jinmei Wan, Jun Chen, Junjie Yuan, Kang An, Lei Lei, Limin Zhong, Li Xie, Lun Cai, Mengqiang Ren, Mingliang Li, Mingxiao Li, Min Xu, Na Wang, Peilin Li, Pengfei Tan, Peng Yang, Qiang Tong, Qiaoling Huang, Qingfu Du, Qingjian Lin, Rui Wang, Runze Li, Shengchen Zhou, Shenghua Hu, Shihao Peng, Shiliang Yang, Shi Qiu, Siqi Tu, Siyi Zhou, Tianjiao Deng, Ting Xu, Tong Wang, WeiMing Niu, Wenwen Qu, Wuxun Xie, Xiangyu Li, Xiangyu Tony Zhang, Xiangyu Zhang, Xianwei Zhang, Xianyu Feng, Xiaojia Liu, Xing Chen, Xiongbin Wu, Xuerui Yang, Yang Li, Yang Yang, Yan Wu, Yechang Huang, Yibo Zhu, Yifan Zhang, Yile Liu, Yi Liu, Yongshen Long, Yuanhao Ding, Yuchu Luo, Yu Fu, Yuhao Wang, Yuhe Yin, Yu Luo, Yunfang Xu, Yuxiang Yang, Yuxin Li, Yuxin Zhang, Zhengyan Sheng, Zhiguo Huang, Zhiyue Wu, Zichao Li, Zichao Zhou.

**Figure 1.** Figure 1: A unified view of the StepAudio 2.5 model family. The shared audio-language stack provides the common architectural basis used to organize ASR, TTS, and Realtime, while the three systems serve different deployment goals. prior plus a mechanism to route supervision through different output spaces and deployment regimes. Recognition, synthesis, and realtime dialogue then become three ways of querying the sam… view at source ↗

**Figure 2.** Figure 2: ASR architecture in StepAudio 2.5. The shared encoder-adaptor-decoder backbone is augmented with parallel future-token branches, making decoding substantially more efficient while preserving autoregressive verification. processed by a decoder-style Transformer block. All branches share the same embedding layer and vocabulary output head as the main decoder. 4.1 Training Pipeline ASR SFT Supervised fine-tun… view at source ↗

**Figure 3.** Figure 3: Long-form ASR data construction pipeline. The process transitions from individual clip transcription to global session-level refinement to ensure both accuracy and consistency. Both stages inherit the 32K sequence budget, 32 global batch size, and 10K-step training horizon. During training, the main branch predicts the next token xt+1 at position t, while the h-th MTP branch targets the future token xt+1+h… view at source ↗

**Figure 4.** Figure 4: Arena Win Rates of StepAudio-2.5-TTS. Finally, we select three leading models with controllable generation capabilities—MiniMax-2.8-HD, Elevenlabs-v3, and Gemini-3.1-Flash-TTS. For each model, we adopt its officially recommended optimal voice preset and conduct arena-based evaluation using 774 prompts. The results in [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗

**Figure 5.** Figure 5: Realtime interaction evaluation. Higher is better. Best results are in bold. Results Analysis: As shown in [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

read the original abstract

Unified audio-language modeling has emerged as a prominent trend in modern speech systems, promising to bring the reasoning capabilities of large language models to auditory tasks. However, existing unified foundations often struggle to match the depth of specialized systems across automatic speech recognition (ASR), text-to-speech synthesis (TTS), and realtime spoken interaction. Bridging this gap remains an open challenge. This report presents StepAudio 2.5, a unified audio-language foundation model that matches or exceeds specialized systems across all three capabilities. Rather than treating these tasks as architecturally distinct, we operate on the premise that once text and audio share a multimodal representational space, task specialization becomes a matter of operational regimes: data construction, optimization targets, and decoding constraints. Guided by this insight, we advance the post-training paradigm from standard supervised learning to task-tailored Reinforcement Learning from Human Feedback (RLHF), using it as the primary mechanism to define complex optimization targets. We leverage this RLHF-centric alignment, alongside specialized decoding, to shape a shared backbone into three distinct operational modes. Concretely, the ASR branch advances transcription efficiency via verifiable multi-token decoding; the TTS branch achieves controllable, expressive synthesis through preference-based RLHF and context-rich supervision; and the Realtime branch realizes low-latency, persona-consistent dialogue via generative reward modeling within an RLHF framework. On standard benchmarks, StepAudio 2.5 achieves state-of-the-art results across ASR, TTS, and Realtime, demonstrating that a singular audio-language foundation can successfully internalize the distinct deployment objectives of speech understanding, generation, and live interaction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

StepAudio 2.5 report describes a unified model using task-specific RLHF for ASR, TTS, and realtime but gives no benchmark numbers to support the SOTA claims.

read the letter

The punchline on StepAudio 2.5 is that it's a technical report showing a unified audio-language model can hit SOTA on ASR, TTS, and realtime interaction through task-specific RLHF rather than separate models. What comes across as new is the concrete RLHF setups for each branch: multi-token verifiable decoding for ASR, preference-based RLHF with rich context for TTS, and generative reward modeling for low-latency dialogue. The core idea that specialization happens via data, targets, and decoding once the representation space is shared is laid out plainly. The report does well at describing how they moved from supervised learning to this RLHF-centric post-training. It gives a coherent picture of how one backbone can be shaped into three modes without architectural splits. The soft spots are that the abstract announces SOTA results but supplies no actual scores, comparisons, or dataset info, so the performance claims can't be checked from what's here. There's also no discussion of how much the gains depend on the particular human feedback data. If the full paper has those, they'd need close scrutiny for independence and reproducibility. This is the kind of paper that would interest groups working on consolidating speech pipelines or applying RLHF to audio tasks. A reader looking for practical examples of unified models would get value from the approach description. It deserves a serious referee to go over the benchmark results and the RLHF implementations in detail. Recommendation: Yes, send it for peer review.

Referee Report

1 major / 1 minor

Summary. The manuscript presents StepAudio 2.5, a unified audio-language foundation model that matches or exceeds specialized systems across ASR, TTS, and realtime spoken interaction. It operates on the premise that a shared multimodal representational space allows task specialization via operational regimes (data construction, optimization targets, and decoding constraints), with RLHF as the primary post-training mechanism: verifiable multi-token decoding for ASR, preference-based RLHF for TTS, and generative reward modeling for realtime.

Significance. If the SOTA claims are substantiated with detailed, reproducible benchmarks including error bars, dataset specifications, and direct comparisons to specialized baselines, the work would be significant for showing that a single backbone can internalize distinct deployment objectives through RLHF-centric alignment rather than separate architectures.

major comments (1)

[Abstract] Abstract: the central claim that StepAudio 2.5 'achieves state-of-the-art results across ASR, TTS, and Realtime' is presented without any quantitative metrics (e.g., WER, MOS, latency figures), error bars, dataset details, or comparison tables. This directly undermines verification of the performance claim that is load-bearing for the entire contribution.

minor comments (1)

[Abstract] Abstract, paragraph 3: the phrase 'standard benchmarks' is used without naming the specific datasets or metrics, reducing clarity on how the SOTA comparisons were performed.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract. We address the single major comment below and will revise accordingly to strengthen verifiability of the claims.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that StepAudio 2.5 'achieves state-of-the-art results across ASR, TTS, and Realtime' is presented without any quantitative metrics (e.g., WER, MOS, latency figures), error bars, dataset details, or comparison tables. This directly undermines verification of the performance claim that is load-bearing for the entire contribution.

Authors: We agree that the abstract would benefit from explicit quantitative support to allow immediate assessment of the SOTA claims. The full manuscript contains detailed benchmark tables, dataset specifications, and direct comparisons in the experimental sections, but the abstract relies on a summary statement. In the revised version we will update the abstract to include representative metrics (e.g., WER on LibriSpeech, MOS on standard TTS test sets, and end-to-end latency for realtime), along with brief references to baselines and error bars where reported. This change improves transparency without altering the technical narrative or results. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper states its central premise explicitly as an operating assumption ('we operate on the premise that once text and audio share a multimodal representational space, task specialization becomes a matter of operational regimes') and then describes the application of RLHF and specialized decoding to produce three modes. No equations, fitted parameters renamed as predictions, or self-citation chains are present in the provided text that would reduce the claimed results to the inputs by construction. The SOTA claims rest on benchmark outcomes rather than any definitional equivalence or load-bearing self-reference. This is the normal case of a self-contained empirical report.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The report relies on the untested premise that a shared multimodal space plus task-specific RLHF is sufficient to match specialized systems; no independent evidence for this premise is supplied in the abstract.

axioms (1)

domain assumption Once text and audio share a multimodal representational space, task specialization reduces to data construction, optimization targets, and decoding constraints.
Stated explicitly in the abstract as the guiding insight.

pith-pipeline@v0.9.0 · 6189 in / 1212 out tokens · 13214 ms · 2026-05-25T02:50:13.420355+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

39 extracted references · 39 canonical work pages · 11 internal anchors

[1]

Connectionist temporal classification

Alex Graves. Connectionist temporal classification. InSupervised sequence labelling with recurrent neural networks, pages 61–93. Springer, 2012

work page 2012
[2]

Sequence Transduction with Recurrent Neural Networks

Alex Graves. Sequence transduction with recurrent neural networks.arXiv preprint arXiv:1211.3711, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012
[3]

Listen, Attend and Spell

William Chan, Navdeep Jaitly, Quoc V Le, and Oriol Vinyals. Listen, attend and spell.arXiv preprint arXiv:1508.01211, 2015. 16 StepFun-Audio Team

work page internal anchor Pith review Pith/arXiv arXiv 2015
[4]

Robust speech recognition via large-scale weak supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. pages 28492–28518, 2023

work page 2023
[5]

VIBEVOICE-ASR technical report.arXiv preprint arXiv:2601.18184, 2026

Zhiliang Peng, Jianwei Yu, Yaoyao Chang, Zilong Wang, Li Dong, Yingbo Hao, et al. VIBEVOICE-ASR technical report.arXiv preprint arXiv:2601.18184, 2026

work page arXiv 2026
[6]

Fun-ASR technical report.arXiv preprint arXiv:2509.12508, 2025

Keyu An, Yanni Chen, Zhigao Chen, Chong Deng, Zhihao Du, Changfeng Gao, et al. Fun-ASR technical report.arXiv preprint arXiv:2509.12508, 2025

work page arXiv 2025
[7]

Seed-asr: Understanding diverse speech and contexts with llm-based speech recognition.arXiv preprint arXiv:2407.04675, 2024

Ye Bai, Jingping Chen, Jitong Chen, Wei Chen, Zhuo Chen, Chuang Ding, Linhao Dong, Qianqian Dong, Yujiao Du, Kepan Gao, et al. Seed-asr: Understanding diverse speech and contexts with llm-based speech recognition.arXiv preprint arXiv:2407.04675, 2024

work page arXiv 2024
[8]

Qwen3-ASR Technical Report

Xian Shi, Xiong Wang, Zhifang Guo, Yongqi Wang, Pei Zhang, Xinyu Zhang, et al. Qwen3- ASR technical report.arXiv preprint arXiv:2601.21337, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[9]

Step-Audio 2 Technical Report

Boyong Wu, Chao Yan, Chen Hu, Cheng Yi, Chengli Feng, Fei Tian, Feiyu Shen, Gang Yu, Haoyang Zhang, Jingbei Li, et al. StepAudio 2 technical report.arXiv preprint arXiv:2507.16632, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Qwen3-Omni Technical Report

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-Omni technical report.arXiv preprint arXiv:2509.17765, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation

Che Liu, Lichao Ma, Xiangyu Tony Zhang, Yuxin Zhang, Haoyang Zhang, Xuerui Yang, and Fei Tian. Boosting omni-modal language models: Staged post-training with visually debiased evaluation, 2026. URLhttps://arxiv.org/abs/2605.12034

work page internal anchor Pith review Pith/arXiv arXiv 2026
[12]

Salmonn: Towards generic hearing abilities for large language models

Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang. Salmonn: Towards generic hearing abilities for large language models. In International Conference on Learning Representations, volume 2024, pages 16607–16629, 2024

work page 2024
[13]

Audiolm: a language modeling approach to audio generation.IEEE/ACM transactions on audio, speech, and language processing, 31:2523–2533, 2023

Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, et al. Audiolm: a language modeling approach to audio generation.IEEE/ACM transactions on audio, speech, and language processing, 31:2523–2533, 2023

work page 2023
[14]

Recent advances in speech language models: A survey

Wenqian Cui, Dianzhi Yu, Xiaoqi Jiao, Ziqiao Meng, Guangyan Zhang, Qichao Wang, Steven Y Guo, and Irwin King. Recent advances in speech language models: A survey. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 13943–13970, 2025

work page 2025
[15]

Paralinguistics-aware speech-empowered large language models for natural conversation.Advances in Neural Information Processing Systems, 37:131072–131103, 2024

Heeseung Kim, Soonshin Seo, Kyeongseok Jeong, Ohsung Kwon, Soyoon Kim, Jungwhan Kim, Jaehong Lee, Eunwoo Song, Myungwoo Oh, Jung-Woo Ha, et al. Paralinguistics-aware speech-empowered large language models for natural conversation.Advances in Neural Information Processing Systems, 37:131072–131103, 2024

work page 2024
[16]

Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen LLM

Xiong Wang, Yangze Li, Chaoyou Fu, Yike Zhang, Yunhang Shen, Lei Xie, Ke Li, Xing Sun, and Long MA. Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen LLM. InF orty-second International Conference on Machine Learning, 2025. URL 17 StepFun-Audio Team https://openreview.net/forum?id=s1EImzs5Id

work page 2025
[17]

Depflow: Disentangled speech generation to mitigate semantic bias in depression detection.arXiv preprint arXiv:2601.00303, 2026

Yuxin Li, Xiangyu Zhang, Yifei Li, Zhiwei Guo, Haoyang Zhang, Eng Siong Chng, and Cuntai Guan. Depflow: Disentangled speech generation to mitigate semantic bias in depression detection.arXiv preprint arXiv:2601.00303, 2026

work page arXiv 2026
[18]

A new approach to extract fetal electrocardiogram using affine combination of adaptive filters

Yu Xuan, Xiangyu Zhang, Shuyue Stella Li, Zihan Shen, Xin Xie, Leibny Paola Garcia, and Roberto Togneri. A new approach to extract fetal electrocardiogram using affine combination of adaptive filters. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023

work page 2023
[19]

Multi-bench: A multi-turn interactive benchmark for assessing emotional intelligence ability of spoken dialogue models.arXiv preprint arXiv:2511.00850, 2025

Yayue Deng, Guoqiang Hu, Haiyang Sun, Xiangyu Zhang, Haoyang Zhang, Fei Tian, Xuerui Yang, Gang Yu, and Eng Siong Chng. Multi-bench: A multi-turn interactive benchmark for assessing emotional intelligence ability of spoken dialogue models.arXiv preprint arXiv:2511.00850, 2025

work page arXiv 2025
[20]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[22]

Mind-Paced Speaking: A Dual-Brain Approach to Real-Time Reasoning in Spoken Language Models

Donghang Wu, Haoyang Zhang, Jun Chen, Hexin Liu, Eng Siong Chng, Fei Tian, Xuerui Yang, Xiangyu Zhang, Daxin Jiang, Gang Yu, et al. Mind-paced speaking: A dual-brain approach to real-time reasoning in spoken language models.arXiv preprint arXiv:2510.09592, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[23]

Chronological thinking in full-duplex spoken dialogue language models.arXiv preprint arXiv:2510.05150, 2025

Donghang Wu, Haoyang Zhang, Chen Chen, Tianyu Zhang, Fei Tian, Xuerui Yang, Gang Yu, Hexin Liu, Nana Hou, Yuchen Hu, et al. Chronological thinking in full-duplex spoken dialogue language models.arXiv preprint arXiv:2510.05150, 2025

work page arXiv 2025
[24]

Duplexsla: A full-duplex spoken language model with synchronized speech, language, and action, 2026

Haoyang Zhang, Jun Chen, Donghang Wu, Yuxin Li, Yuxin Zhang, Xiangyu Tony Zhang, Che Liu, Qingjian Lin, Yizhou Peng, Hexin Liu, Eng Siong Chng, Chao Yan, Boyong Wu, Yechang Huang, Xuerui Yang, and Fei Tian. Duplexsla: A full-duplex spoken language model with synchronized speech, language, and action, 2026. URL https://arxiv.org/abs/2605. 20755

work page 2026
[25]

Mamba in speech: Towards an alternative to self-attention.IEEE Transactions on Audio, Speech and Language Processing, 2025

Xiangyu Zhang, Qiquan Zhang, Hexin Liu, Tianyi Xiao, Xinyuan Qian, Beena Ahmed, Eliathamby Ambikairajah, Haizhou Li, and Julien Epps. Mamba in speech: Towards an alternative to self-attention.IEEE Transactions on Audio, Speech and Language Processing, 2025

work page 2025
[26]

Code-switching speech recognition under the lens: Model-and data-centric perspectives.IEEE Transactions on Audio, Speech and Language Processing, 2026

Hexin Liu, Haoyang Zhang, Qiquan Zhang, Xiangyu Zhang, Dongyuan Shi, Eng Siong Chng, and Haizhou Li. Code-switching speech recognition under the lens: Model-and data-centric perspectives.IEEE Transactions on Audio, Speech and Language Processing, 2026

work page 2026
[27]

Step-audio-r1 technical report.arXiv preprint arXiv:2511.15848, 2025

Fei Tian, Xiangyu Tony Zhang, Yuxin Zhang, Haoyang Zhang, Yuxin Li, Daijiao Liu, Yayue Deng, Donghang Wu, Jun Chen, Liang Zhao, et al. Step-audio-r1 technical report.arXiv preprint arXiv:2511.15848, 2025. 18 StepFun-Audio Team

work page arXiv 2025
[28]

Step-Audio-R1.5 Technical Report

Yuxin Zhang, Xiangyu Tony Zhang, Daijiao Liu, Fei Tian, Yayue Deng, Jun Chen, Qingjian Lin, Haoyang Zhang, Yuxin Li, Jinglan Gong, et al. Step-audio-r1.5 technical report.arXiv preprint arXiv:2604.25719, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[29]

Park, William Chan, Yu Zhang, et al

Daniel S. Park, William Chan, Yu Zhang, et al. SpecAugment: A simple data augmentation method for automatic speech recognition. InInterspeech 2019, pages 2613–2617, 2019

work page 2019
[30]

J. G. Fiscus. A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (ROVER). In1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings, pages 347–354, 1997

work page 1997
[31]

AIShell-1: An open-source mandarin speech corpus and a speech recognition baseline

Hui Bu, Jiatong Du, Xingyu Na, Bengu Wu, and Hao Zheng. AIShell-1: An open-source mandarin speech corpus and a speech recognition baseline. In20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment, pages 1–5, 2017

work page 2017
[32]

AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale

Jiatong Du, Xingyu Na, Xuechen Liu, and Hui Bu. AISHELL-2: Transforming mandarin ASR research into industrial scale. InarXiv preprint arXiv:1808.10583, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[33]

WenetSpeech: A 10000+ hours multi-domain mandarin corpus for speech recognition

Binbin Zhang, Hang Lv, Pengcheng Guo, Qijie Shao, Chao Yang, Lei Xie, Xin Xu, Hui Bu, Xiaoyu Chen, Chenchen Zeng, et al. WenetSpeech: A 10000+ hours multi-domain mandarin corpus for speech recognition. InICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 6182–6186, 2022

work page 2022
[34]

FLEURS: Few-shot learning evaluation of universal representations of speech.arXiv preprint arXiv:2205.12446, 2022

Alexis Conneau, Min Ma, Simran Khanuja, et al. FLEURS: Few-shot learning evaluation of universal representations of speech.arXiv preprint arXiv:2205.12446, 2022

work page arXiv 2022
[35]

LibriSpeech: An ASR corpus based on public domain audio books

Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. LibriSpeech: An ASR corpus based on public domain audio books. In2015 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 5206–5210, 2015

work page 2015
[36]

Common voice: A massively-multilingual speech corpus

Rosana Ardila, Megan Branson, Kelly Davis, et al. Common voice: A massively-multilingual speech corpus. InProceedings of the Twelfth Language Resources and Evaluation Conference, pages 4218–4222, 2020

work page 2020
[37]

V oxpopuli-cleaned-aa: Cleaned ground truth transcripts for voxpopuli english test set, 2026

Artificial Analysis. V oxpopuli-cleaned-aa: Cleaned ground truth transcripts for voxpopuli english test set, 2026. URLhttps://artificialanalysis.ai/articles/aa-wer-v2

work page 2026
[38]

Earnings22-cleaned-aa: Cleaned ground truth transcripts for earnings22 english test set, 2026

Artificial Analysis. Earnings22-cleaned-aa: Cleaned ground truth transcripts for earnings22 english test set, 2026. URLhttps://artificialanalysis.ai/articles/aa-wer-v2

work page 2026
[39]

Step-audio-editx technical report, 2025

Chao Yan, Boyong Wu, Peng Yang, Pengfei Tan, Guoqiang Hu, Yuxin Zhang, Xiangyu, Zhang, Fei Tian, Xuerui Yang, Xiangyu Zhang, Daxin Jiang, and Gang Yu. Step-audio-editx technical report, 2025. URLhttps://arxiv.org/abs/2511.03601

work page arXiv 2025
[40]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017. 19

work page internal anchor Pith review Pith/arXiv arXiv 2017

[1] [1]

Connectionist temporal classification

Alex Graves. Connectionist temporal classification. InSupervised sequence labelling with recurrent neural networks, pages 61–93. Springer, 2012

work page 2012

[2] [2]

Sequence Transduction with Recurrent Neural Networks

Alex Graves. Sequence transduction with recurrent neural networks.arXiv preprint arXiv:1211.3711, 2012

work page internal anchor Pith review Pith/arXiv arXiv 2012

[3] [3]

Listen, Attend and Spell

William Chan, Navdeep Jaitly, Quoc V Le, and Oriol Vinyals. Listen, attend and spell.arXiv preprint arXiv:1508.01211, 2015. 16 StepFun-Audio Team

work page internal anchor Pith review Pith/arXiv arXiv 2015

[4] [4]

Robust speech recognition via large-scale weak supervision

Alec Radford, Jong Wook Kim, Tao Xu, Greg Brockman, Christine McLeavey, and Ilya Sutskever. Robust speech recognition via large-scale weak supervision. pages 28492–28518, 2023

work page 2023

[5] [5]

VIBEVOICE-ASR technical report.arXiv preprint arXiv:2601.18184, 2026

Zhiliang Peng, Jianwei Yu, Yaoyao Chang, Zilong Wang, Li Dong, Yingbo Hao, et al. VIBEVOICE-ASR technical report.arXiv preprint arXiv:2601.18184, 2026

work page arXiv 2026

[6] [6]

Fun-ASR technical report.arXiv preprint arXiv:2509.12508, 2025

Keyu An, Yanni Chen, Zhigao Chen, Chong Deng, Zhihao Du, Changfeng Gao, et al. Fun-ASR technical report.arXiv preprint arXiv:2509.12508, 2025

work page arXiv 2025

[7] [7]

Seed-asr: Understanding diverse speech and contexts with llm-based speech recognition.arXiv preprint arXiv:2407.04675, 2024

Ye Bai, Jingping Chen, Jitong Chen, Wei Chen, Zhuo Chen, Chuang Ding, Linhao Dong, Qianqian Dong, Yujiao Du, Kepan Gao, et al. Seed-asr: Understanding diverse speech and contexts with llm-based speech recognition.arXiv preprint arXiv:2407.04675, 2024

work page arXiv 2024

[8] [8]

Qwen3-ASR Technical Report

Xian Shi, Xiong Wang, Zhifang Guo, Yongqi Wang, Pei Zhang, Xinyu Zhang, et al. Qwen3- ASR technical report.arXiv preprint arXiv:2601.21337, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[9] [9]

Step-Audio 2 Technical Report

Boyong Wu, Chao Yan, Chen Hu, Cheng Yi, Chengli Feng, Fei Tian, Feiyu Shen, Gang Yu, Haoyang Zhang, Jingbei Li, et al. StepAudio 2 technical report.arXiv preprint arXiv:2507.16632, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Qwen3-Omni Technical Report

Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-Omni technical report.arXiv preprint arXiv:2509.17765, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation

Che Liu, Lichao Ma, Xiangyu Tony Zhang, Yuxin Zhang, Haoyang Zhang, Xuerui Yang, and Fei Tian. Boosting omni-modal language models: Staged post-training with visually debiased evaluation, 2026. URLhttps://arxiv.org/abs/2605.12034

work page internal anchor Pith review Pith/arXiv arXiv 2026

[12] [12]

Salmonn: Towards generic hearing abilities for large language models

Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang. Salmonn: Towards generic hearing abilities for large language models. In International Conference on Learning Representations, volume 2024, pages 16607–16629, 2024

work page 2024

[13] [13]

Audiolm: a language modeling approach to audio generation.IEEE/ACM transactions on audio, speech, and language processing, 31:2523–2533, 2023

Zalán Borsos, Raphaël Marinier, Damien Vincent, Eugene Kharitonov, Olivier Pietquin, Matt Sharifi, Dominik Roblek, Olivier Teboul, David Grangier, Marco Tagliasacchi, et al. Audiolm: a language modeling approach to audio generation.IEEE/ACM transactions on audio, speech, and language processing, 31:2523–2533, 2023

work page 2023

[14] [14]

Recent advances in speech language models: A survey

Wenqian Cui, Dianzhi Yu, Xiaoqi Jiao, Ziqiao Meng, Guangyan Zhang, Qichao Wang, Steven Y Guo, and Irwin King. Recent advances in speech language models: A survey. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume 1: Long Papers), pages 13943–13970, 2025

work page 2025

[15] [15]

Paralinguistics-aware speech-empowered large language models for natural conversation.Advances in Neural Information Processing Systems, 37:131072–131103, 2024

Heeseung Kim, Soonshin Seo, Kyeongseok Jeong, Ohsung Kwon, Soyoon Kim, Jungwhan Kim, Jaehong Lee, Eunwoo Song, Myungwoo Oh, Jung-Woo Ha, et al. Paralinguistics-aware speech-empowered large language models for natural conversation.Advances in Neural Information Processing Systems, 37:131072–131103, 2024

work page 2024

[16] [16]

Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen LLM

Xiong Wang, Yangze Li, Chaoyou Fu, Yike Zhang, Yunhang Shen, Lei Xie, Ke Li, Xing Sun, and Long MA. Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen LLM. InF orty-second International Conference on Machine Learning, 2025. URL 17 StepFun-Audio Team https://openreview.net/forum?id=s1EImzs5Id

work page 2025

[17] [17]

Depflow: Disentangled speech generation to mitigate semantic bias in depression detection.arXiv preprint arXiv:2601.00303, 2026

Yuxin Li, Xiangyu Zhang, Yifei Li, Zhiwei Guo, Haoyang Zhang, Eng Siong Chng, and Cuntai Guan. Depflow: Disentangled speech generation to mitigate semantic bias in depression detection.arXiv preprint arXiv:2601.00303, 2026

work page arXiv 2026

[18] [18]

A new approach to extract fetal electrocardiogram using affine combination of adaptive filters

Yu Xuan, Xiangyu Zhang, Shuyue Stella Li, Zihan Shen, Xin Xie, Leibny Paola Garcia, and Roberto Togneri. A new approach to extract fetal electrocardiogram using affine combination of adaptive filters. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023

work page 2023

[19] [19]

Multi-bench: A multi-turn interactive benchmark for assessing emotional intelligence ability of spoken dialogue models.arXiv preprint arXiv:2511.00850, 2025

Yayue Deng, Guoqiang Hu, Haiyang Sun, Xiangyu Zhang, Haoyang Zhang, Fei Tian, Xuerui Yang, Gang Yu, and Eng Siong Chng. Multi-bench: A multi-turn interactive benchmark for assessing emotional intelligence ability of spoken dialogue models.arXiv preprint arXiv:2511.00850, 2025

work page arXiv 2025

[20] [20]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[21] [22]

Mind-Paced Speaking: A Dual-Brain Approach to Real-Time Reasoning in Spoken Language Models

Donghang Wu, Haoyang Zhang, Jun Chen, Hexin Liu, Eng Siong Chng, Fei Tian, Xuerui Yang, Xiangyu Zhang, Daxin Jiang, Gang Yu, et al. Mind-paced speaking: A dual-brain approach to real-time reasoning in spoken language models.arXiv preprint arXiv:2510.09592, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [23]

Chronological thinking in full-duplex spoken dialogue language models.arXiv preprint arXiv:2510.05150, 2025

Donghang Wu, Haoyang Zhang, Chen Chen, Tianyu Zhang, Fei Tian, Xuerui Yang, Gang Yu, Hexin Liu, Nana Hou, Yuchen Hu, et al. Chronological thinking in full-duplex spoken dialogue language models.arXiv preprint arXiv:2510.05150, 2025

work page arXiv 2025

[23] [24]

Duplexsla: A full-duplex spoken language model with synchronized speech, language, and action, 2026

Haoyang Zhang, Jun Chen, Donghang Wu, Yuxin Li, Yuxin Zhang, Xiangyu Tony Zhang, Che Liu, Qingjian Lin, Yizhou Peng, Hexin Liu, Eng Siong Chng, Chao Yan, Boyong Wu, Yechang Huang, Xuerui Yang, and Fei Tian. Duplexsla: A full-duplex spoken language model with synchronized speech, language, and action, 2026. URL https://arxiv.org/abs/2605. 20755

work page 2026

[24] [25]

Mamba in speech: Towards an alternative to self-attention.IEEE Transactions on Audio, Speech and Language Processing, 2025

Xiangyu Zhang, Qiquan Zhang, Hexin Liu, Tianyi Xiao, Xinyuan Qian, Beena Ahmed, Eliathamby Ambikairajah, Haizhou Li, and Julien Epps. Mamba in speech: Towards an alternative to self-attention.IEEE Transactions on Audio, Speech and Language Processing, 2025

work page 2025

[25] [26]

Code-switching speech recognition under the lens: Model-and data-centric perspectives.IEEE Transactions on Audio, Speech and Language Processing, 2026

Hexin Liu, Haoyang Zhang, Qiquan Zhang, Xiangyu Zhang, Dongyuan Shi, Eng Siong Chng, and Haizhou Li. Code-switching speech recognition under the lens: Model-and data-centric perspectives.IEEE Transactions on Audio, Speech and Language Processing, 2026

work page 2026

[26] [27]

Step-audio-r1 technical report.arXiv preprint arXiv:2511.15848, 2025

Fei Tian, Xiangyu Tony Zhang, Yuxin Zhang, Haoyang Zhang, Yuxin Li, Daijiao Liu, Yayue Deng, Donghang Wu, Jun Chen, Liang Zhao, et al. Step-audio-r1 technical report.arXiv preprint arXiv:2511.15848, 2025. 18 StepFun-Audio Team

work page arXiv 2025

[27] [28]

Step-Audio-R1.5 Technical Report

Yuxin Zhang, Xiangyu Tony Zhang, Daijiao Liu, Fei Tian, Yayue Deng, Jun Chen, Qingjian Lin, Haoyang Zhang, Yuxin Li, Jinglan Gong, et al. Step-audio-r1.5 technical report.arXiv preprint arXiv:2604.25719, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[28] [29]

Park, William Chan, Yu Zhang, et al

Daniel S. Park, William Chan, Yu Zhang, et al. SpecAugment: A simple data augmentation method for automatic speech recognition. InInterspeech 2019, pages 2613–2617, 2019

work page 2019

[29] [30]

J. G. Fiscus. A post-processing system to yield reduced word error rates: Recognizer output voting error reduction (ROVER). In1997 IEEE Workshop on Automatic Speech Recognition and Understanding Proceedings, pages 347–354, 1997

work page 1997

[30] [31]

AIShell-1: An open-source mandarin speech corpus and a speech recognition baseline

Hui Bu, Jiatong Du, Xingyu Na, Bengu Wu, and Hao Zheng. AIShell-1: An open-source mandarin speech corpus and a speech recognition baseline. In20th Conference of the Oriental Chapter of the International Coordinating Committee on Speech Databases and Speech I/O Systems and Assessment, pages 1–5, 2017

work page 2017

[31] [32]

AISHELL-2: Transforming Mandarin ASR Research Into Industrial Scale

Jiatong Du, Xingyu Na, Xuechen Liu, and Hui Bu. AISHELL-2: Transforming mandarin ASR research into industrial scale. InarXiv preprint arXiv:1808.10583, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[32] [33]

WenetSpeech: A 10000+ hours multi-domain mandarin corpus for speech recognition

Binbin Zhang, Hang Lv, Pengcheng Guo, Qijie Shao, Chao Yang, Lei Xie, Xin Xu, Hui Bu, Xiaoyu Chen, Chenchen Zeng, et al. WenetSpeech: A 10000+ hours multi-domain mandarin corpus for speech recognition. InICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 6182–6186, 2022

work page 2022

[33] [34]

FLEURS: Few-shot learning evaluation of universal representations of speech.arXiv preprint arXiv:2205.12446, 2022

Alexis Conneau, Min Ma, Simran Khanuja, et al. FLEURS: Few-shot learning evaluation of universal representations of speech.arXiv preprint arXiv:2205.12446, 2022

work page arXiv 2022

[34] [35]

LibriSpeech: An ASR corpus based on public domain audio books

Vassil Panayotov, Guoguo Chen, Daniel Povey, and Sanjeev Khudanpur. LibriSpeech: An ASR corpus based on public domain audio books. In2015 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 5206–5210, 2015

work page 2015

[35] [36]

Common voice: A massively-multilingual speech corpus

Rosana Ardila, Megan Branson, Kelly Davis, et al. Common voice: A massively-multilingual speech corpus. InProceedings of the Twelfth Language Resources and Evaluation Conference, pages 4218–4222, 2020

work page 2020

[36] [37]

V oxpopuli-cleaned-aa: Cleaned ground truth transcripts for voxpopuli english test set, 2026

Artificial Analysis. V oxpopuli-cleaned-aa: Cleaned ground truth transcripts for voxpopuli english test set, 2026. URLhttps://artificialanalysis.ai/articles/aa-wer-v2

work page 2026

[37] [38]

Earnings22-cleaned-aa: Cleaned ground truth transcripts for earnings22 english test set, 2026

Artificial Analysis. Earnings22-cleaned-aa: Cleaned ground truth transcripts for earnings22 english test set, 2026. URLhttps://artificialanalysis.ai/articles/aa-wer-v2

work page 2026

[38] [39]

Step-audio-editx technical report, 2025

Chao Yan, Boyong Wu, Peng Yang, Pengfei Tan, Guoqiang Hu, Yuxin Zhang, Xiangyu, Zhang, Fei Tian, Xuerui Yang, Xiangyu Zhang, Daxin Jiang, and Gang Yu. Step-audio-editx technical report, 2025. URLhttps://arxiv.org/abs/2511.03601

work page arXiv 2025

[39] [40]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms.arXiv preprint arXiv:1707.06347, 2017. 19

work page internal anchor Pith review Pith/arXiv arXiv 2017