Step-Audio-R1.5 Technical Report

Chengyuan Yao; Daijiao Liu; Daxin Jiang; Eng Siong Chng; Fei Tian; Gang Yu; Haoyang Zhang; Hexin Liu; Jinglan Gong; Jun Chen

arxiv: 2604.25719 · v2 · pith:PHKD6RXNnew · submitted 2026-04-28 · 📡 eess.AS

Step-Audio-R1.5 Technical Report

Yuxin Zhang , Xiangyu Tony Zhang , Daijiao Liu , Fei Tian , Yayue Deng , Jun Chen , Qingjian Lin , Haoyang Zhang

show 11 more authors

Yuxin Li Jinglan Gong Yechang Huang Liang Zhao Chengyuan Yao Hexin Liu Eng Siong Chng Xuerui Yang Gang Yu Xiangyu Zhang Daxin Jiang

This is my paper

Pith reviewed 2026-05-07 14:06 UTC · model grok-4.3

classification 📡 eess.AS

keywords audio language modelsreinforcement learning with human feedbackreinforcement learning with verified rewardsspoken dialogue systemsprosodic naturalnesschain-of-thought reasoningimmersive interaction

0 comments

The pith

Reinforcement learning from human feedback keeps audio reasoning strong while restoring natural spoken dialogue qualities lost under verified rewards.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies a verifiable reward trap in training audio language models for chain-of-thought reasoning. Using reinforcement learning with verified rewards turns continuous audio into discrete answers that score well on tests but sound mechanical and reduce immersion in long conversations. Step-Audio-R1.5 applies reinforcement learning from human feedback instead. This preserves analytical reasoning while improving prosodic naturalness, emotional continuity, and user engagement. Sympathetic readers would care because practical voice AI depends on feeling conversational rather than like an answering machine.

Core claim

The central discovery is that RLVR optimization for verifiable text labels in audio models creates the verifiable reward trap, systematically degrading prosodic naturalness, emotional continuity, and immersion in long-turn dialogues, whereas Step-Audio-R1.5 trained with RLHF maintains robust analytical reasoning and transforms the interactive experience to enable deeply immersive spoken dialogue.

What carries the argument

The verifiable reward trap, the process by which RLVR reduces continuous auditory contexts to isolated verifiable text labels, which Step-Audio-R1.5 mitigates by using RLHF to align with human perceptions of natural audio interaction.

If this is right

Analytical reasoning capabilities remain robust on objective benchmarks.
Prosodic naturalness and emotional continuity are restored in extended spoken interactions.
User immersion increases in long-turn dialogues without mechanical responses.
Audio models can handle complex tasks while maintaining a natural conversational feel.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Evaluations of future audio models should include subjective measures of immersion alongside benchmark scores.
This approach may apply to other continuous media where discrete rewards risk losing nuance.
Real-world deployment of voice AI could benefit from integrating human feedback to sustain engagement over multiple turns.

Load-bearing premise

That training with verified rewards systematically degrades the natural qualities of audio output and that human feedback training can recover them without compromising reasoning performance.

What would settle it

A controlled user study comparing immersion and naturalness ratings for long dialogues generated by Step-Audio-R1.5 against an equivalent RLVR model, where equivalent or lower ratings for the RLHF model would falsify the improvement claim.

Figures

Figures reproduced from arXiv: 2604.25719 by Chengyuan Yao, Daijiao Liu, Daxin Jiang, Eng Siong Chng, Fei Tian, Gang Yu, Haoyang Zhang, Hexin Liu, Jinglan Gong, Jun Chen, Liang Zhao, Qingjian Lin, Xiangyu Tony Zhang, Xiangyu Zhang, Xuerui Yang, Yayue Deng, Yechang Huang, Yuxin Li, Yuxin Zhang.

**Figure 1.** Figure 1: Aggregate Performance across Speech-to-Text Benchmarks. The average score represents the holistic capabilities of each model computed over 8 distinct reasoning and perception benchmarks, including Audio MultiChallenge, Big Bench Audio, MMSU, MMAU, Spoken MQA, Step-Caption, Step-DU, and Step-SPQA. Step-Audio-R1.5 substantially outperforms its predecessor and remains highly competitive with state-of-the-art… view at source ↗

read the original abstract

Recent advancements in large audio language models have extended Chain-of-Thought (CoT) reasoning into the auditory domain, enabling models to tackle increasingly complex acoustic and spoken tasks. To elicit and sustain these extended reasoning chains, the prevailing paradigm -- driven by the success of text-based reasoning models -- overwhelmingly relies on Reinforcement Learning with Verified Rewards (RLVR). However, as models are strictly optimized to distill rich, continuous auditory contexts into isolated, verifiable text labels, a fundamental question arises: are we fostering true audio intelligence, or merely reducing a continuous sensory medium into a discrete puzzle? We identify this as the "verifiable reward trap." While RLVR yields remarkable scores on standardized objective benchmarks, it systematically degrades the real-world conversational feel of audio models. By prioritizing isolated correctness over acoustic nuance, RLVR reduces dynamic interactions to mechanical "answering machines," severely compromising prosodic naturalness, emotional continuity, and user immersion, particularly in long-turn dialogues. To bridge the gap between mechanical objective verification and genuine sensory empathy, we introduce Step-Audio-R1.5, marking a paradigm shift toward Reinforcement Learning from Human Feedback (RLHF) in audio reasoning. Comprehensive evaluations demonstrate that Step-Audio-R1.5 not only maintains robust analytical reasoning but profoundly transforms the interactive experience, redefining the boundaries of deeply immersive long-turn spoken dialogue.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper claims that Reinforcement Learning with Verified Rewards (RLVR) creates a 'verifiable reward trap' in large audio language models by reducing continuous auditory contexts to discrete text labels, which yields strong objective benchmark scores but systematically degrades prosodic naturalness, emotional continuity, and immersion in long-turn spoken dialogues. It introduces Step-Audio-R1.5, which applies Reinforcement Learning from Human Feedback (RLHF) instead, asserting that this approach maintains robust analytical reasoning while profoundly improving interactive experience and redefining immersive audio dialogue.

Significance. If the central claims hold with proper controls, the work would be significant for audio-language model training by challenging the dominance of RLVR paradigms and demonstrating a viable RLHF alternative that preserves reasoning while enhancing naturalness and user immersion. This could influence future development of spoken dialogue systems, particularly for long-turn interactions.

major comments (3)

Abstract: The assertions that RLVR 'systematically degrades' prosodic naturalness, emotional continuity, and immersion, while RLHF 'profoundly transforms' the interactive experience without trade-offs, are presented without any quantitative results, baselines, evaluation protocols, metrics, or data excerpts. This leaves the central causal claim unsupported.
Abstract and overall manuscript: No direct RLVR baseline comparison is reported using an identically initialized model, same training data, and identical evaluation protocol. Without this control, the attribution of observed differences to the 'verifiable reward trap' versus other factors (e.g., reward model design, training dynamics, or data curation) cannot be isolated.
Abstract: The claim of 'comprehensive evaluations' demonstrating maintained analytical reasoning alongside improved immersion is stated but not accompanied by specific objective benchmarks, subjective scores, or comparative tables that would allow verification of the no-trade-off assertion.

minor comments (1)

Abstract: The term 'Step-Audio-R1.5' is introduced without a clear description of its architecture, base model, or training details in the provided text.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment point by point below, providing clarifications based on the full manuscript content and indicating revisions where the presentation can be strengthened without altering the core claims.

read point-by-point responses

Referee: Abstract: The assertions that RLVR 'systematically degrades' prosodic naturalness, emotional continuity, and immersion, while RLHF 'profoundly transforms' the interactive experience without trade-offs, are presented without any quantitative results, baselines, evaluation protocols, metrics, or data excerpts. This leaves the central causal claim unsupported.

Authors: We agree that the abstract's concise format omits explicit quantitative details. The full manuscript (Sections 4 and 5) describes the evaluation protocols, including human-rated metrics for prosodic naturalness, emotional continuity, and immersion in long-turn dialogues, alongside objective reasoning benchmarks. We will revise the abstract to reference these protocols and include summary metrics to better ground the claims. revision: yes
Referee: Abstract and overall manuscript: No direct RLVR baseline comparison is reported using an identically initialized model, same training data, and identical evaluation protocol. Without this control, the attribution of observed differences to the 'verifiable reward trap' versus other factors (e.g., reward model design, training dynamics, or data curation) cannot be isolated.

Authors: This is a fair critique of experimental isolation. Our reported comparisons use publicly documented RLVR models rather than a matched initialization and data regime, as retraining an identical RLVR control at this scale was not feasible within the technical report's scope. We will add an expanded limitations discussion detailing setup differences and potential confounds while preserving the observed patterns as evidence for the paradigm distinction. revision: partial
Referee: Abstract: The claim of 'comprehensive evaluations' demonstrating maintained analytical reasoning alongside improved immersion is stated but not accompanied by specific objective benchmarks, subjective scores, or comparative tables that would allow verification of the no-trade-off assertion.

Authors: The full manuscript presents these evaluations in dedicated sections with objective benchmark tables for analytical reasoning tasks and subjective human evaluation scores for immersion and naturalness. We will revise the abstract to briefly cite key comparative results and direct readers to the relevant tables and figures for verification. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain.

full rationale

The paper's argument identifies the verifiable reward trap as an empirical observation from RLVR training effects on audio models and proposes RLHF-based Step-Audio-R1.5 as an alternative that preserves reasoning while improving immersion. No equations, fitted parameters renamed as predictions, self-citations, or ansatzes are present in the provided abstract or described manuscript that reduce any claim to its inputs by construction. The chain consists of observational claims supported by comprehensive evaluations rather than self-definitional loops or load-bearing internal references. This is a standard empirical technical report with independent content from training and testing protocols.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract contains no mathematical derivations, fitted parameters, or postulated entities; it is a high-level engineering description of a model-training change.

pith-pipeline@v0.9.0 · 5603 in / 1102 out tokens · 67424 ms · 2026-05-07T14:06:38.216830+00:00 · methodology

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
cs.MM 2026-05 unverdicted novelty 7.0

Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.
Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation
cs.MM 2026-05 unverdicted novelty 6.0

Staged post-training with self-distillation lets a 3B omni-modal model match or slightly exceed a 30B model on a visually debiased benchmark.
StepAudio 2.5 Technical Report
eess.AS 2026-05 unverdicted novelty 5.0

StepAudio 2.5 is a unified audio-language foundation model that reaches state-of-the-art results on ASR, TTS, and realtime interaction by using task-tailored RLHF on a shared backbone.
DuplexSLA: A Full-Duplex Spoken Language Model with Synchronized Speech, Language, and Action
eess.AS 2026-05 unverdicted novelty 5.0

DuplexSLA is a dual-stream three-channel full-duplex model that synchronizes continuous user audio, discrete assistant audio, and rate-limited action text for native turn-taking and in-conversation tool calling.