arxiv: 2604.18187 · v1 · submitted 2026-04-20 · 💻 cs.SD · cs.CL

Recognition: unknown

Audio-DeepThinker: Progressive Reasoning-Aware Reinforcement Learning for High-Quality Chain-of-Thought Emergence in Audio Language Models

Xiang He , Chenxing Li , Jinting Wang , Yan Rong , Tianxin Xie , Wenfu Wang , Li Liu , Dong Yu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 03:37 UTC · model grok-4.3

classification 💻 cs.SD cs.CL

keywords audio language modelschain of thoughtreinforcement learningreasoning emergencehybrid rewardprogressive curriculumaudio understandingmixture of experts

0 comments

The pith

A progressive two-stage reinforcement learning approach enables high-quality chain-of-thought reasoning to emerge in audio language models without supervised fine-tuning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large audio language models currently function mainly as systems that perceive audio and give direct answers, lacking explicit step-by-step reasoning. The paper proposes that a hybrid reward signal, which scores reasoning chains for both logical structure via an LLM judge and semantic match via embeddings, combined with a curriculum that starts easy and moves to hard cases, can make quality reasoning arise purely through reinforcement learning exploration. This matters because it sidesteps the bottleneck of needing high-quality supervised reasoning examples and instead lets the model discover grounded acoustic reasoning on its own. If successful, audio models could handle complex questions that require analyzing sounds, patterns, and inferences in a transparent way.

Core claim

The central discovery is that instruction-tuned audio language models with no initial chain-of-thought capability can develop high-quality reasoning chains through a two-stage RL process: the first stage uses a hybrid reward of LLM logical evaluation plus embedding similarity on foundational audio QA tasks to build basic patterns, while the second stage uses only LLM rewards on difficult boundary cases to promote diversity, ultimately producing models that set new state-of-the-art scores on audio reasoning benchmarks.

What carries the argument

The hybrid reasoning similarity reward that combines an LLM evaluator checking logical path alignment, key step coverage, and analytical depth with an embedding similarity measure to enforce alignment with reference reasoning chains.

If this is right

Audio-DeepThinker reaches 74.0% on MMAR, 78.5% on MMAU-test-mini, and 77.26% on MMSU.
It secures first place in the Interspeech 2026 Audio Reasoning Challenge in the single model track.
RL training reshapes the gating mechanisms in upper layers of the mixture-of-experts architecture.
Reasoning tokens form progressively in the upper transformer layers as training advances.
Explicit chain-of-thought emerges from pure RL without any prior supervised reasoning training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar progressive curricula might enable reasoning emergence in other sensory modalities such as vision or robotics where grounding is key.
The observed changes in model internals suggest that future work could monitor layer activations to detect when reasoning capabilities stabilize.
If the hybrid reward generalizes, it could be adapted to reduce dependence on large annotated datasets for training reasoning in language models overall.
Testing the method on non-audio tasks would help determine whether the gains stem from the reward design or the specific challenges of audio data.

Load-bearing premise

The hybrid reward must supply an unbiased measure of reasoning quality that encourages genuine acoustic grounding instead of rewarding imitation of reference chains or superficial logical structure.

What would settle it

Running the same training but replacing the hybrid reward with a simple accuracy-based reward and checking if the resulting models still produce reasoning chains with comparable logical depth, semantic alignment, and benchmark performance.

Figures

Figures reproduced from arXiv: 2604.18187 by Chenxing Li, Dong Yu, Jinting Wang, Li Liu, Tianxin Xie, Wenfu Wang, Xiang He, Yan Rong.

**Figure 1.** Figure 1: Radar chart comparison of Audio-DeepThinker against representative open-source and closed-source models on three benchmarks: (a) MMAR, (b) MMAU, and (c) MMSU. Our method (green) achieves consistently strong performance across all three benchmarks. to “constructing a logically rigorous and faithfully grounded reasoning chain.” Second, to fully leverage this reward signal, we propose a progressive two-sta… view at source ↗

**Figure 2.** Figure 2: Overview of Audio-DeepThinker. Bottom-left: Data construction pipeline generating reference CoT annotations via audio captioning, QA generation, and CoT generation. Top-left: Progressive two-stage RL curriculum: Stage 1 elicits reasoning on foundational audio QA via pure RL exploration, and Stage 2 refines it on acoustically challenging boundary cases. Right: Reasoning-aware multi-reward design combining … view at source ↗

**Figure 3.** Figure 3: Training data distributions for the two RL stages. Top: Stage 1 dataset D1 (39,412 samples from AVQA), dominated by sound-event audio (81.14%) with a roughly balanced split between perception (58.3%) and reasoning (41.7%) tasks. Bottom: Stage 2 dataset D2 (29,483 samples from diverse sources), featuring a substantially more balanced modality distribution across speech, sound, music, and mixed categori… view at source ↗

**Figure 4.** Figure 4: Detailed illustration of the three reward components. Top-left: Base perceptual reward Rbase, combining answer correctness via exact matching and format compliance. Top-right: Reasoning consistency reward Rcon, which uses an LLM judge to verify that the generated reasoning chain logically supports the predicted answer. Bottom: Hybrid reasoning similarity reward R hybrid sim , which evaluates the generated … view at source ↗

**Figure 5.** Figure 5: visualizes the progressive representation drift across the two RL training stages. For the same input, we extract hidden states from the compared model pair at each layer and compute cosine distance (1 − cos(hA, hB)) to measure directional changes in the representation space. Stage 1 (Base → Stage 1) creates a monotonically increasing drift that accelerates in the upper layers. Cosine distance rises grad… view at source ↗

**Figure 6.** Figure 6: Progressive MoE analysis across the two RL stages. Top row: Per-expert drift heatmaps (128 experts × 48 layers). Bottom row: Component-level drift decomposed into Experts, Shared, Gate, and Attention. Left column: Stage 1 vs. Base. Right column: Stage 2 vs. Stage 1. All drifts are measured as relative L2 norm ∥Wafter −Wbefore∥2/∥Wbefore∥2 on raw weight matrices before any activation function. Both stages e… view at source ↗

**Figure 7.** Figure 7: Token prediction dynamics via logit lens analysis. (a) Entropy comparison at three critical generation moments: the <reasoning> tag, the </reasoning> tag, and the final answer token. The <reasoning> token maintains the lowest entropy throughout, with all three tokens converging to nearzero in L42–L47. (b) Decision layer distribution across 34,998 generated tokens, confirming that output decisions concentr… view at source ↗

read the original abstract

Large Audio-Language Models (LALMs) have made significant progress in audio understanding, yet they primarily operate as perception-and-answer systems without explicit reasoning processes. Existing methods for enhancing audio reasoning rely either on supervised chain-of-thought (CoT) fine-tuning, which is limited by training data quality, or on reinforcement learning (RL) with coarse rewards that do not directly evaluate reasoning quality. As a result, the generated reasoning chains often appear well-structured yet lack specific acoustic grounding. We propose Audio-DeepThinker, a framework built on two core ideas. First, we introduce a hybrid reasoning similarity reward that directly supervises the quality of generated reasoning chains by combining an LLM evaluator assessing logical path alignment, key step coverage, and analytical depth with an embedding similarity component enforcing semantic alignment with reference reasoning chains. Second, we propose a progressive two-stage curriculum that enables high-quality CoT reasoning to emerge through pure RL exploration, without any supervised reasoning fine-tuning, from an instruction-tuned model that possesses no prior chain-of-thought capability. Stage 1 trains on foundational audio QA with the hybrid reward to foster basic reasoning patterns, while Stage 2 shifts to acoustically challenging boundary cases with an LLM-only reward for greater reasoning diversity. Audio-DeepThinker achieves state-of-the-art results on MMAR (74.0%), MMAU-test-mini (78.5%), and MMSU (77.26%), winning 1st Place in the Interspeech 2026 Audio Reasoning Challenge (Single Model Track). Interpretability analyses further reveal that RL training primarily reshapes upper-layer MoE gating mechanisms and that reasoning tokens crystallize progressively in the upper transformer layers, offering mechanistic insights into how audio reasoning emerges through exploration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Audio-DeepThinker gets SOTA numbers on audio reasoning benchmarks with a two-stage RL curriculum, but the hybrid reward's embedding match to reference chains undercuts the pure-emergence story.

read the letter

The central takeaway is that this work applies reinforcement learning from an instruction-tuned base model to produce chain-of-thought in audio language models, using a progressive curriculum and a hybrid reward that mixes LLM judgment with embedding similarity to reference chains. It reports 74.0% on MMAR, 78.5% on MMAU-test-mini, and 77.26% on MMSU, plus first place in the Interspeech 2026 single-model track. The two-stage design starts with the hybrid reward on standard audio QA to build basic patterns, then switches to LLM-only reward on harder boundary cases for more diversity. The mechanistic section on MoE gating shifts and upper-layer token crystallization is a concrete addition that helps explain where the behavior appears in the model.

Referee Report

3 major / 0 minor

Summary. The manuscript presents Audio-DeepThinker, a two-stage progressive RL framework for inducing high-quality chain-of-thought reasoning in audio language models starting from an instruction-tuned base without prior CoT capability. Stage 1 employs a hybrid reward (LLM evaluator for logical alignment, key-step coverage, and depth plus embedding similarity to reference reasoning chains) on foundational audio QA tasks; Stage 2 switches to an LLM-only reward on acoustically challenging cases. The paper reports SOTA results on MMAR (74.0%), MMAU-test-mini (78.5%), and MMSU (77.26%), first place in the Interspeech 2026 Audio Reasoning Challenge (Single Model Track), and mechanistic findings that RL reshapes upper-layer MoE gating while reasoning tokens crystallize progressively in upper transformer layers.

Significance. If substantiated, the work would advance audio-language modeling by showing that grounded CoT reasoning can emerge via RL exploration rather than supervised fine-tuning, reducing dependence on curated reasoning datasets. The two-stage curriculum design offers a practical way to bootstrap basic then diverse reasoning patterns, and the mechanistic analyses on gating and layer-wise token crystallization provide useful interpretability insights for future LALM architectures. The reported benchmark wins, if verified with proper controls, would demonstrate strong empirical utility.

major comments (3)

[Abstract] Abstract: The hybrid reasoning similarity reward combines an LLM evaluator with an embedding similarity term 'enforcing semantic alignment with reference reasoning chains.' This term supplies a direct supervised signal from reference material, which appears to contradict the central claim of emergence 'through pure RL exploration, without any supervised reasoning fine-tuning.' The manuscript must demonstrate (via divergence metrics, ablations removing the similarity component, or qualitative examples) that post-training CoT chains are not simply mimicking references while still achieving acoustic grounding.
[Abstract] Abstract: The SOTA performance claims (MMAR 74.0%, MMAU-test-mini 78.5%, MMSU 77.26%) and Interspeech 2026 first-place result are presented without baseline comparisons, ablation results, number of evaluation runs, or statistical tests. Because these numbers are load-bearing for the contribution, the paper requires tables or sections showing prior methods, the isolated effect of the hybrid reward versus LLM-only, and the curriculum stages.
[Abstract] Abstract (Interpretability analyses): The statements that 'RL training primarily reshapes upper-layer MoE gating mechanisms' and that 'reasoning tokens crystallize progressively in the upper transformer layers' are offered without reference to specific figures, quantitative metrics (e.g., gating entropy changes, token activation thresholds), or experimental protocols. These mechanistic claims are central to explaining emergence and need detailed, reproducible exposition.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback, which highlights important areas for clarification and strengthening of our claims. We address each major comment point by point below, providing honest responses and committing to revisions where appropriate.

read point-by-point responses

Referee: [Abstract] Abstract: The hybrid reasoning similarity reward combines an LLM evaluator with an embedding similarity term 'enforcing semantic alignment with reference reasoning chains.' This term supplies a direct supervised signal from reference material, which appears to contradict the central claim of emergence 'through pure RL exploration, without any supervised reasoning fine-tuning.' The manuscript must demonstrate (via divergence metrics, ablations removing the similarity component, or qualitative examples) that post-training CoT chains are not simply mimicking references while still achieving acoustic grounding.

Authors: We appreciate the referee highlighting this potential ambiguity. The phrase 'without any supervised reasoning fine-tuning' specifically denotes the lack of supervised fine-tuning (SFT) on chain-of-thought data; our method starts from an instruction-tuned model with no prior CoT capability and optimizes exclusively via RL policy gradients. The embedding similarity term forms part of the hybrid reward to encourage coherent reasoning patterns during exploration, but does not involve direct imitation or SFT. To substantiate that generated CoT chains achieve genuine acoustic grounding rather than reference mimicry, we will add ablations removing the similarity component, quantitative divergence metrics (e.g., embedding distances and semantic overlap scores) between generated and reference chains, and qualitative examples in the revised manuscript. revision: yes
Referee: [Abstract] Abstract: The SOTA performance claims (MMAR 74.0%, MMAU-test-mini 78.5%, MMSU 77.26%) and Interspeech 2026 first-place result are presented without baseline comparisons, ablation results, number of evaluation runs, or statistical tests. Because these numbers are load-bearing for the contribution, the paper requires tables or sections showing prior methods, the isolated effect of the hybrid reward versus LLM-only, and the curriculum stages.

Authors: We agree that the reported performance figures require supporting evidence and controls to substantiate the SOTA claims. In the revised version, we will add a comprehensive results table with comparisons to prior methods, ablation studies isolating the hybrid reward versus LLM-only reward and the effects of each curriculum stage, and report metrics averaged across multiple evaluation runs with appropriate statistical tests (e.g., standard deviations and significance levels). revision: yes
Referee: [Abstract] Abstract (Interpretability analyses): The statements that 'RL training primarily reshapes upper-layer MoE gating mechanisms' and that 'reasoning tokens crystallize progressively in the upper transformer layers' are offered without reference to specific figures, quantitative metrics (e.g., gating entropy changes, token activation thresholds), or experimental protocols. These mechanistic claims are central to explaining emergence and need detailed, reproducible exposition.

Authors: We acknowledge that the interpretability section requires greater detail and reproducibility. In the revision, we will expand this analysis with explicit references to the corresponding figures, quantitative metrics including pre- and post-RL changes in MoE gating entropy and reasoning token activation thresholds, and a full description of the experimental protocols, layer-wise analysis methods, and visualization procedures. revision: yes

Circularity Check

1 steps flagged

Hybrid reward incorporates reference embedding similarity, reducing 'pure RL emergence' claim to supervised alignment with references

specific steps

fitted input called prediction [Abstract]
"we introduce a hybrid reasoning similarity reward that directly supervises the quality of generated reasoning chains by combining an LLM evaluator assessing logical path alignment, key step coverage, and analytical depth with an embedding similarity component enforcing semantic alignment with reference reasoning chains. ... enables high-quality CoT reasoning to emerge through pure RL exploration, without any supervised reasoning fine-tuning, from an instruction-tuned model that possesses no prior chain-of-thought capability. Stage 1 trains on foundational audio QA with the hybrid reward to fos"

The embedding similarity term directly uses reference reasoning chains as the target for semantic alignment in the reward function. Although the paper asserts 'pure RL exploration' and 'without any supervised reasoning fine-tuning,' the generated chains are trained to minimize deviation from those references. This reduces the claimed emergence to optimization against the reference inputs by construction, rather than independent discovery of acoustic-grounded reasoning.

full rationale

The paper's core claim is that high-quality CoT emerges via pure RL exploration from an instruction-tuned base model with no supervised reasoning fine-tuning. However, the Stage 1 hybrid reward explicitly includes an embedding similarity term that enforces semantic alignment to reference reasoning chains. This makes the training signal dependent on the same reference material used to define 'high-quality' reasoning, so the output chains are optimized toward reference patterns by construction. Stage 2's LLM-only reward inherits any patterns shaped in Stage 1. This matches the 'fitted input called prediction' pattern: the reward is constructed using references, then the emergent reasoning is presented as independent discovery. No self-citations, uniqueness theorems, or ansatz smuggling appear in the provided text; the circularity is isolated to the reward design vs. emergence claim.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Only the abstract is available, so no concrete free parameters, axioms, or invented entities can be extracted or audited.

pith-pipeline@v0.9.0 · 5643 in / 1238 out tokens · 32430 ms · 2026-05-10T03:37:43.590047+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

56 extracted references · 25 canonical work pages · 16 internal anchors

[1]

Audio-DeepThinker: Progressive Reasoning-Aware Reinforcement Learning for High-Quality Chain-of-Thought Emergence in Audio Language Models

Introduction Recent progress in large language models has demonstrated that reinforcement learning (RL) can unlock sophisticated rea- soning capabilities through pure exploration, as exempliﬁed by DeepSeek-R1 [ 1] and OpenAI o1 [ 2]. These successes have inspired growing interest in extending RL-based reasoning to multimodal domains [ 3, 4, 5]. In the aud...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

Signiﬁcant progress has been made in audio understanding through large- scale pre-training

Related Work Large Audio-Language Models (LALMs). Signiﬁcant progress has been made in audio understanding through large- scale pre-training. SALMONN [ 7] proposed a generic hearing framework for LLMs. Audio Flamingo [ 17] introduced few- shot capabilities and dialogue abilities, while Qwen2-Audio- Instruct [ 6] pushed the performance of open-source model...
[3]

think out loud

Method We present Audio-DeepThinker, a framework that enables chain-of-thought reasoning capabilities to emerge in Large Audio-Language Models through reinforcement learning, with- out any supervised reasoning data. Our approach builds on a key insight: just as DeepSeek-R1 [ 1] demonstrated that reason- ing can emerge naturally through RL exploration in t...
[4]

Experimental Setup Model

Experiments 4.1. Experimental Setup Model. We adopt Qwen3-Omni-30B-A3B-Instruct [ 5] as our base model, which employs a mixture-of-experts (MoE) archi- tecture with 30B total parameters and 3B active parameters per token. The model contains 48 transformer layers with 128 ex- perts per layer. Notably, the model lacks inherent CoT reason- ing capability pri...

2026
[5]

Analysis To gain deeper insights into how CoT reasoning capabilities are acquired through RL training, we conduct comprehensive in- terpretability analyses. We systematically compare the internal representations across the progressive training pipeline (Base → Stage 1 → Stage 2) at multiple levels of granularity: repre- sentation drift across layers, MoE ...

2000
[6]

Conclusion We present Audio-DeepThinker, a progressive two-stage RL framework that enables high-quality CoT reasoning to emerge in LALMs through pure RL exploration, without supervised reasoning ﬁne-tuning. Our hybrid reasoning similarity reward provides ﬁne-grained supervision over reasoning quality, while the progressive curriculum ﬁrst establishes foun...

2026
[7]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

D. Guo, D. Y ang, H. Zhang, J. Song, R. Zhang, R. Xu, Q. Zhu, S. Ma, S. Wang, S. Wu et al., “Deepseek-r1: Incentivizing reason- ing capability in llms via reinforcement learning,” arXiv preprint arXiv:2501.12948, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

OpenAI o1 System Card

A. Jaech, L. Adams, M. Brundage et al., “Openai o1 system card,” arXiv preprint arXiv:2412.16720, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Vision-R1: Incentivizing Reasoning Capability in Multimodal Large Language Models

W. Huang, B. Jia, Z. Zhai, S. Cao, Z. Y e, F. Zhao, Z. Xu, Y . Hu, and S. Lin, “Vision-r1: Incentivizing reasoning ca- pability in multimodal large language models,” arXiv preprint arXiv:2503.06749, 2025

work page internal anchor Pith review arXiv 2025
[10]

Visual-rft: Visual reinforcement ﬁne-tuning,

Z. Liu, Z. Sun, Y . Zang, X. Dong, Y . Cao, H. Duan, D. Lin, and J. Wang, “Visual-rft: Visual reinforcement ﬁne-tuning,” in Pro- ceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 2034–2044

2025
[11]

Qwen3-Omni Technical Report

J. Xu, Z. Guo, H. Hu, Y . Chu, X. Wang, J. He, Y . Wang, X. Shi, T. He, X. Zhu et al., “Qwen3-omni technical report,”arXiv preprint arXiv:2509.17765, 2025

work page internal anchor Pith review arXiv 2025
[12]

Qwen2-Audio Technical Report

Y . Chu, J. Xu, Q. Y ang, H. Wei, X. Wei, Z. Guo, Y . Leng, Y . Lv, J. He, J. Lin et al., “Qwen2-audio technical report,”arXiv preprint arXiv:2407.10759, 2024

work page internal anchor Pith review arXiv 2024
[13]

SALMONN: towards generic hearing abilities for large language models,

C. Tang, W. Y u, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. Ma, and C. Zhang, “SALMONN: towards generic hearing abilities for large language models,” in The Twelfth International Conference on Learning Representations, ICLR 2024 , 2024

2024
[14]

Step-audio 2 technical report, 2025

B. Wu, C. Y an, C. Hu, C. Yi, C. Feng, F. Tian, F. Shen, G. Y u, H. Zhang, J. Li et al. , “Step-audio 2 technical report,” arXiv preprint arXiv:2507.16632, 2025

work page arXiv 2025
[15]

Audio ﬂamingo 2: An audio- language model with long-audio understanding and expert reason- ing abilities,

S. Ghosh, Z. Kong, S. Kumar, S. Sakshi, J. Kim, W. Ping, R. V alle, D. Manocha, and B. Catanzaro, “Audio ﬂamingo 2: An audio- language model with long-audio understanding and expert reason- ing abilities,” in International Conference on Machine Learning . PMLR, 2025, pp. 19 358–19 405

2025
[16]

Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

A. Goel, S. Ghosh, J. Kim, S. Kumar, Z. Kong, S.-g. Lee, C.- H. H. Y ang, R. Duraiswami, D. Manocha, R. V alle et al., “Audio ﬂamingo 3: Advancing audio intelligence with fully open large audio language models,” arXiv preprint arXiv:2507.08128, 2025

work page internal anchor Pith review arXiv 2025
[17]

Reinforcement Learning Outperforms Supervised Fine-Tuning: A Case Study on Audio Question Answering,

G. Li, J. Liu, H. Dinkel, Y . Niu, J. Zhang, and J. Luan, “Reinforce- ment learning outperforms supervised ﬁne-tuning: A case study on audio question answering,” arXiv preprint arXiv:2503.11197 , 2025

work page arXiv 2025
[18]

Omni-r1: Do you really need audio to fine-tune your audio llm?arXiv preprint arXiv:2505.09439, 2025

A. Rouditchenko, S. Bhati, E. Araujo, S. Thomas, H. Kuehne, R. Feris, and J. Glass, “Omni-r1: Do you really need audio to ﬁne- tune your audio llm?” arXiv preprint arXiv:2505.09439, 2025

work page arXiv 2025
[19]

Measuring audio’s impact on correctness: Audio- contribution-aware post-training of large audio language models,

H. He, X. Du, R. Sun, Z. Dai, Y . Xiao, M. Y ang, J. Zhou, X. Li, Z. Liu, Z. Liang et al. , “Measuring audio’s impact on correct- ness: Audio-contribution-aware post-training of large audio lan- guage models,” arXiv preprint arXiv:2509.21060, 2025

work page arXiv 2025
[20]

DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models

Z. Shao, P . Wang, Q. Zhu, R. Xu, J. Song, M. Zhang, Y . Li, Y . Wu, and D. Guo, “Deepseekmath: Pushing the limits of math- ematical reasoning in open language models,” arXiv preprint arXiv:2402.03300, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Audio-thinker: Guiding audio language model when and how to think via reinforcement learning,

S. Wu, C. Li, W. Wang, H. Zhang, H. Wang, M. Y u, and D. Y u, “Audio-thinker: Guiding audio language model when and how to think via reinforcement learning,” arXiv preprint arXiv:2508.08039, 2025

work page arXiv 2025
[22]

Incentivizing consistent, effective and scalable reasoning capability in audio llms via reasoning process rewards,

J. Fan, R. Ren, J. Li, R. Pandey, P . G. Shivakumar, I. Bulyko, A. Gandhe, G. Liu, and Y . Gu, “Incentivizing consistent, effec- tive and scalable reasoning capability in audio llms via reasoning process rewards,” arXiv preprint arXiv:2510.20867, 2025

work page arXiv 2025
[23]

Audio ﬂamingo: a novel audio language model with few- shot learning and dialogue abilities,

Z. Kong, A. Goel, R. Badlani, W. Ping, R. V alle, and B. Catan- zaro, “Audio ﬂamingo: a novel audio language model with few- shot learning and dialogue abilities,” inProceedings of the 41st In- ternational Conference on Machine Learning , 2024, pp. 25 125– 25 148

2024
[24]

GPT-4o System Card

A. Hurst, A. Lerer, A. P . Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radford et al. , “GPT-4o system card,” arXiv preprint arXiv:2410.21276, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Our vision for building a universal AI assis- tant,

D. Hassabis, “Our vision for building a universal AI assis- tant,” 2025. [Online]. Available: https://blog.google/technology/ google-deepmind/gemini-universal-ai-assistant/

2025
[26]

Baichuan-omni-1.5 technical report.arXiv preprint arXiv:2501.15368, 2025

Y . Li, J. Liu, T. Zhang, S. Chen, T. Li, Z. Li, L. Liu, L. Ming, G. Dong, D. Pan et al. , “Baichuan-omni-1.5 technical report,” arXiv preprint arXiv:2501.15368, 2025

work page arXiv 2025
[27]

Qwen2.5-Omni Technical Report

J. Xu, Z. Guo, J. He, H. Hu, T. He, S. Bai, K. Chen, J. Wang, Y . Fan, K. Dang, B. Zhang, X. Wang, Y . Chu, and J. Lin, “Qwen2.5-omni technical report,” arXiv preprint arXiv:2503.20215, 2025

work page internal anchor Pith review arXiv 2025
[28]

Audio- reasoner: Improving reasoning capability in large audio language models,

X. Zhifei, M. Lin, Z. Liu, P . Wu, S. Y an, and C. Miao, “Audio- reasoner: Improving reasoning capability in large audio language models,” in Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Nov. 2025, pp. 23 829– 23 851

2025
[29]

Ke-omni- r: Achieving advanced audio reasoning with a concise 50-words think process,

S. Zhao, T. Guo, C. Wen, B. Xiang, W. Zou, and X. Li, “Ke-omni- r: Achieving advanced audio reasoning with a concise 50-words think process,” https://github.com/shuaijiang/Ke-Omni-R, 2025

2025
[30]

GDPO: Group reward-Decoupled Normalization Policy Optimization for Multi-reward RL Optimization

S.-Y . Liu, X. Dong, X. Lu, S. Diao, P . Belcak, M. Liu, M.-H. Chen, H. Yin, Y .-C. F. Wang, K.-T. Cheng et al. , “Gdpo: Group reward-decoupled normalization policy optimization for multi- reward rl optimization,” arXiv preprint arXiv:2601.05242, 2026

work page internal anchor Pith review arXiv 2026
[31]

Audio set: An ontology and human-labeled dataset for audio events,

J. F. Gemmeke, D. P . Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” in 2017 IEEE inter- national conference on acoustics, speech and signal processing (ICASSP). IEEE, 2017, pp. 776–780

2017
[32]

Evalu- ation of algorithms using games: The case of music tagging,

E. Law, K. West, M. Mandel, M. Bay, and J. S. Downie, “Evalu- ation of algorithms using games: The case of music tagging,” in 10th International Society for Music Information Retrieval Con- ference, ISMIR 2009, 2009, pp. 387–392

2009
[33]

Qwen3 Technical Report

A. Y ang, A. Li, B. Y ang, B. Zhang, B. Hui, B. Zheng, B. Y u, C. Gao, C. Huang, C. Lv et al., “Qwen3 technical report,” arXiv preprint arXiv:2505.09388, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[34]

Avqa: A dataset for audio-visual question answering,

C. Li, M. Zhao, Q. She et al. , “Avqa: A dataset for audio-visual question answering,” in Proceedings of the 30th ACM Interna- tional Conference on Multimedia , 2022, pp. 2785–2795

2022
[35]

DeepSeek-V3 Technical Report

A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruan et al. , “Deepseek-v3 technical re- port,” arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[36]

Mustango: Toward controllable text-to-music gen- eration,

J. Melechovsky, Z. Guo, D. Ghosal, N. Majumder, D. Herremans, and S. Poria, “Mustango: Toward controllable text-to-music gen- eration,” in Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguis- tics: Human Language Technologies (V olume 1: Long Papers) , 2024, pp. 8286–8309

2024
[37]

Iemocap: Interactive emotional dyadic motion capture database,

C. Busso, M. Bulut, C.-C. Lee, A. Kazemzadeh, E. Mower, S. Kim, J. N. Chang, S. Lee, and S. S. Narayanan, “Iemocap: Interactive emotional dyadic motion capture database,” Language resources and evaluation, vol. 42, no. 4, pp. 335–359, 2008

2008
[38]

Bge m3- embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation,

J. Chen, S. Xiao, P . Zhang, K. Luo, D. Lian, and Z. Liu, “Bge m3- embedding: Multi-lingual, multi-functionality, multi-granularity text embeddings through self-knowledge distillation,” arXiv e- prints, pp. arXiv–2402, 2024

2024
[39]

T. M. Cover, Elements of information theory . John Wiley & Sons, 1999

1999
[40]

Swift:a scalable lightweight infrastructure for ﬁne-tuning,

Y . Zhao, J. Huang, J. Hu, X. Wang, Y . Mao, D. Zhang, Z. Jiang, Z. Wu, B. Ai, A. Wang, W. Zhou, and Y . Chen, “Swift:a scalable lightweight infrastructure for ﬁne-tuning,” 2024

2024
[41]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

M. Shoeybi, M. Patwary, R. Puri, P . LeGresley, J. Casper, and B. Catanzaro, “Megatron-lm: Training multi-billion parame- ter language models using model parallelism,” arXiv preprint arXiv:1909.08053, 2019

work page internal anchor Pith review arXiv 1909
[42]

MMAR: A challenging bench- mark for deep reasoning in speech, audio, music, and their mix,

Z. Ma, Y . Ma, Y . Zhu, C. Y ang, Y .-W. Chao, R. Xu, W. Chen, Y . Chen, Z. Chen, J. Cong et al., “MMAR: A challenging bench- mark for deep reasoning in speech, audio, music, and their mix,” in The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2025

2025
[43]

The interspeech 2026 audio reasoning challenge: Evaluating reasoning process quality for audio reason- ing models and agents,

Z. Ma, R. Xu, Y . Ma, C.-H. H. Y ang, B. Li, J. Kim, J. Xu, J. Li, C. Busso, K. Y u et al. , “The interspeech 2026 audio reasoning challenge: Evaluating reasoning process quality for audio reason- ing models and agents,” arXiv preprint arXiv:2602.14224, 2026

work page arXiv 2026
[44]

Mmau: A massive multi-task audio understand- ing benchmark,

S. Sakshi et al., “Mmau: A massive multi-task audio understand- ing benchmark,” in Conference on Neural Information Processing Systems, 2024

2024
[45]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen et al., “Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabil- ities,” arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

Kimi-Audio Technical Report

D. Ding, Z. Ju, Y . Leng, S. Liu, T. Liu, Z. Shang, K. Shen, W. Song, X. Tan, H. Tang et al. , “Kimi-audio technical report,” arXiv preprint arXiv:2504.18425, 2025

work page internal anchor Pith review arXiv 2025
[47]

Covo-audio technical report,

W. Wang, C. Li, L. Zhang, Y . Zhao, Y . Zou, H. Li, M. Cui, H. Zhang, K. Wei, L. Xu et al. , “Covo-audio technical report,” arXiv preprint arXiv:2602.09823, 2026

work page arXiv 2026
[48]

Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space,

M. Geva, A. Caciularu, K. Wang, and Y . Goldberg, “Transformer feed-forward layers build predictions by promoting concepts in the vocabulary space,” in Proceedings of the 2022 conference on empirical methods in natural language processing , 2022, pp. 30– 45. A. Prompt Templates This appendix provides the detailed prompt templates used in our framework. A....

2022
[49]

If audio Description is completely unrelated to the question and answer → output only "None"
[50]

{answer}

Otherwise, write 200-600 word reasoning that: - Identifies key audio evidence - Explains how it points to "{answer}" - Rules out other options - Sounds like analysis, not pre-knowledge
[51]

NO meta-commentary, NO repetition
[52]

Start immediately with your analysis Analysis: A.2. Reasoning Similarity Evaluation Prompt The following prompt is used by the LLM evaluator (Qwen3-235B-A22B-Instruct) to assess the similarity between generated reasoning and the reference CoT: Reasoning Similarity Evaluation Prompt You are an expert in evaluating reasoning similarity. Compare the followin...
[53]

Do they follow similar logical paths?
[54]

Do they cover the same key steps?
[55]

Are the reasoning strategies aligned?
[56]

U nas ju ˙z trzyna ´scie, ale... wybieramy jego spokojn ˛ a, a dzisiaj...,

Is the depth of analysis comparable? Reference Reasoning (Ground Truth): {gt_think} Generated Reasoning: {pred_think} Output ONLY a number between 0.0 and 1.0. No explanation. A.3. Model System Prompts The following system prompts are used for the Qwen3-Omni-30B-A3B-Instruct and Qwen3-Omni-30B-A3B-Thinking models during evaluation: Qwen3-Omni-Instruct Sys...