Thinking-while-speaking: A Controlled, Interleaved Reasoning Method for Real-Time Speech Generation

Borui Jiang; Changming Xiao; Han Shu; Qiangyu Yan; Wenshuo Li; Xinghao Chen; Xuan Du

arxiv: 2605.20946 · v1 · pith:MHQO3MFYnew · submitted 2026-05-20 · 💻 cs.CL

Thinking-while-speaking: A Controlled, Interleaved Reasoning Method for Real-Time Speech Generation

Xuan Du , Qiangyu Yan , Wenshuo Li , Borui Jiang , Changming Xiao , Han Shu , Xinghao Chen This is my paper

Pith reviewed 2026-05-21 05:50 UTC · model grok-4.3

classification 💻 cs.CL

keywords thinking-while-speakinginterleaved reasoningreal-time speech generationchain of thoughtreinforcement learningspeech synthesisAI dialogue

0 comments

The pith

InterRS enables AI to reason while speaking by interleaving controlled reasoning steps into fluent speech generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to make AI communication more human-like by allowing models to perform deep reasoning during real-time speech output rather than pausing to think first. It introduces the InterRS approach, which inserts reasoning steps only at points during natural speech generation to avoid disrupting fluency. Achieving this requires creating high-quality training data where reasoning and speech segments are precisely aligned with the thinking-to-answer length ratio kept controlled, which the authors accomplish through a new data generation pipeline. Training then uses interleaved supervised fine-tuning on this data combined with reinforcement learning that incorporates a TA-Balance Reward to manage timing and ratios plus a Linguistic Quality Reward to refine expression. If the method works, AI could deliver instant spoken responses on reasoning-heavy tasks like math and logic while matching the speed and naturalness of dedicated spoken-language models.

Core claim

By generating seamlessly interleaved audio data through a dedicated pipeline and training via interleaved SFT plus RL with the TA-Balance Reward for timing control and the Linguistic Quality Reward for expression, the resulting model produces instant responses comparable to spoken-language instruct models outputting fast CoT while achieving 13 percent better performance on mathematical and logic benchmarks and more natural, fluent answers than prior methods.

What carries the argument

The InterRS pipeline that produces precisely aligned interleaved audio data with controlled thinking-to-answer length ratios, combined with RL using TA-Balance and Linguistic Quality rewards to enforce timing and quality during training.

If this is right

The model generates instant responses similar to spoken-language instruct models that output fast CoT.
Performance improves by 13 percent on mathematical and logic benchmarks.
Generated answers are more natural and fluent than those from prior methods.
Fluent speech is maintained even while deep reasoning occurs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This interleaving pattern could be tested in other real-time generative settings such as live translation to see if simultaneous reasoning improves accuracy without added delay.
Controlling the thinking-to-answer ratio might prove useful for balancing depth versus speed in non-speech tasks like streaming text generation.
The approach suggests voice assistants could handle complex queries more effectively by overlapping internal reasoning with ongoing speech output.

Load-bearing premise

A novel pipeline can reliably produce high-quality audio data in which reasoning steps and speech segments stay precisely aligned and the thinking-to-answer length ratio stays under controlled limits.

What would settle it

A test dataset or deployment where the generated interleaved data shows misalignment between reasoning steps and speech segments or uncontrolled length ratios, resulting in loss of the reported 13 percent benchmark gains or reduced speech fluency.

Figures

Figures reproduced from arXiv: 2605.20946 by Borui Jiang, Changming Xiao, Han Shu, Qiangyu Yan, Wenshuo Li, Xinghao Chen, Xuan Du.

**Figure 1.** Figure 1: Our framework, InterRS, integrates reasoning steps within temporal gaps in speech generation, enabling [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Data Pipeline: (The top line) For questions, the original question first undergoes desymbolization [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Comparison of responses from Configuration (1) and Configuration (3) for a multi-step reasoning task. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Distribution of individual thinking segment lengths across different model configurations. 0 20 40 60 80 Training Step 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 Reward Total Reward with Moving Average Total Reward Moving Average (window=4) Overall Average: 1.671 [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Rewards of our InterRS in training process. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

read the original abstract

The thinking-while-speaking paradigm aims to make AI communication more human. A key challenge is maintaining fluent speech while performing deep reasoning. Our method, InterRS, tackles this by inserting reasoning steps only during natural speech generation. This requires high-quality data where reasoning and speech are precisely aligned, and the length ratio are under controlled. We introduce a novel pipeline to generate such seamlessly interleaved audio data. To train our model, we combine interleaved SFT with refined data and reinforcement learning with two new rewards: a TA-Balance Reward to manage timing and thinking-answer ratio, and a Linguistic Quality Reward to refine expression. Experiments show our approach achieves 13% better performance on mathmatical and logic benchmarks while generating instant response like a spoken-language instruct model which outputs fast CoT response. Furthermore, our method generates more natural and fluent answers than prior methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

InterRS introduces a data pipeline and two rewards for interleaving reasoning into fluent speech, but the abstract gives almost no evidence that the pipeline actually works or that the 13% gain is real.

read the letter

The paper's main move is to generate training data where reasoning steps are slotted into natural pauses in spoken output, then train with interleaved SFT plus RL using a TA-Balance reward to control thinking-to-answer length and a Linguistic Quality reward to keep the speech natural. That specific pipeline plus the two named rewards is the clearest new piece; earlier CoT and TTS work does not combine them this way for real-time speech. The approach is sensible for the stated goal of keeping spoken responses fast while adding math or logic steps without obvious breaks. The abstract's claim of 13% better benchmark performance alongside instant spoken-style output is the part that needs checking. No baselines, no dataset sizes, no alignment metrics, and no ablation results appear in the provided text, so it is impossible to tell whether the gains come from cleaner data, the new rewards, or just reward tuning. The central assumption that the pipeline produces precisely aligned, high-quality interleaved audio with controlled ratios is stated but not demonstrated with any numbers or failure cases. This work is for groups already building real-time spoken agents who want to add limited reasoning without wrecking fluency. A reader could borrow the high-level idea of controlled interleaving even if the current numbers stay unverified. I would send it to peer review so the authors can show the pipeline details, alignment statistics, and proper controls; without those the claims stay preliminary.

Referee Report

2 major / 2 minor

Summary. The paper proposes InterRS, a controlled interleaved reasoning method for real-time speech generation that inserts reasoning steps only during natural speech pauses. It introduces a novel pipeline for producing precisely aligned interleaved audio data with controlled thinking-to-answer length ratios, then trains via interleaved SFT combined with RL using two new rewards (TA-Balance for timing/ratio control and Linguistic Quality for expression). The central empirical claim is a 13% performance improvement on mathematical and logic benchmarks while preserving instant, fluent spoken-style output comparable to fast CoT instruct models.

Significance. If the data-generation pipeline and performance gains can be substantiated, the work would address a practically important gap between deep reasoning and fluent real-time speech, with potential impact on conversational AI. The custom rewards and emphasis on timing control represent a targeted technical contribution, but the current absence of verification metrics for the pipeline and experimental details substantially reduces the assessed significance of the reported results.

major comments (2)

[Abstract] Abstract: The central claim that the novel pipeline reliably produces 'seamlessly interleaved audio data' with precise alignment of reasoning steps and controlled thinking-to-answer length ratios is load-bearing for all downstream results, yet the manuscript supplies no alignment metrics, human evaluation scores, failure-rate statistics, or implementation details for this pipeline.
[Abstract] Abstract: The reported 13% performance lift on mathematical and logic benchmarks is stated without any baseline details, dataset descriptions, statistical tests, or ablation results, rendering it impossible to evaluate whether the gains support the InterRS method or arise from reward tuning choices.

minor comments (2)

[Abstract] Abstract contains a spelling error: 'mathmatical' should be 'mathematical'.
[Abstract] Abstract: The phrasing 'the length ratio are under controlled' is grammatically unclear and should be revised for precision (e.g., 'the thinking-to-answer length ratio is kept under explicit control').

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below, indicating where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that the novel pipeline reliably produces 'seamlessly interleaved audio data' with precise alignment of reasoning steps and controlled thinking-to-answer length ratios is load-bearing for all downstream results, yet the manuscript supplies no alignment metrics, human evaluation scores, failure-rate statistics, or implementation details for this pipeline.

Authors: We acknowledge that the manuscript currently lacks explicit quantitative verification of the data pipeline. In the revised version we will add a new subsection describing the pipeline implementation in detail, including the controlled synthesis process for aligning reasoning steps with natural speech pauses and enforcing thinking-to-answer length ratios. We will also report alignment accuracy metrics, human evaluation scores on seamlessness and naturalness, and failure-rate statistics. These additions will allow readers to assess the reliability of the interleaved data. revision: yes
Referee: [Abstract] Abstract: The reported 13% performance lift on mathematical and logic benchmarks is stated without any baseline details, dataset descriptions, statistical tests, or ablation results, rendering it impossible to evaluate whether the gains support the InterRS method or arise from reward tuning choices.

Authors: We agree that the performance claim requires additional context and controls. We will expand the experiments section to include explicit baseline comparisons against fast CoT instruct models and other relevant speech-reasoning systems, full descriptions of the math and logic evaluation datasets, statistical significance tests (e.g., paired t-tests with p-values), and ablation studies isolating the contributions of interleaved SFT, the TA-Balance reward, and the Linguistic Quality reward. This will clarify that the reported gains are attributable to the InterRS method. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected in derivation chain

full rationale

The paper introduces a novel pipeline for generating aligned interleaved audio data, defines two new RL rewards (TA-Balance for timing/ratio control and Linguistic Quality for expression), combines them with interleaved SFT, and reports experimental gains of 13% on math/logic benchmarks while preserving fast spoken-style output. No load-bearing step reduces by construction to its own inputs: the rewards are explicitly engineered for controllable aspects rather than being fitted to the target benchmark scores, the pipeline is presented as a new construction whose outputs are then used for training, and the performance claims rest on external benchmark evaluation rather than tautological renaming or self-referential prediction. The derivation remains self-contained against the stated experimental results.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 3 invented entities

The central claim rests on the unverified ability of the new data pipeline to produce precisely aligned interleaved audio and on the effectiveness of the two newly introduced reward functions; these are treated as domain assumptions without independent external validation in the abstract.

free parameters (1)

thinking-answer length ratio
Explicitly controlled during data generation and managed by the TA-Balance Reward; value not numerically specified in abstract.

axioms (1)

domain assumption A novel pipeline can generate high-quality audio data with precise alignment between reasoning steps and speech segments.
Invoked as the foundation for training data quality in the abstract.

invented entities (3)

InterRS method no independent evidence
purpose: Insert reasoning steps only during natural speech generation
Newly proposed technique for interleaving.
TA-Balance Reward no independent evidence
purpose: Manage timing and thinking-answer ratio
New reward signal introduced for reinforcement learning.
Linguistic Quality Reward no independent evidence
purpose: Refine expression and fluency
New reward signal introduced for reinforcement learning.

pith-pipeline@v0.9.0 · 5692 in / 1251 out tokens · 48707 ms · 2026-05-21T05:50:06.647461+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce a novel pipeline to generate such seamlessly interleaved audio data... TA-Balance Reward... Linguistic Quality Reward... ratio of 4:1 for thinking:answer.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat_induction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

S = (T1,A1),(T2,A2),…,(Tn,An)

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 7 internal anchors

[1]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

work page 2000
[2]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

work page 1980
[3]

M. J. Kearns , title =

work page
[4]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

work page 1983
[5]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

work page 2000
[6]

Suppressed for Anonymity , author=

work page
[7]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

work page 1981
[8]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

work page 1959
[9]

Towards Spoken Mathematical Reasoning: Benchmarking Speech-based Models over Multi-faceted Math Problems , author=

work page
[10]

X-Talk: On the Underestimated Potential of Modular Speech-to-Speech Dialogue System , author=

work page
[11]

Towards General Auditory Intelligence: Large Multimodal Models for Machine Listening and Speaking , author=

work page
[12]

STITCH: Simultaneous Thinking and Talking with Chunked Reasoning for Spoken Language Models , author=

work page
[13]

Mini-Omni-Reasoner: Token-Level Thinking-in-Speaking in Large Speech Models , author=

work page
[14]

Mind-Paced Speaking: A Dual-Brain Approach to Real-Time Reasoning in Spoken Language Models , author=

work page
[15]

Interleaved Reasoning for Large Language Models via Reinforcement Learning , author=

work page
[16]

Step-Audio-R1 Technical Report , author=

work page
[17]

Transactions of the Association for Computational Linguistics , year=

Generative Spoken Language Modeling from Raw Audio , author=. Transactions of the Association for Computational Linguistics , year=

work page
[18]

AudioPaLM: A Large Language Model That Can Speak and Listen

AudioPaLM: A Large Language Model that Can Speak and Listen , author=. arXiv preprint arXiv:2306.12925 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=. arXiv preprint arXiv:2203.11171 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[20]

Advances in Neural Information Processing Systems , year=

Large Language Models are Zero-Shot Reasoners , author=. Advances in Neural Information Processing Systems , year=

work page
[21]

Proximal Policy Optimization Algorithms

Proximal Policy Optimization Algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Advances in Neural Information Processing Systems , year=

Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , year=

work page
[23]

Proceedings of ACL , year=

Towards Effective Reasoning in Spoken Dialogue Systems , author=. Proceedings of ACL , year=

work page
[24]

Qwen2-Audio Technical Report

Qwen2-Audio Technical Report , author=. arXiv preprint arXiv:2407.10759 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Advances in Neural Information Processing Systems , year=

Reflexion: Language Agents with Iterative Design Learning , author=. Advances in Neural Information Processing Systems , year=

work page
[26]

High Fidelity Neural Audio Compression

High Fidelity Neural Audio Compression , author=. arXiv preprint arXiv:2210.13438 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[27]

Language , year=

A simplest systematics for the organization of turn-taking for conversation , author=. Language , year=

work page
[28]

Advances in Neural Information Processing Systems (NeurIPS) , volume=

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

work page
[29]

IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume=

AudioLM: a Language Modeling Approach to Audio Generation , author=. IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume=. 2023 , publisher=

work page 2023
[30]

Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities

SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities , author=. arXiv preprint arXiv:2305.11000 , year=

work page arXiv
[31]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[32]

DeepSeek-V3 Technical Report

DeepSeek-V3 Technical Report , author=. arXiv preprint arXiv:2412.19437 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[33]

Towards Spoken Mathematical Reasoning: Benchmarking Speech-based Models over Multi-faceted Math Problems , journal =

Chengwei Wei and Bin Wang and Jung. Towards Spoken Mathematical Reasoning: Benchmarking Speech-based Models over Multi-faceted Math Problems , journal =

work page
[34]

CoRR , volume =

Chulin Xie and Yangsibo Huang and Chiyuan Zhang and Da Yu and Xinyun Chen and Bill Yuchen Lin and Bo Li and Badih Ghazi and Ravi Kumar , title =. CoRR , volume =

work page
[35]

Hamza Kheddar and Mustapha Hemis and Yassine Himeur , title =. Inf. Fusion , volume =

work page
[36]

Automatic speech recognition and speech variability:

Mohamed Benzeghiba and Renato de Mori and Olivier Deroo and St. Automatic speech recognition and speech variability:. Speech Commun. , volume =

work page
[37]

Long Ouyang and Jeffrey Wu and Xu Jiang and Diogo Almeida and Carroll L. Wainwright and Pamela Mishkin and Chong Zhang and Sandhini Agarwal and Katarina Slama and Alex Ray and John Schulman and Jacob Hilton and Fraser Kelton and Luke Miller and Maddie Simens and Amanda Askell and Peter Welinder and Paul F. Christiano and Jan Leike and Ryan Lowe , editor =...

work page
[38]

Liu , title =

Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , title =. J. Mach. Learn. Res. , volume =

work page
[39]

Emergent Abilities of Large Language Models under Continued Pre-training for Language Adaptation , booktitle =

Ahmed Elhady and Eneko Agirre and Mikel Artetxe , editor =. Emergent Abilities of Large Language Models under Continued Pre-training for Language Adaptation , booktitle =

work page
[40]

Large Language Models are Zero-Shot Reasoners , booktitle =

Takeshi Kojima and Shixiang Shane Gu and Machel Reid and Yutaka Matsuo and Yusuke Iwasawa , editor =. Large Language Models are Zero-Shot Reasoners , booktitle =

work page
[41]

CoRR , volume =

Yiwei Guo and Zhihan Li and Hankun Wang and Bohan Li and Chongtian Shao and Hanglei Zhang and Chenpeng Du and Xie Chen and Shujie Liu and Kai Yu , title =. CoRR , volume =

work page
[42]

Reddy , title =

Weijie Xu and Shixian Cui and Xi Fang and Chi Xue and Stephanie Eckman and Chandan K. Reddy , title =. CoRR , volume =

work page
[43]

CoRR , volume =

Ziyang Ma and Zhuo Chen and Yuping Wang and Eng Siong Chng and Xie Chen , title =. CoRR , volume =

work page
[44]

CoRR , volume =

Yexing Du and Ziyang Ma and Yifan Yang and Keqi Deng and Xie Chen and Bo Yang and Yang Xiang and Ming Liu and Bing Qin , title =. CoRR , volume =

work page
[45]

Moshi: a speech-text foundation model for real-time dialogue , journal =

Alexandre D. Moshi: a speech-text foundation model for real-time dialogue , journal =

work page
[46]

arXiv preprint arXiv:2601.04960 , year=

A Unified Spoken Language Model with Injected Emotional-Attribution Thinking for Human-like Interaction , author=. arXiv preprint arXiv:2601.04960 , year=

work page arXiv
[47]

CoRR , volume =

Zhihao Du and Yuxuan Wang and Qian Chen and Xian Shi and Xiang Lv and Tianyu Zhao and Zhifu Gao and Yexin Yang and Changfeng Gao and Hui Wang and Fan Yu and Huadai Liu and Zhengyan Sheng and Yue Gu and Chong Deng and Wen Wang and Shiliang Zhang and Zhijie Yan and Jingren Zhou , title =. CoRR , volume =

work page
[48]

CoRR , volume =

Yi Chen and Yuying Ge and Rui Wang and Yixiao Ge and Junhao Cheng and Ying Shan and Xihui Liu , title =. CoRR , volume =

work page
[49]

Qwen2.5-Omni Technical Report , author=

work page
[50]

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models , author=

work page

[1] [1]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

work page 2000

[2] [2]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

work page 1980

[3] [3]

M. J. Kearns , title =

work page

[4] [4]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

work page 1983

[5] [5]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

work page 2000

[6] [6]

Suppressed for Anonymity , author=

work page

[7] [7]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

work page 1981

[8] [8]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

work page 1959

[9] [9]

Towards Spoken Mathematical Reasoning: Benchmarking Speech-based Models over Multi-faceted Math Problems , author=

work page

[10] [10]

X-Talk: On the Underestimated Potential of Modular Speech-to-Speech Dialogue System , author=

work page

[11] [11]

Towards General Auditory Intelligence: Large Multimodal Models for Machine Listening and Speaking , author=

work page

[12] [12]

STITCH: Simultaneous Thinking and Talking with Chunked Reasoning for Spoken Language Models , author=

work page

[13] [13]

Mini-Omni-Reasoner: Token-Level Thinking-in-Speaking in Large Speech Models , author=

work page

[14] [14]

Mind-Paced Speaking: A Dual-Brain Approach to Real-Time Reasoning in Spoken Language Models , author=

work page

[15] [15]

Interleaved Reasoning for Large Language Models via Reinforcement Learning , author=

work page

[16] [16]

Step-Audio-R1 Technical Report , author=

work page

[17] [17]

Transactions of the Association for Computational Linguistics , year=

Generative Spoken Language Modeling from Raw Audio , author=. Transactions of the Association for Computational Linguistics , year=

work page

[18] [18]

AudioPaLM: A Large Language Model That Can Speak and Listen

AudioPaLM: A Large Language Model that Can Speak and Listen , author=. arXiv preprint arXiv:2306.12925 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Self-Consistency Improves Chain of Thought Reasoning in Language Models

Self-Consistency Improves Chain of Thought Reasoning in Language Models , author=. arXiv preprint arXiv:2203.11171 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

Advances in Neural Information Processing Systems , year=

Large Language Models are Zero-Shot Reasoners , author=. Advances in Neural Information Processing Systems , year=

work page

[21] [21]

Proximal Policy Optimization Algorithms

Proximal Policy Optimization Algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[22] [22]

Advances in Neural Information Processing Systems , year=

Training language models to follow instructions with human feedback , author=. Advances in Neural Information Processing Systems , year=

work page

[23] [23]

Proceedings of ACL , year=

Towards Effective Reasoning in Spoken Dialogue Systems , author=. Proceedings of ACL , year=

work page

[24] [24]

Qwen2-Audio Technical Report

Qwen2-Audio Technical Report , author=. arXiv preprint arXiv:2407.10759 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

Advances in Neural Information Processing Systems , year=

Reflexion: Language Agents with Iterative Design Learning , author=. Advances in Neural Information Processing Systems , year=

work page

[26] [26]

High Fidelity Neural Audio Compression

High Fidelity Neural Audio Compression , author=. arXiv preprint arXiv:2210.13438 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

Language , year=

A simplest systematics for the organization of turn-taking for conversation , author=. Language , year=

work page

[28] [28]

Advances in Neural Information Processing Systems (NeurIPS) , volume=

Chain-of-Thought Prompting Elicits Reasoning in Large Language Models , author=. Advances in Neural Information Processing Systems (NeurIPS) , volume=

work page

[29] [29]

IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume=

AudioLM: a Language Modeling Approach to Audio Generation , author=. IEEE/ACM Transactions on Audio, Speech, and Language Processing , volume=. 2023 , publisher=

work page 2023

[30] [30]

Speechgpt: Empowering large language models with intrinsic cross-modal conversational abilities

SpeechGPT: Empowering Large Language Models with Intrinsic Cross-Modal Conversational Abilities , author=. arXiv preprint arXiv:2305.11000 , year=

work page arXiv

[31] [31]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

DeepSeek-V3 Technical Report

DeepSeek-V3 Technical Report , author=. arXiv preprint arXiv:2412.19437 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[33] [33]

Towards Spoken Mathematical Reasoning: Benchmarking Speech-based Models over Multi-faceted Math Problems , journal =

Chengwei Wei and Bin Wang and Jung. Towards Spoken Mathematical Reasoning: Benchmarking Speech-based Models over Multi-faceted Math Problems , journal =

work page

[34] [34]

CoRR , volume =

Chulin Xie and Yangsibo Huang and Chiyuan Zhang and Da Yu and Xinyun Chen and Bill Yuchen Lin and Bo Li and Badih Ghazi and Ravi Kumar , title =. CoRR , volume =

work page

[35] [35]

Hamza Kheddar and Mustapha Hemis and Yassine Himeur , title =. Inf. Fusion , volume =

work page

[36] [36]

Automatic speech recognition and speech variability:

Mohamed Benzeghiba and Renato de Mori and Olivier Deroo and St. Automatic speech recognition and speech variability:. Speech Commun. , volume =

work page

[37] [37]

Long Ouyang and Jeffrey Wu and Xu Jiang and Diogo Almeida and Carroll L. Wainwright and Pamela Mishkin and Chong Zhang and Sandhini Agarwal and Katarina Slama and Alex Ray and John Schulman and Jacob Hilton and Fraser Kelton and Luke Miller and Maddie Simens and Amanda Askell and Peter Welinder and Paul F. Christiano and Jan Leike and Ryan Lowe , editor =...

work page

[38] [38]

Liu , title =

Colin Raffel and Noam Shazeer and Adam Roberts and Katherine Lee and Sharan Narang and Michael Matena and Yanqi Zhou and Wei Li and Peter J. Liu , title =. J. Mach. Learn. Res. , volume =

work page

[39] [39]

Emergent Abilities of Large Language Models under Continued Pre-training for Language Adaptation , booktitle =

Ahmed Elhady and Eneko Agirre and Mikel Artetxe , editor =. Emergent Abilities of Large Language Models under Continued Pre-training for Language Adaptation , booktitle =

work page

[40] [40]

Large Language Models are Zero-Shot Reasoners , booktitle =

Takeshi Kojima and Shixiang Shane Gu and Machel Reid and Yutaka Matsuo and Yusuke Iwasawa , editor =. Large Language Models are Zero-Shot Reasoners , booktitle =

work page

[41] [41]

CoRR , volume =

Yiwei Guo and Zhihan Li and Hankun Wang and Bohan Li and Chongtian Shao and Hanglei Zhang and Chenpeng Du and Xie Chen and Shujie Liu and Kai Yu , title =. CoRR , volume =

work page

[42] [42]

Reddy , title =

Weijie Xu and Shixian Cui and Xi Fang and Chi Xue and Stephanie Eckman and Chandan K. Reddy , title =. CoRR , volume =

work page

[43] [43]

CoRR , volume =

Ziyang Ma and Zhuo Chen and Yuping Wang and Eng Siong Chng and Xie Chen , title =. CoRR , volume =

work page

[44] [44]

CoRR , volume =

Yexing Du and Ziyang Ma and Yifan Yang and Keqi Deng and Xie Chen and Bo Yang and Yang Xiang and Ming Liu and Bing Qin , title =. CoRR , volume =

work page

[45] [45]

Moshi: a speech-text foundation model for real-time dialogue , journal =

Alexandre D. Moshi: a speech-text foundation model for real-time dialogue , journal =

work page

[46] [46]

arXiv preprint arXiv:2601.04960 , year=

A Unified Spoken Language Model with Injected Emotional-Attribution Thinking for Human-like Interaction , author=. arXiv preprint arXiv:2601.04960 , year=

work page arXiv

[47] [47]

CoRR , volume =

Zhihao Du and Yuxuan Wang and Qian Chen and Xian Shi and Xiang Lv and Tianyu Zhao and Zhifu Gao and Yexin Yang and Changfeng Gao and Hui Wang and Fan Yu and Huadai Liu and Zhengyan Sheng and Yue Gu and Chong Deng and Wen Wang and Shiliang Zhang and Zhijie Yan and Jingren Zhou , title =. CoRR , volume =

work page

[48] [48]

CoRR , volume =

Yi Chen and Yuying Ge and Rui Wang and Yixiao Ge and Junhao Cheng and Ying Shan and Xihui Liu , title =. CoRR , volume =

work page

[49] [49]

Qwen2.5-Omni Technical Report , author=

work page

[50] [50]

MetaMath: Bootstrap Your Own Mathematical Questions for Large Language Models , author=

work page