LaSR: Context-Aware Speech Recognition via Latent Reasoning

Heyang Liu; Jiayi Huang; Qunshan Gu; Ronghua Wu; Wenyang Xiao; Yanfeng Wang; Yu Wang; Ziyang Cheng

arxiv: 2606.00507 · v1 · pith:YNOWIVCLnew · submitted 2026-05-30 · 💻 cs.CL

LaSR: Context-Aware Speech Recognition via Latent Reasoning

Heyang Liu , Ziyang Cheng , Jiayi Huang , Wenyang Xiao , Ronghua Wu , Qunshan Gu , Yanfeng Wang , Yu Wang This is my paper

Pith reviewed 2026-06-28 19:10 UTC · model grok-4.3

classification 💻 cs.CL

keywords speech large language modelslatent reasoningcontext-aware speech recognitionchain-of-thought supervisionterminology recognitionspoken darwin-science corpus

0 comments

The pith

LaSR aligns chain-of-thought supervision to acoustic features and inserts latent reasoning periods to improve contextual speech recognition.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LaSR as a training paradigm for speech large language models that builds a context-aware reasoning trajectory through latent processes rather than visible steps. It positions chain-of-thought supervision at the acoustic regions corresponding to target words and adds latent reasoning periods to incorporate context and manage shifts in transcription. On the Fun-Audio-Chat benchmark and the new Spoken Darwin-Science corpus of academic terms, this yields better recognition of specialized vocabulary. The gains appear without extra latency and exceed those from ordinary supervised fine-tuning. The method targets better capture of speaker intent and topical context in spoken assistants.

Core claim

LaSR is a novel training paradigm featuring a context-aware reasoning trajectory that leverages the latent reasoning process. Instead of generating explicit intermediate tokens, LaSR aligns chain-of-thought (CoT) supervision around the acoustic feature region of the targeted word, and introduces latent reasoning periods for context information grounding and transcriptional transition.

What carries the argument

Context-aware reasoning trajectory that aligns CoT supervision to acoustic feature regions and uses latent reasoning periods for grounding and transition.

If this is right

Terminology recognition accuracy rises in speech LLMs.
Response latency stays unchanged.
Performance exceeds that of standard supervised fine-tuning baselines.
Models better reflect speaker intent and topical context during recognition.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach may transfer to other context-sensitive speech tasks such as intent detection or summarization.
The Spoken Darwin-Science corpus could become a reusable testbed for comparing context-aware speech methods.
Avoiding explicit tokens might lower overall training cost in larger speech systems.

Load-bearing premise

That aligning chain-of-thought supervision around the acoustic feature region of the targeted word, combined with latent reasoning periods, successfully grounds context information and enables transcriptional transition without explicit intermediate tokens.

What would settle it

A head-to-head test on the Spoken Darwin-Science corpus in which LaSR produces no measurable gain in terminology recognition accuracy over standard supervised fine-tuning.

Figures

Figures reproduced from arXiv: 2606.00507 by Heyang Liu, Jiayi Huang, Qunshan Gu, Ronghua Wu, Wenyang Xiao, Yanfeng Wang, Yu Wang, Ziyang Cheng.

**Figure 2.** Figure 2: Dataset statistics of Spoken Darwin-Science 20% subset. (a): The instance distribution of various subjects; [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: LaSR training method. (a) Structured causal reasoning trajectory of LaSR; (b) Comparison with textual [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Supervised finetuning results on Qwen3-ASR. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Average period ratio in our anchor strategies. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 7.** Figure 7: B Spoken Darwin-Science Statistics The number of instances and total duration of Spoken Darwin-Science are shown in [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: The screenshot of human verification [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

read the original abstract

Recent advances in Speech Large Language Models (Speech LLMs) have significantly enhanced spoken language understanding and reasoning. However, their contextual awareness is limited, struggling to perform speech recognition that effectively reflects the speaker's intent and topical context. In this paper, we propose LaSR (Latent Speech Reasoning), a novel training paradigm featuring a context-aware reasoning trajectory that leverages the latent reasoning process. Instead of generating explicit intermediate tokens, LaSR aligns chain-of-thought (CoT) supervision around the acoustic feature region of the targeted word, and introduces latent reasoning periods for context information grounding and transcriptional transition. Furthermore, to effectively benchmark contextual recognition on specialized vocabulary, we propose Spoken Darwin-Science, a large-scale corpus focusing on academic terminologies. Preliminary experiments on Fun-Audio-Chat demonstrate that LaSR significantly improves terminology recognition without introducing additional latency and consistently outperforms standard supervised fine-tuning baselines. Our findings highlight the potential of latent reasoning in building efficient, context-aware speech assistants.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LaSR offers a latent reasoning training setup for context in speech LLMs plus a new academic-term benchmark, but rests on preliminary results with no metrics or details shown.

read the letter

The main point is a training method that aligns chain-of-thought supervision to acoustic regions of target words and inserts latent reasoning periods instead of explicit tokens, aiming to ground context and improve specialized vocabulary recognition in speech LLMs without added latency.

What stands out as new is the specific combination of acoustic-region alignment and those latent periods, presented as a distinct paradigm from standard supervised fine-tuning. The Spoken Darwin-Science corpus for academic terminologies is a concrete, usable addition that could help test similar ideas in technical dictation settings.

The experiments claim consistent gains over baselines on Fun-Audio-Chat. That framing is straightforward and the latency claim is a practical plus. The paper also avoids obvious circularity by treating the improvement as an empirical outcome rather than a fitted quantity.

The soft spot is the evidence itself. The abstract gives no numbers, no baseline details, no statistical tests, and no error analysis, and the full methods are not verifiable here. This leaves the central assumption—that the latent periods successfully ground context—plausible but untested in any detail that can be checked.

This is for researchers working on speech LLMs for domain-specific or technical use cases. A reader focused on benchmarks or alternative reasoning mechanisms in audio models could extract value from the corpus and the training sketch even if the results section needs expansion.

I would send it for peer review. The training idea and the new data set give it enough substance to justify referee time, though the experimental reporting will need to be filled in.

Referee Report

2 major / 1 minor

Summary. The manuscript proposes LaSR, a training paradigm for Speech LLMs that uses a context-aware reasoning trajectory with latent reasoning periods to align chain-of-thought supervision around acoustic feature regions of targeted words, enabling better contextual grounding without explicit intermediate tokens. It introduces the Spoken Darwin-Science corpus for benchmarking specialized academic vocabulary and reports preliminary experiments on Fun-Audio-Chat claiming improved terminology recognition with no added latency over standard supervised fine-tuning.

Significance. If the empirical gains can be rigorously quantified and reproduced, the approach would demonstrate a latency-preserving mechanism for injecting context into speech recognition, with potential value for domain-specific applications such as academic or technical transcription.

major comments (2)

[Abstract] Abstract: the claim that LaSR 'significantly improves terminology recognition' and 'consistently outperforms standard supervised fine-tuning baselines' is presented without any numerical metrics (e.g., WER, accuracy deltas), baseline specifications, statistical tests, or error analysis, which is load-bearing for the central empirical claim.
[Abstract] Abstract / LaSR description: the mechanism of 'latent reasoning periods' for 'context information grounding and transcriptional transition' is introduced at a conceptual level with no equations, pseudocode, or precise alignment procedure, preventing assessment of whether the method is reproducible or reduces to standard fine-tuning.

minor comments (1)

[Abstract] The abstract mentions a 'large-scale corpus' but supplies no statistics on size, vocabulary coverage, or construction details; adding a brief quantitative summary would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the two major comments point-by-point below and will incorporate revisions to strengthen the abstract and methodological clarity.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that LaSR 'significantly improves terminology recognition' and 'consistently outperforms standard supervised fine-tuning baselines' is presented without any numerical metrics (e.g., WER, accuracy deltas), baseline specifications, statistical tests, or error analysis, which is load-bearing for the central empirical claim.

Authors: We agree that the abstract's empirical claims would benefit from explicit quantitative support. The full manuscript reports concrete results from preliminary experiments on Fun-Audio-Chat (including WER and terminology accuracy deltas versus standard SFT baselines), but these were not summarized numerically in the abstract. In revision we will add specific metrics, baseline details, and any available statistical context directly into the abstract while preserving its length constraints. revision: yes
Referee: [Abstract] Abstract / LaSR description: the mechanism of 'latent reasoning periods' for 'context information grounding and transcriptional transition' is introduced at a conceptual level with no equations, pseudocode, or precise alignment procedure, preventing assessment of whether the method is reproducible or reduces to standard fine-tuning.

Authors: We acknowledge the abstract presents the latent reasoning periods at a high level. The manuscript body contains the alignment procedure for CoT supervision on acoustic regions, but it lacks explicit equations or pseudocode. In the revised version we will add a concise algorithmic description or pseudocode box in the methods section and ensure the abstract references this distinction from standard fine-tuning to improve reproducibility assessment. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper introduces LaSR as a training paradigm that aligns chain-of-thought supervision to acoustic regions and adds latent reasoning periods, then reports empirical gains on terminology recognition versus supervised fine-tuning baselines on Fun-Audio-Chat. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The claimed improvement is framed as an external experimental comparison rather than a quantity forced by the method's own definitions or prior self-referential results. The derivation chain is therefore self-contained as an empirical proposal.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on the untested premise that latent internal states can substitute for explicit reasoning tokens while preserving context; no free parameters, standard axioms, or new entities with independent evidence are described in the abstract.

axioms (1)

domain assumption Aligning CoT supervision to acoustic feature regions of target words enables effective context grounding and transcriptional transition without explicit tokens.
This premise is required for the latent reasoning periods to function as claimed.

invented entities (1)

latent reasoning periods no independent evidence
purpose: To perform context information grounding and transcriptional transition inside the model without generating explicit tokens.
New construct introduced in the abstract as a core component of LaSR.

pith-pipeline@v0.9.1-grok · 5714 in / 1292 out tokens · 22148 ms · 2026-06-28T19:10:51.355952+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 19 canonical work pages · 13 internal anchors

[1]

arXiv preprint arXiv:2604.00610 , year=

Speech LLMs are Contextual Reasoning Transcribers , author=. arXiv preprint arXiv:2604.00610 , year=

work page arXiv
[2]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

VocalNet: Speech LLMs with Multi-Token Prediction for Faster and High-Quality Generation , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025
[3]

arXiv preprint arXiv:2512.20156 , year=

Fun-Audio-Chat Technical Report , author=. arXiv preprint arXiv:2512.20156 , year=

work page arXiv
[4]

arXiv preprint arXiv:2503.11197 , year=

Reinforcement learning outperforms supervised fine-tuning: A case study on audio question answering , author=. arXiv preprint arXiv:2503.11197 , year=

work page arXiv
[5]

Qwen3-Omni Technical Report

Qwen3-omni technical report , author=. arXiv preprint arXiv:2509.17765 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Step-Audio 2 Technical Report

Step-audio 2 technical report , author=. arXiv preprint arXiv:2507.16632 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) , pages=

Post-decoder Biasing for End-to-End Speech Recognition of Multi-turn Medical Interview , author=. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) , pages=

2024
[9]

2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) , pages=

Tree-constrained pointer generator for end-to-end contextual speech recognition , author=. 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) , pages=. 2021 , organization=

2021
[10]

GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot

Glm-4-voice: Towards intelligent and human-like end-to-end spoken chatbot , author=. arXiv preprint arXiv:2412.02612 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Qwen2.5-Omni Technical Report

Qwen2. 5-omni technical report , author=. arXiv preprint arXiv:2503.20215 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[12]

arXiv preprint arXiv:2511.15848 , year=

Step-Audio-R1 Technical Report , author=. arXiv preprint arXiv:2511.15848 , year=

work page arXiv
[13]

Step-Audio-R1.5 Technical Report

Step-Audio-R1. 5 Technical Report , author=. arXiv preprint arXiv:2604.25719 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[14]

arXiv preprint arXiv:2512.23808 , year=

MiMo-Audio: Audio Language Models are Few-Shot Learners , author=. arXiv preprint arXiv:2512.23808 , year=

work page arXiv
[15]

, author=

Shallow-Fusion End-to-End Contextual Biasing. , author=. Interspeech , pages=
[16]

2018 IEEE spoken language technology workshop (SLT) , pages=

Deep context: end-to-end contextual speech recognition , author=. 2018 IEEE spoken language technology workshop (SLT) , pages=. 2018 , organization=

2018
[17]

2024 IEEE Spoken Language Technology Workshop (SLT) , pages=

Ctc-assisted llm-based contextual asr , author=. 2024 IEEE Spoken Language Technology Workshop (SLT) , pages=. 2024 , organization=

2024
[18]

arXiv preprint arXiv:2602.07824 , year=

Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training , author=. arXiv preprint arXiv:2602.07824 , year=

work page arXiv
[19]

22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021 , pages=

GigaSpeech: An evolving, multi-domain ASR corpus with 10,000 hours of transcribed audio , author=. 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021 , pages=. 2021 , organization=

2021
[20]

Proceedings of the COLING/ACL 2006 interactive presentation sessions , pages=

NLTK: the natural language toolkit , author=. Proceedings of the COLING/ACL 2006 interactive presentation sessions , pages=

2006
[21]

DNSMOS Pro: A Reduced-Size DNN for Probabilistic MOS of Speech , author=. Proc. Interspeech 2024 , pages=

2024
[22]

CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

Cosyvoice 3: Towards in-the-wild speech generation via scaling-up and post-training , author=. arXiv preprint arXiv:2505.17589 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

Seed-tts: A family of high-quality versatile speech generation models , author=. arXiv preprint arXiv:2406.02430 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[24]

Qwen3-ASR Technical Report

Qwen3-ASR Technical Report , author=. arXiv preprint arXiv:2601.21337 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[25]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=
[26]

LIMO: Less is More for Reasoning

Limo: Less is more for reasoning , author=. arXiv preprint arXiv:2502.03387 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[27]

International Conference on Learning Representations , volume=

Mammoth: Building math generalist models through hybrid instruction tuning , author=. International Conference on Learning Representations , volume=
[28]

Training Large Language Models to Reason in a Continuous Latent Space

Training large language models to reason in a continuous latent space , author=. arXiv preprint arXiv:2412.06769 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[29]

From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step

From explicit cot to implicit cot: Learning to internalize cot step by step , author=. arXiv preprint arXiv:2405.14838 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[30]

Advances in Neural Information Processing Systems , volume=

Think silently, think fast: Dynamic latent compression of llm reasoning chains , author=. Advances in Neural Information Processing Systems , volume=
[31]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Codi: Compressing chain-of-thought into continuous space via self-distillation , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025
[32]

International Conference on Machine Learning , pages=

Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning , author=. International Conference on Machine Learning , pages=. 2025 , organization=

2025
[33]

Qwen3-TTS Technical Report

Qwen3-TTS Technical Report , author=. arXiv preprint arXiv:2601.15621 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 3: system demonstrations) , pages=

Llamafactory: Unified efficient fine-tuning of 100+ language models , author=. Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 3: system demonstrations) , pages=
[35]

Qwen3.5: Accelerating Productivity with Native Multimodal Agents , url =

Qwen Team , month =. Qwen3.5: Accelerating Productivity with Native Multimodal Agents , url =
[36]

International conference on machine learning , pages=

Robust speech recognition via large-scale weak supervision , author=. International conference on machine learning , pages=. 2023 , organization=

2023
[37]

2015 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages=

Librispeech: an asr corpus based on public domain audio books , author=. 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages=. 2015 , organization=

2015
[38]

Exploring SSL Discrete Speech Features for Zipformer-based Contextual ASR , author=. Proc. Interspeech 2025 , pages=

2025

[1] [1]

arXiv preprint arXiv:2604.00610 , year=

Speech LLMs are Contextual Reasoning Transcribers , author=. arXiv preprint arXiv:2604.00610 , year=

work page arXiv

[2] [2]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

VocalNet: Speech LLMs with Multi-Token Prediction for Faster and High-Quality Generation , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025

[3] [3]

arXiv preprint arXiv:2512.20156 , year=

Fun-Audio-Chat Technical Report , author=. arXiv preprint arXiv:2512.20156 , year=

work page arXiv

[4] [4]

arXiv preprint arXiv:2503.11197 , year=

Reinforcement learning outperforms supervised fine-tuning: A case study on audio question answering , author=. arXiv preprint arXiv:2503.11197 , year=

work page arXiv

[5] [5]

Qwen3-Omni Technical Report

Qwen3-omni technical report , author=. arXiv preprint arXiv:2509.17765 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Step-Audio 2 Technical Report

Step-audio 2 technical report , author=. arXiv preprint arXiv:2507.16632 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) , pages=

Post-decoder Biasing for End-to-End Speech Recognition of Multi-turn Medical Interview , author=. Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024) , pages=

2024

[9] [9]

2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) , pages=

Tree-constrained pointer generator for end-to-end contextual speech recognition , author=. 2021 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU) , pages=. 2021 , organization=

2021

[10] [10]

GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot

Glm-4-voice: Towards intelligent and human-like end-to-end spoken chatbot , author=. arXiv preprint arXiv:2412.02612 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Qwen2.5-Omni Technical Report

Qwen2. 5-omni technical report , author=. arXiv preprint arXiv:2503.20215 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

arXiv preprint arXiv:2511.15848 , year=

Step-Audio-R1 Technical Report , author=. arXiv preprint arXiv:2511.15848 , year=

work page arXiv

[13] [13]

Step-Audio-R1.5 Technical Report

Step-Audio-R1. 5 Technical Report , author=. arXiv preprint arXiv:2604.25719 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

arXiv preprint arXiv:2512.23808 , year=

MiMo-Audio: Audio Language Models are Few-Shot Learners , author=. arXiv preprint arXiv:2512.23808 , year=

work page arXiv

[15] [15]

, author=

Shallow-Fusion End-to-End Contextual Biasing. , author=. Interspeech , pages=

[16] [16]

2018 IEEE spoken language technology workshop (SLT) , pages=

Deep context: end-to-end contextual speech recognition , author=. 2018 IEEE spoken language technology workshop (SLT) , pages=. 2018 , organization=

2018

[17] [17]

2024 IEEE Spoken Language Technology Workshop (SLT) , pages=

Ctc-assisted llm-based contextual asr , author=. 2024 IEEE Spoken Language Technology Workshop (SLT) , pages=. 2024 , organization=

2024

[18] [18]

arXiv preprint arXiv:2602.07824 , year=

Data Darwinism Part I: Unlocking the Value of Scientific Data for Pre-training , author=. arXiv preprint arXiv:2602.07824 , year=

work page arXiv

[19] [19]

22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021 , pages=

GigaSpeech: An evolving, multi-domain ASR corpus with 10,000 hours of transcribed audio , author=. 22nd Annual Conference of the International Speech Communication Association, INTERSPEECH 2021 , pages=. 2021 , organization=

2021

[20] [20]

Proceedings of the COLING/ACL 2006 interactive presentation sessions , pages=

NLTK: the natural language toolkit , author=. Proceedings of the COLING/ACL 2006 interactive presentation sessions , pages=

2006

[21] [21]

DNSMOS Pro: A Reduced-Size DNN for Probabilistic MOS of Speech , author=. Proc. Interspeech 2024 , pages=

2024

[22] [22]

CosyVoice 3: Towards In-the-wild Speech Generation via Scaling-up and Post-training

Cosyvoice 3: Towards in-the-wild speech generation via scaling-up and post-training , author=. arXiv preprint arXiv:2505.17589 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

Seed-tts: A family of high-quality versatile speech generation models , author=. arXiv preprint arXiv:2406.02430 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[24] [24]

Qwen3-ASR Technical Report

Qwen3-ASR Technical Report , author=. arXiv preprint arXiv:2601.21337 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

Advances in neural information processing systems , volume=

Chain-of-thought prompting elicits reasoning in large language models , author=. Advances in neural information processing systems , volume=

[26] [26]

LIMO: Less is More for Reasoning

Limo: Less is more for reasoning , author=. arXiv preprint arXiv:2502.03387 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[27] [27]

International Conference on Learning Representations , volume=

Mammoth: Building math generalist models through hybrid instruction tuning , author=. International Conference on Learning Representations , volume=

[28] [28]

Training Large Language Models to Reason in a Continuous Latent Space

Training large language models to reason in a continuous latent space , author=. arXiv preprint arXiv:2412.06769 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[29] [29]

From Explicit CoT to Implicit CoT: Learning to Internalize CoT Step by Step

From explicit cot to implicit cot: Learning to internalize cot step by step , author=. arXiv preprint arXiv:2405.14838 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[30] [30]

Advances in Neural Information Processing Systems , volume=

Think silently, think fast: Dynamic latent compression of llm reasoning chains , author=. Advances in Neural Information Processing Systems , volume=

[31] [31]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

Codi: Compressing chain-of-thought into continuous space via self-distillation , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025

[32] [32]

International Conference on Machine Learning , pages=

Token Assorted: Mixing Latent and Text Tokens for Improved Language Model Reasoning , author=. International Conference on Machine Learning , pages=. 2025 , organization=

2025

[33] [33]

Qwen3-TTS Technical Report

Qwen3-TTS Technical Report , author=. arXiv preprint arXiv:2601.15621 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[34] [34]

Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 3: system demonstrations) , pages=

Llamafactory: Unified efficient fine-tuning of 100+ language models , author=. Proceedings of the 62nd annual meeting of the association for computational linguistics (volume 3: system demonstrations) , pages=

[35] [35]

Qwen3.5: Accelerating Productivity with Native Multimodal Agents , url =

Qwen Team , month =. Qwen3.5: Accelerating Productivity with Native Multimodal Agents , url =

[36] [36]

International conference on machine learning , pages=

Robust speech recognition via large-scale weak supervision , author=. International conference on machine learning , pages=. 2023 , organization=

2023

[37] [37]

2015 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages=

Librispeech: an asr corpus based on public domain audio books , author=. 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP) , pages=. 2015 , organization=

2015

[38] [38]

Exploring SSL Discrete Speech Features for Zipformer-based Contextual ASR , author=. Proc. Interspeech 2025 , pages=

2025