Recognition: unknown
ASPIRin: Action Space Projection for Interactivity-Optimized Reinforcement Learning in Full-Duplex Speech Language Models
Pith reviewed 2026-05-10 16:59 UTC · model grok-4.3
The pith
Decoupling speech timing from content selection prevents repetition while improving turn-taking in full-duplex models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
ASPIRin maps the text vocabulary into a coarse-grained binary state of active speech versus inactive silence. Applying Group Relative Policy Optimization with rule-based rewards on this reduced space balances user interruption and response latency. Isolating timing from token selection preserves semantic coherence and reduces the portion of duplicate n-grams by over 50 percent compared to standard GRPO, effectively eliminating degenerative repetition.
What carries the argument
Action Space Projection, which collapses the text vocabulary to a binary active-speech versus inactive-silence state so that timing can be optimized separately from content.
If this is right
- Optimizes turn-taking, backchanneling, and pause handling in full-duplex settings.
- Preserves semantic coherence while training for interactivity.
- Reduces duplicate n-grams by more than 50 percent relative to standard GRPO.
- Balances user interruption against response latency with rule-based rewards.
Where Pith is reading between the lines
- The same binary projection idea could be tested on other sequential tasks where timing and content decisions compete during training.
- Smaller action spaces from this projection may allow faster policy updates or lower memory use in reinforcement learning for dialogue systems.
- Extending the binary state to a few more coarse levels such as low-volume backchannels might add finer interactivity control without reintroducing repetition.
Load-bearing premise
Mapping the text vocabulary into a coarse-grained binary state of active speech versus inactive silence is sufficient to optimize timing without losing information critical to semantic quality.
What would settle it
A side-by-side run of ASPIRin against standard GRPO on a full-duplex conversation benchmark that measures both duplicate n-gram rate and semantic coherence scores; if the duplicate reduction drops below 50 percent or coherence falls, the central claim fails.
Figures
read the original abstract
End-to-end full-duplex Speech Language Models (SLMs) require precise turn-taking for natural interaction. However, optimizing temporal dynamics via standard raw-token reinforcement learning (RL) degrades semantic quality, causing severe generative collapse and repetition. We propose ASPIRin, an interactivity-optimized RL framework that explicitly decouples when to speak from what to say. Using Action Space Projection, ASPIRin maps the text vocabulary into a coarse-grained binary state (active speech vs. inactive silence). By applying Group Relative Policy Optimization (GRPO) with rule-based rewards, it balances user interruption and response latency. Empirical evaluations show ASPIRin optimizes interactivity across turn-taking, backchanneling, and pause handling. Crucially, isolating timing from token selection preserves semantic coherence and reduces the portion of duplicate n-grams by over 50% compared to standard GRPO, effectively eliminating degenerative repetition.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ASPIRin, a reinforcement learning framework for full-duplex Speech Language Models that decouples timing (when to speak) from content (what to say) via Action Space Projection, which maps the text vocabulary to a coarse binary state of active speech versus inactive silence. GRPO is then applied with rule-based rewards to optimize interactivity metrics such as turn-taking, backchanneling, and pause handling. The central empirical claim is that this isolation preserves semantic coherence while reducing the portion of duplicate n-grams by over 50% relative to standard GRPO, thereby mitigating degenerative repetition.
Significance. If the decoupling via binary projection can be shown to retain critical semantic and timing cues without collapse, the approach would address a key failure mode in RL for conversational SLMs and enable more natural full-duplex interaction. The explicit separation of action spaces is a targeted intervention that could generalize to other latency-sensitive generation tasks, but its value hinges on rigorous validation of the preservation claim.
major comments (3)
- Section 3.2: The Action Space Projection is defined as a simple binary mapping of the full vocabulary to {speech, silence} states prior to GRPO. No ablation on projection granularity (e.g., finer token-level or lexical-category states) is reported, leaving open whether the coarse mapping erases distinctions needed for backchannels, pauses, or semantic coherence as hypothesized in the skeptic analysis.
- Results section (referenced in abstract): The claim of >50% reduction in duplicate n-grams and elimination of degenerative repetition is presented without experimental setup details, baseline comparisons beyond standard GRPO, statistical significance tests, dataset descriptions, or ablation studies on the projection step, rendering the quantitative result unverifiable and the central claim unsupported.
- Reward formulation (Section 3): The rule-based rewards used to balance user interruption and response latency are not explicitly defined or derived; without their precise functional form or sensitivity analysis, it is unclear whether the reported interactivity gains are robust or merely artifacts of the reduced action space.
minor comments (2)
- The abstract and introduction would benefit from a brief statement of the underlying SLM architecture and training corpus to contextualize the GRPO application.
- Notation for the projected states (active/inactive) should be formalized with an equation or pseudocode for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment point by point below. Where the manuscript lacks sufficient clarity or supporting analyses, we will revise accordingly to strengthen the presentation and verifiability of our claims.
read point-by-point responses
-
Referee: Section 3.2: The Action Space Projection is defined as a simple binary mapping of the full vocabulary to {speech, silence} states prior to GRPO. No ablation on projection granularity (e.g., finer token-level or lexical-category states) is reported, leaving open whether the coarse mapping erases distinctions needed for backchannels, pauses, or semantic coherence as hypothesized in the skeptic analysis.
Authors: The binary projection is a core design decision to enforce explicit decoupling of timing from content, directly targeting the entanglement that produces repetition under standard GRPO. Our results show that semantic coherence and backchanneling/pause handling are preserved, consistent with the hypothesized benefits. We agree that granularity ablations would provide additional support. In the revision we will expand Section 3.2 with theoretical justification for the binary choice and include a new ablation comparing binary versus lexical-category projections. revision: partial
-
Referee: Results section (referenced in abstract): The claim of >50% reduction in duplicate n-grams and elimination of degenerative repetition is presented without experimental setup details, baseline comparisons beyond standard GRPO, statistical significance tests, dataset descriptions, or ablation studies on the projection step, rendering the quantitative result unverifiable and the central claim unsupported.
Authors: We regret that the experimental details were not presented with sufficient prominence. Section 4 of the manuscript describes the datasets, the duplicate n-gram metric computation, and comparisons against standard GRPO; statistical significance was assessed via paired tests. We will revise the Results section to explicitly restate all setup elements, expand baseline comparisons, report exact p-values, and add a dedicated ablation isolating the projection step so that the >50% reduction claim is fully verifiable. revision: yes
-
Referee: Reward formulation (Section 3): The rule-based rewards used to balance user interruption and response latency are not explicitly defined or derived; without their precise functional form or sensitivity analysis, it is unclear whether the reported interactivity gains are robust or merely artifacts of the reduced action space.
Authors: The rewards are defined in Section 3 as a linear combination of an interruption penalty term and a latency reward term. We will make the exact functional forms and weighting coefficients explicit, include their derivation from the interactivity objectives, and add a sensitivity analysis over the weighting hyperparameters in the revised manuscript to demonstrate that the gains are robust. revision: yes
Circularity Check
No significant circularity in ASPIRin derivation chain
full rationale
The paper introduces Action Space Projection as an explicit, rule-based design choice that maps the full text vocabulary to a coarse binary state space ({active speech, inactive silence}) before applying GRPO with separate rule-based rewards. This mapping is defined independently of the parameters being optimized and does not reduce any claimed prediction or result to its own inputs by construction. No self-citations, uniqueness theorems, or fitted inputs are invoked as load-bearing justifications for the central decoupling or the reported >50% reduction in duplicate n-grams; those outcomes are presented as empirical consequences of the method rather than tautological consequences of the projection definition. The derivation remains self-contained with externally verifiable components (standard GRPO plus hand-specified rewards).
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Reinforcement learning can be applied to a projected binary action space while the original token-level policy remains intact.
invented entities (1)
-
Action Space Projection
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Introduction Traditional spoken dialogue systems have long relied on a cas- caded architecture, pipelining audio through independent Auto- matic Speech Recognition (ASR) [1–9], Large Language Mod- els (LLMs) [10–16], and Text-to-Speech (TTS) [17–25] mod- ules. While effective for basic information retrieval, this dis- jointed pipeline introduces compoundi...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
Methodology As illustrated in Figure 1, we propose ASPIRin, an alignment framework designed to optimize the temporal dynamics of full- duplex speech models parameterized byθ. Unlike standard ap- proaches that treat audio generation as a unified sequence task, ASPIRin decoupleswhento speak fromwhatto say by replac- ing fine-grained token optimization with ...
-
[3]
Experimental Setup Training Data.We utilize a 43-hour in-house dataset of natural conversational speech (approx
Experiments 3.1. Experimental Setup Training Data.We utilize a 43-hour in-house dataset of natural conversational speech (approx. 1,300 two-minute, dual-channel clips). This dataset was collected with ex- plicit speaker consent and rigorously anonymized to en- sure privacy compliance. We process the audio using the nvidia/parakeet-tdt-0.6b-v3ASR model [9]...
-
[4]
speak or not
Results and Analysis 4.1. Main Results Establishing a Strong Baseline.We establish a strong heuris- tic baseline by introducing a 3-second prompt delay to the base Moshi model in Table 1. This simple modification yields sub- stantial improvements: Takeover Rate (TOR) drops by 49% – 57% in pause handling and backchanneling scenarios, while the GPT-4o seman...
-
[5]
speak or not
Conclusion We introduced ASPIRin, an interactivity-optimized reinforce- ment learning framework resolving the tension between tem- poral dynamics and semantic coherence in full-duplex SLMs. While standard GRPO burdens fine-grained token policies and suffers from aggressive, repetitive generation, ASPIRin utilizes Action Space Projection to map vocabulary ...
-
[6]
Generative AI was not used to pro- duce any significant portion of the manuscript’s original con- tent, ideas, or research findings
Generative AI Use Disclosure During the preparation of this work, the authors used Generative AI tools exclusively for editing and polishing the manuscript to improve overall readability. Generative AI was not used to pro- duce any significant portion of the manuscript’s original con- tent, ideas, or research findings. All co-authors consent to this submi...
-
[7]
We are also grateful to Steve Chung-Cheng Chen, Tsung-Ying Yang, Jen-Hao Cheng, and Dau-Cheng Lyu for their insight- ful discussions and feedback
Acknowledgements We thank the ASUS Open Cloud Infrastructure Software Center for providing the essential resources that supported this work. We are also grateful to Steve Chung-Cheng Chen, Tsung-Ying Yang, Jen-Hao Cheng, and Dau-Cheng Lyu for their insight- ful discussions and feedback. Additionally, this research was supported by the National Center for ...
-
[8]
Robust Speech Recognition via Large-Scale Weak Supervision
A. Radford, J. W. Kim, T. Xu, G. Brockman, C. McLeavey, and I. Sutskever, “Robust speech recognition via large- scale weak supervision,” 2022. [Online]. Available: https: //arxiv.org/abs/2212.04356
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[9]
X. Shi, X. Wang, Z. Guo, Y . Wang, P. Zhang, X. Zhang, Z. Guo, H. Hao, Y . Xi, B. Yang, J. Xu, J. Zhou, and J. Lin, “Qwen3-asr technical report,”arXiv preprint arXiv:2601.21337, 2026
work page internal anchor Pith review arXiv 2026
-
[10]
L.-H. Tseng, Y .-K. Fu, H.-J. Chang, and H.-y. Lee, “Mandarin- english code-switching speech recognition with self-supervised speech representation models,”arXiv preprint arXiv:2110.03504, 2021
-
[11]
Reborn: Reinforcement-learned boundary segmentation with iterative training for unsupervised asr,
L.-H. Tseng, E.-P. Hu, C.-H. Chiang, Y . Tseng, H.-y. Lee, L.-s. Lee, and S.-H. Sun, “Reborn: Reinforcement-learned boundary segmentation with iterative training for unsupervised asr,” inAdvances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Pa- quet, J. Tomczak, and C. Zhang, Eds., vol. 37. Curran Associates, ...
2024
-
[12]
Yang, K.-P
C.-K. Yang, K.-P. Huang, K.-H. Lu, C.-Y . Kuan, C.-Y . Hsiao, and H.-Y . Lee, “Investigating zero-shot generalizability on mandarin- english code-switched asr and speech-to-text translation of recent foundation models with self-supervision and weak supervision,” in2024 IEEE International Conference on Acoustics, Speech, and Signal Processing Workshops (IC...
2024
-
[13]
Do prompts really prompt? exploring the prompt understanding capability of whis- per,
C.-K. Yang, K.-P. Huang, and H.-Y . Lee, “Do prompts really prompt? exploring the prompt understanding capability of whis- per,” in2024 IEEE Spoken Language Technology Workshop (SLT), 2024, pp. 1–8
2024
-
[14]
Enhanc- ing multilingual asr for unseen languages via language embedding modeling,
S.-S. Huang, K.-P. Huang, A. T. Liu, and H.-Y . Lee, “Enhanc- ing multilingual asr for unseen languages via language embedding modeling,” inICASSP 2025-2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2025, pp. 1–5
2025
-
[15]
A self-refining frame- work for enhancing asr using tts-synthesized data,
C.-K. Chou, C.-J. Hsu, H.-L. Chung, L.-H. Tseng, H.-C. Cheng, Y .-K. Fu, K. P. Huang, and H.-Y . Lee, “A self-refining frame- work for enhancing asr using tts-synthesized data,”arXiv preprint arXiv:2506.11130, 2025
-
[16]
Available: https://arxiv.org/abs/2509.14128
M. Sekoyanet al., “Canary-1b-v2 & parakeet-tdt-0.6b-v3: Efficient and high-performance models for multilingual asr and ast,” 2025. [Online]. Available: https://arxiv.org/abs/2509.14128
-
[17]
OpenAIet al., “Gpt-4o system card,” 2024. [Online]. Available: https://arxiv.org/abs/2410.21276
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Gemini: A Family of Highly Capable Multimodal Models
G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millicanet al., “Gemini: a family of highly capable multimodal models,”arXiv preprint arXiv:2312.11805, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[19]
G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen et al., “Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabil- ities,”arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[20]
J. Bai, S. Bai, Y . Chu, Z. Cui, K. Dang, X. Deng, Y . Fan, W. Ge, Y . Han, F. Huanget al., “Qwen technical report,”arXiv preprint arXiv:2309.16609, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[21]
A. Grattafiori, A. Dubey, A. Jauhri, A. Pandey, A. Kadian, A. Al- Dahle, A. Letman, A. Mathur, A. Schelten, A. Vaughanet al., “The llama 3 herd of models,”arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[22]
A. Liu, B. Feng, B. Xue, B. Wang, B. Wu, C. Lu, C. Zhao, C. Deng, C. Zhang, C. Ruanet al., “Deepseek-v3 technical re- port,”arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[23]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI, “Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,” 2025. [Online]. Available: https://arxiv.org/abs/2501.12948
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[24]
Z. Du, Q. Chen, S. Zhang, K. Hu, H. Lu, Y . Yang, H. Hu, S. Zheng, Y . Gu, Z. Maet al., “Cosyvoice: A scalable multi- lingual zero-shot text-to-speech synthesizer based on supervised semantic tokens,”arXiv preprint arXiv:2407.05407, 2024
-
[25]
CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models
Z. Du, Y . Wang, Q. Chen, X. Shi, X. Lv, T. Zhao, Z. Gao, Y . Yang, C. Gao, H. Wanget al., “Cosyvoice 2: Scalable stream- ing speech synthesis with large language models,”arXiv preprint arXiv:2412.10117, 2024
work page internal anchor Pith review arXiv 2024
-
[26]
Z. Du, C. Gao, Y . Wang, F. Yu, T. Zhao, H. Wang, X. Lv, H. Wang, C. Ni, X. Shiet al., “Cosyvoice 3: Towards in-the- wild speech generation via scaling-up and post-training,”arXiv preprint arXiv:2505.17589, 2025
-
[27]
Qwen3-tts technical report.arXiv preprint arXiv:2601.15621, 2026
H. Hu, X. Zhu, T. He, D. Guo, B. Zhang, X. Wang, Z. Guo, Z. Jiang, H. Hao, Z. Guo, X. Zhang, P. Zhang, B. Yang, J. Xu, J. Zhou, and J. Lin, “Qwen3-tts technical report,”arXiv preprint arXiv:2601.15621, 2026
-
[28]
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
C. Wang, S. Chen, Y . Wu, Z. Zhang, L. Zhou, S. Liu, Z. Chen, Y . Liu, H. Wang, J. Liet al., “Neural codec language mod- els are zero-shot text to speech synthesizers,”arXiv preprint arXiv:2301.02111, 2023
work page internal anchor Pith review arXiv 2023
-
[29]
V ALL-E 2: Neural codec language models are human parity zero-shot text to speech synthesizers
S. Chen, S. Liu, L. Zhou, Y . Liu, X. Tan, J. Li, S. Zhao, Y . Qian, and F. Wei, “Vall-e 2: Neural codec language models are hu- man parity zero-shot text to speech synthesizers,”arXiv preprint arXiv:2406.05370, 2024
-
[30]
XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model,
E. Casanova, K. Davis, E. G ¨olge, G. G ¨oknar, I. Gulea, L. Hart, A. Aljafari, J. Meyer, R. Morais, S. Olayemi, and J. We- ber, “XTTS: a Massively Multilingual Zero-Shot Text-to-Speech Model,” inInterspeech 2024, 2024, pp. 4978–4982
2024
-
[31]
C.-J. Hsuet al., “Breezyvoice: Adapting tts for taiwanese man- darin with enhanced polyphone disambiguation–challenges and insights,”arXiv preprint arXiv:2501.17790, 2025
-
[32]
C.-J. Hsu, C.-S. Liu, M.-H. Chen, M. Chen, P.-C. Hsu, Y .-C. Chen, and D.-S. Shiu, “The breeze 2 herd of models: Traditional chinese llms based on llama with vision-aware and function-calling capa- bilities,”arXiv preprint arXiv:2501.13921, 2025
-
[33]
Building a taiwanese mandarin spoken language model: A first attempt,
C.-K. Yanget al., “Building a taiwanese mandarin spoken lan- guage model: A first attempt,”arXiv preprint arXiv:2411.07111, 2024
-
[34]
Analyzing Mitigation Strategies for Catas- trophic Forgetting in End-to-End Training of Spoken Language Models,
C.-Y . Hsiaoet al., “Analyzing Mitigation Strategies for Catas- trophic Forgetting in End-to-End Training of Spoken Language Models,” inInterspeech 2025, 2025, pp. 3234–3238
2025
-
[35]
Desta: Enhancing speech language models through descriptive speech-text alignment,
K.-H. Lu, Z. Chen, S.-W. Fu, H. Huang, B. Ginsburg, Y .-C. F. Wang, and H.-y. Lee, “Desta: Enhancing speech language models through descriptive speech-text alignment,” inInterspeech 2024, 2024, pp. 4159–4163
2024
-
[36]
Developing instruction-following speech lan- guage model without speech instruction-tuning data,
K.-H. Luet al., “Developing instruction-following speech lan- guage model without speech instruction-tuning data,” inICASSP 2025 - 2025 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2025, pp. 1–5
2025
-
[37]
K.-H. Lu, Z. Chen, S.-W. Fu, C.-H. H. Yang, J. Balam, B. Gins- burg, Y .-C. F. Wang, and H.-Y . Lee, “Desta2.5-audio: Toward general-purpose large audio language model with self-generated cross-modal alignment,”arXiv preprint arXiv:2507.02768, 2025
-
[38]
A preliminary exploration with gpt-4o voice mode,
Y .-X. Linet al., “A preliminary exploration with gpt-4o voice mode,”arXiv preprint arXiv:2502.09940, 2025
-
[39]
T.-w. Hsuet al., “Reducing object hallucination in large audio- language models via audio-aware decoding,”arXiv preprint arXiv:2506.07233, 2025
-
[40]
Speech-copilot: Leveraging large language models for speech processing via task decomposition, modularization, and program generation,
C.-Y . Kuan, C.-K. Yang, W.-P. Huang, K.-H. Lu, and H.-y. Lee, “Speech-copilot: Leveraging large language models for speech processing via task decomposition, modularization, and program generation,” in2024 IEEE Spoken Language Technology Work- shop (SLT). IEEE, 2024, pp. 1060–1067
2024
-
[41]
Stitch: Simultaneous thinking and talking with chunked reasoning for spoken language models,
C.-H. Chiang, X. Wang, L. Li, C.-C. Lin, K. Lin, S. Liu, Z. Wang, Z. Yang, H.-y. Lee, and L. Wang, “Stitch: Simultaneous thinking and talking with chunked reasoning for spoken language models,” arXiv preprint arXiv:2507.15375, 2025
-
[42]
On The Landscape of Spoken Language Models: A Comprehensive Survey
S. Aroraet al., “On the landscape of spoken language models: A comprehensive survey,”arXiv preprint arXiv:2504.08528, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[43]
Speechprompt: Prompting speech language models for speech processing tasks,
K.-W. Chang, H. Wu, Y .-K. Wang, Y .-K. Wu, H. Shen, W.- C. Tseng, I.-t. Kang, S.-W. Li, and H.-y. Lee, “Speechprompt: Prompting speech language models for speech processing tasks,” IEEE/ACM Transactions on Audio, Speech, and Language Pro- cessing, 2024
2024
-
[44]
K.-W. Chang, W.-C. Tseng, S.-W. Li, and H.-y. Lee, “Speech- prompt: An exploration of prompt tuning on generative spo- ken language model for speech processing tasks,”arXiv preprint arXiv:2203.16773, 2022
-
[45]
Speechprompt v2: Prompt tuning for speech classification tasks,
K.-W. Chang, Y .-K. Wang, H. Shen, I.-t. Kang, W.-C. Tseng, S.-W. Li, and H.-y. Lee, “Speechprompt v2: Prompt tuning for speech classification tasks,”arXiv preprint arXiv:2303.00733, 2023
-
[46]
Y . Chu, J. Xu, Q. Yang, H. Wei, X. Wei, Z. Guo, Y . Leng, Y . Lv, J. He, J. Linet al., “Qwen2-audio technical report,”arXiv preprint arXiv:2407.10759, 2024
work page internal anchor Pith review arXiv 2024
-
[47]
Taste: Text-aligned speech tokenization and embedding for spoken language modeling,
L.-H. Tseng, Y .-C. Chen, K.-Y . Lee, D.-S. Shiu, and H. yi Lee, “Taste: Text-aligned speech tokenization and embedding for spoken language modeling,” 2026. [Online]. Available: https://arxiv.org/abs/2504.07053
-
[48]
Dynamic-SUPERB phase-2: A collabo- ratively expanding benchmark for measuring the capabilities of spoken language models with 180 tasks,
C. yu Huanget al., “Dynamic-SUPERB phase-2: A collabo- ratively expanding benchmark for measuring the capabilities of spoken language models with 180 tasks,” inThe Thirteenth In- ternational Conference on Learning Representations, 2025. [On- line]. Available: https://openreview.net/forum?id=s7lzZpAW7T
2025
-
[49]
Dynamic-superb: Towards a dynamic, col- laborative, and comprehensive instruction-tuning benchmark for speech,
C.-y. Huanget al., “Dynamic-superb: Towards a dynamic, col- laborative, and comprehensive instruction-tuning benchmark for speech,” inICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2024, pp. 12 136–12 140
2024
-
[50]
Language model can listen while speaking,
Z. Ma, Y . Song, C. Du, J. Cong, Z. Chen, Y . Wang, Y . Wang, and X. Chen, “Language model can listen while speaking,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 23, 2025, pp. 24 831–24 839
2025
-
[51]
Effi- cient and Direct Duplex Modeling for Speech-to-Speech Lan- guage Model,
K. Hu, E. Hosseini-Asl, C. Chen, E. Casanova, S. Ghosh, P. ˙Zelasko, Z. Chen, J. Li, J. Balam, and B. Ginsburg, “Effi- cient and Direct Duplex Modeling for Speech-to-Speech Lan- guage Model,” inInterspeech 2025, 2025, pp. 2715–2719
2025
-
[52]
R. Roy, J. Raiman, S. gil Lee, T.-D. Ene, R. Kirby, S. Kim, J. Kim, and B. Catanzaro, “Personaplex: V oice and role control for full duplex conversational speech models,” 2026. [Online]. Available: https://arxiv.org/abs/2602.06053
-
[53]
Moshi: a speech-text foundation model for real-time dialogue
A. D ´efossez, L. Mazar´e, M. Orsini, A. Royer, P. P´erez, H. J´egou, E. Grave, and N. Zeghidour, “Moshi: a speech-text foundation model for real-time dialogue,”arXiv preprint arXiv:2410.00037, 2024
work page internal anchor Pith review arXiv 2024
-
[54]
G.-T. Lin, S.-Y . S. Kuan, J. Shi, K.-W. Chang, S. Arora, S. Watanabe, and H. yi Lee, “Full-duplex-bench-v2: A multi-turn evaluation framework for duplex dialogue systems with an automated examiner,” 2025. [Online]. Available: https://arxiv.org/abs/2510.07838
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[55]
Towards holistic evaluation of large audio- language models: A comprehensive survey,
C.-K. Yanget al., “Towards holistic evaluation of large audio- language models: A comprehensive survey,” inProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, C. Christodoulopoulos, T. Chakraborty, C. Rose, and V . Peng, Eds. Suzhou, China: Association for Computational Linguistics, Nov. 2025, pp. 10 144–10 170. [Online...
2025
-
[56]
Game-Time: Evaluating Temporal Dynamics in Spoken Language Models
K.-W. Changet al., “Game-time: Evaluating temporal dynamics in spoken language models,” 2025. [Online]. Available: https: //arxiv.org/abs/2509.26388
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[57]
Full-Duplex-Bench v1.5: Evaluating Overlap Handling for Full-Duplex Speech Models
G.-T. Lin, S.-Y . S. Kuan, Q. Wang, J. Lian, T. Li, S. Watanabe, and H. yi Lee, “Full-duplex-bench v1.5: Evaluating overlap handling for full-duplex speech models,” 2026. [Online]. Available: https://arxiv.org/abs/2507.23159
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[58]
Aligning spoken dialogue models from user in- teractions,
A. Wuet al., “Aligning spoken dialogue models from user in- teractions,” inInternational Conference on Machine Learning. PMLR, 2025, pp. 67 476–67 498
2025
-
[59]
Align-SLM: Textless spoken language models with reinforcement learning from AI feedback,
G.-T. Lin, P. G. Shivakumar, A. Gourav, Y . Gu, A. Gandhe, H.-y. Lee, and I. Bulyko, “Align-SLM: Textless spoken language models with reinforcement learning from AI feedback,” in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), W. Che, J. Nabende, E. Shutova, and M. T. Pilehvar, Eds. Vienna, ...
2025
-
[60]
Reinforcement learn- ing enhanced full-duplex spoken dialogue language models for conversational interactions,
C. Chen, K. Hu, C.-H. H. Yang, A. Pasad, E. Casanova, W. Wang, S.-W. Fu, J. Li, Z. Chen, J. Balamet al., “Reinforcement learn- ing enhanced full-duplex spoken dialogue language models for conversational interactions,” inSecond Conference on Language Modeling, 2025
2025
-
[61]
S. Arora, J. Tian, J. Shi, H. Futami, Y . Kashiwagi, E. Tsunoo, and S. Watanabe, “Optimizing conversational quality in spoken dialogue systems with reinforcement learning from ai feedback,” arXiv preprint arXiv:2601.19063, 2026
-
[62]
DeepSeekMath: Pushing the Limits of Mathematical Reasoning in Open Language Models
Z. Shaoet al., “Deepseekmath: Pushing the limits of math- ematical reasoning in open language models,”arXiv preprint arXiv:2402.03300, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[63]
Full-duplex-bench: A benchmark to evaluate full-duplex spoken dialogue models on turn-taking capabilities,
G.-T. Linet al., “Full-duplex-bench: A benchmark to evaluate full-duplex spoken dialogue models on turn-taking capabilities,”
-
[64]
[Online]. Available: https://arxiv.org/abs/2503.04721
-
[65]
LoRA: Low-rank adaptation of large language models,
E. J. Hu, yelong shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, and W. Chen, “LoRA: Low-rank adaptation of large language models,” inInternational Conference on Learning Representations, 2022. [Online]. Available: https: //openreview.net/forum?id=nZeVKeeFYf9
2022
-
[66]
Neural text generation with unlikelihood training,
S. Welleck, I. Kulikov, S. Roller, E. Dinan, K. Cho, and J. Weston, “Neural text generation with unlikelihood training,” inInternational Conference on Learning Representations,
-
[67]
Available: https://openreview.net/forum?id= SJeYe0NtvH
[Online]. Available: https://openreview.net/forum?id= SJeYe0NtvH
-
[68]
Texygen: A benchmarking platform for text generation models,
Y . Zhu, S. Lu, L. Zheng, J. Guo, W. Zhang, J. Wang, and Y . Yu, “Texygen: A benchmarking platform for text generation models,” SIGIR, 2018
2018
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.