pith. machine review for the scientific record. sign in

arxiv: 2507.16632 · v3 · submitted 2025-07-22 · 💻 cs.CL · cs.SD· eess.AS

Recognition: no theorem link

Step-Audio 2 Technical Report

Authors on Pith no claims yet

Pith reviewed 2026-05-16 05:55 UTC · model grok-4.3

classification 💻 cs.CL cs.SDeess.AS
keywords multi-modal LLMaudio understandingspeech conversationdiscrete audio tokensreinforcement learningretrieval-augmented generationend-to-end model
0
0 comments X

The pith

Step-Audio 2 integrates latent audio encoding and discrete token generation to deliver state-of-the-art audio understanding and expressive end-to-end speech conversation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Step-Audio 2 as an end-to-end multi-modal large language model built specifically for industry-level audio tasks and natural speech interaction. It combines a latent audio encoder with reasoning-centric reinforcement learning to strengthen automatic speech recognition and general audio comprehension. Discrete audio tokens are generated directly inside the language model to capture speaking styles and emotions in responses. Retrieval-augmented generation plus tool calls for web search and audio search reduce hallucinations while allowing timbre changes. Trained on millions of hours of speech data, the model reports superior results on standard benchmarks against both open-source and commercial alternatives.

Core claim

Step-Audio 2 demonstrates that an integrated architecture using latent audio encoding, reasoning-centric reinforcement learning, discrete audio token generation within language modeling, and retrieval-augmented generation produces stronger automatic speech recognition, audio understanding, and responsive conversational output than prior separate-component systems.

What carries the argument

Discrete audio token generation embedded in the language modeling process, which enables direct responsiveness to paralinguistic cues such as emotion and style.

If this is right

  • Direct modeling of paralinguistic information reduces the need for separate emotion or style modules in conversational agents
  • Tool calling for web search and audio retrieval measurably lowers hallucination rates in spoken responses
  • End-to-end discrete token output supports lower-latency turn-taking in multi-turn dialogue
  • Scaling to millions of hours of training data yields consistent gains across diverse conversational domains

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same token-generation approach could be applied to video or sensor streams to create unified multi-modal conversational systems
  • External tool integration may allow future models to maintain up-to-date knowledge without full retraining
  • If the RL component generalizes well, similar reasoning-centric training could improve robustness in low-resource languages or noisy environments

Load-bearing premise

The combination of latent encoding, RL reasoning, discrete tokens, and RAG produces robust performance on real-world conversational audio beyond the evaluated benchmarks.

What would settle it

A new test set of long-form conversational audio with varied emotions and accents where Step-Audio 2 shows no accuracy or naturalness advantage over strong baseline models.

read the original abstract

This paper presents Step-Audio 2, an end-to-end multi-modal large language model designed for industry-strength audio understanding and speech conversation. By integrating a latent audio encoder and reasoning-centric reinforcement learning (RL), Step-Audio 2 achieves promising performance in automatic speech recognition (ASR) and audio understanding. To facilitate genuine end-to-end speech conversation, Step-Audio 2 incorporates the generation of discrete audio tokens into language modeling, significantly enhancing its responsiveness to paralinguistic information such as speaking styles and emotions. To effectively leverage the rich textual and acoustic knowledge in real-world data, Step-Audio 2 integrates retrieval-augmented generation (RAG) and is able to call external tools such as web search to mitigate hallucination and audio search to switch timbres. Trained on millions of hours of speech and audio data, Step-Audio 2 delivers intelligence and expressiveness across diverse conversational scenarios. Evaluation results demonstrate that Step-Audio 2 achieves state-of-the-art performance on various audio understanding and conversational benchmarks compared to other open-source and commercial solutions. Please visit https://github.com/stepfun-ai/Step-Audio2 for more information.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Step-Audio 2, an end-to-end multi-modal LLM for audio understanding and conversational speech. It combines a latent audio encoder, reasoning-centric reinforcement learning, discrete audio token generation to capture paralinguistic cues, and RAG with external tool calling (web search, audio search) to reduce hallucinations. Trained on millions of hours of speech and audio data, the model claims state-of-the-art results on various audio understanding and conversational benchmarks relative to open-source and commercial baselines.

Significance. If the performance claims are substantiated with complete benchmark tables, baselines, and ablations, the work would advance practical end-to-end audio LLMs by showing how RL-driven reasoning and RAG can be integrated with discrete token modeling for expressive, low-hallucination conversation. The industry-oriented framing and emphasis on real-world tool use are strengths.

major comments (2)
  1. [Evaluation] Evaluation section: the SOTA claim is presented without named benchmarks (e.g., no LibriSpeech, CommonVoice, or conversational test sets), exact baseline versions, metric definitions, data splits, error bars, or ablation results on the RL or RAG components. This absence makes the central empirical result unverifiable and load-bearing for the paper's contribution.
  2. [§3 (Architecture) and §4 (Training)] §3 (Architecture) and §4 (Training): the claim that discrete token generation 'significantly enhances responsiveness to paralinguistic information' is stated without quantitative comparison to a continuous-token or non-RL baseline, leaving the contribution of this design choice unsupported by evidence.
minor comments (2)
  1. [Abstract] Abstract: phrases such as 'promising performance' and 'state-of-the-art' are used without accompanying metrics or qualifiers.
  2. [Conclusion] The GitHub link is provided but no details on released code, checkpoints, or evaluation scripts are given in the text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and will revise the manuscript accordingly to improve clarity and verifiability of the results.

read point-by-point responses
  1. Referee: [Evaluation] Evaluation section: the SOTA claim is presented without named benchmarks (e.g., no LibriSpeech, CommonVoice, or conversational test sets), exact baseline versions, metric definitions, data splits, error bars, or ablation results on the RL or RAG components. This absence makes the central empirical result unverifiable and load-bearing for the paper's contribution.

    Authors: We acknowledge that the current evaluation section provides only high-level SOTA claims without the requested specifics. In the revised manuscript we will expand this section to explicitly name the benchmarks (including LibriSpeech, CommonVoice, and conversational test sets), list exact baseline versions, define all metrics, specify data splits, report error bars where available, and include ablation results isolating the RL and RAG components. revision: yes

  2. Referee: [§3 (Architecture) and §4 (Training)] §3 (Architecture) and §4 (Training): the claim that discrete token generation 'significantly enhances responsiveness to paralinguistic information' is stated without quantitative comparison to a continuous-token or non-RL baseline, leaving the contribution of this design choice unsupported by evidence.

    Authors: We agree that the claim would be stronger with direct quantitative support. The revised version will add comparisons to continuous-token and non-RL baselines, reporting relevant metrics that demonstrate the contribution of discrete token generation to paralinguistic responsiveness. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper is an empirical technical report on model architecture, training data, and benchmark results with no mathematical derivations, equations, or self-referential definitions present. Performance claims reference external benchmarks and datasets rather than quantities defined or fitted inside the paper. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work appear in the provided text. The central claims reduce to standard empirical evaluation and are therefore self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract describes an empirically trained neural network without listing explicit axioms, free parameters, or newly invented theoretical entities; performance claims rest on large-scale data training and benchmark results.

pith-pipeline@v0.9.0 · 5915 in / 1048 out tokens · 29325 ms · 2026-05-16T05:55:34.925173+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models

    cs.SD 2026-04 unverdicted novelty 8.0

    HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-lang...

  2. Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation

    cs.MM 2026-05 unverdicted novelty 7.0

    Visual debiasing of omni-modal benchmarks combined with staged post-training lets a 3B model match or exceed a 30B model without a stronger teacher.

  3. How Should LLMs Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue

    cs.CL 2026-05 unverdicted novelty 7.0

    Channel fusion gives better semantic grounding and QA performance in full-duplex LLM dialogue but is vulnerable to context corruption during interruptions, while cross-attention routing is more robust at the cost of w...

  4. SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation

    cs.CL 2026-04 unverdicted novelty 7.0

    SpeechParaling-Bench is a new evaluation framework for paralinguistic-aware speech generation that reveals major limitations in current large audio-language models.

  5. HumDial-EIBench: A Human-Recorded Multi-Turn Emotional Intelligence Benchmark for Audio Language Models

    eess.AS 2026-04 unverdicted novelty 7.0

    HumDial-EIBench is a new benchmark using real human dialogues to evaluate audio language models on emotional intelligence tasks including multi-turn tracking, causal reasoning, empathy generation, and acoustic-semanti...

  6. CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation

    cs.SD 2026-04 unverdicted novelty 7.0

    CapTalk unifies single-utterance and dialogue voice design via utterance- and speaker-level captions plus a hierarchical variational module for stable timbre with adaptive expression.

  7. Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR

    eess.AS 2026-04 unverdicted novelty 7.0

    Speaker-Reasoner is an end-to-end speech LLM that iteratively analyzes audio structure, predicts temporal boundaries, and jointly models speaker identity, gender, timestamps, and transcription using a speaker-aware ca...

  8. TiCo: Time-Controllable Spoken Dialogue Model

    cs.CL 2026-03 unverdicted novelty 7.0

    TiCo enables spoken dialogue models to follow explicit time constraints in generated responses using Spoken Time Markers and reinforcement learning with verifiable rewards, cutting duration error by 2.7x over its backbone.

  9. Towards Fine-Grained Multi-Dimensional Speech Understanding: Data Pipeline, Benchmark, and Model

    eess.AS 2026-05 unverdicted novelty 6.0

    A data pipeline, 14-dimension benchmark, and decoupled fine-tuning model are presented to advance fine-grained multi-dimensional speech understanding in LLMs.

  10. Boosting Omni-Modal Language Models: Staged Post-Training with Visually Debiased Evaluation

    cs.MM 2026-05 unverdicted novelty 6.0

    Staged post-training with self-distillation lets a 3B omni-modal model match or slightly exceed a 30B model on a visually debiased benchmark.

  11. VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models

    cs.SD 2026-05 unverdicted novelty 6.0

    VocalParse applies interleaved and Chain-of-Thought prompting to a Large Audio Language Model to jointly transcribe lyrics, melody and word-note alignments, achieving state-of-the-art results on multiple singing datasets.

  12. VIBE: Voice-Induced open-ended Bias Evaluation for Large Audio-Language Models via Real-World Speech

    eess.AS 2026-04 unverdicted novelty 6.0

    VIBE evaluates generative biases in large audio-language models with real-world speech and open-ended tasks, showing that gender cues produce larger distributional shifts than accent cues across 11 tested models.

  13. Audio2Tool: Speak, Call, Act -- A Dataset for Benchmarking Speech Tool Use

    cs.SD 2026-04 unverdicted novelty 6.0

    Audio2Tool is a new benchmark dataset that shows speech models perform well on simple commands but degrade sharply on compositional tasks and realistic acoustic noise.

  14. Why Your Tokenizer Fails in Information Fusion: A Timing-Aware Pre-Quantization Fusion for Video-Enhanced Audio Tokenization

    eess.AS 2026-04 unverdicted novelty 6.0

    A timing-aware pre-quantization fusion approach integrates visual cues into audio tokenizers along the temporal axis, maintaining reconstruction quality while outperforming audio-only and prior multimodal baselines on...

  15. Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs

    eess.AS 2026-04 unverdicted novelty 6.0

    A multi-stage training method for LLM-based ASR uses new entropy allocation metrics to achieve competitive benchmark performance with 2.3B parameters while mitigating hallucinations via better encoder-LLM decoupling.

  16. Audio-DeepThinker: Progressive Reasoning-Aware Reinforcement Learning for High-Quality Chain-of-Thought Emergence in Audio Language Models

    cs.SD 2026-04 unverdicted novelty 5.0

    A hybrid-reward progressive RL curriculum enables high-quality chain-of-thought to emerge in audio language models without prior supervised CoT training, yielding SOTA results on MMAR, MMAU, and MMSU benchmarks.

  17. Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models

    eess.AS 2026-04 unverdicted novelty 5.0

    Audio-Cogito is an open-source LALM using Cogito-pipe data curation and self-distillation to achieve leading open-source performance on audio reasoning benchmarks.

  18. Step-Audio-R1.5 Technical Report

    eess.AS 2026-04 unverdicted novelty 4.0

    Step-Audio-R1.5 applies RLHF to audio reasoning models to maintain analytical performance while improving prosodic naturalness and immersion in extended spoken interactions.

  19. NIM4-ASR: Towards Efficient, Robust, and Customizable Real-Time LLM-Based ASR

    eess.AS 2026-04 unverdicted novelty 4.0

    NIM4-ASR delivers SOTA ASR performance on public benchmarks using a 2.3B-parameter LLM with multi-stage training, real-time streaming, and million-scale hotword customization via RAG.

  20. OmniFysics: Towards Physical Intelligence Evolution via Omni-Modal Signal Processing and Network Optimization

    cs.CV 2026-02 unverdicted novelty 4.0

    OmniFysics is an omni-modal network using a dynamic physical data engine and evolutive tuning to improve performance on multimodal benchmarks and physics-oriented tasks.

Reference graph

Works this paper leans on

84 extracted references · 84 canonical work pages · cited by 19 Pith papers · 22 internal anchors

  1. [1]

    Seed-TTS: A Family of High-Quality Versatile Speech Generation Models

    Philip Anastassiou et al. “Seed-tts: A family of high-quality versatile speech generation models”. In: arXiv preprint arXiv:2406.02430 (2024)

  2. [2]

    PaLM 2 Technical Report

    Rohan Anil et al. PaLM 2 Technical Report. 2023. arXiv: 2305.10403 [cs.CL]. URL: https://arxiv.org/ abs/2305.10403

  3. [3]

    wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations

    Alexei Baevski et al. wav2vec 2.0: A Framework for Self-Supervised Learning of Speech Representations. 2020. arXiv: 2006.11477 [cs.CL]. URL: https://arxiv.org/abs/2006.11477

  4. [4]

    Qwen Technical Report

    Jinze Bai et al. Qwen Technical Report. 2023. arXiv: 2309.16609 [cs.CL]. URL: https://arxiv.org/abs/ 2309.16609

  5. [5]

    Seed-asr: Understanding diverse speech and contexts with llm-based speech recognition

    Ye Bai et al. “Seed-asr: Understanding diverse speech and contexts with llm-based speech recognition”. In:arXiv preprint arXiv:2407.04675 (2024)

  6. [6]

    Better speech synthesis through scaling

    James Betker. Better speech synthesis through scaling . 2023. arXiv: 2305 . 07243 [cs.SD]. URL: https : //arxiv.org/abs/2305.07243

  7. [7]

    Audiolm: a language modeling approach to audio generation

    Zalán Borsos et al. “Audiolm: a language modeling approach to audio generation”. In: IEEE/ACM transactions on audio, speech, and language processing 31 (2023), pp. 2523–2533

  8. [8]

    GigaSpeech: An Evolving, Multi-Domain ASR Corpus with 10,000 Hours of Transcribed Audio

    Guoguo Chen et al. “GigaSpeech: An Evolving, Multi-Domain ASR Corpus with 10,000 Hours of Transcribed Audio”. In: Interspeech 2021. ISCA, Aug. 2021. DOI: 10.21437/interspeech.2021- 1965 . URL: http: //dx.doi.org/10.21437/Interspeech.2021-1965

  9. [9]

    Minmo: A multimodal large language model for seamless voice interaction

    Qian Chen et al. “Minmo: A multimodal large language model for seamless voice interaction”. In:arXiv preprint arXiv:2501.06282 (2025)

  10. [10]

    Sanyuan Chen et al.BEATs: Audio Pre-Training with Acoustic Tokenizers. 2022. arXiv:2212.09058 [eess.AS]. URL: https://arxiv.org/abs/2212.09058

  11. [11]

    neurips.cc/paper/2020/hash/1457c0d6bfcb4967418bfb8ac142f64a-Abstract.html

    Sanyuan Chen et al. “WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing”. In: IEEE Journal of Selected Topics in Signal Processing16.6 (Oct. 2022), pp. 1505–1518. ISSN : 1941-0484. DOI: 10.1109/jstsp.2022.3188113. URL: http://dx.doi.org/10.1109/JSTSP.2022.3188113

  12. [12]

    Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

    Yunfei Chu et al. “Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models”. In: arXiv preprint arXiv:2311.07919 (2023)

  13. [13]

    Qwen2-Audio Technical Report

    Yunfei Chu et al. “Qwen2-audio technical report”. In: arXiv preprint arXiv:2407.10759 (2024)

  14. [14]

    Simple and Controllable Music Generation

    Jade Copet et al. Simple and Controllable Music Generation. 2024. arXiv: 2306.05284 [cs.SD]. URL: https: //arxiv.org/abs/2306.05284

  15. [15]

    High Fidelity Neural Audio Compression

    Alexandre Défossez et al. “High fidelity neural audio compression”. In:arXiv preprint arXiv:2210.13438 (2022)

  16. [16]

    Moshi: a speech-text foundation model for real-time dialogue

    Alexandre Défossez et al. “Moshi: a speech-text foundation model for real-time dialogue”. In:arXiv preprint arXiv:2410.00037 (2024)

  17. [17]

    Pengi: An audio language model for audio tasks

    Soham Deshmukh et al. “Pengi: An audio language model for audio tasks”. In: Advances in Neural Information Processing Systems 36 (2023), pp. 18090–18108

  18. [18]

    Kimi-Audio Technical Report

    Ding Ding et al. “Kimi-audio technical report”. In: arXiv preprint arXiv:2504.18425 (2025)

  19. [19]

    CosyVoice 2: Scalable Streaming Speech Synthesis with Large Language Models

    Zhihao Du et al. “Cosyvoice 2: Scalable streaming speech synthesis with large language models”. In: arXiv preprint arXiv:2412.10117 (2024)

  20. [20]

    CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens

    Zhihao Du et al. CosyVoice: A Scalable Multilingual Zero-shot Text-to-speech Synthesizer based on Supervised Semantic Tokens. 2024. arXiv: 2407.05407 [cs.SD]. URL: https://arxiv.org/abs/2407.05407

  21. [21]

    Llama-omni: Seamless speech interaction with large language models

    Qingkai Fang et al. “Llama-omni: Seamless speech interaction with large language models”. In: arXiv preprint arXiv:2409.06666 (2024)

  22. [22]

    LUCY: Linguistic Understanding and Control Yielding Early Stage of Her

    Heting Gao et al. LUCY: Linguistic Understanding and Control Yielding Early Stage of Her . 2025. arXiv: 2501.16327 [cs.CL]. URL: https://arxiv.org/abs/2501.16327

  23. [23]

    Audio Set: An ontology and human-labeled dataset for audio events

    Jort F. Gemmeke et al. “Audio Set: An ontology and human-labeled dataset for audio events”. In:2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) . 2017, pp. 776–780. DOI: 10.1109/ICASSP.2017.7952261. 14 Step-Audio 2 Technical Report

  24. [24]

    Audio flamingo 2: An audio- language model with long-audio understanding and expert reasoning abilities,

    Sreyan Ghosh et al. Audio Flamingo 2: An Audio-Language Model with Long-Audio Understanding and Expert Reasoning Abilities. 2025. arXiv: 2503.03983 [cs.SD]. URL: https://arxiv.org/abs/2503.03983

  25. [25]

    Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models

    Arushi Goel et al. Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models. 2025. arXiv: 2507.08128 [cs.SD]. URL: https://arxiv.org/abs/2507.08128

  26. [26]

    V ocalsound: A Dataset for Improving Human V ocal Sounds Recognition

    Yuan Gong, Jin Yu, and James Glass. “V ocalsound: A Dataset for Improving Human V ocal Sounds Recognition”. In: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2022, pp. 151–155. DOI: 10.1109/ICASSP43922.2022.9746828

  27. [27]

    Joint audio and speech understanding

    Yuan Gong et al. “Joint audio and speech understanding”. In:2023 IEEE Automatic Speech Recognition and Understanding Workshop (ASRU). IEEE. 2023, pp. 1–8

  28. [28]

    Listen, think, and understand

    Yuan Gong et al. “Listen, think, and understand”. In: arXiv preprint arXiv:2305.10790 (2023)

  29. [29]

    The Llama 3 Herd of Models

    Aaron Grattafiori et al. The Llama 3 Herd of Models . 2024. arXiv: 2407 . 21783 [cs.AI]. URL: https : //arxiv.org/abs/2407.21783

  30. [30]

    HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units

    Wei-Ning Hsu et al. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units. 2021. arXiv: 2106.07447 [cs.CL]. URL: https://arxiv.org/abs/2106.07447

  31. [31]

    Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model

    Ailin Huang et al. “Step-Audio-AQAA: a Fully End-to-End Expressive Large Audio Language Model”. In:arXiv preprint arXiv:2506.08967 (2025)

  32. [32]

    Step-audio: Unified understanding and generation in intelligent speech interaction

    Ailin Huang et al. “Step-audio: Unified understanding and generation in intelligent speech interaction”. In: arXiv preprint arXiv:2502.11946 (2025)

  33. [33]

    Audiogpt: Understanding and generating speech, music, sound, and talking head

    Rongjie Huang et al. “Audiogpt: Understanding and generating speech, music, sound, and talking head”. In: Proceedings of the AAAI Conference on Artificial Intelligence. V ol. 38. 21. 2024, pp. 23802–23804

  34. [34]

    GPT-4o System Card

    Aaron Hurst et al. “Gpt-4o system card”. In: arXiv preprint arXiv:2410.21276 (2024)

  35. [35]

    CochlScene: Acquisition of acoustic scene data using crowdsourcing

    Il-Young Jeong and Jeongsoo Park. “CochlScene: Acquisition of acoustic scene data using crowdsourcing”. In: 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). 2022, pp. 17–21. DOI: 10.23919/APSIPAASC55919.2022.9979822

  36. [36]

    Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling

    Shengpeng Ji et al. “Wavtokenizer: an efficient acoustic discrete codec tokenizer for audio language modeling”. In: arXiv preprint arXiv:2408.16532 (2024)

  37. [37]

    CVSS Corpus and Massively Multilingual Speech-to-Speech Translation

    Ye Jia et al. CVSS Corpus and Massively Multilingual Speech-to-Speech Translation. 2022. arXiv: 2201.03713 [cs.CL]. URL: https://arxiv.org/abs/2201.03713

  38. [38]

    Direct speech-to-speech translation with a sequence-to-sequence model

    Ye Jia et al. “Direct speech-to-speech translation with a sequence-to-sequence model”. In: arXiv preprint arXiv:1904.06037 (2019)

  39. [39]

    Translatotron 2: High-quality direct speech-to-speech translation with voice preservation

    Ye Jia et al. “Translatotron 2: High-quality direct speech-to-speech translation with voice preservation”. In: International conference on machine learning. PMLR. 2022, pp. 10120–10134

  40. [40]

    Kharitonov, D

    Eugene Kharitonov et al. Speak, Read and Prompt: High-Fidelity Text-to-Speech with Minimal Supervision. 2023. arXiv: 2302.03540 [cs.SD]. URL: https://arxiv.org/abs/2302.03540

  41. [41]

    Audiocaps: Generating captions for audios in the wild

    Chris Dongjoo Kim et al. “Audiocaps: Generating captions for audios in the wild”. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019, pp. 119–132

  42. [42]

    HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis

    Jungil Kong, Jaehyeon Kim, and Jaekyoung Bae. HiFi-GAN: Generative Adversarial Networks for Efficient and High Fidelity Speech Synthesis. 2020. arXiv: 2010.05646 [cs.SD]. URL: https://arxiv.org/abs/2010. 05646

  43. [43]

    Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities

    Zhifeng Kong et al. Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities. 2024. arXiv: 2402.01831 [cs.SD]. URL: https://arxiv.org/abs/2402.01831

  44. [44]

    Transvip: Speech to speech translation system with voice and isochrony preservation

    Chenyang Le et al. “Transvip: Speech to speech translation system with voice and isochrony preservation”. In: Advances in Neural Information Processing Systems 37 (2024), pp. 89682–89705

  45. [45]

    Textless speech-to-speech translation on real data

    Ann Lee et al. “Textless speech-to-speech translation on real data”. In:arXiv preprint arXiv:2112.08352 (2021)

  46. [46]

    BigVGAN: A Universal Neural Vocoder with Large-Scale Training

    Sang-gil Lee et al. BigVGAN: A Universal Neural Vocoder with Large-Scale Training. 2023. arXiv: 2206.04658 [cs.SD]. URL: https://arxiv.org/abs/2206.04658

  47. [47]

    Advancing large language models to capture varied speaking styles and respond properly in spoken conversations

    Guan-Ting Lin, Cheng-Han Chiang, and Hung-yi Lee. “Advancing large language models to capture varied speaking styles and respond properly in spoken conversations”. In: arXiv preprint arXiv:2402.12786 (2024)

  48. [48]

    Paralinguistics-enhanced large language modeling of spoken dialogue

    Guan-Ting Lin et al. “Paralinguistics-enhanced large language modeling of spoken dialogue”. In: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2024, pp. 10316–10320

  49. [49]

    Spirit LM: Interleaved Spoken and Written Language Model

    Tu Anh Nguyen et al. Spirit LM: Interleaved Spoken and Written Language Model. 2024. arXiv: 2402.05755 [cs.CL]. URL: https://arxiv.org/abs/2402.05755

  50. [50]

    GPT-4 Technical Report

    OpenAI. GPT-4 Technical Report. https://openai.com/research/gpt-4. Accessed: 2025-07-11. 2023. 15 Step-Audio 2 Technical Report

  51. [51]

    Introducing ChatGPT

    OpenAI. Introducing ChatGPT. Accessed: 2025-07-11. 2022. URL: https://openai.com/blog/chatgpt

  52. [52]

    Deep voice 3: 2000-speaker neural text-to-speech

    Wei Ping et al. “Deep voice 3: 2000-speaker neural text-to-speech”. In:proc. ICLR. V ol. 79. 2018, pp. 1094–1099

  53. [53]

    Robust speech recognition via large-scale weak supervision

    Alec Radford et al. “Robust speech recognition via large-scale weak supervision”. In:International conference on machine learning. PMLR. 2023, pp. 28492–28518

  54. [54]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov et al. “Direct preference optimization: Your language model is secretly a reward model”. In: Advances in Neural Information Processing Systems 36 (2023), pp. 53728–53741

  55. [55]

    Fastspeech 2: Fast and high-quality end-to-end text to speech

    Yi Ren et al. “Fastspeech 2: Fast and high-quality end-to-end text to speech”. In:arXiv preprint arXiv:2006.04558 (2020)

  56. [56]

    Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?

    Andrew Rouditchenko et al. “Omni-R1: Do You Really Need Audio to Fine-Tune Your Audio LLM?” In:arXiv preprint arXiv:2505.09439 (2025)

  57. [57]

    AudioPaLM: A Large Language Model That Can Speak and Listen

    Paul K Rubenstein et al. “Audiopalm: A large language model that can speak and listen”. In: arXiv preprint arXiv:2306.12925 (2023)

  58. [58]

    MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

    S Sakshi et al. “Mmau: A massive multi-task audio understanding and reasoning benchmark”. In: arXiv preprint arXiv:2410.19168 (2024)

  59. [59]

    Natural tts synthesis by conditioning wavenet on mel spectrogram predictions

    Jonathan Shen et al. “Natural tts synthesis by conditioning wavenet on mel spectrogram predictions”. In:2018 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE. 2018, pp. 4779–4783

  60. [60]

    Snac: Multi-scale neural audio codec

    Hubert Siuzdak, Florian Grötschla, and Luca A Lanzendörfer. “Snac: Multi-scale neural audio codec”. In: arXiv preprint arXiv:2410.14411 (2024)

  61. [61]

    Salmonn: Towards generic hearing abilities for large language models,

    Changli Tang et al. “Salmonn: Towards generic hearing abilities for large language models”. In:arXiv preprint arXiv:2310.13289 (2023)

  62. [62]

    Springer Science & Business Media, 2013

    Wolfgang Wahlster.Verbmobil: foundations of speech-to-speech translation. Springer Science & Business Media, 2013

  63. [63]

    Changhan Wang, Anne Wu, and Juan Pino.CoVoST 2 and Massively Multilingual Speech-to-Text Translation

  64. [64]

    URL: https://arxiv.org/abs/2007.10310

    arXiv: 2007.10310 [cs.CL]. URL: https://arxiv.org/abs/2007.10310

  65. [65]

    Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

    Chengyi Wang et al. “Neural codec language models are zero-shot text to speech synthesizers”. In:arXiv preprint arXiv:2301.02111 (2023)

  66. [66]

    Spark-tts: An efficient llm-based text-to-speech model with single-stream decoupled speech tokens

    Xinsheng Wang et al. “Spark-tts: An efficient llm-based text-to-speech model with single-stream decoupled speech tokens”. In: arXiv preprint arXiv:2503.01710 (2025)

  67. [67]

    Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen llm

    Xiong Wang et al. “Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen llm”. In: arXiv preprint arXiv:2411.00774 (2024)

  68. [68]

    Maskgct: Zero-shot text-to-speech with masked generative codec transformer

    Yuancheng Wang et al. “Maskgct: Zero-shot text-to-speech with masked generative codec transformer”. In:arXiv preprint arXiv:2409.00750 (2024)

  69. [69]

    Tacotron: Towards End-to-End Speech Synthesis

    Yuxuan Wang et al. “Tacotron: Towards end-to-end speech synthesis”. In: arXiv preprint arXiv:1703.10135 (2017)

  70. [70]

    Finetuned Language Models Are Zero-Shot Learners

    Jason Wei et al. “Finetuned language models are zero-shot learners”. In:arXiv preprint arXiv:2109.01652 (2021)

  71. [71]

    Google's Neural Machine Translation System: Bridging the Gap between Human and Machine Translation

    Yonghui Wu et al. “Google’s neural machine translation system: Bridging the gap between human and machine translation”. In: arXiv preprint arXiv:1609.08144 (2016)

  72. [72]

    Mini-omni: Language models can hear, talk while thinking in streaming

    Zhifei Xie and Changqiao Wu. “Mini-omni: Language models can hear, talk while thinking in streaming”. In: arXiv preprint arXiv:2408.16725 (2024)

  73. [73]

    Mini-omni2: Towards open-source gpt-4o with vision, speech and duplex capabilities

    Zhifei Xie and Changqiao Wu. “Mini-omni2: Towards open-source gpt-4o with vision, speech and duplex capabilities”. In: arXiv preprint arXiv:2410.11190 (2024)

  74. [74]

    Bigcodec: Pushing the limits of low-bitrate neural speech codec

    Detai Xin et al. “Bigcodec: Pushing the limits of low-bitrate neural speech codec”. In: arXiv preprint arXiv:2409.05377 (2024)

  75. [75]

    Qwen2.5-Omni Technical Report

    Jin Xu et al. Qwen2.5-Omni Technical Report. 2025. arXiv: 2503.20215 [cs.CL] . URL: https://arxiv. org/abs/2503.20215

  76. [76]

    URO-Bench: A Comprehensive Benchmark for End-to-End Spoken Dialogue Models

    Ruiqi Yan et al. URO-Bench: A Comprehensive Benchmark for End-to-End Spoken Dialogue Models . 2025. arXiv: 2502.17810 [cs.CL]. URL: https://arxiv.org/abs/2502.17810

  77. [77]

    Soundstream: An end-to-end neural audio codec

    Neil Zeghidour et al. “Soundstream: An end-to-end neural audio codec”. In: IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2021), pp. 495–507

  78. [78]

    GLM-4-Voice: Towards Intelligent and Human-Like End-to-End Spoken Chatbot

    Aohan Zeng et al. “Glm-4-voice: Towards intelligent and human-like end-to-end spoken chatbot”. In: arXiv preprint arXiv:2412.02612 (2024)

  79. [79]

    WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition

    Binbin Zhang et al. WenetSpeech: A 10000+ Hours Multi-domain Mandarin Corpus for Speech Recognition

  80. [80]

    URL: https://arxiv.org/abs/2110.03370

    arXiv: 2110.03370 [cs.SD]. URL: https://arxiv.org/abs/2110.03370

Showing first 80 references.