arxiv: 2504.18425 · v1 · submitted 2025-04-25 · 📡 eess.AS · cs.AI· cs.CL· cs.LG· cs.MM· cs.SD

Recognition: 2 theorem links

· Lean Theorem

Kimi-Audio Technical Report

Aoxiong Yin, Chu Wei, Ding Ding, Dongchao Yang, Guokun Lai, Hao Yang, Heyi Tang, Jianwei Yu, Jianzhou Wang, Jun Chen, Kai Shen, KimiTeam, Qingcheng Li, Ruibin Yuan, Songxiang Liu, Tong Liu, Weidong Sun, Weiran He, Wei Song, Xinran Xu, Xinyu Zhou, Xu Tan, Yangyang Liu, Yanru Chen, Y. Charles, Yichong Leng, Yifei Xin, Ying Yang, Yuefeng Wu, Yulun Du, Yutao Zhang, Yutong Zhang, Yuxin Wu, Yuzhi Wang, Zaida Zhou, Zeqian Ju, Zeyu Shang, Zhengtao Wang, Zhenxing Hu, Zhilin Yang

Pith reviewed 2026-05-11 19:15 UTC · model grok-4.3

classification 📡 eess.AS cs.AIcs.CLcs.LGcs.MMcs.SD

keywords audio foundation modelspeech recognitionaudio understandingspeech conversationaudio tokenizerflow matchinglarge language modelaudio generation

0 comments

The pith

Kimi-Audio reaches state-of-the-art results on speech recognition, audio understanding, question answering, and conversation tasks through a unified architecture and massive pre-training data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Kimi-Audio as an open-source audio foundation model capable of handling understanding, generation, and conversation in one system. It describes building the model by starting from a pre-trained language model, then continually pre-training it on over 13 million hours of audio and text data using targeted tasks, followed by fine-tuning on curated examples. The authors report that this produces top benchmark scores across several audio tasks, which would matter to a sympathetic reader because it offers a single accessible system that could replace multiple separate tools for processing speech, sounds, and music. The work focuses on practical details like data handling and deployment to make the model usable in real applications.

Core claim

Kimi-Audio is initialized from a pre-trained LLM and continually pre-trained on both audio and text data with several carefully designed tasks before fine-tuning for diverse audio-related tasks. It employs a 12.5 Hz audio tokenizer, an LLM-based architecture that accepts continuous audio features as input and produces discrete tokens as output, and a chunk-wise streaming detokenizer based on flow matching. Supported by a pre-training dataset exceeding 13 million hours covering speech, sound, and music plus a pipeline for high-quality post-training data, the model achieves state-of-the-art performance on benchmarks for speech recognition, audio understanding, audio question answering, and spe

What carries the argument

The LLM-based architecture that takes continuous audio features as input and outputs discrete tokens, paired with a 12.5 Hz audio tokenizer and a flow-matching chunk-wise streaming detokenizer. This setup allows the language model backbone to directly process and generate audio content in a streaming manner after initialization and continued pre-training on mixed audio-text data.

If this is right

A single model can handle speech-to-text conversion, answering questions about audio content, and maintaining natural spoken conversations without switching between separate specialized systems.
The streaming detokenizer design supports real-time audio output suitable for interactive voice applications.
Open release of the model weights, training code, and evaluation tools enables direct reproduction and further development by anyone with sufficient compute resources.
Training on 13 million hours spanning speech, environmental sounds, and music allows the model to generalize across varied audio inputs rather than requiring domain-specific versions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same pattern of starting from a text LLM and adding audio tokenization could be tested on other non-text modalities to create unified foundation models.
The emphasis on careful post-training data construction suggests that future audio models may benefit more from curation quality than from simply scaling data volume further.
Because the model supports both understanding and generation in one framework, it could simplify building end-to-end voice assistants that process incoming audio and respond directly in audio without intermediate text steps.

Load-bearing premise

The post-training data curation pipeline produces high-quality and diverse examples that support broad generalization, and the benchmark results reflect fair comparisons without data leakage or hidden tuning.

What would settle it

Independent evaluation on a new, publicly available audio benchmark set not used in training or post-training would confirm or refute the reported performance levels if the scores match or fall short of the claimed state-of-the-art results.

read the original abstract

We present Kimi-Audio, an open-source audio foundation model that excels in audio understanding, generation, and conversation. We detail the practices in building Kimi-Audio, including model architecture, data curation, training recipe, inference deployment, and evaluation. Specifically, we leverage a 12.5Hz audio tokenizer, design a novel LLM-based architecture with continuous features as input and discrete tokens as output, and develop a chunk-wise streaming detokenizer based on flow matching. We curate a pre-training dataset that consists of more than 13 million hours of audio data covering a wide range of modalities including speech, sound, and music, and build a pipeline to construct high-quality and diverse post-training data. Initialized from a pre-trained LLM, Kimi-Audio is continual pre-trained on both audio and text data with several carefully designed tasks, and then fine-tuned to support a diverse of audio-related tasks. Extensive evaluation shows that Kimi-Audio achieves state-of-the-art performance on a range of audio benchmarks including speech recognition, audio understanding, audio question answering, and speech conversation. We release the codes, model checkpoints, as well as the evaluation toolkits in https://github.com/MoonshotAI/Kimi-Audio.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Kimi-Audio is a usable open release of an audio LLM with a 12.5 Hz tokenizer and flow-matching detokenizer, but the SOTA claims need the full benchmark tables to hold up.

read the letter

Kimi-Audio ships an open audio foundation model trained on more than 13 million hours of speech, sound, and music data. The concrete pieces are a 12.5 Hz tokenizer, an LLM backbone that takes continuous audio features and produces discrete tokens, and a chunk-wise flow-matching detokenizer for streaming output. They start from a pretrained LLM, do continual pretraining on mixed audio-text tasks, then fine-tune for understanding, generation, and conversation. The code, checkpoints, and evaluation scripts are all released on GitHub, which is the part that actually matters for downstream work.

Referee Report

1 major / 3 minor

Summary. The paper presents Kimi-Audio, an open-source audio foundation model for understanding, generation, and conversation. It describes a 12.5 Hz audio tokenizer, an LLM-based architecture taking continuous features as input and producing discrete tokens as output, and a chunk-wise streaming detokenizer using flow matching. The model is initialized from a pre-trained LLM, continually pre-trained on >13 million hours of audio (speech, sound, music) plus text data with designed tasks, then fine-tuned; post-training uses a high-quality diverse data pipeline. Extensive evaluation reports SOTA results on speech recognition, audio understanding, audio QA, and speech conversation benchmarks. Codes, checkpoints, and evaluation toolkits are released.

Significance. If the benchmark results hold under the reported protocols, the work supplies a strong, reproducible open-source audio foundation model trained at large scale, along with detailed training practices and artifacts. This can serve as a practical baseline for the community, lowering barriers for research on audio understanding and conversational systems while enabling independent verification.

major comments (1)

[Evaluation] Evaluation section: the SOTA claims rest on benchmark numbers, but the manuscript would be strengthened by explicit tables or appendices listing all baselines with their reported scores, the exact evaluation protocols (including any preprocessing or prompting details), and error bars or multiple-run statistics to allow direct assessment of the performance margins.

minor comments (3)

[Abstract and Introduction] Abstract and §1: the phrase 'a diverse of audio-related tasks' contains a grammatical error and should be rephrased for clarity.
[Model Architecture] Architecture description: the distinction between the continuous-to-discrete LLM design and prior discrete-token audio models could be highlighted with a short comparison paragraph to make the novelty more immediately apparent.
[Data Curation] Data section: while the >13 M hour corpus size is stated, a breakdown by modality (speech/sound/music) and language distribution would help readers assess coverage and potential biases.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive assessment and the recommendation for minor revision. The suggestion to strengthen the evaluation section is constructive, and we will incorporate additional details to improve reproducibility and transparency.

read point-by-point responses

Referee: [Evaluation] Evaluation section: the SOTA claims rest on benchmark numbers, but the manuscript would be strengthened by explicit tables or appendices listing all baselines with their reported scores, the exact evaluation protocols (including any preprocessing or prompting details), and error bars or multiple-run statistics to allow direct assessment of the performance margins.

Authors: We agree that explicit compilation of baselines and protocols would enhance the manuscript. In the revised version, we will add a dedicated appendix with a table listing all compared baselines and their originally reported scores. We will also expand the evaluation section with precise descriptions of protocols, including preprocessing, prompting templates, and any other implementation specifics for each benchmark. For error bars and multiple-run statistics, our results follow standard single-run evaluation protocols common in the field; performing multiple independent runs at this scale was not feasible due to computational cost. We will explicitly state this limitation and its implications for margin interpretation in the revision. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper is an empirical technical report describing an audio foundation model architecture (12.5 Hz tokenizer, LLM-based continuous-to-discrete design, flow-matching detokenizer), a >13M-hour pre-training corpus, post-training data pipeline, and benchmark results. No mathematical derivation chain, first-principles predictions, or equations exist that could reduce to inputs by construction. Central SOTA claims rest on reported experimental evaluations rather than self-referential logic, fitted parameters renamed as predictions, or load-bearing self-citations. The approach is self-contained with released code and checkpoints enabling external verification.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claims rest on standard deep-learning assumptions about scaling and tokenization rather than new physical laws; the tokenizer frequency and flow-matching hyperparameters are chosen design decisions rather than derived quantities.

free parameters (2)

audio tokenizer frame rate
12.5 Hz rate selected for efficiency-quality trade-off in the architecture description
flow matching detokenizer chunk size
Chunk-wise streaming parameter chosen to enable real-time generation

axioms (2)

domain assumption LLM backbones initialized from text pre-training can be effectively adapted to audio via continuous feature input and discrete token output
Invoked to justify the continual pre-training recipe on audio and text data
domain assumption Large-scale curation of speech, sound, and music data yields generalizable representations for downstream audio tasks
Underlies the claim that 13 million hours of pre-training supports SOTA results

pith-pipeline@v0.9.0 · 5677 in / 1486 out tokens · 88908 ms · 2026-05-11T19:15:12.239253+00:00 · methodology

discussion (0)

Forward citations

Cited by 27 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos
cs.CV 2026-05 unverdicted novelty 8.0

TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models
cs.SD 2026-04 unverdicted novelty 8.0

HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-lang...
Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs
cs.CR 2026-04 conditional novelty 8.0

Benign fine-tuning on audio data breaks safety alignment in Audio LLMs by raising jailbreak success rates up to 87%, with the dominant risk axis depending on model architecture and embedding proximity to harmful content.
VoxSafeBench: Not Just What Is Said, but Who, How, and Where
cs.SD 2026-04 unverdicted novelty 8.0

VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.
AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling
cs.SD 2026-05 unverdicted novelty 7.0

AffectCodec is an emotion-guided neural speech codec that preserves emotional cues during quantization while maintaining semantic fidelity and prosodic naturalness.
How Should LLMs Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue
cs.CL 2026-05 unverdicted novelty 7.0

Channel fusion gives better semantic grounding and QA performance in full-duplex LLM dialogue but is vulnerable to context corruption during interruptions, while cross-attention routing is more robust at the cost of w...
VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing
cs.CL 2026-05 unverdicted novelty 7.0

VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conve...
Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM
cs.CL 2026-05 unverdicted novelty 7.0

TextPro-SLM minimizes the speech-text modality gap from the input side via a prosody-aware unified encoder, delivering the lowest gap and strong performance at 3B/7B scales with only ~1000 hours of audio.
SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation
cs.CL 2026-04 unverdicted novelty 7.0

SpeechParaling-Bench is a new evaluation framework for paralinguistic-aware speech generation that reveals major limitations in current large audio-language models.
ICLAD: In-Context Learning with Comparison-Guidance for Audio Deepfake Detection
cs.SD 2026-04 unverdicted novelty 7.0

ICLAD combines in-context learning and comparison guidance in audio language models with a routing detector to boost generalization and explanations for audio deepfake detection, achieving up to 2x F1 gains on wild data.
From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench
cs.AI 2026-04 unverdicted novelty 7.0

ProVoice-Bench is the first framework to evaluate proactive voice agents, revealing that state-of-the-art multimodal LLMs struggle with over-triggering and context-aware reasoning.
Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection
cs.CR 2026-04 unverdicted novelty 7.0

AudioHijack generates imperceptible adversarial audio via gradient estimation, attention supervision, and reverberation blending to hijack 13 LALMs with 79-96% success on unseen contexts and real commercial agents.
HumDial-EIBench: A Human-Recorded Multi-Turn Emotional Intelligence Benchmark for Audio Language Models
eess.AS 2026-04 unverdicted novelty 7.0

HumDial-EIBench is a new benchmark using real human dialogues to evaluate audio language models on emotional intelligence tasks including multi-turn tracking, causal reasoning, empathy generation, and acoustic-semanti...
CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation
cs.SD 2026-04 unverdicted novelty 7.0

CapTalk unifies single-utterance and dialogue voice design via utterance- and speaker-level captions plus a hierarchical variational module for stable timbre with adaptive expression.
Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR
eess.AS 2026-04 unverdicted novelty 7.0

Speaker-Reasoner is an end-to-end speech LLM that iteratively analyzes audio structure, predicts temporal boundaries, and jointly models speaker identity, gender, timestamps, and transcription using a speaker-aware ca...
Towards Fine-Grained Multi-Dimensional Speech Understanding: Data Pipeline, Benchmark, and Model
eess.AS 2026-05 unverdicted novelty 6.0

A data pipeline, 14-dimension benchmark, and decoupled fine-tuning model are presented to advance fine-grained multi-dimensional speech understanding in LLMs.
VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models
cs.SD 2026-05 unverdicted novelty 6.0

VocalParse applies interleaved and Chain-of-Thought prompting to a Large Audio Language Model to jointly transcribe lyrics, melody and word-note alignments, achieving state-of-the-art results on multiple singing datasets.
MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation
cs.CL 2026-04 unverdicted novelty 6.0

MoVE uses specialized LoRA expert adapters and a soft router to translate non-verbal vocalizations in S2ST, reproducing them in 76% of cases versus at most 14% for baselines while scoring highest on naturalness and em...
Audio2Tool: Speak, Call, Act -- A Dataset for Benchmarking Speech Tool Use
cs.SD 2026-04 unverdicted novelty 6.0

Audio2Tool is a new benchmark dataset that shows speech models perform well on simple commands but degrade sharply on compositional tasks and realistic acoustic noise.
SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding
cs.SD 2026-04 unverdicted novelty 6.0

SpotSound adds a hallucination-suppressing objective and a needle-in-haystack benchmark to audio-language models, reaching state-of-the-art temporal grounding while keeping general task performance.
Bridging What the Model Thinks and How It Speaks: Self-Aware Speech Language Models for Expressive Speech Generation
cs.CL 2026-04 unverdicted novelty 6.0

SA-SLM uses variational information bottleneck for intent-aware bridging and self-criticism for realization-aware alignment to close the semantic-acoustic gap, outperforming open-source models and nearing GPT-4o-Audio...
Noise-Aware In-Context Learning for Hallucination Mitigation in ALLMs
cs.SD 2026-04 unverdicted novelty 6.0

NAICL reduces hallucination rates in ALLMs from 26.53% to 16.98% via noise priors in context and introduces the Clotho-1K benchmark with four hallucination types.
Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs
eess.AS 2026-04 unverdicted novelty 6.0

A multi-stage training method for LLM-based ASR uses new entropy allocation metrics to achieve competitive benchmark performance with 2.3B parameters while mitigating hallucinations via better encoder-LLM decoupling.
Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM
cs.CL 2026-05 unverdicted novelty 5.0

TextPro-SLM reduces the speech-text modality gap by feeding an LLM backbone with synchronized text tokens and prosody embeddings from WhisperPro, achieving lowest gap scores at 3B/7B scales with roughly 1,000 hours of audio.
Audio-DeepThinker: Progressive Reasoning-Aware Reinforcement Learning for High-Quality Chain-of-Thought Emergence in Audio Language Models
cs.SD 2026-04 unverdicted novelty 5.0

A hybrid-reward progressive RL curriculum enables high-quality chain-of-thought to emerge in audio language models without prior supervised CoT training, yielding SOTA results on MMAR, MMAU, and MMSU benchmarks.
Towards Fine-grained Temporal Perception: Post-Training Large Audio-Language Models with Audio-Side Time Prompt
cs.SD 2026-04 unverdicted novelty 5.0

TimePro-RL interleaves timestamp embeddings in audio sequences and applies RL post-SFT to boost temporal alignment in LALMs, yielding gains on grounding, event detection, and dense captioning.
Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models
eess.AS 2026-04 unverdicted novelty 5.0

Audio-Cogito is an open-source LALM using Cogito-pipe data curation and self-distillation to achieve leading open-source performance on audio reasoning benchmarks.

Reference graph

Works this paper leans on

88 extracted references · 88 canonical work pages · cited by 26 Pith papers · 15 internal anchors

[1]

Common voice: A massively-multilingual speech corpus,

Rosana Ardila et al. “Common voice: A massively-multilingual speech corpus”. In: arXiv preprint arXiv:1912.06670 (2019)

work page arXiv 1912
[2]

Infinity Instruct

Beijing Academy of Artificial Intelligence (BAAI). “Infinity Instruct”. In: arXiv preprint arXiv:2406.XXXX (2024). 21 Kimi-Audio Technical Report

work page 2024
[3]

Audiolm: a language modeling approach to audio generation

Zalán Borsos et al. “Audiolm: a language modeling approach to audio generation”. In: IEEE/ACM transactions on audio, speech, and language processing 31 (2023), pp. 2523–2533

work page 2023
[4]

Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline

Hui Bu et al. “Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline”. In: 2017 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA). IEEE. 2017, pp. 1–5

work page 2017
[5]

Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio,

Guoguo Chen et al. “Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio”. In: arXiv preprint arXiv:2106.06909 (2021)

work page arXiv 2021
[6]

Vggsound: A large-scale audio-visual dataset

Honglie Chen et al. “Vggsound: A large-scale audio-visual dataset”. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2020, pp. 721–725

work page 2020
[7]

Minmo: A multimodal large language model for seamless voice interaction.CoRR, abs/2501.06282, 2025

Qian Chen et al. “Minmo: A multimodal large language model for seamless voice interaction”. In:arXiv preprint arXiv:2501.06282 (2025)

work page arXiv 2025
[8]

Tan, and Haizhou Li

Yiming Chen et al. “V oiceBench: Benchmarking LLM-Based V oice Assistants”. In: arXiv preprint arXiv:2410.17196 (2024)

work page arXiv 2024
[9]

VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

Zesen Cheng et al. “Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms”. In: arXiv preprint arXiv:2406.07476 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[10]

Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models

Yunfei Chu et al. “Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models”. In: arXiv preprint arXiv:2311.07919 (2023)

work page internal anchor Pith review arXiv 2023
[11]

Qwen2-Audio Technical Report

Yunfei Chu et al. “Qwen2-audio technical report”. In: arXiv preprint arXiv:2407.10759 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Fleurs: Few-shot learning evaluation of universal representations of speech

Alexis Conneau et al. “Fleurs: Few-shot learning evaluation of universal representations of speech”. In:2022 IEEE Spoken Language Technology Workshop (SLT). IEEE. 2023, pp. 798–805

work page 2022
[13]

DeepSeek-V3 Technical Report

DeepSeek-AI. DeepSeek-V3 Technical Report. 2024. arXiv: 2412.19437 [cs.CL] . URL: https://arxiv. org/abs/2412.19437

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Moshi: a speech-text foundation model for real-time dialogue

Alexandre Défossez et al. “Moshi: a speech-text foundation model for real-time dialogue”. In:arXiv preprint arXiv:2410.00037 (2024)

work page internal anchor Pith review arXiv 2024
[15]

OpenBezoar: Small, Cost-Effective and Open Models Trained on Mixes of Instruction Data

Chandeepa Dissanayake et al. OpenBezoar: Small, Cost-Effective and Open Models Trained on Mixes of Instruction Data. 2024. arXiv: 2404.12195 [cs.CL]

work page arXiv 2024
[16]

Clotho: An audio captioning dataset

Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen. “Clotho: An audio captioning dataset”. In:ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2020, pp. 736–740

work page 2020
[17]

Aishell-2: Transform- ing mandarin asr research into industrial scale,

Jiayu Du et al. “Aishell-2: Transforming mandarin asr research into industrial scale”. In: arXiv preprint arXiv:1808.10583 (2018)

work page arXiv 2018
[18]

Llama-omni: Seamless speech interaction with large language models.arXiv preprint arXiv:2409.06666, 2024

Qingkai Fang et al. “Llama-omni: Seamless speech interaction with large language models”. In: arXiv preprint arXiv:2409.06666 (2024)

work page arXiv 2024
[19]

Fsd50k: an open dataset of human-labeled sound events

Eduardo Fonseca et al. “Fsd50k: an open dataset of human-labeled sound events”. In:IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2021), pp. 829–852

work page 2021
[20]

Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to- end speech recognition,

Zhifu Gao et al. “Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to-end speech recognition”. In: arXiv preprint arXiv:2206.08317 (2022)

work page arXiv 2022
[21]

OSUM: Advancing Open Speech Understanding Models with Limited Resources in Academia

Xuelong Geng et al. “OSUM: Advancing Open Speech Understanding Models with Limited Resources in Academia”. In: arXiv preprint arXiv:2501.13306 (2025)

work page arXiv 2025
[22]

Gama: A large audio-language model with advanced audio understanding and complex reasoning abilities

Sreyan Ghosh et al. “Gama: A large audio-language model with advanced audio understanding and complex reasoning abilities”. In: arXiv preprint arXiv:2406.11768 (2024)

work page arXiv 2024
[23]

Audioclip: Extending clip to image, text and audio

Yuan Gong, Jin Yu, and James Glass. “V ocalsound: A Dataset for Improving Human V ocal Sounds Recognition”. In: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2022, pp. 151–155. DOI: 10.1109/ICASSP43922.2022.9746828

work page doi:10.1109/icassp43922.2022.9746828 2022
[24]

The Llama 3 Herd of Models

Aaron Grattafiori et al. “The llama 3 herd of models”. In: arXiv preprint arXiv:2407.21783 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[25]

Emilia: A large-scale, extensive, multilingual, and diverse dataset for speech generation

Haorui He et al. “Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation”. In: arXiv preprint arXiv:2501.15907 (2025)

work page arXiv 2025
[26]

Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generation

Haorui He et al. “Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generation”. In: 2024 IEEE Spoken Language Technology Workshop (SLT). IEEE. 2024, pp. 885–890

work page 2024
[27]

Heittola et al

T. Heittola et al. TAU Urban Acoustic Scenes 2022 Mobile, Development dataset . Zenodo. Mar. 2022. DOI: 10.5281/zenodo.6337421

work page doi:10.5281/zenodo.6337421 2022
[28]

Step-audio: Unified understanding and generation in intelligent speech interaction, 2025

Ailin Huang et al. “Step-audio: Unified understanding and generation in intelligent speech interaction”. In: arXiv preprint arXiv:2502.11946 (2025)

work page arXiv 2025
[29]

GPT-4o System Card

Aaron Hurst et al. “Gpt-4o system card”. In: arXiv preprint arXiv:2410.21276 (2024). 22 Kimi-Audio Technical Report

work page internal anchor Pith review Pith/arXiv arXiv 2024
[30]

Surrey Audio-Visual Expressed Emotion (SAVEE) database

Philip Jackson and Sana ul haq. Surrey Audio-Visual Expressed Emotion (SAVEE) database. Apr. 2011

work page 2011
[31]

Cochlscene: Acquisition of acoustic scene data using crowdsourcing

Il-Young Jeong and Jeongsoo Park. “Cochlscene: Acquisition of acoustic scene data using crowdsourcing”. In: 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE. 2022, pp. 17–21

work page 2022
[32]

MoonCast: High-Quality Zero-Shot Podcast Generation

Zeqian Ju et al. “MoonCast: High-Quality Zero-Shot Podcast Generation”. In: arXiv preprint arXiv:2503.14345 (2025)

work page arXiv 2025
[33]

Libriheavy: A 50,000 hours ASR corpus with punctuation casing and context

Wei Kang et al. “Libriheavy: A 50,000 hours ASR corpus with punctuation casing and context”. In: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2024, pp. 10991–10995

work page 2024
[34]

Audiocaps: Generating captions for audios in the wild

Chris Dongjoo Kim et al. “Audiocaps: Generating captions for audios in the wild”. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019, pp. 119–132

work page 2019
[35]

Audio flamingo: A novel audio language model with few-shot learning and dialogue abilities,

Zhifeng Kong et al. “Audio flamingo: A novel audio language model with few-shot learning and dialogue abilities”. In: arXiv preprint arXiv:2402.01831 (2024)

work page arXiv 2024
[36]

Tulu 3: Pushing Frontiers in Open Language Model Post-Training

Nathan Lambert et al. “T\" ulu 3: Pushing frontiers in open language model post-training”. In: arXiv preprint arXiv:2411.15124 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[37]

V oicebox: Text-guided multilingual universal speech generation at scale

Matthew Le et al. “V oicebox: Text-guided multilingual universal speech generation at scale”. In:Advances in neural information processing systems 36 (2023), pp. 14005–14034

work page 2023
[38]

Bigvgan: A universal neural vocoder with large-scale training

Sang-gil Lee et al. “Bigvgan: A universal neural vocoder with large-scale training”. In: arXiv preprint arXiv:2206.04658 (2022)

work page arXiv 2022
[39]

Learning to answer questions in dynamic audio-visual scenarios

Guangyao Li et al. “Learning to answer questions in dynamic audio-visual scenarios”. In:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, pp. 19108–19118

work page 2022
[40]

Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions

Jia Li et al. “Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions”. In: Hugging Face repository 13 (2024), p. 9

work page 2024
[41]

Baichuan-audio: A unified framework for end-to-end speech interaction.CoRR, abs/2502.17239, 2025

Tianpeng Li et al. “Baichuan-audio: A unified framework for end-to-end speech interaction”. In:arXiv preprint arXiv:2502.17239 (2025)

work page arXiv 2025
[42]

OpenOrca: An Open Dataset of GPT Augmented FLAN Reasoning Traces

Wing Lian et al. OpenOrca: An Open Dataset of GPT Augmented FLAN Reasoning Traces. https://https: //huggingface.co/datasets/Open-Orca/OpenOrca. 2023

work page 2023
[43]

Clotho-aqa: A crowdsourced dataset for audio question answering

Samuel Lipping et al. “Clotho-aqa: A crowdsourced dataset for audio question answering”. In: 2022 30th European Signal Processing Conference (EUSIPCO). IEEE. 2022, pp. 1140–1144

work page 2022
[44]

Muon is Scalable for LLM Training

Jingyuan Liu et al. Muon is Scalable for LLM Training . 2025. arXiv: 2502.16982 [cs.LG] . URL: https: //arxiv.org/abs/2502.16982

work page internal anchor Pith review arXiv 2025
[45]

Convincing Audio Generation Based on LLM and Speech Tokenization

Rui-Bo Liu et al. “Convincing Audio Generation Based on LLM and Speech Tokenization”. In:2024 IEEE 14th International Symposium on Chinese Spoken Language Processing (ISCSLP). IEEE. 2024, pp. 591–595

work page 2024
[46]

Zero-shot voice conversion with diffusion transform- ers,

Songting Liu. “Zero-shot V oice Conversion with Diffusion Transformers”. In:arXiv preprint arXiv:2411.09943 (2024)

work page arXiv 2024
[47]

The Ryerson Audio-Visual Database of Emotional Speech and Song (RA VDESS): A dynamic, multimodal set of facial and vocal expressions in North American English

Steven R Livingstone and Frank A Russo. “The Ryerson Audio-Visual Database of Emotional Speech and Song (RA VDESS): A dynamic, multimodal set of facial and vocal expressions in North American English”. In:PloS one 13.5 (2018), e0196391

work page 2018
[48]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. “Decoupled weight decay regularization”. In:arXiv preprint arXiv:1711.05101 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[49]

Music Source Separation With Band-Split RNN

Yi Luo and Jianwei Yu. “Music Source Separation With Band-Split RNN”. In: IEEE/ACM Transactions on Audio, Speech, and Language Processing 31 (2023), pp. 1893–1901. DOI: 10.1109/TASLP.2023.3271145

work page doi:10.1109/taslp.2023.3271145 2023
[50]

Wenetspeech4tts: A 12,800-hour mandarin tts corpus for large speech generation model benchmark

Linhan Ma et al. “Wenetspeech4tts: A 12,800-hour mandarin tts corpus for large speech generation model benchmark”. In: arXiv preprint arXiv:2406.05763 (2024)

work page arXiv 2024
[51]

What is the ground truth? reliability of multi-annotator data for audio tagging

Irene Martín-Morató and Annamaria Mesaros. “What is the ground truth? reliability of multi-annotator data for audio tagging”. In: 2021 29th European Signal Processing Conference (EUSIPCO). IEEE. 2021, pp. 76–80

work page 2021
[52]

Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research

Xinhao Mei et al. “Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research”. In: IEEE/ACM Transactions on Audio, Speech, and Language Processing(2024)

work page 2024
[53]

TUT Database for Acoustic Scene Classification and Sound Event Detection

Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. “TUT Database for Acoustic Scene Classification and Sound Event Detection”. In: 24th European Signal Processing Conference 2016 (EUSIPCO 2016). Budapest, Hungary, 2016

work page 2016
[54]

Librispeech: an asr corpus based on public domain audio books

Vassil Panayotov et al. “Librispeech: an asr corpus based on public domain audio books”. In: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE. 2015, pp. 5206–5210. 23 Kimi-Audio Technical Report

work page 2015
[55]

ISBN 9781450334594

Karol J. Piczak. “ESC: Dataset for Environmental Sound Classification”. In:Proceedings of the 23rd Annual ACM Conference on Multimedia. Brisbane, Australia: ACM Press, Oct. 13, 2015, pp. 1015–1018. ISBN : 978- 1-4503-3459-4. DOI: 10.1145/2733373.2806390 . URL: http://dl.acm.org/citation.cfm?doid= 2733373.2806390

work page doi:10.1145/2733373.2806390 2015
[56]

Meld: A multimodal multi-party dataset for emotion recognition in conversations

Soujanya Poria et al. “Meld: A multimodal multi-party dataset for emotion recognition in conversations”. In: arXiv preprint arXiv:1810.02508 (2018)

work page arXiv 2018
[57]

Mls: A large-scale multilingual dataset for speech research,

Vineel Pratap et al. “MLS: A Large-Scale Multilingual Dataset for Speech Research”. In:ArXiv abs/2012.03411 (2020)

work page arXiv 2012
[58]

Robust speech recognition via large-scale weak supervision

Alec Radford et al. “Robust speech recognition via large-scale weak supervision”. In:International conference on machine learning. PMLR. 2023, pp. 28492–28518

work page 2023
[59]

Nonspeech7k dataset: Classification and analysis of human non-speech sound

Muhammad Mamunur Rashid, Guiqing Li, and Chengrui Du. “Nonspeech7k dataset: Classification and analysis of human non-speech sound”. In: IET Signal Processing 17.6 (2023), e12233

work page 2023
[60]

MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark

S Sakshi et al. “Mmau: A massive multi-task audio understanding and reasoning benchmark”. In: arXiv preprint arXiv:2410.19168 (2024)

work page internal anchor Pith review arXiv 2024
[61]

A dataset and taxonomy for urban sound research

Justin Salamon, Christopher Jacoby, and Juan Pablo Bello. “A dataset and taxonomy for urban sound research”. In: Proceedings of the 22nd ACM international conference on Multimedia. 2014, pp. 1041–1044

work page 2014
[62]

arXiv preprint arXiv:2010.11567 , year=

Yao Shi et al. “Aishell-3: A multi-speaker mandarin tts corpus and the baselines”. In: arXiv preprint arXiv:2010.11567 (2020)

work page arXiv 2010
[63]

Salmonn: Towards generic hearing abilities for large language models.arXiv preprint arXiv:2310.13289, 2023

Changli Tang et al. “Salmonn: Towards generic hearing abilities for large language models”. In:arXiv preprint arXiv:2310.13289 (2023)

work page arXiv 2023
[64]

Kespeech: An open source speech dataset of mandarin and its eight subdialects

Zhiyuan Tang et al. “Kespeech: An open source speech dataset of mandarin and its eight subdialects”. In: Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). 2021

work page 2021
[65]

Kimi k1.5: Scaling Reinforcement Learning with LLMs

Kimi Team et al. Kimi k1.5: Scaling Reinforcement Learning with LLMs. 2025. arXiv: 2501.12599 [cs.AI]. URL: https://arxiv.org/abs/2501.12599

work page internal anchor Pith review Pith/arXiv arXiv 2025
[66]

OpenHermes 2.5: An Open Dataset of Synthetic Data for Generalist LLM Assistants

Teknium. OpenHermes 2.5: An Open Dataset of Synthetic Data for Generalist LLM Assistants . 2023. URL: https://huggingface.co/datasets/teknium/OpenHermes-2.5

work page 2023
[67]

Synthia-70b-v1.2: Synthetic intelligent agent

Migel Tissera. Synthia-70b-v1.2: Synthetic intelligent agent. Hugging Face. 2023. URL: https://huggingface. co/migtissera/Synthia-13B

work page 2023
[68]

Multi-modal emotion recognition on iemocap dataset using deep learning

Samarth Tripathi, Sarthak Tripathi, and Homayoon Beigi. “Multi-modal emotion recognition on iemocap dataset using deep learning”. In: arXiv preprint arXiv:1804.05788 (2018)

work page arXiv 2018
[69]

V oxPopuli: A large-scale multilingual speech corpus for representation learning, semi- supervised learning and interpretation

Changhan Wang et al. “V oxPopuli: A large-scale multilingual speech corpus for representation learning, semi- supervised learning and interpretation”. In: arXiv preprint arXiv:2101.00390 (2021)

work page arXiv 2021
[70]

Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers

Chengyi Wang et al. “Neural codec language models are zero-shot text to speech synthesizers”. In:arXiv preprint arXiv:2301.02111 (2023)

work page internal anchor Pith review arXiv 2023
[71]

Freeze-omni: A smart and low latency speech-to-speech dia- logue model with frozen llm,

Xiong Wang et al. “Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen llm”. In: arXiv preprint arXiv:2411.00774 (2024)

work page arXiv 2024
[72]

Mini-omni: Language models can hear, talk while thinking in streaming,

Zhifei Xie and Changqiao Wu. “Mini-omni: Language models can hear, talk while thinking in streaming”. In: arXiv preprint arXiv:2408.16725 (2024)

work page arXiv 2024
[74]

Qwen2.5-Omni Technical Report

Jin Xu et al. Qwen2.5-Omni Technical Report. 2025. arXiv: 2503.20215 [cs.CL] . URL: https://arxiv. org/abs/2503.20215

work page internal anchor Pith review Pith/arXiv arXiv 2025
[75]

Magpie: Alignment data synthesis from scratch by prompting aligned llms with nothing.arXiv preprint arXiv:2406.08464,

Zhangchen Xu et al. “Magpie: Alignment data synthesis from scratch by prompting aligned llms with nothing”. In: arXiv preprint arXiv:2406.08464 (2024)

work page arXiv 2024
[76]

Qwen2.5 Technical Report

An Yang et al. “Qwen2.5 Technical Report”. In: arXiv preprint arXiv:2412.15115 (2024)

work page internal anchor Pith review Pith/arXiv arXiv 2024
[77]

Uniaudio: An audio foundation model toward universal audio generation

Dongchao Yang et al. “Uniaudio: An audio foundation model toward universal audio generation”. In: arXiv preprint arXiv:2310.00704 (2023)

work page arXiv 2023
[78]

Avqa: A dataset for audio-visual question answering on videos

Pinci Yang et al. “Avqa: A dataset for audio-visual question answering on videos”. In:Proceedings of the 30th ACM international conference on multimedia. 2022, pp. 3480–3491

work page 2022
[79]

Open source magicdata-ramc: A rich annotated mandarin conversational (ramc) speech dataset,

Zehui Yang et al. “Open source magicdata-ramc: A rich annotated mandarin conversational (ramc) speech dataset”. In: arXiv preprint arXiv:2203.16844 (2022)

work page arXiv 2022
[80]

Llasa: Scaling train-time and inference-time compute for llama-based speech synthesis.arXiv preprint arXiv:2502.04128, 2025

Zhen Ye et al. “Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis”. In: arXiv preprint arXiv:2502.04128 (2025). 24 Kimi-Audio Technical Report

work page arXiv 2025
[81]

Autoprep: An automatic preprocessing framework for in-the-wild speech data

Jianwei Yu et al. “Autoprep: An automatic preprocessing framework for in-the-wild speech data”. In:ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2024, pp. 1136–1140

work page 2024

Showing first 80 references.