Recognition: 2 theorem links
· Lean TheoremKimi-Audio Technical Report
Pith reviewed 2026-05-11 19:15 UTC · model grok-4.3
The pith
Kimi-Audio reaches state-of-the-art results on speech recognition, audio understanding, question answering, and conversation tasks through a unified architecture and massive pre-training data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Kimi-Audio is initialized from a pre-trained LLM and continually pre-trained on both audio and text data with several carefully designed tasks before fine-tuning for diverse audio-related tasks. It employs a 12.5 Hz audio tokenizer, an LLM-based architecture that accepts continuous audio features as input and produces discrete tokens as output, and a chunk-wise streaming detokenizer based on flow matching. Supported by a pre-training dataset exceeding 13 million hours covering speech, sound, and music plus a pipeline for high-quality post-training data, the model achieves state-of-the-art performance on benchmarks for speech recognition, audio understanding, audio question answering, and spe
What carries the argument
The LLM-based architecture that takes continuous audio features as input and outputs discrete tokens, paired with a 12.5 Hz audio tokenizer and a flow-matching chunk-wise streaming detokenizer. This setup allows the language model backbone to directly process and generate audio content in a streaming manner after initialization and continued pre-training on mixed audio-text data.
If this is right
- A single model can handle speech-to-text conversion, answering questions about audio content, and maintaining natural spoken conversations without switching between separate specialized systems.
- The streaming detokenizer design supports real-time audio output suitable for interactive voice applications.
- Open release of the model weights, training code, and evaluation tools enables direct reproduction and further development by anyone with sufficient compute resources.
- Training on 13 million hours spanning speech, environmental sounds, and music allows the model to generalize across varied audio inputs rather than requiring domain-specific versions.
Where Pith is reading between the lines
- The same pattern of starting from a text LLM and adding audio tokenization could be tested on other non-text modalities to create unified foundation models.
- The emphasis on careful post-training data construction suggests that future audio models may benefit more from curation quality than from simply scaling data volume further.
- Because the model supports both understanding and generation in one framework, it could simplify building end-to-end voice assistants that process incoming audio and respond directly in audio without intermediate text steps.
Load-bearing premise
The post-training data curation pipeline produces high-quality and diverse examples that support broad generalization, and the benchmark results reflect fair comparisons without data leakage or hidden tuning.
What would settle it
Independent evaluation on a new, publicly available audio benchmark set not used in training or post-training would confirm or refute the reported performance levels if the scores match or fall short of the claimed state-of-the-art results.
read the original abstract
We present Kimi-Audio, an open-source audio foundation model that excels in audio understanding, generation, and conversation. We detail the practices in building Kimi-Audio, including model architecture, data curation, training recipe, inference deployment, and evaluation. Specifically, we leverage a 12.5Hz audio tokenizer, design a novel LLM-based architecture with continuous features as input and discrete tokens as output, and develop a chunk-wise streaming detokenizer based on flow matching. We curate a pre-training dataset that consists of more than 13 million hours of audio data covering a wide range of modalities including speech, sound, and music, and build a pipeline to construct high-quality and diverse post-training data. Initialized from a pre-trained LLM, Kimi-Audio is continual pre-trained on both audio and text data with several carefully designed tasks, and then fine-tuned to support a diverse of audio-related tasks. Extensive evaluation shows that Kimi-Audio achieves state-of-the-art performance on a range of audio benchmarks including speech recognition, audio understanding, audio question answering, and speech conversation. We release the codes, model checkpoints, as well as the evaluation toolkits in https://github.com/MoonshotAI/Kimi-Audio.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents Kimi-Audio, an open-source audio foundation model for understanding, generation, and conversation. It describes a 12.5 Hz audio tokenizer, an LLM-based architecture taking continuous features as input and producing discrete tokens as output, and a chunk-wise streaming detokenizer using flow matching. The model is initialized from a pre-trained LLM, continually pre-trained on >13 million hours of audio (speech, sound, music) plus text data with designed tasks, then fine-tuned; post-training uses a high-quality diverse data pipeline. Extensive evaluation reports SOTA results on speech recognition, audio understanding, audio QA, and speech conversation benchmarks. Codes, checkpoints, and evaluation toolkits are released.
Significance. If the benchmark results hold under the reported protocols, the work supplies a strong, reproducible open-source audio foundation model trained at large scale, along with detailed training practices and artifacts. This can serve as a practical baseline for the community, lowering barriers for research on audio understanding and conversational systems while enabling independent verification.
major comments (1)
- [Evaluation] Evaluation section: the SOTA claims rest on benchmark numbers, but the manuscript would be strengthened by explicit tables or appendices listing all baselines with their reported scores, the exact evaluation protocols (including any preprocessing or prompting details), and error bars or multiple-run statistics to allow direct assessment of the performance margins.
minor comments (3)
- [Abstract and Introduction] Abstract and §1: the phrase 'a diverse of audio-related tasks' contains a grammatical error and should be rephrased for clarity.
- [Model Architecture] Architecture description: the distinction between the continuous-to-discrete LLM design and prior discrete-token audio models could be highlighted with a short comparison paragraph to make the novelty more immediately apparent.
- [Data Curation] Data section: while the >13 M hour corpus size is stated, a breakdown by modality (speech/sound/music) and language distribution would help readers assess coverage and potential biases.
Simulated Author's Rebuttal
We thank the referee for the positive assessment and the recommendation for minor revision. The suggestion to strengthen the evaluation section is constructive, and we will incorporate additional details to improve reproducibility and transparency.
read point-by-point responses
-
Referee: [Evaluation] Evaluation section: the SOTA claims rest on benchmark numbers, but the manuscript would be strengthened by explicit tables or appendices listing all baselines with their reported scores, the exact evaluation protocols (including any preprocessing or prompting details), and error bars or multiple-run statistics to allow direct assessment of the performance margins.
Authors: We agree that explicit compilation of baselines and protocols would enhance the manuscript. In the revised version, we will add a dedicated appendix with a table listing all compared baselines and their originally reported scores. We will also expand the evaluation section with precise descriptions of protocols, including preprocessing, prompting templates, and any other implementation specifics for each benchmark. For error bars and multiple-run statistics, our results follow standard single-run evaluation protocols common in the field; performing multiple independent runs at this scale was not feasible due to computational cost. We will explicitly state this limitation and its implications for margin interpretation in the revision. revision: partial
Circularity Check
No significant circularity detected
full rationale
The paper is an empirical technical report describing an audio foundation model architecture (12.5 Hz tokenizer, LLM-based continuous-to-discrete design, flow-matching detokenizer), a >13M-hour pre-training corpus, post-training data pipeline, and benchmark results. No mathematical derivation chain, first-principles predictions, or equations exist that could reduce to inputs by construction. Central SOTA claims rest on reported experimental evaluations rather than self-referential logic, fitted parameters renamed as predictions, or load-bearing self-citations. The approach is self-contained with released code and checkpoints enabling external verification.
Axiom & Free-Parameter Ledger
free parameters (2)
- audio tokenizer frame rate
- flow matching detokenizer chunk size
axioms (2)
- domain assumption LLM backbones initialized from text pre-training can be effectively adapted to audio via continuous feature input and discrete token output
- domain assumption Large-scale curation of speech, sound, and music data yields generalizable representations for downstream audio tasks
Forward citations
Cited by 27 Pith papers
-
TraceAV-Bench: Benchmarking Multi-Hop Trajectory Reasoning over Long Audio-Visual Videos
TraceAV-Bench is the first benchmark for multi-hop trajectory reasoning over long audio-visual videos, showing top models reach only 51-68% accuracy with substantial room for improvement.
-
HalluAudio: A Comprehensive Benchmark for Hallucination Detection in Large Audio-Language Models
HalluAudio is the first large-scale benchmark spanning speech, environmental sound, and music that uses human-verified QA pairs, adversarial prompts, and mixed-audio tests to measure hallucinations in large audio-lang...
-
Benign Fine-Tuning Breaks Safety Alignment in Audio LLMs
Benign fine-tuning on audio data breaks safety alignment in Audio LLMs by raising jailbreak success rates up to 87%, with the dominant risk axis depending on model architecture and embedding proximity to harmful content.
-
VoxSafeBench: Not Just What Is Said, but Who, How, and Where
VoxSafeBench reveals that speech language models recognize social norms from text but fail to apply them when acoustic cues like speaker or scene determine the appropriate response.
-
AffectCodec: Emotion-Preserving Neural Speech Codec for Expressive Speech Modeling
AffectCodec is an emotion-guided neural speech codec that preserves emotional cues during quantization while maintaining semantic fidelity and prosodic naturalness.
-
How Should LLMs Listen While Speaking? A Study of User-Stream Routing in Full-Duplex Spoken Dialogue
Channel fusion gives better semantic grounding and QA performance in full-duplex LLM dialogue but is vulnerable to context corruption during interruptions, while cross-attention routing is more robust at the cost of w...
-
VITA-QinYu: Expressive Spoken Language Model for Role-Playing and Singing
VITA-QinYu is the first expressive end-to-end spoken language model supporting role-playing and singing alongside conversation, trained on 15.8K hours of data and outperforming prior models on expressiveness and conve...
-
Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM
TextPro-SLM minimizes the speech-text modality gap from the input side via a prosody-aware unified encoder, delivering the lowest gap and strong performance at 3B/7B scales with only ~1000 hours of audio.
-
SpeechParaling-Bench: A Comprehensive Benchmark for Paralinguistic-Aware Speech Generation
SpeechParaling-Bench is a new evaluation framework for paralinguistic-aware speech generation that reveals major limitations in current large audio-language models.
-
ICLAD: In-Context Learning with Comparison-Guidance for Audio Deepfake Detection
ICLAD combines in-context learning and comparison guidance in audio language models with a routing detector to boost generalization and explanations for audio deepfake detection, achieving up to 2x F1 gains on wild data.
-
From Reactive to Proactive: Assessing the Proactivity of Voice Agents via ProVoice-Bench
ProVoice-Bench is the first framework to evaluate proactive voice agents, revealing that state-of-the-art multimodal LLMs struggle with over-triggering and context-aware reasoning.
-
Hijacking Large Audio-Language Models via Context-Agnostic and Imperceptible Auditory Prompt Injection
AudioHijack generates imperceptible adversarial audio via gradient estimation, attention supervision, and reverberation blending to hijack 13 LALMs with 79-96% success on unseen contexts and real commercial agents.
-
HumDial-EIBench: A Human-Recorded Multi-Turn Emotional Intelligence Benchmark for Audio Language Models
HumDial-EIBench is a new benchmark using real human dialogues to evaluate audio language models on emotional intelligence tasks including multi-turn tracking, causal reasoning, empathy generation, and acoustic-semanti...
-
CapTalk: Unified Voice Design for Single-Utterance and Dialogue Speech Generation
CapTalk unifies single-utterance and dialogue voice design via utterance- and speaker-level captions plus a hierarchical variational module for stable timbre with adaptive expression.
-
Speaker-Reasoner: Scaling Interaction Turns and Reasoning Patterns for Timestamped Speaker-Attributed ASR
Speaker-Reasoner is an end-to-end speech LLM that iteratively analyzes audio structure, predicts temporal boundaries, and jointly models speaker identity, gender, timestamps, and transcription using a speaker-aware ca...
-
Towards Fine-Grained Multi-Dimensional Speech Understanding: Data Pipeline, Benchmark, and Model
A data pipeline, 14-dimension benchmark, and decoupled fine-tuning model are presented to advance fine-grained multi-dimensional speech understanding in LLMs.
-
VocalParse: Towards Unified and Scalable Singing Voice Transcription with Large Audio Language Models
VocalParse applies interleaved and Chain-of-Thought prompting to a Large Audio Language Model to jointly transcribe lyrics, melody and word-note alignments, achieving state-of-the-art results on multiple singing datasets.
-
MoVE: Translating Laughter and Tears via Mixture of Vocalization Experts in Speech-to-Speech Translation
MoVE uses specialized LoRA expert adapters and a soft router to translate non-verbal vocalizations in S2ST, reproducing them in 76% of cases versus at most 14% for baselines while scoring highest on naturalness and em...
-
Audio2Tool: Speak, Call, Act -- A Dataset for Benchmarking Speech Tool Use
Audio2Tool is a new benchmark dataset that shows speech models perform well on simple commands but degrade sharply on compositional tasks and realistic acoustic noise.
-
SpotSound: Enhancing Large Audio-Language Models with Fine-Grained Temporal Grounding
SpotSound adds a hallucination-suppressing objective and a needle-in-haystack benchmark to audio-language models, reaching state-of-the-art temporal grounding while keeping general task performance.
-
Bridging What the Model Thinks and How It Speaks: Self-Aware Speech Language Models for Expressive Speech Generation
SA-SLM uses variational information bottleneck for intent-aware bridging and self-criticism for realization-aware alignment to close the semantic-acoustic gap, outperforming open-source models and nearing GPT-4o-Audio...
-
Noise-Aware In-Context Learning for Hallucination Mitigation in ALLMs
NAICL reduces hallucination rates in ALLMs from 26.53% to 16.98% via noise priors in context and introduces the Clotho-1K benchmark with four hallucination types.
-
Rethinking Entropy Allocation in LLM-based ASR: Understanding the Dynamics between Speech Encoders and LLMs
A multi-stage training method for LLM-based ASR uses new entropy allocation metrics to achieve competitive benchmark performance with 2.3B parameters while mitigating hallucinations via better encoder-LLM decoupling.
-
Minimizing Modality Gap from the Input Side: Your Speech LLM Can Be a Prosody-Aware Text LLM
TextPro-SLM reduces the speech-text modality gap by feeding an LLM backbone with synchronized text tokens and prosody embeddings from WhisperPro, achieving lowest gap scores at 3B/7B scales with roughly 1,000 hours of audio.
-
Audio-DeepThinker: Progressive Reasoning-Aware Reinforcement Learning for High-Quality Chain-of-Thought Emergence in Audio Language Models
A hybrid-reward progressive RL curriculum enables high-quality chain-of-thought to emerge in audio language models without prior supervised CoT training, yielding SOTA results on MMAR, MMAU, and MMSU benchmarks.
-
Towards Fine-grained Temporal Perception: Post-Training Large Audio-Language Models with Audio-Side Time Prompt
TimePro-RL interleaves timestamp embeddings in audio sequences and applies RL post-SFT to boost temporal alignment in LALMs, yielding gains on grounding, event detection, and dense captioning.
-
Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models
Audio-Cogito is an open-source LALM using Cogito-pipe data curation and self-distillation to achieve leading open-source performance on audio reasoning benchmarks.
Reference graph
Works this paper leans on
-
[1]
Common voice: A massively-multilingual speech corpus,
Rosana Ardila et al. “Common voice: A massively-multilingual speech corpus”. In: arXiv preprint arXiv:1912.06670 (2019)
-
[2]
Beijing Academy of Artificial Intelligence (BAAI). “Infinity Instruct”. In: arXiv preprint arXiv:2406.XXXX (2024). 21 Kimi-Audio Technical Report
work page 2024
-
[3]
Audiolm: a language modeling approach to audio generation
Zalán Borsos et al. “Audiolm: a language modeling approach to audio generation”. In: IEEE/ACM transactions on audio, speech, and language processing 31 (2023), pp. 2523–2533
work page 2023
-
[4]
Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline
Hui Bu et al. “Aishell-1: An open-source mandarin speech corpus and a speech recognition baseline”. In: 2017 20th conference of the oriental chapter of the international coordinating committee on speech databases and speech I/O systems and assessment (O-COCOSDA). IEEE. 2017, pp. 1–5
work page 2017
-
[5]
Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio,
Guoguo Chen et al. “Gigaspeech: An evolving, multi-domain asr corpus with 10,000 hours of transcribed audio”. In: arXiv preprint arXiv:2106.06909 (2021)
-
[6]
Vggsound: A large-scale audio-visual dataset
Honglie Chen et al. “Vggsound: A large-scale audio-visual dataset”. In: ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2020, pp. 721–725
work page 2020
-
[7]
Minmo: A multimodal large language model for seamless voice interaction.CoRR, abs/2501.06282, 2025
Qian Chen et al. “Minmo: A multimodal large language model for seamless voice interaction”. In:arXiv preprint arXiv:2501.06282 (2025)
-
[8]
Yiming Chen et al. “V oiceBench: Benchmarking LLM-Based V oice Assistants”. In: arXiv preprint arXiv:2410.17196 (2024)
-
[9]
VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs
Zesen Cheng et al. “Videollama 2: Advancing spatial-temporal modeling and audio understanding in video-llms”. In: arXiv preprint arXiv:2406.07476 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[10]
Qwen-Audio: Advancing Universal Audio Understanding via Unified Large-Scale Audio-Language Models
Yunfei Chu et al. “Qwen-audio: Advancing universal audio understanding via unified large-scale audio-language models”. In: arXiv preprint arXiv:2311.07919 (2023)
work page internal anchor Pith review arXiv 2023
-
[11]
Yunfei Chu et al. “Qwen2-audio technical report”. In: arXiv preprint arXiv:2407.10759 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
Fleurs: Few-shot learning evaluation of universal representations of speech
Alexis Conneau et al. “Fleurs: Few-shot learning evaluation of universal representations of speech”. In:2022 IEEE Spoken Language Technology Workshop (SLT). IEEE. 2023, pp. 798–805
work page 2022
-
[13]
DeepSeek-AI. DeepSeek-V3 Technical Report. 2024. arXiv: 2412.19437 [cs.CL] . URL: https://arxiv. org/abs/2412.19437
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[14]
Moshi: a speech-text foundation model for real-time dialogue
Alexandre Défossez et al. “Moshi: a speech-text foundation model for real-time dialogue”. In:arXiv preprint arXiv:2410.00037 (2024)
work page internal anchor Pith review arXiv 2024
-
[15]
OpenBezoar: Small, Cost-Effective and Open Models Trained on Mixes of Instruction Data
Chandeepa Dissanayake et al. OpenBezoar: Small, Cost-Effective and Open Models Trained on Mixes of Instruction Data. 2024. arXiv: 2404.12195 [cs.CL]
-
[16]
Clotho: An audio captioning dataset
Konstantinos Drossos, Samuel Lipping, and Tuomas Virtanen. “Clotho: An audio captioning dataset”. In:ICASSP 2020-2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2020, pp. 736–740
work page 2020
-
[17]
Aishell-2: Transform- ing mandarin asr research into industrial scale,
Jiayu Du et al. “Aishell-2: Transforming mandarin asr research into industrial scale”. In: arXiv preprint arXiv:1808.10583 (2018)
-
[18]
Qingkai Fang et al. “Llama-omni: Seamless speech interaction with large language models”. In: arXiv preprint arXiv:2409.06666 (2024)
-
[19]
Fsd50k: an open dataset of human-labeled sound events
Eduardo Fonseca et al. “Fsd50k: an open dataset of human-labeled sound events”. In:IEEE/ACM Transactions on Audio, Speech, and Language Processing 30 (2021), pp. 829–852
work page 2021
-
[20]
Zhifu Gao et al. “Paraformer: Fast and accurate parallel transformer for non-autoregressive end-to-end speech recognition”. In: arXiv preprint arXiv:2206.08317 (2022)
-
[21]
OSUM: Advancing Open Speech Understanding Models with Limited Resources in Academia
Xuelong Geng et al. “OSUM: Advancing Open Speech Understanding Models with Limited Resources in Academia”. In: arXiv preprint arXiv:2501.13306 (2025)
-
[22]
Gama: A large audio-language model with advanced audio understanding and complex reasoning abilities
Sreyan Ghosh et al. “Gama: A large audio-language model with advanced audio understanding and complex reasoning abilities”. In: arXiv preprint arXiv:2406.11768 (2024)
-
[23]
Audioclip: Extending clip to image, text and audio
Yuan Gong, Jin Yu, and James Glass. “V ocalsound: A Dataset for Improving Human V ocal Sounds Recognition”. In: ICASSP 2022 - 2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). 2022, pp. 151–155. DOI: 10.1109/ICASSP43922.2022.9746828
-
[24]
Aaron Grattafiori et al. “The llama 3 herd of models”. In: arXiv preprint arXiv:2407.21783 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[25]
Emilia: A large-scale, extensive, multilingual, and diverse dataset for speech generation
Haorui He et al. “Emilia: A Large-Scale, Extensive, Multilingual, and Diverse Dataset for Speech Generation”. In: arXiv preprint arXiv:2501.15907 (2025)
-
[26]
Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generation
Haorui He et al. “Emilia: An extensive, multilingual, and diverse speech dataset for large-scale speech generation”. In: 2024 IEEE Spoken Language Technology Workshop (SLT). IEEE. 2024, pp. 885–890
work page 2024
-
[27]
T. Heittola et al. TAU Urban Acoustic Scenes 2022 Mobile, Development dataset . Zenodo. Mar. 2022. DOI: 10.5281/zenodo.6337421
-
[28]
Step-audio: Unified understanding and generation in intelligent speech interaction, 2025
Ailin Huang et al. “Step-audio: Unified understanding and generation in intelligent speech interaction”. In: arXiv preprint arXiv:2502.11946 (2025)
-
[29]
Aaron Hurst et al. “Gpt-4o system card”. In: arXiv preprint arXiv:2410.21276 (2024). 22 Kimi-Audio Technical Report
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
Surrey Audio-Visual Expressed Emotion (SAVEE) database
Philip Jackson and Sana ul haq. Surrey Audio-Visual Expressed Emotion (SAVEE) database. Apr. 2011
work page 2011
-
[31]
Cochlscene: Acquisition of acoustic scene data using crowdsourcing
Il-Young Jeong and Jeongsoo Park. “Cochlscene: Acquisition of acoustic scene data using crowdsourcing”. In: 2022 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). IEEE. 2022, pp. 17–21
work page 2022
-
[32]
MoonCast: High-Quality Zero-Shot Podcast Generation
Zeqian Ju et al. “MoonCast: High-Quality Zero-Shot Podcast Generation”. In: arXiv preprint arXiv:2503.14345 (2025)
-
[33]
Libriheavy: A 50,000 hours ASR corpus with punctuation casing and context
Wei Kang et al. “Libriheavy: A 50,000 hours ASR corpus with punctuation casing and context”. In: ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2024, pp. 10991–10995
work page 2024
-
[34]
Audiocaps: Generating captions for audios in the wild
Chris Dongjoo Kim et al. “Audiocaps: Generating captions for audios in the wild”. In: Proceedings of the 2019 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers). 2019, pp. 119–132
work page 2019
-
[35]
Audio flamingo: A novel audio language model with few-shot learning and dialogue abilities,
Zhifeng Kong et al. “Audio flamingo: A novel audio language model with few-shot learning and dialogue abilities”. In: arXiv preprint arXiv:2402.01831 (2024)
-
[36]
Tulu 3: Pushing Frontiers in Open Language Model Post-Training
Nathan Lambert et al. “T\" ulu 3: Pushing frontiers in open language model post-training”. In: arXiv preprint arXiv:2411.15124 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[37]
V oicebox: Text-guided multilingual universal speech generation at scale
Matthew Le et al. “V oicebox: Text-guided multilingual universal speech generation at scale”. In:Advances in neural information processing systems 36 (2023), pp. 14005–14034
work page 2023
-
[38]
Bigvgan: A universal neural vocoder with large-scale training
Sang-gil Lee et al. “Bigvgan: A universal neural vocoder with large-scale training”. In: arXiv preprint arXiv:2206.04658 (2022)
-
[39]
Learning to answer questions in dynamic audio-visual scenarios
Guangyao Li et al. “Learning to answer questions in dynamic audio-visual scenarios”. In:Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022, pp. 19108–19118
work page 2022
-
[40]
Jia Li et al. “Numinamath: The largest public dataset in ai4maths with 860k pairs of competition math problems and solutions”. In: Hugging Face repository 13 (2024), p. 9
work page 2024
-
[41]
Baichuan-audio: A unified framework for end-to-end speech interaction.CoRR, abs/2502.17239, 2025
Tianpeng Li et al. “Baichuan-audio: A unified framework for end-to-end speech interaction”. In:arXiv preprint arXiv:2502.17239 (2025)
-
[42]
OpenOrca: An Open Dataset of GPT Augmented FLAN Reasoning Traces
Wing Lian et al. OpenOrca: An Open Dataset of GPT Augmented FLAN Reasoning Traces. https://https: //huggingface.co/datasets/Open-Orca/OpenOrca. 2023
work page 2023
-
[43]
Clotho-aqa: A crowdsourced dataset for audio question answering
Samuel Lipping et al. “Clotho-aqa: A crowdsourced dataset for audio question answering”. In: 2022 30th European Signal Processing Conference (EUSIPCO). IEEE. 2022, pp. 1140–1144
work page 2022
-
[44]
Muon is Scalable for LLM Training
Jingyuan Liu et al. Muon is Scalable for LLM Training . 2025. arXiv: 2502.16982 [cs.LG] . URL: https: //arxiv.org/abs/2502.16982
work page internal anchor Pith review arXiv 2025
-
[45]
Convincing Audio Generation Based on LLM and Speech Tokenization
Rui-Bo Liu et al. “Convincing Audio Generation Based on LLM and Speech Tokenization”. In:2024 IEEE 14th International Symposium on Chinese Spoken Language Processing (ISCSLP). IEEE. 2024, pp. 591–595
work page 2024
-
[46]
Zero-shot voice conversion with diffusion transform- ers,
Songting Liu. “Zero-shot V oice Conversion with Diffusion Transformers”. In:arXiv preprint arXiv:2411.09943 (2024)
-
[47]
Steven R Livingstone and Frank A Russo. “The Ryerson Audio-Visual Database of Emotional Speech and Song (RA VDESS): A dynamic, multimodal set of facial and vocal expressions in North American English”. In:PloS one 13.5 (2018), e0196391
work page 2018
-
[48]
Decoupled Weight Decay Regularization
Ilya Loshchilov and Frank Hutter. “Decoupled weight decay regularization”. In:arXiv preprint arXiv:1711.05101 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[49]
Music Source Separation With Band-Split RNN
Yi Luo and Jianwei Yu. “Music Source Separation With Band-Split RNN”. In: IEEE/ACM Transactions on Audio, Speech, and Language Processing 31 (2023), pp. 1893–1901. DOI: 10.1109/TASLP.2023.3271145
-
[50]
Wenetspeech4tts: A 12,800-hour mandarin tts corpus for large speech generation model benchmark
Linhan Ma et al. “Wenetspeech4tts: A 12,800-hour mandarin tts corpus for large speech generation model benchmark”. In: arXiv preprint arXiv:2406.05763 (2024)
-
[51]
What is the ground truth? reliability of multi-annotator data for audio tagging
Irene Martín-Morató and Annamaria Mesaros. “What is the ground truth? reliability of multi-annotator data for audio tagging”. In: 2021 29th European Signal Processing Conference (EUSIPCO). IEEE. 2021, pp. 76–80
work page 2021
-
[52]
Xinhao Mei et al. “Wavcaps: A chatgpt-assisted weakly-labelled audio captioning dataset for audio-language multimodal research”. In: IEEE/ACM Transactions on Audio, Speech, and Language Processing(2024)
work page 2024
-
[53]
TUT Database for Acoustic Scene Classification and Sound Event Detection
Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. “TUT Database for Acoustic Scene Classification and Sound Event Detection”. In: 24th European Signal Processing Conference 2016 (EUSIPCO 2016). Budapest, Hungary, 2016
work page 2016
-
[54]
Librispeech: an asr corpus based on public domain audio books
Vassil Panayotov et al. “Librispeech: an asr corpus based on public domain audio books”. In: 2015 IEEE international conference on acoustics, speech and signal processing (ICASSP). IEEE. 2015, pp. 5206–5210. 23 Kimi-Audio Technical Report
work page 2015
-
[55]
Karol J. Piczak. “ESC: Dataset for Environmental Sound Classification”. In:Proceedings of the 23rd Annual ACM Conference on Multimedia. Brisbane, Australia: ACM Press, Oct. 13, 2015, pp. 1015–1018. ISBN : 978- 1-4503-3459-4. DOI: 10.1145/2733373.2806390 . URL: http://dl.acm.org/citation.cfm?doid= 2733373.2806390
-
[56]
Meld: A multimodal multi-party dataset for emotion recognition in conversations
Soujanya Poria et al. “Meld: A multimodal multi-party dataset for emotion recognition in conversations”. In: arXiv preprint arXiv:1810.02508 (2018)
-
[57]
Mls: A large-scale multilingual dataset for speech research,
Vineel Pratap et al. “MLS: A Large-Scale Multilingual Dataset for Speech Research”. In:ArXiv abs/2012.03411 (2020)
-
[58]
Robust speech recognition via large-scale weak supervision
Alec Radford et al. “Robust speech recognition via large-scale weak supervision”. In:International conference on machine learning. PMLR. 2023, pp. 28492–28518
work page 2023
-
[59]
Nonspeech7k dataset: Classification and analysis of human non-speech sound
Muhammad Mamunur Rashid, Guiqing Li, and Chengrui Du. “Nonspeech7k dataset: Classification and analysis of human non-speech sound”. In: IET Signal Processing 17.6 (2023), e12233
work page 2023
-
[60]
MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark
S Sakshi et al. “Mmau: A massive multi-task audio understanding and reasoning benchmark”. In: arXiv preprint arXiv:2410.19168 (2024)
work page internal anchor Pith review arXiv 2024
-
[61]
A dataset and taxonomy for urban sound research
Justin Salamon, Christopher Jacoby, and Juan Pablo Bello. “A dataset and taxonomy for urban sound research”. In: Proceedings of the 22nd ACM international conference on Multimedia. 2014, pp. 1041–1044
work page 2014
-
[62]
arXiv preprint arXiv:2010.11567 , year=
Yao Shi et al. “Aishell-3: A multi-speaker mandarin tts corpus and the baselines”. In: arXiv preprint arXiv:2010.11567 (2020)
-
[63]
Changli Tang et al. “Salmonn: Towards generic hearing abilities for large language models”. In:arXiv preprint arXiv:2310.13289 (2023)
-
[64]
Kespeech: An open source speech dataset of mandarin and its eight subdialects
Zhiyuan Tang et al. “Kespeech: An open source speech dataset of mandarin and its eight subdialects”. In: Thirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2). 2021
work page 2021
-
[65]
Kimi k1.5: Scaling Reinforcement Learning with LLMs
Kimi Team et al. Kimi k1.5: Scaling Reinforcement Learning with LLMs. 2025. arXiv: 2501.12599 [cs.AI]. URL: https://arxiv.org/abs/2501.12599
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[66]
OpenHermes 2.5: An Open Dataset of Synthetic Data for Generalist LLM Assistants
Teknium. OpenHermes 2.5: An Open Dataset of Synthetic Data for Generalist LLM Assistants . 2023. URL: https://huggingface.co/datasets/teknium/OpenHermes-2.5
work page 2023
-
[67]
Synthia-70b-v1.2: Synthetic intelligent agent
Migel Tissera. Synthia-70b-v1.2: Synthetic intelligent agent. Hugging Face. 2023. URL: https://huggingface. co/migtissera/Synthia-13B
work page 2023
-
[68]
Multi-modal emotion recognition on iemocap dataset using deep learning
Samarth Tripathi, Sarthak Tripathi, and Homayoon Beigi. “Multi-modal emotion recognition on iemocap dataset using deep learning”. In: arXiv preprint arXiv:1804.05788 (2018)
-
[69]
Changhan Wang et al. “V oxPopuli: A large-scale multilingual speech corpus for representation learning, semi- supervised learning and interpretation”. In: arXiv preprint arXiv:2101.00390 (2021)
-
[70]
Neural Codec Language Models are Zero-Shot Text to Speech Synthesizers
Chengyi Wang et al. “Neural codec language models are zero-shot text to speech synthesizers”. In:arXiv preprint arXiv:2301.02111 (2023)
work page internal anchor Pith review arXiv 2023
-
[71]
Freeze-omni: A smart and low latency speech-to-speech dia- logue model with frozen llm,
Xiong Wang et al. “Freeze-omni: A smart and low latency speech-to-speech dialogue model with frozen llm”. In: arXiv preprint arXiv:2411.00774 (2024)
-
[72]
Mini-omni: Language models can hear, talk while thinking in streaming,
Zhifei Xie and Changqiao Wu. “Mini-omni: Language models can hear, talk while thinking in streaming”. In: arXiv preprint arXiv:2408.16725 (2024)
-
[74]
Jin Xu et al. Qwen2.5-Omni Technical Report. 2025. arXiv: 2503.20215 [cs.CL] . URL: https://arxiv. org/abs/2503.20215
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[75]
Zhangchen Xu et al. “Magpie: Alignment data synthesis from scratch by prompting aligned llms with nothing”. In: arXiv preprint arXiv:2406.08464 (2024)
-
[76]
An Yang et al. “Qwen2.5 Technical Report”. In: arXiv preprint arXiv:2412.15115 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[77]
Uniaudio: An audio foundation model toward universal audio generation
Dongchao Yang et al. “Uniaudio: An audio foundation model toward universal audio generation”. In: arXiv preprint arXiv:2310.00704 (2023)
-
[78]
Avqa: A dataset for audio-visual question answering on videos
Pinci Yang et al. “Avqa: A dataset for audio-visual question answering on videos”. In:Proceedings of the 30th ACM international conference on multimedia. 2022, pp. 3480–3491
work page 2022
-
[79]
Open source magicdata-ramc: A rich annotated mandarin conversational (ramc) speech dataset,
Zehui Yang et al. “Open source magicdata-ramc: A rich annotated mandarin conversational (ramc) speech dataset”. In: arXiv preprint arXiv:2203.16844 (2022)
-
[80]
Zhen Ye et al. “Llasa: Scaling Train-Time and Inference-Time Compute for Llama-based Speech Synthesis”. In: arXiv preprint arXiv:2502.04128 (2025). 24 Kimi-Audio Technical Report
-
[81]
Autoprep: An automatic preprocessing framework for in-the-wild speech data
Jianwei Yu et al. “Autoprep: An automatic preprocessing framework for in-the-wild speech data”. In:ICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE. 2024, pp. 1136–1140
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.