Recognition: unknown
Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models
Pith reviewed 2026-05-10 14:06 UTC · model grok-4.3
The pith
Audio-Cogito builds chain-of-thought reasoning into large audio language models by curating 545k samples and applying self-distillation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Audio-Cogito is a fully open-source large audio language model that employs Cogito-pipe to curate 545k high-quality audio reasoning samples and uses self-distillation during fine-tuning to instill chain-of-thought capabilities, achieving the strongest results among open-source models on the MMAR benchmark while matching or exceeding certain closed-source models on specific metrics and placing among the top systems in the Interspeech 2026 Audio Reasoning Challenge.
What carries the argument
Cogito-pipe, a pipeline that generates high-quality audio reasoning samples containing explicit chains, paired with a self-distillation training strategy that transfers those chains into the model's responses.
If this is right
- Explicit chain-of-thought outputs become more consistent and accurate for complex audio understanding tasks.
- The released 545k reasoning samples can be reused to train or evaluate other audio models.
- Open-source models can reach parity with some closed-source systems on audio reasoning metrics.
- The same curation-plus-distillation recipe applies to additional audio benchmarks and challenges.
Where Pith is reading between the lines
- The curation approach could be extended to create reasoning datasets for video or multimodal inputs that combine audio with other signals.
- Wider release of the data pipeline would allow independent labs to replicate and scale audio reasoning without proprietary resources.
- Performance on real-world recordings with background noise or accents would test whether the learned reasoning chains remain robust outside clean benchmarks.
Load-bearing premise
The 545k curated samples contain high-quality, generalizable audio reasoning chains that self-distillation can effectively transfer to new audio tasks.
What would settle it
Running the fine-tuned model on a fresh audio reasoning benchmark whose tasks were never seen during data curation or distillation and finding no gain over a baseline model trained on uncurated audio data would falsify the central claim.
Figures
read the original abstract
Recent advances in reasoning models have driven significant progress in text and multimodal domains, yet audio reasoning remains relatively limited. Only a few Large Audio Language Models (LALMs) incorporate explicit Chain-of-Thought (CoT) reasoning, and their capabilities are often inconsistent and insufficient for complex tasks. To bridge this gap, we introduce Audio-Cogito, a fully open-source solution for deep audio reasoning. We develop Cogito-pipe for high-quality audio reasoning data curation, producing 545k reasoning samples that will be released after review. Based on this dataset, we adopt a self-distillation strategy for model fine-tuning. Experiments on the MMAR benchmark, the only audio benchmark evaluating the CoT process, show that our model achieves the best performance among open-source models and matches or surpasses certain closed-source models in specific metrics. Our approach also ranks among the top-tier systems in the Interspeech 2026 Audio Reasoning Challenge.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Audio-Cogito, an open-source Large Audio Language Model (LALM) for explicit Chain-of-Thought (CoT) audio reasoning. It describes Cogito-pipe, a curation pipeline that generates 545k audio reasoning samples, which are then used in a self-distillation fine-tuning procedure. On the MMAR benchmark (the only audio benchmark that evaluates the CoT process), the resulting model is reported to achieve the highest scores among open-source LALMs and to match or exceed selected closed-source models on specific metrics; it also places among the top entries in the Interspeech 2026 Audio Reasoning Challenge.
Significance. If the performance claims are confirmed with complete experimental controls, the work would constitute a useful contribution by supplying both an open dataset of audio reasoning chains and a practical self-distillation recipe for improving CoT capabilities in the audio domain. The public release of the 545k samples is a concrete community resource that parallels the role of CoT datasets in text LLMs.
major comments (3)
- [§4] §4 (Experiments): The central claim that Audio-Cogito attains the best performance among open-source models on MMAR is only partially supported. The section provides no exhaustive list of baselines, their parameter counts, training regimes, or prompting formats, nor any statistical significance tests or confidence intervals on the reported scores. These omissions prevent assessment of whether the gains are robust or attributable to the proposed pipeline.
- [§3] §3 (Cogito-pipe): No analysis is given of possible overlap or leakage between the 545k curated samples and the MMAR test set. Because the pipeline draws from existing audio corpora, contamination remains a plausible alternative explanation for the observed benchmark gains and must be ruled out.
- [§4.3] §4.3 (Ablations): The manuscript contains no ablation studies that isolate the contribution of the self-distillation stage from the quality of the curated data alone, nor comparisons against standard supervised fine-tuning or other reasoning-enhancement techniques. Without these controls the attribution of improvements to the proposed method remains unverified.
minor comments (2)
- [Abstract] Abstract: The phrase 'matches or surpasses certain closed-source models in specific metrics' is too vague; naming the models and metrics would improve clarity.
- Throughout: Some citations to prior LALM and multimodal reasoning papers could be expanded to better situate the novelty of explicit audio CoT.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. The comments highlight important aspects for strengthening the experimental rigor and methodological validation of our work. We address each point below and will incorporate the necessary revisions.
read point-by-point responses
-
Referee: [§4] §4 (Experiments): The central claim that Audio-Cogito attains the best performance among open-source models on MMAR is only partially supported. The section provides no exhaustive list of baselines, their parameter counts, training regimes, or prompting formats, nor any statistical significance tests or confidence intervals on the reported scores. These omissions prevent assessment of whether the gains are robust or attributable to the proposed pipeline.
Authors: We agree that the current presentation of results is incomplete. In the revised manuscript we will expand §4 with a new comprehensive table that enumerates every open-source and closed-source baseline, including exact parameter counts, training data regimes, and the precise prompting formats employed. We will also rerun the MMAR evaluations across multiple random seeds to report means and standard deviations, and include paired statistical significance tests (e.g., t-tests with p-values) to substantiate the robustness of the reported gains. revision: yes
-
Referee: [§3] §3 (Cogito-pipe): No analysis is given of possible overlap or leakage between the 545k curated samples and the MMAR test set. Because the pipeline draws from existing audio corpora, contamination remains a plausible alternative explanation for the observed benchmark gains and must be ruled out.
Authors: We recognize this as a critical issue. We will add a new subsection to §3 that quantifies potential leakage using both embedding cosine similarity (with a threshold) and n-gram overlap detection between the 545k Cogito-pipe samples and the MMAR test set. Any detected overlap will be reported with exact percentages, and contaminated samples will be removed prior to final training. The revised paper will include these results and the mitigation steps taken. revision: yes
-
Referee: [§4.3] §4.3 (Ablations): The manuscript contains no ablation studies that isolate the contribution of the self-distillation stage from the quality of the curated data alone, nor comparisons against standard supervised fine-tuning or other reasoning-enhancement techniques. Without these controls the attribution of improvements to the proposed method remains unverified.
Authors: We concur that additional controls are required. The revised §4.3 will present new ablation experiments comparing (1) the base model, (2) supervised fine-tuning on the curated data without self-distillation, (3) the full self-distillation pipeline, and (4) alternative techniques such as standard CoT prompting and direct preference optimization. These results will isolate the incremental benefit of each stage and allow direct attribution of gains to the proposed method. revision: yes
Circularity Check
No significant circularity in derivation chain
full rationale
The paper describes an empirical pipeline: curation of 545k audio reasoning samples via Cogito-pipe followed by self-distillation fine-tuning, with performance measured on the external MMAR benchmark. No equations, fitted parameters, or derivations are present that could reduce to inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claim (improved benchmark scores) rests on comparison to independent external models and data, making the argument self-contained without internal circularity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Self-distillation from improved outputs reliably enhances reasoning capabilities in audio language models
invented entities (1)
-
Cogito-pipe
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Audio-Cogito: Towards Deep Audio Reasoning in Large Audio Language Models
Introduction Recent advancements in Large Language Models (LLMs) have significantly boosted their capabilities, particularly through techniques like inference scaling and Chain-of-Thought (CoT). It has been widely demonstrated that CoT enhances reasoning effectively by decomposing complex queries into intermediate reasoning steps. This paradigm has succes...
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[2]
Cogito-Pipe In this section, we introduce our automated pipeline, Cogito- Pipe, to generate audio reasoning SFT data
Audio-Cogito 2.1. Cogito-Pipe In this section, we introduce our automated pipeline, Cogito- Pipe, to generate audio reasoning SFT data. As shown in Fig- ure 1, the Cogito-Pipe consists of four stages: (1) Data Collec- tion from multi-domain audio sources spanning sound, speech, and music; (2) QA Construction to synthesize diverse and chal- lenging QA pair...
-
[3]
Experiments 3.1. Experimental Setup 3.1.1. Training Details Our model,Audio-Cogito, is built upon the Qwen3-Omni-Thinkingwith 30 billion parameters. We utilize the ms-swift 2 framework to conduct supervised fine-tuning using Low-Rank Adaptation (LoRA). The model is fine-tuned for one epoch on the dataset constructed via Cogito-Pipe, with a maximum learnin...
-
[4]
Leveraging Cogito- Pipe for high-quality data curation, we construct and release a 545k-sample open-source audio reasoning dataset
Conclusion In this work, we introduce Audio-Cogito, an open-source solu- tion for deep audio reasoning in LALMs. Leveraging Cogito- Pipe for high-quality data curation, we construct and release a 545k-sample open-source audio reasoning dataset. We fur- ther employ a self-distillation strategy that substantially en- hances complex reasoning capabilities. E...
2026
-
[5]
These tools were not used to develop the methodology, conduct the experiments, generate the results, or draw the conclusions of this work
Generative AI Use Disclosure Generative AI tools were employed exclusively for linguistic refinement and editorial assistance. These tools were not used to develop the methodology, conduct the experiments, generate the results, or draw the conclusions of this work. The authors retain full responsibility and accountability for all aspects of the manuscript
-
[6]
Improve vision language model chain-of- thought reasoning,
R. Zhang, B. Zhang, Y . Li, H. Zhang, Z. Sun, Z. Gan, Y . Yang, R. Pang, and Y . Yang, “Improve vision language model chain-of- thought reasoning,” inProc. ACL, 2025
2025
-
[7]
C. Tang, W. Yu, G. Sun, X. Chen, T. Tan, W. Li, L. Lu, Z. Ma, and C. Zhang, “Salmonn: Towards generic hearing abilities for large language models,”arXiv preprint arXiv:2310.13289, 2023
-
[8]
Y . Chu, J. Xu, Q. Yang, H. Wei, X. Wei, Z. Guo, Y . Leng, Y . Lv, J. He, J. Lin, C. Zhou, and J. Zhou, “Qwen2-audio technical re- port,”CoRR, vol. abs/2407.10759, 2024
work page internal anchor Pith review arXiv 2024
-
[9]
Audio flamingo 2: An audio- language model with long-audio understanding and expert rea- soning abilities,
S. Ghosh, Z. Kong, S. Kumar, S. Sakshi, J. Kim, W. Ping, R. Valle, D. Manocha, and B. Catanzaro, “Audio flamingo 2: An audio- language model with long-audio understanding and expert rea- soning abilities,” inProc. ICML, 2025
2025
-
[10]
Listen, think, and understand,
Y . Gong, H. Luo, A. H. Liu, L. Karlinsky, and J. Glass, “Listen, think, and understand,”arXiv preprint arXiv:2305.10790, 2023
-
[11]
Musilingo: Bridging music and text with pre- trained language models for music captioning and query re- sponse,
Z. Deng, Y . Ma, Y . Liu, R. Guo, G. Zhang, W. Chen, W. Huang, and E. Benetos, “Musilingo: Bridging music and text with pre- trained language models for music captioning and query re- sponse,” inProc. Findings of ACL, 2024
2024
-
[12]
Music understanding llama: Advancing text-to-music generation with question answer- ing and captioning,
S. Liu, A. S. Hussain, C. Sun, and Y . Shan, “Music understanding llama: Advancing text-to-music generation with question answer- ing and captioning,” inProc. ICASSP, 2024
2024
-
[13]
Gama: A large audio-language model with advanced audio understanding and complex reasoning abilities,
S. Ghosh, S. Kumar, A. Seth, C. K. R. Evuru, U. Tyagi, S. Sak- shi, O. Nieto, R. Duraiswami, and D. Manocha, “Gama: A large audio-language model with advanced audio understanding and complex reasoning abilities,” inPro. EMNLP, 2024
2024
-
[14]
OSUM: Advancing Open Speech Understanding Models with Limited Resources in Academia
X. Geng, K. Wei, Q. Shao, S. Liu, Z. Lin, Z. Zhao, G. Li, W. Tian, P. Chen, Y . Liet al., “Osum: Advancing open speech understand- ing models with limited resources in academia,”arXiv preprint arXiv:2501.13306, 2025
-
[15]
Mimo-audio: Audio language models are few-shot learners
D. Zhang, G. Wang, J. Xue, K. Fang, L. Zhao, R. Ma, S. Ren, S. Liu, T. Guo, W. Zhuanget al., “Mimo-audio: Audio language models are few-shot learners,”arXiv preprint arXiv:2512.23808, 2025
-
[16]
Step-audio 2 technical report, 2025
B. Wu, C. Yan, C. Hu, C. Yi, C. Feng, F. Tian, F. Shen, G. Yu, H. Zhang, J. Liet al., “Step-audio 2 technical report,”arXiv preprint arXiv:2507.16632, 2025
-
[17]
D. Ding, Z. Ju, Y . Leng, S. Liu, T. Liu, Z. Shang, K. Shen, W. Song, X. Tan, H. Tanget al., “Kimi-audio technical report,” arXiv preprint arXiv:2504.18425, 2025
work page internal anchor Pith review arXiv 2025
-
[18]
A. Hurst, A. Lerer, A. P. Goucher, A. Perelman, A. Ramesh, A. Clark, A. Ostrow, A. Welihinda, A. Hayes, A. Radfordet al., “Gpt-4o system card,”arXiv preprint arXiv:2410.21276, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[19]
Anygpt: Unified multimodal llm with discrete sequence modeling,
J. Zhan, J. Dai, J. Ye, Y . Zhou, D. Zhang, Z. Liu, X. Zhang, R. Yuan, G. Zhang, L. Liet al., “Anygpt: Unified multimodal llm with discrete sequence modeling,” inProc. ACL, 2024
2024
-
[20]
arXiv preprint arXiv:2501.04561 , year=
R. Luo, T.-E. Lin, H. Zhang, Y . Wu, X. Liu, M. Yang, Y . Li, L. Chen, J. Li, L. Zhanget al., “Openomni: Advancing open- source omnimodal large language models with progressive multi- modal alignment and real-time self-aware emotional speech syn- thesis,”arXiv preprint arXiv:2501.04561, 2025
-
[21]
Baichuan-omni-1.5 technical report.arXiv preprint arXiv:2501.15368, 2025
Y . Li, J. Liu, T. Zhang, S. Chen, T. Li, Z. Li, L. Liu, L. Ming, G. Dong, D. Panet al., “Baichuan-omni-1.5 technical report,” arXiv preprint arXiv:2501.15368, 2025
-
[22]
J. Xu, Z. Guo, J. He, H. Hu, T. He, S. Bai, K. Chen, J. Wang, Y . Fan, K. Dang, B. Zhang, X. Wang, Y . Chu, and J. Lin, “Qwen2.5-omni technical report,”arXiv preprint arXiv:2503.20215, 2025
work page internal anchor Pith review arXiv 2025
-
[23]
J. Xu, Z. Guo, H. Hu, Y . Chu, X. Wang, J. He, Y . Wang, X. Shi, T. He, X. Zhuet al., “Qwen3-omni technical report,”arXiv preprint arXiv:2509.17765, 2025
work page internal anchor Pith review arXiv 2025
-
[24]
I. AI, B. Gong, C. Zou, C. Zheng, C. Zhou, C. Yan, C. Jin, C. Shen, D. Zheng, F. Wanget al., “Ming-omni: A unified mul- timodal model for perception and generation,”arXiv preprint arXiv:2506.09344, 2025
-
[25]
H. Zhong, M. Zhu, Z. Du, Z. Huang, C. Zhao, M. Liu, W. Wang, H. Chen, and C. Shen, “Omni-r1: Reinforcement learning for om- nimodal reasoning via two-system collaboration,”arXiv preprint arXiv:2505.20256, 2025
-
[26]
Gemini 2.0 flash,
Google, “Gemini 2.0 flash,” https://cloud.google.com/vertex-ai/ generative-ai/docs/models/gemini/2-0-flash, 2025
2025
-
[27]
Audio-cot: Exploring chain-of-thought reasoning in large audio language model,
Z. Ma, Z. Chen, Y . Wang, E. S. Chng, and X. Chen, “Audio- cot: Exploring chain-of-thought reasoning in large audio language model,”arXiv preprint arXiv:2501.07246, 2025
-
[28]
Audio Flamingo 3: Advancing Audio Intelligence with Fully Open Large Audio Language Models
A. Goel, S. Ghosh, J. Kim, S. Kumar, Z. Kong, S.-g. Lee, C.- H. H. Yang, R. Duraiswami, D. Manocha, R. Valleet al., “Audio flamingo 3: Advancing audio intelligence with fully open large audio language models,”arXiv preprint arXiv:2507.08128, 2025
work page internal anchor Pith review arXiv 2025
-
[29]
Step-audio-r1 technical report, 2025
F. Tian, X. T. Zhang, Y . Zhang, H. Zhang, Y . Li, D. Liu, Y . Deng, D. Wu, J. Chen, L. Zhaoet al., “Step-audio-r1 technical report,” arXiv preprint arXiv:2511.15848, 2025
-
[30]
Z. Ma, Y . Ma, Y . Zhu, C. Yang, Y .-W. Chao, R. Xu, W. Chen, Y . Chen, Z. Chen, J. Conget al., “Mmar: A challenging bench- mark for deep reasoning in speech, audio, music, and their mix,” arXiv preprint arXiv:2505.13032, 2025
-
[31]
S. Kumar, ˇS. Sedl ´aˇcek, V . Lokegaonkar, F. L ´opez, W. Yu, N. Anand, H. Ryu, L. Chen, M. Pli ˇcka, M. Hlav ´aˇceket al., “Mmau-pro: A challenging and comprehensive benchmark for holistic evaluation of audio general intelligence,”arXiv preprint arXiv:2508.13992, 2025
-
[32]
Audio set: An ontology and human-labeled dataset for audio events,
J. F. Gemmeke, D. P. Ellis, D. Freedman, A. Jansen, W. Lawrence, R. C. Moore, M. Plakal, and M. Ritter, “Audio set: An ontology and human-labeled dataset for audio events,” inProc. ICASSP, 2017
2017
-
[33]
Audiocaps: Generating captions for audios in the wild,
C. D. Kim, B. Kim, H. Lee, and G. Kim, “Audiocaps: Generating captions for audios in the wild,” inProc. NAACL-HLT, 2019
2019
-
[34]
Clotho: An audio cap- tioning dataset,
K. Drossos, S. Lipping, and T. Virtanen, “Clotho: An audio cap- tioning dataset,” inProc. ICASSP, 2020
2020
-
[35]
Audio- reasoner: Improving reasoning capability in large audio language models,
X. Zhifei, M. Lin, Z. Liu, P. Wu, S. Yan, and C. Miao, “Audio- reasoner: Improving reasoning capability in large audio language models,” inProc. EMNLP, 2025
2025
-
[36]
G. Comanici, E. Bieber, M. Schaekermann, I. Pasupat, N. Sachdeva, I. Dhillon, M. Blistein, O. Ram, D. Zhang, E. Rosen et al., “Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabil- ities,”arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[37]
Z. Ma, R. Xu, Y . Ma, C.-H. H. Yang, B. Li, J. Kim, J. Xu, J. Li, C. Busso, K. Yu, E. S. Chng, and X. Chen, “The interspeech 2026 audio reasoning challenge: Evaluating reasoning process quality for audio reasoning models and agents,” 2026. [Online]. Available: https://arxiv.org/abs/2602.14224
-
[38]
Meld: A multimodal multi-party dataset for emo- tion recognition in conversations,
S. Poria, D. Hazarika, N. Majumder, G. Naik, E. Cambria, and R. Mihalcea, “Meld: A multimodal multi-party dataset for emo- tion recognition in conversations,” inProc. ACL, 2019
2019
-
[39]
Towards multimodal sarcasm detection (an obviously perfect paper),
S. Castro, D. Hazarika, V . P ´erez-Rosas, R. Zimmermann, R. Mi- halcea, and S. Poria, “Towards multimodal sarcasm detection (an obviously perfect paper),” inProc. ACL, 2019
2019
-
[40]
Dailytalk: Spoken dialogue dataset for conversational text-to-speech,
K. Lee, K. Park, and D. Kim, “Dailytalk: Spoken dialogue dataset for conversational text-to-speech,” inProc. ICASSP, 2023
2023
-
[41]
Mustango: Toward controllable text-to-music gen- eration,
J. Melechovsky, Z. Guo, D. Ghosal, N. Majumder, D. Herremans, and S. Poria, “Mustango: Toward controllable text-to-music gen- eration,” inProc. NAACL-HLT, 2024
2024
-
[42]
FMA: A Dataset For Music Analysis
M. Defferrard, K. Benzi, P. Vandergheynst, and X. Bresson, “Fma: A dataset for music analysis,”arXiv preprint arXiv:1612.01840, 2016
work page Pith review arXiv 2016
-
[43]
Medley- solos-db: a cross-collection dataset for musical instrument recognition,
V . Lostanlen, C.-E. Cella, R. Bittner, and S. Essid, “Medley- solos-db: a cross-collection dataset for musical instrument recognition,” 2019. [Online]. Available: https://doi.org/10.5281/ zenodo.3464194
2019
-
[44]
Audiobench: A universal benchmark for audio large language models,
B. Wang, X. Zou, G. Lin, S. Sun, Z. Liu, W. Zhang, Z. Liu, A. Aw, and N. Chen, “Audiobench: A universal benchmark for audio large language models,” inProc. NAACL-HLT, 2025
2025
-
[45]
Air-bench: Benchmarking large audio- language models via generative comprehension,
Q. Yang, J. Xu, W. Liu, Y . Chu, Z. Jiang, X. Zhou, Y . Leng, Y . Lv, Z. Zhao, C. Zhouet al., “Air-bench: Benchmarking large audio- language models via generative comprehension,” inProc. ACL, 2024
2024
-
[46]
MMAU: A Massive Multi-Task Audio Understanding and Reasoning Benchmark
S. Sakshi, U. Tyagi, S. Kumar, A. Seth, R. Selvakumar, O. Nieto, R. Duraiswami, S. Ghosh, and D. Manocha, “Mmau: A massive multi-task audio understanding and reasoning benchmark,”arXiv preprint arXiv:2410.19168, 2024
work page internal anchor Pith review arXiv 2024
-
[47]
Mmsu: A massive multi-task spoken language understanding and reasoning benchmark, 2025
D. Wang, J. Wu, J. Li, D. Yang, X. Chen, T. Zhang, and H. Meng, “Mmsu: A massive multi-task spoken language understanding and reasoning benchmark,”arXiv preprint arXiv:2506.04779, 2025
-
[48]
Televal: A dynamic benchmark designed for spoken language models in chinese interactive sce- narios,
Z. Li, H. Chen, Q. Wang, Y . Zhang, J. Zhou, H. Lv, M. Du, Y . Song, J. Lian, J. Kanget al., “Televal: A dynamic benchmark designed for spoken language models in chinese interactive sce- narios,”arXiv preprint arXiv:2507.18061, 2025
-
[49]
Uro-bench: Towards comprehensive evaluation for end-to-end spoken dialogue models,
R. Yan, X. Li, W. Chen, Z. Niu, C. Yang, Z. Ma, K. Yu, and X. Chen, “Uro-bench: Towards comprehensive evaluation for end-to-end spoken dialogue models,” inProc. EMNLP, 2025
2025
-
[50]
Audio flamingo: A novel audio language model with few-shot learning and dialogue abilities,
Z. Kong, A. Goel, R. Badlani, W. Ping, R. Valle, and B. Catanzaro, “Audio flamingo: A novel audio language model with few-shot learning and dialogue abilities,”arXiv preprint arXiv:2402.01831, 2024
-
[51]
Mellow: a small audio language model for reasoning,
S. Deshmukh, S. Dixit, R. Singh, and B. Raj, “Mellow: a small audio language model for reasoning,”arXiv preprint arXiv:2503.08540, 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.