PlanRAG-Audio: Planning and Retrieval Augmented Generation for Long-form Audio Understanding

Arora; Chien-yu; Cornell; Grant P; Huang; Jing; Liu; Markus; Masao; M\"uller

arxiv: 2605.20414 · v1 · pith:GTMPJVJLnew · submitted 2026-05-19 · 📡 eess.AS

PlanRAG-Audio: Planning and Retrieval Augmented Generation for Long-form Audio Understanding

Masao , Someki , Chien-yu , Huang , Siddhant , Arora , Samuele , Cornell

show 12 more authors

Markus M\"uller Nathan Susanj Rupak V Swaminathan Grant P Strimel Jing Liu Shinji Watanabe

This is my paper

Pith reviewed 2026-05-21 06:51 UTC · model grok-4.3

classification 📡 eess.AS

keywords long-form audioretrieval-augmented generationplanningaudio reasoningtemporal spansmodality selectionlarge audio models

0 comments

The pith

PlanRAG-Audio plans required modalities and time spans then retrieves only relevant segments to improve accuracy on long audio.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to show that large audio language models can reason more reliably over long recordings if they first decide which types of information and which time periods matter for a query, then pull only those pieces from a prepared database instead of ingesting the full audio. This matters because current models lose accuracy or become too costly once audio grows long and the needed clues such as speech content, speaker identity, emotion, or sound events are scattered across time. If the claim holds, models would maintain steady performance on complex queries while the amount of data sent to them stays bounded by the query rather than the recording length. A reader interested in practical audio analysis would see this as a way to handle meetings, broadcasts, or surveillance audio without the current length barriers.

Core claim

PlanRAG-Audio is a planning-based retrieval-augmented generation framework that explicitly plans which modalities and temporal spans are required for a given query and retrieves only query-relevant information from a structured text and audio database rather than processing entire recordings directly. This retrieval planning enables effective reasoning over complex, cross-domain audio queries while substantially reducing the input length passed to the large language models. Experiments show improved reasoning accuracy and stabilized performance as audio duration increases by decoupling inference cost from raw audio length.

What carries the argument

The explicit planning of modalities and temporal spans followed by targeted retrieval from a structured text and audio database.

Load-bearing premise

The structured database must contain every query-relevant acoustic cue and the planning step must correctly select the needed modalities and time spans without leaving out critical details.

What would settle it

A set of queries where the planner omits a time span containing a decisive sound event or emotion and the model then gives wrong answers that a full-audio baseline would have answered correctly.

Figures

Figures reproduced from arXiv: 2605.20414 by Arora, Chien-yu, Cornell, Grant P, Huang, Jing, Liu, Markus, Masao, M\"uller, Nathan, Rupak V, Samuele, Shinji, Siddhant, Someki, Strimel, Susanj, Swaminathan, Watanabe.

**Figure 2.** Figure 2: Audio database construction. Raw audio is processed by task-specific modules to construct a structured, time-aligned audio database D(a) consisting of modality-specific metadata streams [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Relative performance of base tasks under long-form audio inputs. Results are for QA-1, MCQA, Summ, [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

read the original abstract

Long-form audio understanding poses significant challenges for large audio language models (LALMs) due to the extreme length of audio sequences and the need to reason over heterogeneous acoustic cues distributed over time, such as speech content, speaker identity, emotion, and sound events. To address these challenges, we propose \textbf{PlanRAG-Audio}, a planning-based retrieval-augmented generation framework for scalable long-form audio understanding. Rather than having audio LALMs process entire recordings directly, PlanRAG-Audio explicitly plans which modalities and temporal spans are required for a given query, and retrieves only query-relevant information from a structured text and audio database. This retrieval planning enables effective reasoning over complex, cross-domain audio queries while substantially reducing the input length passed to the large language models. Experiments across a wide range of speech/audio retrieval demonstrate that PlanRAG-Audio improves reasoning accuracy and stabilizes performance as audio duration increases by decoupling inference cost from raw audio length.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PlanRAG-Audio adds a planning layer to RAG for long audio but the gains hinge on unverified planner recall of distributed cues.

read the letter

The main takeaway is that this paper puts forward PlanRAG-Audio, a framework that plans which audio modalities and time windows to retrieve from a structured database before passing anything to the language model. The goal is to keep compute from growing with raw audio length while still handling queries that mix speech content, speaker identity, emotion, and sound events spread across long recordings. That framing directly targets a real bottleneck in current large audio language models. The approach is new in its explicit separation of planning from retrieval and reasoning, and it is presented as a procedural method rather than another end-to-end fine-tune. If the planning step works reliably, the reported stabilization of accuracy with increasing duration would be a practical win for applications like meeting analysis. The paper earns credit for naming the heterogeneous-cue problem clearly and for showing that retrieval can be scoped to modalities and spans instead of uniform chunking. The experiments are described as covering a range of retrieval tasks and showing accuracy gains, which at least gives a starting point for comparison. The soft spot is exactly the one the stress test flags: everything rests on the planner correctly surfacing all query-relevant segments. The abstract gives no implementation details for the planner, no recall metrics against ground-truth relevant spans, and no ablation that measures what happens when a critical cue is missed. Without those checks, it is difficult to know whether the stability comes from better context selection or from other factors in the setup. Minor gaps include the lack of baseline descriptions in the summary, but those can be filled in review. This paper is for researchers working on scalable audio understanding and retrieval-augmented methods. A reader who needs concrete ideas for reducing input length on long recordings will find usable pieces even if the planner evaluation needs tightening. It is coherent enough on its own terms to merit a serious referee who can ask for planner diagnostics and fuller experimental controls.

Referee Report

2 major / 2 minor

Summary. The paper proposes PlanRAG-Audio, a planning-based retrieval-augmented generation framework for long-form audio understanding. Instead of feeding entire recordings to large audio language models (LALMs), the method explicitly plans which modalities and temporal spans are needed for a query, retrieves only the relevant information from a structured text and audio database, and performs reasoning on the reduced context. The central claim is that this improves reasoning accuracy and stabilizes performance as audio duration grows by decoupling inference cost from raw audio length.

Significance. If the planning step reliably recovers all query-relevant acoustic cues (speech content, speaker identity, emotion, sound events) distributed over time, the framework would provide a practical route to scalable long-form audio reasoning without quadratic growth in context length. The approach is procedurally novel in its explicit modality-and-span planning layer, but its impact hinges on empirical verification of that layer.

major comments (2)

[Abstract] Abstract: the claim that experiments 'demonstrate accuracy gains and stabilized performance' is unsupported by any reported baselines, datasets, metrics, controls, or statistical tests. Without these details it is impossible to determine whether the observed improvements are attributable to the planning-retrieval mechanism or to other factors.
[Method / Experiments] Method and Experiments sections: the headline accuracy and stability improvements rest on the planner correctly identifying all query-relevant modalities and temporal spans. The manuscript provides no implementation details for the planning module, no error analysis of planner omissions, and no ablation that measures recall of ground-truth relevant segments. If even one critical cue (e.g., a brief overlapping sound event) is missed, the subsequent retrieval supplies incomplete context that the LALM reasoning step cannot recover.

minor comments (2)

[Method] Clarify how the structured database is constructed and indexed, including the exact representation of acoustic cues and the retrieval mechanism (e.g., embedding model, similarity metric).
[Experiments] Figure captions and experimental tables should explicitly state the audio durations tested and the number of queries per duration bin to allow readers to assess the 'stabilization with duration' claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review of our manuscript on PlanRAG-Audio. We have carefully considered the major comments and provide point-by-point responses below. Where the comments identify areas needing clarification or additional analysis, we outline specific revisions that will be incorporated into the next version of the paper.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that experiments 'demonstrate accuracy gains and stabilized performance' is unsupported by any reported baselines, datasets, metrics, controls, or statistical tests. Without these details it is impossible to determine whether the observed improvements are attributable to the planning-retrieval mechanism or to other factors.

Authors: We agree that the abstract, due to its length constraints, does not explicitly list the experimental details. In the revised manuscript we will update the abstract to briefly reference the evaluation datasets, the primary metrics (accuracy and stability measures), the baseline methods compared against, and note that improvements are statistically significant. The full experimental protocol, controls, and results remain in the Experiments section; the abstract revision will better anchor the high-level claim to the reported evidence. revision: yes
Referee: [Method / Experiments] Method and Experiments sections: the headline accuracy and stability improvements rest on the planner correctly identifying all query-relevant modalities and temporal spans. The manuscript provides no implementation details for the planning module, no error analysis of planner omissions, and no ablation that measures recall of ground-truth relevant segments. If even one critical cue (e.g., a brief overlapping sound event) is missed, the subsequent retrieval supplies incomplete context that the LALM reasoning step cannot recover.

Authors: This is a fair and important observation. The effectiveness of PlanRAG-Audio indeed depends on the planner's ability to surface all necessary cues. In the revised version we will expand the Method section with concrete implementation details of the planning module (including the prompting strategy and decision criteria for modality and span selection). We will also add a dedicated error analysis of planner omissions together with an ablation that reports recall of ground-truth relevant segments. These additions will directly quantify the planner's reliability and address the risk of missing critical acoustic cues. revision: yes

Circularity Check

0 steps flagged

No circularity: procedural framework with experimental claims

full rationale

The paper introduces PlanRAG-Audio as an explicit planning-plus-retrieval procedure that selects modalities and temporal spans before querying a structured database, then feeds the results to an LALM. No equations, fitted parameters, or self-referential definitions appear in the provided text. Performance improvements are asserted via experiments rather than derived from prior fits or self-citations. The central claim therefore remains independent of its own inputs and does not reduce by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The framework rests on the existence of a well-structured database and on the planning module's ability to select relevant information; no free parameters or invented physical entities are mentioned.

axioms (1)

domain assumption A structured text and audio database exists that stores query-relevant information extractable without loss of critical cues.
The retrieval step depends on this database being available and complete for the queries tested.

invented entities (1)

PlanRAG-Audio planning module no independent evidence
purpose: To decide which modalities and temporal spans are required for a given query before retrieval.
New component introduced to orchestrate retrieval for long audio.

pith-pipeline@v0.9.0 · 5745 in / 1333 out tokens · 35421 ms · 2026-05-21T06:51:13.369971+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

PlanRAG-Audio explicitly plans which modalities and temporal spans are required for a given query, and retrieves only query-relevant information from a structured text and audio database.
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

the system first plans which modalities (e.g., spoken content, speaker information, emotional cues, and non-verbal acoustic events), temporal spans, and constraints are required

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · 1 internal anchor

[1]

arXiv preprint , year =

OpenAI , title =. arXiv preprint , year =

work page
[2]

arXiv preprint , year =

Gemini Team Google , title =. arXiv preprint , year =

work page
[3]

2025 , eprint=

Qwen2.5-1M Technical Report , author=. 2025 , eprint=

work page 2025
[4]

Han, and Katrin Kirchhoff

SpeechVerse: A Large-scale Generalizable Audio Framework , author =. arXiv preprint arXiv:2405.08295 , year =

work page arXiv
[5]

Qwen2-Audio Technical Report

Qwen2-Audio Technical Report , author =. arXiv preprint arXiv:2407.10759 , year =

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Proceedings of the 41st International Conference on Machine Learning (ICML) , year =

Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities , author =. Proceedings of the 41st International Conference on Machine Learning (ICML) , year =

work page
[7]

2024 , eprint=

Moshi: a speech‐text foundation model for real‐time dialogue , author =. 2024 , eprint=

work page 2024
[8]

2025 , eprint=

SALMONN-omni: A Standalone Speech LLM without Codec Injection for Full-duplex Conversation , author =. 2025 , eprint=

work page 2025
[9]

Jinchuan Tian and William Chen and Yifan Peng and Jiatong Shi and Siddhant Arora and Shikhar Bharadwaj and Takashi Maekaku and Yusuke Shinohara and Keita Goto and Xiang Yue and Huck Yang and Shinji Watanabe , year =

work page
[10]

2025 , eprint=

BLAB: Brutally Long Audio Bench , author=. 2025 , eprint=

work page 2025
[11]

M eeting QA : Extractive Question-Answering on Meeting Transcripts

Prasad, Archiki and Bui, Trung and Yoon, Seunghyun and Deilamsalehy, Hanieh and Dernoncourt, Franck and Bansal, Mohit. M eeting QA : Extractive Question-Answering on Meeting Transcripts. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023

work page 2023
[12]

CORAAL QA: A Dataset and Framework for Open Domain Spontaneous Speech Question Answering from Long Audio Files , year=

Shankar, Natarajan Balaji and Johnson, Alexander and Chance, Christina and Veeramani, Hariram and Alwan, Abeer , booktitle=. CORAAL QA: A Dataset and Framework for Open Domain Spontaneous Speech Question Answering from Long Audio Files , year=

work page
[13]

W av RAG : Audio-Integrated Retrieval Augmented Generation for Spoken Dialogue Models

Chen, Yifu and Ji, Shengpeng and Wang, Haoxiao and Wang, Ziqing and Chen, Siyu and He, Jinzheng and Xu, Jin and Zhao, Zhou. W av RAG : Audio-Integrated Retrieval Augmented Generation for Spoken Dialogue Models. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025

work page 2025
[14]

Alexander Johnson and Peter Plantinga and Pheobe Sun and Swaroop Gadiyaram and Abenezer Girma and Ahmad Emami , year =

work page
[15]

Cacophony: An Improved Contrastive Audio-Text Model , year=

Zhu, Ge and Darefsky, Jordan and Duan, Zhiyao , journal=. Cacophony: An Improved Contrastive Audio-Text Model , year=

work page
[16]

NAAQA: A Neural Architecture for Acoustic Question Answering , year=

Abdelnour, Jérôme and Rouat, Jean and Salvi, Giampiero , journal=. NAAQA: A Neural Architecture for Acoustic Question Answering , year=

work page
[17]

2022 , booktitle =

DUAL: Discrete Spoken Unit Adaptive Learning for Textless Spoken Question Answering , author =. 2022 , booktitle =

work page 2022
[18]

31st European Signal Processing Conference (EUSIPCO) , year =

Parthasaarathy Sudarsanam and Tuomas Virtanen , title =. 31st European Signal Processing Conference (EUSIPCO) , year =

work page
[19]

OWSM - CTC : An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification

Peng, Yifan and Sudo, Yui and Shakeel, Muhammad and Watanabe, Shinji. OWSM - CTC : An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024

work page 2024
[20]

Yifan Peng and Muhammad Shakeel and Yui Sudo and William Chen and Jinchuan Tian and Chyi-Jiunn Lin and Shinji Watanabe , year =

work page
[21]

HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units , year=

Hsu, Wei-Ning and Bolte, Benjamin and Tsai, Yao-Hung Hubert and Lakhotia, Kushal and Salakhutdinov, Ruslan and Mohamed, Abdelrahman , journal=. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units , year=

work page
[22]

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing , year=

Chen, Sanyuan and Wang, Chengyi and Chen, Zhengyang and Wu, Yu and Liu, Shujie and Chen, Zhuo and Li, Jinyu and Kanda, Naoyuki and Yoshioka, Takuya and Xiao, Xiong and Wu, Jian and Zhou, Long and Ren, Shuo and Qian, Yanmin and Qian, Yao and Wu, Jian and Zeng, Michael and Yu, Xiangzhan and Wei, Furu , journal=. WavLM: Large-Scale Self-Supervised Pre-Traini...

work page
[23]

2023 , publisher =

Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya , title =. 2023 , publisher =

work page 2023
[24]

2023 , series =

Chen, Sanyuan and Wu, Yu and Wang, Chengyi and Liu, Shujie and Tompkins, Daniel and Chen, Zhuo and Che, Wanxiang and Yu, Xiangzhan and Wei, Furu , booktitle =. 2023 , series =

work page 2023
[25]

ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year=

Clap learning audio concepts from natural language supervision , author=. ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year=

work page 2023
[26]

Lawrence and Girshick, Ross , booktitle=

Johnson, Justin and Hariharan, Bharath and van der Maaten, Laurens and Fei-Fei, Li and Zitnick, C. Lawrence and Girshick, Ross , booktitle=. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , year=

work page
[27]

CLEAR: A Dataset for Compositional Language and Elementary Acoustic Reasoning , year =

Jerome Abdelnour and Giampiero Salvi and Jean Rouat , publisher =. CLEAR: A Dataset for Compositional Language and Elementary Acoustic Reasoning , year =

work page
[28]

Clotho-AQA: A Crowdsourced Dataset for Audio Question Answering , year=

Lipping, Samuel and Sudarsanam, Parthasaarathy and Drossos, Konstantinos and Virtanen, Tuomas , booktitle=. Clotho-AQA: A Crowdsourced Dataset for Audio Question Answering , year=

work page
[29]

Audiopedia: Audio QA with Knowledge , year=

Penamakuri, Abhirama Subramanyam and Chhatre, Kiran and Jain, Akshat , booktitle=. Audiopedia: Audio QA with Knowledge , year=

work page
[30]

AudioBERT: Audio Knowledge Augmented Language Model , year=

Ok, Hyunjong and Yoo, Suho and Lee, Jaeho , booktitle=. AudioBERT: Audio Knowledge Augmented Language Model , year=

work page
[31]

2018 , organization=

Sanabria, Ramon and Caglayan, Ozan and Palaskar, Shruti and Elliott, Desmond and Barrault, Lo\"ic and Specia, Lucia and Metze, Florian , booktitle =. 2018 , organization=

work page 2018
[32]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL) , year =

MeetingQA: Extractive Question-Answering on Meeting Transcripts , author =. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL) , year =

work page
[33]

and Auzanne, Cedric G

Garofolo, John S. and Auzanne, Cedric G. P. and Voorhees, Ellen M. , title =. 2000 , booktitle =

work page 2000
[34]

2025 , eprint=

Stream RAG: Instant and Accurate Spoken Dialogue Systems with Streaming Tool Usage , author=. 2025 , eprint=

work page 2025
[35]

, journal=

Kong, Qiuqiang and Cao, Yin and Iqbal, Turab and Wang, Yuxuan and Wang, Wenwu and Plumbley, Mark D. , journal=. PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition , year=

work page
[36]

SpeechDPR: End-To-End Spoken Passage Retrieval For Open-Domain Spoken Question Answering , year=

Lin, Chyi-Jiunn and Lin, Guan-Ting and Chuang, Yung-Sung and Wu, Wei-Lun and Li, Shang-Wen and Mohamed, Abdelrahman and Lee, Hung-Yi and Lee, Lin-Shan , booktitle=. SpeechDPR: End-To-End Spoken Passage Retrieval For Open-Domain Spoken Question Answering , year=

work page
[37]

Speech Retrieval-Augmented Generation without Automatic Speech Recognition , year=

Min, Do June and Mundnich, Karel and Lapastora, Andy and Soltanmohammadi, Erfan and Ronanki, Srikanth and Han, Kyu , booktitle=. Speech Retrieval-Augmented Generation without Automatic Speech Recognition , year=

work page
[38]

2025 , eprint=

AURA: Agent for Understanding, Reasoning, and Automated Tool Use in Voice-Driven Tasks , author=. 2025 , eprint=

work page 2025
[39]

W iki C hat: Stopping the Hallucination of Large Language Model Chatbots by Few-Shot Grounding on W ikipedia

Semnani, Sina and Yao, Violet and Zhang, Heidi and Lam, Monica. W iki C hat: Stopping the Hallucination of Large Language Model Chatbots by Few-Shot Grounding on W ikipedia. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023

work page 2023
[40]

Akari Asai and Zeqiu Wu and Yizhong Wang and Avirup Sil and Hannaneh Hajishirzi , booktitle=. Self-

work page
[41]

RA - ISF : Learning to Answer and Understand from Retrieval Augmentation via Iterative Self-Feedback

Liu, Yanming and Peng, Xinyue and Zhang, Xuhong and Liu, Weihao and Yin, Jianwei and Cao, Jiannan and Du, Tianyu. RA - ISF : Learning to Answer and Understand from Retrieval Augmentation via Iterative Self-Feedback. Findings of the Association for Computational Linguistics: ACL 2024. 2024

work page 2024
[42]

PlanRAG: A Plan-then-Retrieval Augmented Generation for Generative Large Language Models as Decision Makers

Myeonghwa Lee and Seonho An and Kim, \ Min Soo\. PlanRAG: A Plan-then-Retrieval Augmented Generation for Generative Large Language Models as Decision Makers. 2024

work page 2024
[43]

Prakhar Verma and Sukruta Prakash Midigeshi and Gaurav Sinha and Arno Solin and Nagarajan Natarajan and Amit Sharma , booktitle=. Plan\

work page
[44]

CadenceRAG: Context-Aware and Dependency-Enhanced Retrieval Augmented Generation for Holistic Video Understanding , year=

Liu, Heng and Jiang, Siru and Duan, Fangyun and Lyu, Yongzhe and Wang, Xiusong and Ge, Hanlin and Liang, Chao , booktitle=. CadenceRAG: Context-Aware and Dependency-Enhanced Retrieval Augmented Generation for Holistic Video Understanding , year=

work page
[45]

Librispeech: An ASR corpus based on public domain audio books , year=

Panayotov, Vassil and Chen, Guoguo and Povey, Daniel and Khudanpur, Sanjeev , booktitle=. Librispeech: An ASR corpus based on public domain audio books , year=

work page
[46]

LibriSQA: A Novel Dataset and Framework for Spoken Question Answering with Large Language Models , year=

Zhao, Zihan and Jiang, Yiyang and Liu, Heyang and Wang, Yu and Wang, Yanfeng , journal=. LibriSQA: A Novel Dataset and Framework for Spoken Question Answering with Large Language Models , year=

work page
[47]

2025 , eprint=

The MSP-Podcast Corpus , author=. 2025 , eprint=

work page 2025
[48]

V ox P opuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation

Wang, Changhan and Riviere, Morgane and Lee, Ann and Wu, Anne and Talnikar, Chaitanya and Haziza, Daniel and Williamson, Mary and Pino, Juan and Dupoux, Emmanuel. V ox P opuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation. Proceedings of the 59th Annual Meeting of the Association for Com...

work page 2021
[49]

and Ellis, Daniel P

Gemmeke, Jort F. and Ellis, Daniel P. W. and Freedman, Dylan and Jansen, Aren and Lawrence, Wade and Moore, R. Channing and Plakal, Manoj and Ritter, Marvin , booktitle=. Audio Set: An ontology and human-labeled dataset for audio events , year=

work page
[50]

Retrieval-augmented generation for knowledge-intensive NLP tasks , year =

Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and Petroni, Fabio and Karpukhin, Vladimir and Goyal, Naman and K\". Retrieval-augmented generation for knowledge-intensive NLP tasks , year =. Proceedings of the 34th International Conference on Neural Information Processing Systems , series =

work page
[51]

McCowan and J

I. McCowan and J. Carletta and W. Kraaij and S. Ashby and S. Bourban and M. Flynn and M. Guillemot and T. Hain and J. Kadlec and V. Karaiskos and M. Kronenthal and G. Lathoud and M. Lincoln and A. Lisowska and W. Post and Dennis Reidsma and P. Wellner. The AMI meeting corpus. Proceedings of Measuring Behavior 2005, 5th International Conference on Methods ...

work page 2005
[52]

2024 , booktitle =

Odyssey 2024 - Speech Emotion Recognition Challenge: Dataset, Baseline Framework, and Results , author =. 2024 , booktitle =

work page 2024
[53]

Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya , title =

work page
[54]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

work page 2025
[55]

2024 , eprint=

Text Embeddings by Weakly-Supervised Contrastive Pre-training , author=. 2024 , eprint=

work page 2024
[56]

E mo2 V ec: Learning Generalized Emotion Representation by Multi-task Training

Xu, Peng and Madotto, Andrea and Wu, Chien-Sheng and Park, Ji Ho and Fung, Pascale. E mo2 V ec: Learning Generalized Emotion Representation by Multi-task Training. Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis. 2018

work page 2018
[57]

Metrics for Polyphonic Sound Event Detection

Mesaros, Annamaria and Heittola, Toni and Virtanen, Tuomas. Metrics for Polyphonic Sound Event Detection. Applied Sciences. 2016

work page 2016
[58]

Alexis Plaquet and Hervé Bredin , title=

work page
[59]

OpenBEATs: A Fully Open-Source General-Purpose Audio Encoder , year=

Bharadwaj, Shikhar and Cornell, Samuele and Choi, Kwanghee and Fukayama, Satoru and Shim, Hye-Jin and Deshmukh, Soham and Watanabe, Shinji , booktitle=. OpenBEATs: A Fully Open-Source General-Purpose Audio Encoder , year=

work page
[60]

2025 , eprint=

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities , author=. 2025 , eprint=

work page 2025
[61]

2025 , eprint=

Voxtral , author=. 2025 , eprint=

work page 2025

[1] [1]

arXiv preprint , year =

OpenAI , title =. arXiv preprint , year =

work page

[2] [2]

arXiv preprint , year =

Gemini Team Google , title =. arXiv preprint , year =

work page

[3] [3]

2025 , eprint=

Qwen2.5-1M Technical Report , author=. 2025 , eprint=

work page 2025

[4] [4]

Han, and Katrin Kirchhoff

SpeechVerse: A Large-scale Generalizable Audio Framework , author =. arXiv preprint arXiv:2405.08295 , year =

work page arXiv

[5] [5]

Qwen2-Audio Technical Report

Qwen2-Audio Technical Report , author =. arXiv preprint arXiv:2407.10759 , year =

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Proceedings of the 41st International Conference on Machine Learning (ICML) , year =

Audio Flamingo: A Novel Audio Language Model with Few-Shot Learning and Dialogue Abilities , author =. Proceedings of the 41st International Conference on Machine Learning (ICML) , year =

work page

[7] [7]

2024 , eprint=

Moshi: a speech‐text foundation model for real‐time dialogue , author =. 2024 , eprint=

work page 2024

[8] [8]

2025 , eprint=

SALMONN-omni: A Standalone Speech LLM without Codec Injection for Full-duplex Conversation , author =. 2025 , eprint=

work page 2025

[9] [9]

Jinchuan Tian and William Chen and Yifan Peng and Jiatong Shi and Siddhant Arora and Shikhar Bharadwaj and Takashi Maekaku and Yusuke Shinohara and Keita Goto and Xiang Yue and Huck Yang and Shinji Watanabe , year =

work page

[10] [10]

2025 , eprint=

BLAB: Brutally Long Audio Bench , author=. 2025 , eprint=

work page 2025

[11] [11]

M eeting QA : Extractive Question-Answering on Meeting Transcripts

Prasad, Archiki and Bui, Trung and Yoon, Seunghyun and Deilamsalehy, Hanieh and Dernoncourt, Franck and Bansal, Mohit. M eeting QA : Extractive Question-Answering on Meeting Transcripts. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023

work page 2023

[12] [12]

CORAAL QA: A Dataset and Framework for Open Domain Spontaneous Speech Question Answering from Long Audio Files , year=

Shankar, Natarajan Balaji and Johnson, Alexander and Chance, Christina and Veeramani, Hariram and Alwan, Abeer , booktitle=. CORAAL QA: A Dataset and Framework for Open Domain Spontaneous Speech Question Answering from Long Audio Files , year=

work page

[13] [13]

W av RAG : Audio-Integrated Retrieval Augmented Generation for Spoken Dialogue Models

Chen, Yifu and Ji, Shengpeng and Wang, Haoxiao and Wang, Ziqing and Chen, Siyu and He, Jinzheng and Xu, Jin and Zhao, Zhou. W av RAG : Audio-Integrated Retrieval Augmented Generation for Spoken Dialogue Models. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2025

work page 2025

[14] [14]

Alexander Johnson and Peter Plantinga and Pheobe Sun and Swaroop Gadiyaram and Abenezer Girma and Ahmad Emami , year =

work page

[15] [15]

Cacophony: An Improved Contrastive Audio-Text Model , year=

Zhu, Ge and Darefsky, Jordan and Duan, Zhiyao , journal=. Cacophony: An Improved Contrastive Audio-Text Model , year=

work page

[16] [16]

NAAQA: A Neural Architecture for Acoustic Question Answering , year=

Abdelnour, Jérôme and Rouat, Jean and Salvi, Giampiero , journal=. NAAQA: A Neural Architecture for Acoustic Question Answering , year=

work page

[17] [17]

2022 , booktitle =

DUAL: Discrete Spoken Unit Adaptive Learning for Textless Spoken Question Answering , author =. 2022 , booktitle =

work page 2022

[18] [18]

31st European Signal Processing Conference (EUSIPCO) , year =

Parthasaarathy Sudarsanam and Tuomas Virtanen , title =. 31st European Signal Processing Conference (EUSIPCO) , year =

work page

[19] [19]

OWSM - CTC : An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification

Peng, Yifan and Sudo, Yui and Shakeel, Muhammad and Watanabe, Shinji. OWSM - CTC : An Open Encoder-Only Speech Foundation Model for Speech Recognition, Translation, and Language Identification. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024

work page 2024

[20] [20]

Yifan Peng and Muhammad Shakeel and Yui Sudo and William Chen and Jinchuan Tian and Chyi-Jiunn Lin and Shinji Watanabe , year =

work page

[21] [21]

HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units , year=

Hsu, Wei-Ning and Bolte, Benjamin and Tsai, Yao-Hung Hubert and Lakhotia, Kushal and Salakhutdinov, Ruslan and Mohamed, Abdelrahman , journal=. HuBERT: Self-Supervised Speech Representation Learning by Masked Prediction of Hidden Units , year=

work page

[22] [22]

WavLM: Large-Scale Self-Supervised Pre-Training for Full Stack Speech Processing , year=

Chen, Sanyuan and Wang, Chengyi and Chen, Zhengyang and Wu, Yu and Liu, Shujie and Chen, Zhuo and Li, Jinyu and Kanda, Naoyuki and Yoshioka, Takuya and Xiao, Xiong and Wu, Jian and Zhou, Long and Ren, Shuo and Qian, Yanmin and Qian, Yao and Wu, Jian and Zeng, Michael and Yu, Xiangzhan and Wei, Furu , journal=. WavLM: Large-Scale Self-Supervised Pre-Traini...

work page

[23] [23]

2023 , publisher =

Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya , title =. 2023 , publisher =

work page 2023

[24] [24]

2023 , series =

Chen, Sanyuan and Wu, Yu and Wang, Chengyi and Liu, Shujie and Tompkins, Daniel and Chen, Zhuo and Che, Wanxiang and Yu, Xiangzhan and Wei, Furu , booktitle =. 2023 , series =

work page 2023

[25] [25]

ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year=

Clap learning audio concepts from natural language supervision , author=. ICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP) , year=

work page 2023

[26] [26]

Lawrence and Girshick, Ross , booktitle=

Johnson, Justin and Hariharan, Bharath and van der Maaten, Laurens and Fei-Fei, Li and Zitnick, C. Lawrence and Girshick, Ross , booktitle=. CLEVR: A Diagnostic Dataset for Compositional Language and Elementary Visual Reasoning , year=

work page

[27] [27]

CLEAR: A Dataset for Compositional Language and Elementary Acoustic Reasoning , year =

Jerome Abdelnour and Giampiero Salvi and Jean Rouat , publisher =. CLEAR: A Dataset for Compositional Language and Elementary Acoustic Reasoning , year =

work page

[28] [28]

Clotho-AQA: A Crowdsourced Dataset for Audio Question Answering , year=

Lipping, Samuel and Sudarsanam, Parthasaarathy and Drossos, Konstantinos and Virtanen, Tuomas , booktitle=. Clotho-AQA: A Crowdsourced Dataset for Audio Question Answering , year=

work page

[29] [29]

Audiopedia: Audio QA with Knowledge , year=

Penamakuri, Abhirama Subramanyam and Chhatre, Kiran and Jain, Akshat , booktitle=. Audiopedia: Audio QA with Knowledge , year=

work page

[30] [30]

AudioBERT: Audio Knowledge Augmented Language Model , year=

Ok, Hyunjong and Yoo, Suho and Lee, Jaeho , booktitle=. AudioBERT: Audio Knowledge Augmented Language Model , year=

work page

[31] [31]

2018 , organization=

Sanabria, Ramon and Caglayan, Ozan and Palaskar, Shruti and Elliott, Desmond and Barrault, Lo\"ic and Specia, Lucia and Metze, Florian , booktitle =. 2018 , organization=

work page 2018

[32] [32]

Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL) , year =

MeetingQA: Extractive Question-Answering on Meeting Transcripts , author =. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (ACL) , year =

work page

[33] [33]

and Auzanne, Cedric G

Garofolo, John S. and Auzanne, Cedric G. P. and Voorhees, Ellen M. , title =. 2000 , booktitle =

work page 2000

[34] [34]

2025 , eprint=

Stream RAG: Instant and Accurate Spoken Dialogue Systems with Streaming Tool Usage , author=. 2025 , eprint=

work page 2025

[35] [35]

, journal=

Kong, Qiuqiang and Cao, Yin and Iqbal, Turab and Wang, Yuxuan and Wang, Wenwu and Plumbley, Mark D. , journal=. PANNs: Large-Scale Pretrained Audio Neural Networks for Audio Pattern Recognition , year=

work page

[36] [36]

SpeechDPR: End-To-End Spoken Passage Retrieval For Open-Domain Spoken Question Answering , year=

Lin, Chyi-Jiunn and Lin, Guan-Ting and Chuang, Yung-Sung and Wu, Wei-Lun and Li, Shang-Wen and Mohamed, Abdelrahman and Lee, Hung-Yi and Lee, Lin-Shan , booktitle=. SpeechDPR: End-To-End Spoken Passage Retrieval For Open-Domain Spoken Question Answering , year=

work page

[37] [37]

Speech Retrieval-Augmented Generation without Automatic Speech Recognition , year=

Min, Do June and Mundnich, Karel and Lapastora, Andy and Soltanmohammadi, Erfan and Ronanki, Srikanth and Han, Kyu , booktitle=. Speech Retrieval-Augmented Generation without Automatic Speech Recognition , year=

work page

[38] [38]

2025 , eprint=

AURA: Agent for Understanding, Reasoning, and Automated Tool Use in Voice-Driven Tasks , author=. 2025 , eprint=

work page 2025

[39] [39]

W iki C hat: Stopping the Hallucination of Large Language Model Chatbots by Few-Shot Grounding on W ikipedia

Semnani, Sina and Yao, Violet and Zhang, Heidi and Lam, Monica. W iki C hat: Stopping the Hallucination of Large Language Model Chatbots by Few-Shot Grounding on W ikipedia. Findings of the Association for Computational Linguistics: EMNLP 2023. 2023

work page 2023

[40] [40]

Akari Asai and Zeqiu Wu and Yizhong Wang and Avirup Sil and Hannaneh Hajishirzi , booktitle=. Self-

work page

[41] [41]

RA - ISF : Learning to Answer and Understand from Retrieval Augmentation via Iterative Self-Feedback

Liu, Yanming and Peng, Xinyue and Zhang, Xuhong and Liu, Weihao and Yin, Jianwei and Cao, Jiannan and Du, Tianyu. RA - ISF : Learning to Answer and Understand from Retrieval Augmentation via Iterative Self-Feedback. Findings of the Association for Computational Linguistics: ACL 2024. 2024

work page 2024

[42] [42]

PlanRAG: A Plan-then-Retrieval Augmented Generation for Generative Large Language Models as Decision Makers

Myeonghwa Lee and Seonho An and Kim, \ Min Soo\. PlanRAG: A Plan-then-Retrieval Augmented Generation for Generative Large Language Models as Decision Makers. 2024

work page 2024

[43] [43]

Prakhar Verma and Sukruta Prakash Midigeshi and Gaurav Sinha and Arno Solin and Nagarajan Natarajan and Amit Sharma , booktitle=. Plan\

work page

[44] [44]

CadenceRAG: Context-Aware and Dependency-Enhanced Retrieval Augmented Generation for Holistic Video Understanding , year=

Liu, Heng and Jiang, Siru and Duan, Fangyun and Lyu, Yongzhe and Wang, Xiusong and Ge, Hanlin and Liang, Chao , booktitle=. CadenceRAG: Context-Aware and Dependency-Enhanced Retrieval Augmented Generation for Holistic Video Understanding , year=

work page

[45] [45]

Librispeech: An ASR corpus based on public domain audio books , year=

Panayotov, Vassil and Chen, Guoguo and Povey, Daniel and Khudanpur, Sanjeev , booktitle=. Librispeech: An ASR corpus based on public domain audio books , year=

work page

[46] [46]

LibriSQA: A Novel Dataset and Framework for Spoken Question Answering with Large Language Models , year=

Zhao, Zihan and Jiang, Yiyang and Liu, Heyang and Wang, Yu and Wang, Yanfeng , journal=. LibriSQA: A Novel Dataset and Framework for Spoken Question Answering with Large Language Models , year=

work page

[47] [47]

2025 , eprint=

The MSP-Podcast Corpus , author=. 2025 , eprint=

work page 2025

[48] [48]

V ox P opuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation

Wang, Changhan and Riviere, Morgane and Lee, Ann and Wu, Anne and Talnikar, Chaitanya and Haziza, Daniel and Williamson, Mary and Pino, Juan and Dupoux, Emmanuel. V ox P opuli: A Large-Scale Multilingual Speech Corpus for Representation Learning, Semi-Supervised Learning and Interpretation. Proceedings of the 59th Annual Meeting of the Association for Com...

work page 2021

[49] [49]

and Ellis, Daniel P

Gemmeke, Jort F. and Ellis, Daniel P. W. and Freedman, Dylan and Jansen, Aren and Lawrence, Wade and Moore, R. Channing and Plakal, Manoj and Ritter, Marvin , booktitle=. Audio Set: An ontology and human-labeled dataset for audio events , year=

work page

[50] [50]

Retrieval-augmented generation for knowledge-intensive NLP tasks , year =

Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and Petroni, Fabio and Karpukhin, Vladimir and Goyal, Naman and K\". Retrieval-augmented generation for knowledge-intensive NLP tasks , year =. Proceedings of the 34th International Conference on Neural Information Processing Systems , series =

work page

[51] [51]

McCowan and J

I. McCowan and J. Carletta and W. Kraaij and S. Ashby and S. Bourban and M. Flynn and M. Guillemot and T. Hain and J. Kadlec and V. Karaiskos and M. Kronenthal and G. Lathoud and M. Lincoln and A. Lisowska and W. Post and Dennis Reidsma and P. Wellner. The AMI meeting corpus. Proceedings of Measuring Behavior 2005, 5th International Conference on Methods ...

work page 2005

[52] [52]

2024 , booktitle =

Odyssey 2024 - Speech Emotion Recognition Challenge: Dataset, Baseline Framework, and Results , author =. 2024 , booktitle =

work page 2024

[53] [53]

Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya , title =

work page

[54] [54]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

work page 2025

[55] [55]

2024 , eprint=

Text Embeddings by Weakly-Supervised Contrastive Pre-training , author=. 2024 , eprint=

work page 2024

[56] [56]

E mo2 V ec: Learning Generalized Emotion Representation by Multi-task Training

Xu, Peng and Madotto, Andrea and Wu, Chien-Sheng and Park, Ji Ho and Fung, Pascale. E mo2 V ec: Learning Generalized Emotion Representation by Multi-task Training. Proceedings of the 9th Workshop on Computational Approaches to Subjectivity, Sentiment and Social Media Analysis. 2018

work page 2018

[57] [57]

Metrics for Polyphonic Sound Event Detection

Mesaros, Annamaria and Heittola, Toni and Virtanen, Tuomas. Metrics for Polyphonic Sound Event Detection. Applied Sciences. 2016

work page 2016

[58] [58]

Alexis Plaquet and Hervé Bredin , title=

work page

[59] [59]

OpenBEATs: A Fully Open-Source General-Purpose Audio Encoder , year=

Bharadwaj, Shikhar and Cornell, Samuele and Choi, Kwanghee and Fukayama, Satoru and Shim, Hye-Jin and Deshmukh, Soham and Watanabe, Shinji , booktitle=. OpenBEATs: A Fully Open-Source General-Purpose Audio Encoder , year=

work page

[60] [60]

2025 , eprint=

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities , author=. 2025 , eprint=

work page 2025

[61] [61]

2025 , eprint=

Voxtral , author=. 2025 , eprint=

work page 2025