pith. machine review for the scientific record. sign in

arxiv: 2512.20136 · v3 · submitted 2025-12-23 · 💻 cs.CL · cs.AI

M³KG-RAG: Multi-hop Multimodal Knowledge Graph-enhanced Retrieval-Augmented Generation

Pith reviewed 2026-05-16 20:22 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords multimodal knowledge graphretrieval-augmented generationmultimodal large language modelsmulti-hop reasoningaudio-visual knowledgeknowledge pruninggrounded retrieval
0
0 comments X

The pith

M³KG-RAG builds multi-hop multimodal knowledge graphs and prunes them with GRASP to deliver more accurate audio-visual retrieval for MLLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that existing multimodal RAG systems suffer from sparse MMKGs and unfiltered similarity retrieval, so it introduces a lightweight multi-agent pipeline to create connected multi-hop graphs plus a GRASP module that grounds entities and drops everything else. Experiments on diverse benchmarks indicate this combination raises reasoning depth and answer faithfulness in multimodal large language models. A sympathetic reader would care because many real queries require chaining facts across vision and sound, and current models still hallucinate or drift when external knowledge is noisy or disconnected.

Core claim

M³KG-RAG retrieves query-aligned audio-visual knowledge from MMKGs by first building multi-hop context-enriched triplets via a lightweight multi-agent pipeline and then applying GRASP to ground entities, evaluate relevance, and prune redundant context, resulting in enhanced multimodal reasoning and answer faithfulness in MLLMs.

What carries the argument

M³KG, the multi-hop multimodal knowledge graph of context-enriched triplets, together with the GRASP module that performs grounded entity matching, relevance scoring, and selective pruning.

If this is right

  • MLLMs gain improved depth in multi-hop reasoning across audio and visual modalities.
  • Retrieval becomes more precise by filtering off-topic and redundant knowledge.
  • Answer faithfulness increases as only essential supporting knowledge is retained for generation.
  • Existing MMKGs with limited coverage can be extended through the multi-agent construction method.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same construction-plus-pruning pattern could apply to text-only or video-centric tasks where long reasoning chains matter.
  • The approach suggests modality-specific retrieval paths can outperform a single shared embedding space on queries that cross modalities.
  • Real-time adaptation might be tested by updating only affected subgraphs instead of rebuilding the full MMKG.

Load-bearing premise

The multi-agent pipeline produces a sufficiently connected and modality-covered MMKG and GRASP correctly identifies and keeps only answer-supporting knowledge without adding errors.

What would settle it

If the constructed MMKG shows low multi-hop connectivity or if GRASP pruning retains off-topic facts, performance on the same benchmarks would show no gain or a drop versus standard similarity-based RAG baselines.

Figures

Figures reproduced from arXiv: 2512.20136 by Hogun Park, Hyeongcheol Park, Hyeonsoo Im, Jaewon Mun, JeungSub Lee, Jiyoung Seo, Sangpil Kim, Sung June Kim, Wonmin Byeon.

Figure 1
Figure 1. Figure 1: Illustration of multimodal RAG scenarios. Incorrect answers are shown in red, correct answers in blue. (a) Shared embedding search misaligns with the audio-visual query. (b) Noisy, single-hop facts provide little answer support. (c) M3KG-RAG uses modality-wise multi-hop retrieval for answer-supporting context. inputs raise complexity, motivating designs that explicitly ac￾count for multimodal structure. Re… view at source ↗
Figure 2
Figure 2. Figure 2: An overview of the M3KG construction pipeline. The pipeline consists of three steps: (i) Context-Enriched Triplet Extraction, which rewrites multimodal captions into knowledge-intensive text and extracts entity–relation triplets; (ii) Knowledge Grounding, linking normalized entities to open knowledge bases to obtain candidate descriptions; (iii) Context-Aware Description Refinement, selecting and rewriting… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the Multimodal RAG framework. The framework consists of two components: (a) Modality-Wise Retrieval, which retrieves multi-hop triplets aligned with the query from the M3KG; and (b) GRASP (Grounded Retrieval And Selective Pruning), which uses visual and/or audio grounding models to check entity presence and prunes triplets that are off-topic or non-informative. The resulting subgraph is then pr… view at source ↗
Figure 4
Figure 4. Figure 4: Qualitative results on various Question Answering tasks. Incorrect and insufficient model responses are highlighted in red, while correct and sufficient responses are highlighted in blue. question “What type of social setting is depicted in the audio recording?” the base model responds “restaurant,” which is loosely related yet misaligned with the asked social set￾ting and lacks sufficient detail. With M3K… view at source ↗
Figure 5
Figure 5. Figure 5: Sensitivity analysis. M.J. score on VALOR versus modality-wise distance threshold τ (top) and GRASP presence threshold ηav (bottom). d(qm, xm) ≤ τ . The remaining retrieved items are lifted into the graph via Eq. (3) in the main paper. To understand how the choice of τ affects QA perfor￾mance, we conduct a sensitivity study on the VALOR bench￾mark by varying τ ∈ {1.5, 3.0, 4.5, 6.0, 7.5} while fixing ηav =… view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative Comparison on Audio QA. Comparing VAT-KG and M3KG-RAG with Qwen2.5-Omni, including retrieved knowledge and win-rate judge preferences. 7 [PITH_FULL_IMAGE:figures/full_fig_p018_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative Comparison on Video QA. Comparing VAT-KG and M3KG-RAG with Qwen2.5-Omni, including retrieved knowledge and win-rate judge preferences. 8 [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative Comparison on Audio-Visual QA. Comparing VAT-KG and M3KG-RAG with Qwen2.5-Omni, including retrieved knowledge and win-rate judge preferences. 9 [PITH_FULL_IMAGE:figures/full_fig_p020_8.png] view at source ↗
read the original abstract

Retrieval-Augmented Generation (RAG) has recently been extended to multimodal settings, connecting multimodal large language models (MLLMs) with vast corpora of external knowledge such as multimodal knowledge graphs (MMKGs). Despite their recent success, multimodal RAG in the audio-visual domain remains challenging due to 1) limited modality coverage and multi-hop connectivity of existing MMKGs, and 2) retrieval based solely on similarity in a shared multimodal embedding space, which fails to filter out off-topic or redundant knowledge. To address these limitations, we propose M$^3$KG-RAG, a Multi-hop Multimodal Knowledge Graph-enhanced RAG that retrieves query-aligned audio-visual knowledge from MMKGs, improving reasoning depth and answer faithfulness in MLLMs. Specifically, we devise a lightweight multi-agent pipeline to construct multi-hop MMKG (M$^3$KG), which contains context-enriched triplets of multimodal entities, enabling modality-wise retrieval based on input queries. Furthermore, we introduce GRASP (Grounded Retrieval And Selective Pruning), which ensures precise entity grounding to the query, evaluates answer-supporting relevance, and prunes redundant context to retain only knowledge essential for response generation. Extensive experiments across diverse multimodal benchmarks demonstrate that M$^3$KG-RAG significantly enhances MLLMs' multimodal reasoning and grounding over existing approaches. Project website: https://kuai-lab.github.io/cvpr2026m3kgrag/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims to introduce M³KG-RAG, a multimodal RAG framework that constructs a multi-hop multimodal knowledge graph (M³KG) via a lightweight multi-agent pipeline and applies GRASP (Grounded Retrieval And Selective Pruning) for query-aligned entity grounding and relevance-based pruning of redundant knowledge, thereby improving MLLMs' audio-visual reasoning depth and answer faithfulness over standard similarity-based retrieval from existing MMKGs. Extensive experiments on diverse multimodal benchmarks are said to demonstrate significant gains.

Significance. If the core pipeline and pruning mechanism are shown to deliver the claimed coverage and precision, the work would address two persistent bottlenecks in multimodal RAG—limited MMKG connectivity and noisy retrieval—potentially enabling more reliable multi-hop reasoning in MLLMs. The engineering focus on a practical, modality-aware construction and pruning pipeline is a concrete contribution that could be adopted or extended by the community.

major comments (3)
  1. [Abstract / §3] Abstract and §3 (MMKG construction): The central claim requires that the multi-agent pipeline produces an M³KG with sufficient modality coverage and multi-hop connectivity for the target queries, yet no quantitative metrics (triplet counts per modality, average path length, connectivity statistics, or coverage against gold knowledge) are reported to validate this assumption.
  2. [Abstract / §4] Abstract and §4 (GRASP): GRASP is presented as performing precise entity grounding and answer-supporting relevance evaluation while pruning redundant context without introducing errors, but the manuscript provides no ablation on the relevance threshold, no precision/recall against human judgments, and no error-injection study to confirm that pruning improves rather than degrades downstream performance.
  3. [Experiments] Experiments section: The assertion of 'significant gains' over existing approaches is unsupported by reported baseline details, error bars, statistical significance tests, or component ablations (e.g., M³KG vs. GRASP vs. full pipeline), making it impossible to attribute observed improvements to the proposed mechanisms rather than dataset artifacts or base MLLM capabilities.
minor comments (2)
  1. [Abstract] Notation: The acronym M³KG is used before its expansion is fully clarified in the abstract; a parenthetical definition on first use would improve readability.
  2. [Abstract] The project website is referenced but no link to released code, constructed graphs, or evaluation scripts is provided in the manuscript, which would aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. The points raised highlight important areas for strengthening the validation of M³KG construction, GRASP, and experimental rigor. We will incorporate all suggested additions and clarifications in the revised manuscript. We respond to each major comment below.

read point-by-point responses
  1. Referee: [Abstract / §3] Abstract and §3 (MMKG construction): The central claim requires that the multi-agent pipeline produces an M³KG with sufficient modality coverage and multi-hop connectivity for the target queries, yet no quantitative metrics (triplet counts per modality, average path length, connectivity statistics, or coverage against gold knowledge) are reported to validate this assumption.

    Authors: We agree that quantitative metrics are needed to substantiate the claims about M³KG coverage and connectivity. The original manuscript focused on the pipeline description without reporting these statistics. In the revised version, we will add a dedicated subsection in §3 with triplet counts per modality, average path lengths, connectivity statistics (e.g., average degree and component sizes), and coverage metrics against available gold knowledge sources, including direct comparisons to existing MMKGs. revision: yes

  2. Referee: [Abstract / §4] Abstract and §4 (GRASP): GRASP is presented as performing precise entity grounding and answer-supporting relevance evaluation while pruning redundant context without introducing errors, but the manuscript provides no ablation on the relevance threshold, no precision/recall against human judgments, and no error-injection study to confirm that pruning improves rather than degrades downstream performance.

    Authors: We acknowledge the value of these additional validations for GRASP. The current manuscript describes the mechanism but lacks the requested ablations and studies. In the revision, we will add to §4 and the experiments: an ablation varying the relevance threshold, precision/recall results against human annotations on sampled queries, and an error-injection analysis showing that selective pruning improves downstream accuracy without introducing new errors. revision: yes

  3. Referee: [Experiments] Experiments section: The assertion of 'significant gains' over existing approaches is unsupported by reported baseline details, error bars, statistical significance tests, or component ablations (e.g., M³KG vs. GRASP vs. full pipeline), making it impossible to attribute observed improvements to the proposed mechanisms rather than dataset artifacts or base MLLM capabilities.

    Authors: We agree that the experimental reporting requires expansion for proper attribution. The original experiments section presented overall results without the requested details. In the revised manuscript, we will expand the Experiments section to include complete baseline specifications, error bars from multiple runs, statistical significance tests (e.g., paired t-tests with p-values), and component ablations isolating M³KG construction, GRASP, and the full pipeline. revision: yes

Circularity Check

0 steps flagged

No circularity: independent engineering pipeline with external benchmark validation

full rationale

The paper presents M³KG-RAG as a descriptive pipeline (lightweight multi-agent MMKG construction plus GRASP pruning) without equations, derivations, fitted parameters, or uniqueness theorems. No step reduces by construction to its own inputs, self-citations, or renamed empirical patterns. Central claims rest on reported benchmark gains, which are external to the method definition itself. This matches the default expectation of a non-circular engineering contribution.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 2 invented entities

The central claim rests on the assumption that existing MMKGs are limited and that the new construction plus pruning steps add value; no free parameters are named in the abstract, but thresholds for relevance and pruning are implicitly required.

free parameters (1)
  • relevance threshold in GRASP
    A cutoff for deciding whether retrieved knowledge supports the answer must exist to perform selective pruning, though its value is not stated.
axioms (1)
  • domain assumption Existing multimodal knowledge graphs have limited modality coverage and insufficient multi-hop connectivity.
    Directly stated as the first limitation motivating the work.
invented entities (2)
  • M³KG no independent evidence
    purpose: Multi-hop multimodal knowledge graph with context-enriched triplets
    Newly constructed component introduced by the multi-agent pipeline.
  • GRASP no independent evidence
    purpose: Grounded retrieval and selective pruning mechanism
    New method for entity grounding, relevance evaluation, and redundancy removal.

pith-pipeline@v0.9.0 · 5593 in / 1346 out tokens · 30112 ms · 2026-05-16T20:22:37.731625+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 17 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023. 1, 2

  2. [2]

    Flamingo: a visual language model for few-shot learning

    Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. InAdvances in Neural Information Processing Systems, 2022. 2

  3. [3]

    PaLM 2 Technical Report

    Rohan Anil, Andrew M Dai, Orhan Firat, Melvin John- son, Dmitry Lepikhin, Alexandre Passos, Siamak Shakeri, Emanuel Taropa, Paige Bailey, Zhifeng Chen, et al. Palm 2 technical report.arXiv preprint arXiv:2305.10403, 2023. 2

  4. [4]

    Reflective multi- agent collaboration based on large language models.Ad- vances in Neural Information Processing Systems, 37:138595– 138631, 2024

    Xiaohe Bo, Zeyu Zhang, Quanyu Dai, Xueyang Feng, Lei Wang, Rui Li, Xu Chen, and Ji-Rong Wen. Reflective multi- agent collaboration based on large language models.Ad- vances in Neural Information Processing Systems, 37:138595– 138631, 2024. 3

  5. [5]

    Improving language models by retriev- ing from trillions of tokens

    Sebastian Borgeaud, Arthur Mensch, Jordan Hoffmann, Trevor Cai, Eliza Rutherford, Katie Millican, George Bm Van Den Driessche, Jean-Baptiste Lespiau, Bogdan Damoc, Aidan Clark, et al. Improving language models by retriev- ing from trillions of tokens. InInternational Conference on Machine Learning, pages 2206–2240. PMLR, 2022. 2

  6. [6]

    Activitynet: A large-scale video bench- mark for human activity understanding

    Fabian Caba Heilbron, Victor Escorcia, Bernard Ghanem, and Juan Carlos Niebles. Activitynet: A large-scale video bench- mark for human activity understanding. InProceedings of the ieee conference on computer vision and pattern recognition, pages 961–970, 2015. 5, 2

  7. [7]

    Vggsound: A large-scale audio-visual dataset

    Honglie Chen, Weidi Xie, Andrea Vedaldi, and Andrew Zis- serman. Vggsound: A large-scale audio-visual dataset. In ICASSP 2020-2020 IEEE International Conference on Acous- tics, Speech and Signal Processing (ICASSP), pages 721–725. IEEE, 2020. 3

  8. [8]

    VideoLLaMA 2: Advancing Spatial-Temporal Modeling and Audio Understanding in Video-LLMs

    Zesen Cheng, Sicong Leng, Hang Zhang, Yifei Xin, Xin Li, Guanzheng Chen, Yongxin Zhu, Wenqi Zhang, Ziyang Luo, Deli Zhao, et al. Videollama 2: Advancing spatial- temporal modeling and audio understanding in video-llms. arXiv preprint arXiv:2406.07476, 2024. 2, 5

  9. [9]

    Kimi-Audio Technical Report

    Ding Ding, Zeqian Ju, Yichong Leng, Songxiang Liu, Tong Liu, Zeyu Shang, Kai Shen, Wei Song, Xu Tan, Heyi Tang, et al. Kimi-audio technical report.arXiv preprint arXiv:2504.18425, 2025. 2

  10. [10]

    The llama 3 herd of models.arXiv e-prints, pages arXiv–2407,

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Ab- hishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models.arXiv e-prints, pages arXiv–2407,

  11. [11]

    From Local to Global: A Graph RAG Approach to Query-Focused Summarization

    Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropoli- tansky, Robert Osazuwa Ness, and Jonathan Larson. From local to global: A graph rag approach to query-focused sum- marization.arXiv preprint arXiv:2404.16130, 2024. 1, 2, 6, 4

  12. [12]

    A sur- vey on rag meeting llms: Towards retrieval-augmented large language models

    Wenqi Fan, Yujuan Ding, Liangbo Ning, Shijie Wang, Hengyun Li, Dawei Yin, Tat-Seng Chua, and Qing Li. A sur- vey on rag meeting llms: Towards retrieval-augmented large language models. InProceedings of the 30th ACM SIGKDD conference on knowledge discovery and data mining, pages 6491–6501, 2024. 2

  13. [13]

    Precise zero-shot dense retrieval without relevance labels

    Luyu Gao, Xueguang Ma, Jimmy Lin, and Jamie Callan. Precise zero-shot dense retrieval without relevance labels. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1762–1777, 2023. 2

  14. [14]

    Gemmeke, Daniel P

    Jort F. Gemmeke, Daniel P. W. Ellis, Dylan Freedman, Aren Jansen, Wade Lawrence, R. Channing Moore, Manoj Plakal, and Marvin Ritter. Audio set: An ontology and human-labeled dataset for audio events. In2017 IEEE International Confer- ence on Acoustics, Speech and Signal Processing (ICASSP), pages 776–780, 2017. 2

  15. [15]

    Recap: Retrieval- augmented audio captioning

    Sreyan Ghosh, Sonal Kumar, Chandra Kiran Reddy Evuru, Ramani Duraiswami, and Dinesh Manocha. Recap: Retrieval- augmented audio captioning. InICASSP 2024-2024 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1161–1165. IEEE, 2024. 2

  16. [16]

    The Llama 3 Herd of Models

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Ab- hinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

  17. [17]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 2

  18. [18]

    LightRAG: Simple and fast retrieval-augmented gen- eration

    Zirui Guo, Lianghao Xia, Yanhua Yu, Tu Ao, and Chao Huang. LightRAG: Simple and fast retrieval-augmented gen- eration. InFindings of the Association for Computational Lin- guistics: EMNLP 2025, pages 10746–10761, Suzhou, China,

  19. [19]

    1, 2, 6, 4

    Association for Computational Linguistics. 1, 2, 6, 4

  20. [20]

    From RAG to memory: Non-parametric continual learning for large language models

    Bernal Jiménez Gutiérrez, Yiheng Shu, Weijian Qi, Sizhe Zhou, and Yu Su. From RAG to memory: Non-parametric continual learning for large language models. InForty-second International Conference on Machine Learning, 2025. 2

  21. [21]

    Realm: retrieval-augmented language model pre-training

    Kelvin Guu, Kenton Lee, Zora Tung, Panupong Pasupat, and Ming-Wei Chang. Realm: retrieval-augmented language model pre-training. InProceedings of the 37th International Conference on Machine Learning, pages 3929–3938, 2020. 2

  22. [22]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card.arXiv preprint arXiv:2410.21276, 2024. 2, 5

  23. [23]

    Leveraging Passage Retrieval with Generative Models for Open Domain Question Answering

    Gautier Izacard and Edouard Grave. Leveraging passage retrieval with generative models for open domain question answering.arXiv preprint arXiv:2007.01282, 2020. 2

  24. [24]

    VideoRAG: Retrieval-augmented generation over video corpus

    Soyeong Jeong, Kangsan Kim, Jinheon Baek, and Sung Ju Hwang. VideoRAG: Retrieval-augmented generation over video corpus. InFindings of the Association for Computa- tional Linguistics: ACL 2025, pages 21278–21298, Vienna, Austria, 2025. Association for Computational Linguistics. 2

  25. [25]

    Billion-scale similarity search with gpus.IEEE Transactions on Big Data, 7(3):535–547, 2019

    Jeff Johnson, Matthijs Douze, and Hervé Jégou. Billion-scale similarity search with gpus.IEEE Transactions on Big Data, 7(3):535–547, 2019. 4 9

  26. [26]

    Audiocaps: Generating captions for audios in the wild

    Chris Dongjoo Kim, Byeongchang Kim, Hyunmin Lee, and Gunhee Kim. Audiocaps: Generating captions for audios in the wild. InProceedings of the 2019 Conference of the North American Chapter of the Association for Computational Lin- guistics: Human Language Technologies, Volume 1 (Long and Short Papers), pages 119–132, 2019. 3, 5, 2

  27. [27]

    Vista: Visual-textual knowledge graph representation learning

    Jaejun Lee, Chanyoung Chung, Hochang Lee, Sungho Jo, and Joyce Whang. Vista: Visual-textual knowledge graph representation learning. InFindings of the association for computational linguistics: EMNLP 2023, pages 7314–7328,

  28. [28]

    Multi- modal reasoning with multimodal knowledge graph.arXiv preprint arXiv:2406.02030, 2024

    Junlin Lee, Yequan Wang, Jing Li, and Min Zhang. Multi- modal reasoning with multimodal knowledge graph.arXiv preprint arXiv:2406.02030, 2024. 1, 3

  29. [29]

    Retrieval- augmented generation for knowledge-intensive nlp tasks.Ad- vances in Neural Information Processing Systems, 33:9459– 9474, 2020

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval- augmented generation for knowledge-intensive nlp tasks.Ad- vances in Neural Information Processing Systems, 33:9459– 9474, 2020. 1, 2

  30. [30]

    LLaVA-OneVision: Easy Visual Task Transfer

    Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Zi- wei Liu, et al. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024. 2

  31. [31]

    LLaV A-neXT- interleave: Tackling multi-image, video, and 3d in large mul- timodal models

    Feng Li, Renrui Zhang, Hao Zhang, Yuanhan Zhang, Bo Li, Wei Li, Zejun MA, and Chunyuan Li. LLaV A-neXT- interleave: Tackling multi-image, video, and 3d in large mul- timodal models. InThe Thirteenth International Conference on Learning Representations, 2025. 2

  32. [32]

    Blip- 2: bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip- 2: bootstrapping language-image pre-training with frozen image encoders and large language models. InProceedings of the 40th International Conference on Machine Learning, pages 19730–19742, 2023. 2

  33. [33]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InProceedings of the 37th International Conference on Neural Information Processing Systems, pages 34892–34916, 2023. 2

  34. [34]

    Valor: Vision-audio- language omni-perception pretraining model and dataset

    Jing Liu, Sihan Chen, Xingjian He, Longteng Guo, Xinxin Zhu, Weining Wang, and Jinhui Tang. Valor: Vision-audio- language omni-perception pretraining model and dataset. IEEE Transactions on Pattern Analysis and Machine Intelli- gence, 2024. 5, 2, 6

  35. [35]

    Aligning vision to language: Annotation-free multimodal knowledge graph construction for enhanced llms reasoning

    Junming Liu, Siyuan Meng, Yanting Gao, Song Mao, Pinlong Cai, Guohang Yan, Yirong Chen, Zilin Bian, Ding Wang, and Botian Shi. Aligning vision to language: Annotation-free multimodal knowledge graph construction for enhanced llms reasoning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 981–992, 2025. 1

  36. [36]

    Grounding dino: Marrying dino with grounded pre-training for open-set object detection

    Shilong Liu, Zhaoyang Zeng, Tianhe Ren, Feng Li, Hao Zhang, Jie Yang, Qing Jiang, Chunyuan Li, Jianwei Yang, Hang Su, et al. Grounding dino: Marrying dino with grounded pre-training for open-set object detection. InEuropean con- ference on computer vision, pages 38–55. Springer, 2024. 2, 4, 3, 5

  37. [37]

    Video-chatgpt: Towards detailed video un- derstanding via large vision and language models

    Muhammad Maaz, Hanoona Rasheed, Salman Khan, and Fahad Khan. Video-chatgpt: Towards detailed video un- derstanding via large vision and language models. InPro- ceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 12585–12602, 2024. 5, 6

  38. [38]

    Vat-kg: Knowledge-intensive multimodal knowl- edge graph dataset for retrieval-augmented generation.arXiv preprint arXiv:2506.21556, 2025

    Hyeongcheol Park, Jiyoung Seo, MinHyuk Jang, Hogun Park, Ha Dam Baek, Gyusam Chang, Hyeonsoo Im, and Sang- pil Kim. Vat-kg: Knowledge-intensive multimodal knowl- edge graph dataset for retrieval-augmented generation.arXiv preprint arXiv:2506.21556, 2025. 1, 2, 3, 6, 4

  39. [39]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceedings of the 38th International Conference on Machine Learning, pages 8748–8763. PMLR, 2021. 6

  40. [40]

    Accept the modality gap: An exploration in the hyperbolic space

    Sameera Ramasinghe, Violetta Shevchenko, Gil Avraham, and Ajanthan Thalaiyasingam. Accept the modality gap: An exploration in the hyperbolic space. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27263–27272, 2024. 1

  41. [41]

    Videorag: Retrieval-augmented gen- eration with extreme long-context videos.arXiv preprint arXiv:2502.01549, 2025

    Xubin Ren, Lingrui Xu, Long Xia, Shuaiqiang Wang, Dawei Yin, and Chao Huang. Videorag: Retrieval-augmented gen- eration with extreme long-context videos.arXiv preprint arXiv:2502.01549, 2025. 1, 6, 4

  42. [42]

    Retrieval-augmented generation: A comprehensive survey of architectures, en- hancements, and robustness frontiers.arXiv preprint arXiv:2506.00054,

    Chaitanya Sharma. Retrieval-augmented generation: A com- prehensive survey of architectures, enhancements, and ro- bustness frontiers.arXiv preprint arXiv:2506.00054, 2025. 1

  43. [43]

    SALMONN: Towards Generic Hearing Abilities for Large Language Models

    Changli Tang, Wenyi Yu, Guangzhi Sun, Xianzhao Chen, Tian Tan, Wei Li, Lu Lu, Zejun Ma, and Chao Zhang. Salmonn: Towards generic hearing abilities for large language models. arXiv preprint arXiv:2310.13289, 2023. 2

  44. [44]

    Qwen2 Technical Report

    Qwen Team et al. Qwen2 technical report.arXiv preprint arXiv:2407.10671, 2(3), 2024. 2

  45. [45]

    Multi-Agent Collaboration Mechanisms: A Survey of LLMs

    Khanh-Tung Tran, Dung Dao, Minh-Duong Nguyen, Quoc- Viet Pham, Barry O’Sullivan, and Hoang D Nguyen. Multi- agent collaboration mechanisms: A survey of llms.arXiv preprint arXiv:2501.06322, 2025. 3

  46. [46]

    Audiobench: A universal benchmark for audio large language models

    Bin Wang, Xunlong Zou, Geyu Lin, Shuo Sun, Zhuohan Liu, Wenyu Zhang, Zhengyuan Liu, AiTi Aw, and Nancy Chen. Audiobench: A universal benchmark for audio large language models. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pa...

  47. [47]

    Kepler: A unified model for knowledge embedding and pre-trained lan- guage representation.Transactions of the Association for Computational Linguistics, 9:176–194, 2021

    Xiaozhi Wang, Tianyu Gao, Zhaocheng Zhu, Zhengyan Zhang, Zhiyuan Liu, Juanzi Li, and Jian Tang. Kepler: A unified model for knowledge embedding and pre-trained lan- guage representation.Transactions of the Association for Computational Linguistics, 9:176–194, 2021. 6, 3

  48. [48]

    Internvideo2: Scaling foundation models for mul- timodal video understanding

    Yi Wang, Kunchang Li, Xinhao Li, Jiashuo Yu, Yinan He, Guo Chen, Baoqi Pei, Rongkun Zheng, Zun Wang, Yansong Shi, et al. Internvideo2: Scaling foundation models for mul- timodal video understanding. InEuropean Conference on Computer Vision, pages 396–416. Springer, 2024. 4, 6, 7

  49. [49]

    Large-scale con- 10 trastive language-audio pretraining with feature fusion and keyword-to-caption augmentation

    Yusong Wu, Ke Chen, Tianyu Zhang, Yuchen Hui, Taylor Berg-Kirkpatrick, and Shlomo Dubnov. Large-scale con- 10 trastive language-audio pretraining with feature fusion and keyword-to-caption augmentation. InICASSP 2023-2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 1–5. IEEE, 2023. 4, 6, 7

  50. [50]

    Jin Xu, Zhifang Guo, Jinzheng He, Hangrui Hu, Ting He, Shuai Bai, Keqin Chen, Jialin Wang, Yang Fan, Kai Dang, et al. Qwen2. 5-omni technical report.arXiv preprint arXiv:2503.20215, 2025. 2, 5, 7, 6

  51. [51]

    Qwen3-Omni Technical Report

    Jin Xu, Zhifang Guo, Hangrui Hu, Yunfei Chu, Xiong Wang, Jinzheng He, Yuxuan Wang, Xian Shi, Ting He, Xinfa Zhu, et al. Qwen3-omni technical report.arXiv preprint arXiv:2509.17765, 2025. 5, 6

  52. [52]

    Towards weakly supervised text-to-audio grounding.IEEE Transac- tions on Multimedia, 2024

    Xuenan Xu, Ziyang Ma, Mengyue Wu, and Kai Yu. Towards weakly supervised text-to-audio grounding.IEEE Transac- tions on Multimedia, 2024. 2, 5, 3

  53. [53]

    Qwen3 Technical Report

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388, 2025. 2, 1, 3

  54. [54]

    MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone.arXiv preprint arXiv:2408.01800, 2024. 1

  55. [55]

    Uni- versalrag: Retrieval-augmented generation over multiple corpora with diverse modalities and granularities

    Woongyeong Yeo, Kangsan Kim, Soyeong Jeong, Jin- heon Baek, and Sung Ju Hwang. Universalrag: Retrieval- augmented generation over corpora of diverse modalities and granularities.arXiv preprint arXiv:2504.20734, 2025. 1, 2, 4

  56. [56]

    M2conceptbase: A fine-grained aligned concept-centric multimodal knowledge base

    Zhiwei Zha, Jiaan Wang, Zhixu Li, Xiangru Zhu, Wei Song, and Yanghua Xiao. M2conceptbase: A fine-grained aligned concept-centric multimodal knowledge base. InProceedings of the 33rd ACM International Conference on Information and Knowledge Management, pages 3113–3123, 2024. 1, 2, 3, 6, 4

  57. [57]

    Extract, define, canonicalize: An llm-based framework for knowledge graph construction

    Bowen Zhang and Harold Soh. Extract, define, canonicalize: An llm-based framework for knowledge graph construction. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 9820–9836, 2024. 3

  58. [58]

    A survey of graph retrieval- augmented generation for customized large language models

    Qinggang Zhang, Shengyuan Chen, Yuanchen Bei, Zheng Yuan, Huachi Zhou, Zijin Hong, Hao Chen, Yilin Xiao, Chuang Zhou, Junnan Dong, et al. A survey of graph retrieval- augmented generation for customized large language models. arXiv preprint arXiv:2501.13958, 2025. 1

  59. [59]

    dogs”→“dog

    Zikang Zhang, Wangjie You, Tianci Wu, Xinrui Wang, Juntao Li, and Min Zhang. A survey of generative information ex- traction. InProceedings of the 31st International Conference on Computational Linguistics, pages 4840–4870, 2025. 3 11 M3KG-RAG: Multi-hop Multimodal Knowledge Graph-enhanced Retrieval-Augmented Generation Supplementary Material Overview Thi...

  60. [60]

    Detect whether any triple’s context appears in the input (enti- ties, attributes, actions, time/place cues)

  61. [61]

    – Do NOT contradict the primary evidence; if conflict exists, ignore the triple

    If matched, integrate the FULL triple (head, relation, tail) into the answer, and enrich with head_desc/tail_desc. – Do NOT contradict the primary evidence; if conflict exists, ignore the triple

  62. [62]

    Query : {QUERY} Retrieved Triples : {TRIPLES_BLOCK} Triple Format : [i] head={h} | relation={r} | tail={t} || head_description={hd} | tail_description={td} Answer : Table 13

    If no triple is confidently matched, answer from the primary evidence only. Query : {QUERY} Retrieved Triples : {TRIPLES_BLOCK} Triple Format : [i] head={h} | relation={r} | tail={t} || head_description={hd} | tail_description={td} Answer : Table 13. Graph-augmented generation template used to instantiate Eq. (6) in our multimodal RAG framework. over M3KG...

  63. [63]

    53. 04. 56. 07. 5 M³ KG- R AGVAT -KGNone Distance Threshold 𝝉 𝜼𝒂𝒗=𝟏.𝟐 44.6542.9944.87 37.25 34.4535.44 32.2430 .00 34 .00 38 .00 42 .00 46 .00

  64. [64]

    The audio contains the sound of a bird chirping

    70. 91. 21. 51. 8Presence Score Threshold 𝜼𝒂𝒗 𝝉=𝟒.𝟓 Model-As-JudgeScoreModel-As-JudgeScore Figure 5.Sensitivity analysis.M.J. score on V ALOR versus modality-wise distance threshold τ (top) and GRASP presence thresholdη av (bottom). d(qm, xm)≤τ . The remaining retrieved items are lifted into the graph via Eq. (3) in the main paper. To understand how the c...