pith. sign in

arxiv: 2509.00798 · v7 · submitted 2025-08-31 · 💻 cs.CV · cs.AI

Progressive Multimodal Search and Reasoning for Knowledge-Intensive Visual Question Answering

Pith reviewed 2026-05-18 19:59 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords progressive multimodal searchknowledge-intensive VQAdual-scope queriescompositional reasoningretrieval-augmented generationreasoning trajectoryvisual question answering
0
0 comments X

The pith

PMSR builds progressive reasoning trajectories with dual-scope queries and compositional synthesis to improve knowledge acquisition in visual question answering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PMSR, a framework designed to overcome the limits of single-pass retrieval in knowledge-intensive visual question answering. It constructs a structured reasoning trajectory step by step rather than attempting to gather and integrate all needed knowledge in one shot. Dual-scope queries draw on both the most recent record and the full prior trajectory to pull diverse evidence from multiple knowledge bases. Compositional reasoning then condenses that evidence into compact, stable records that support further refinement. Experiments across six benchmarks show gains in retrieval recall and final answer accuracy.

Core claim

PMSR progressively constructs a structured reasoning trajectory to enhance both knowledge acquisition and synthesis. Dual-scope queries conditioned on the latest record and the full trajectory retrieve diverse knowledge from heterogeneous knowledge bases. The retrieved evidence is synthesized into compact records via compositional reasoning. This design enables controlled iterative refinement that produces more stable reasoning trajectories with reduced error propagation.

What carries the argument

PMSR framework that progressively builds structured reasoning trajectories by issuing dual-scope queries for retrieval and applying compositional reasoning to create compact synthesis records.

If this is right

  • Retrieval recall improves across six benchmarks that span encyclopedic, real-world, and live visual questions.
  • End-to-end answer accuracy rises when the same progressive trajectory is used for final response generation.
  • Error propagation decreases because each synthesis step produces a compact, stable record for the next iteration.
  • Heterogeneous knowledge bases can be queried more effectively through repeated, history-aware retrieval passes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same trajectory-building pattern could be tested on other multimodal tasks that require external knowledge, such as long-form image captioning.
  • Iterative refinement may allow smaller retrieval budgets per step while still reaching higher overall recall than a single large retrieval pass.
  • If the synthesis records remain stable, the method might support longer reasoning chains without the usual accumulation of hallucinations.

Load-bearing premise

Conditioning dual-scope queries on both the latest record and the full trajectory will acquire sufficient diverse knowledge, and compositional synthesis will produce stable records that reduce error propagation.

What would settle it

A head-to-head comparison on Encyclopedic-VQA or InfoSeek showing that PMSR produces no gain in retrieval recall or end-to-end answer accuracy relative to single-pass baselines would falsify the claimed benefit.

Figures

Figures reproduced from arXiv: 2509.00798 by Changin Choi, Jungmin Ko, Wonjong Rhee, Wonseok Lee.

Figure 1
Figure 1. Figure 1: An overview of conventional multimodal RAG and our MI-RAG framework. Conventional [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Accuracy and recall of MI-RAG on In￾foSeek subset across 9 iterations. We analyze how iterative refinement impacts MI-RAG’s performance by measuring accuracy and recall across iterations. As shown in Fig￾ure 2, performance improves consistently with each step. The initial iterations deliver signif￾icant gains. Although the rate of improvement moderates in later steps, the model continues to achieve substan… view at source ↗
read the original abstract

Knowledge-intensive visual question answering (VQA) requires external knowledge beyond image content, demanding precise visual grounding and coherent integration of visual and textual information. Although multimodal retrieval-augmented generation has achieved notable advances by incorporating external knowledge bases, existing approaches largely adopt single-pass frameworks that often fail to acquire sufficient knowledge and lack mechanisms to revise misdirected reasoning. We propose PMSR (Progressive Multimodal Search and Reasoning), a framework that progressively constructs a structured reasoning trajectory to enhance both knowledge acquisition and synthesis. PMSR uses dual-scope queries conditioned on both the latest record and the trajectory to retrieve diverse knowledge from heterogeneous knowledge bases. The retrieved evidence is then synthesized into compact records via compositional reasoning. This design facilitates controlled iterative refinement, which supports more stable reasoning trajectories with reduced error propagation. Extensive experiments across six diverse benchmarks (Encyclopedic-VQA, InfoSeek, MMSearch, LiveVQA, FVQA, and OK-VQA) demonstrate that PMSR consistently improves both retrieval recall and end-to-end answer accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that PMSR, a progressive multimodal search and reasoning framework, improves knowledge-intensive VQA by constructing structured reasoning trajectories via dual-scope queries (conditioned on both the latest record and full trajectory) that retrieve from heterogeneous knowledge bases, followed by compositional synthesis into compact records. This iterative refinement is said to yield higher retrieval recall and end-to-end answer accuracy than single-pass approaches, with consistent gains shown across six benchmarks (Encyclopedic-VQA, InfoSeek, MMSearch, LiveVQA, FVQA, OK-VQA).

Significance. If the result holds after proper isolation of the progressive mechanism, the work would offer a concrete advance over single-pass multimodal RAG by demonstrating how trajectory-conditioned retrieval and compositional record synthesis can reduce error propagation in knowledge-intensive visual reasoning. The multi-benchmark evaluation scope is a strength, but the absence of ablations or matched-budget controls limits the ability to credit the specific design choices.

major comments (2)
  1. [Experimental evaluation (across the six benchmarks)] The central claim that dual-scope queries conditioned on the latest record plus full trajectory, together with compositional synthesis into stable records, drive the reported gains (rather than simply executing more retrieval steps) is load-bearing yet untested. The abstract and experimental description provide no non-progressive multi-round baseline or control that matches total retrieval budget or round count, leaving open the possibility that improvements arise from extra retrieval effort alone.
  2. [Experiments and results] No ablation studies, error bars, or analysis of error propagation cases are described, which weakens support for the assertion that the progressive design produces more stable trajectories. Without these, the moderate soundness noted in the review cannot be elevated.
minor comments (1)
  1. [Abstract] The abstract would be clearer if it quantified the reported gains (e.g., absolute or relative improvements in recall and accuracy) rather than stating only that improvements are 'consistent'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment below and describe the revisions we will make to strengthen the experimental support for the progressive mechanism.

read point-by-point responses
  1. Referee: The central claim that dual-scope queries conditioned on the latest record plus full trajectory, together with compositional synthesis into stable records, drive the reported gains (rather than simply executing more retrieval steps) is load-bearing yet untested. The abstract and experimental description provide no non-progressive multi-round baseline or control that matches total retrieval budget or round count, leaving open the possibility that improvements arise from extra retrieval effort alone.

    Authors: We agree that a matched-budget multi-round non-progressive baseline is necessary to isolate the contribution of the progressive design. While the current experiments compare PMSR against single-pass multimodal RAG baselines and show consistent gains in retrieval recall and answer accuracy, we did not include a control that performs the same number of retrieval rounds without dual-scope conditioning or compositional synthesis. In the revised manuscript we will add this baseline, matching total retrieval steps and computational budget (e.g., same number of API calls or token budget per question). This addition will allow direct attribution of performance differences to the trajectory-conditioned queries and record synthesis rather than extra retrieval effort. revision: yes

  2. Referee: No ablation studies, error bars, or analysis of error propagation cases are described, which weakens support for the assertion that the progressive design produces more stable trajectories. Without these, the moderate soundness noted in the review cannot be elevated.

    Authors: We acknowledge that the absence of component ablations, statistical error bars, and targeted error-propagation analysis limits the strength of our claims about trajectory stability. In the revision we will add (1) ablations that remove dual-scope conditioning and compositional synthesis individually while keeping the iterative loop, (2) error bars computed over multiple random seeds for the main results on all six benchmarks, and (3) a qualitative case study that traces specific error-propagation examples, showing how the progressive record synthesis corrects early mistakes that persist in single-pass or non-compositional variants. These additions will provide clearer evidence that the observed improvements stem from reduced error accumulation. revision: yes

Circularity Check

0 steps flagged

No circularity: PMSR is an independent architectural proposal validated empirically.

full rationale

The paper introduces PMSR as a design framework for progressive multimodal search and reasoning, specifying dual-scope queries conditioned on latest record and trajectory plus compositional synthesis into records. These are presented as explicit methodological choices rather than quantities derived from equations, fitted parameters, or self-citations. No load-bearing steps reduce by construction to inputs; the abstract and description frame the approach as an external proposal whose value is assessed via experiments on six benchmarks. The central claims rest on empirical improvements in recall and accuracy, not on self-referential definitions or imported uniqueness results from the authors' prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the method appears to rest on standard assumptions of multimodal retrieval and iterative reasoning.

pith-pipeline@v0.9.0 · 5712 in / 977 out tokens · 42013 ms · 2026-05-18T19:59:49.727496+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. WikiSeeker: Rethinking the Role of Vision-Language Models in Knowledge-Based Visual Question Answering

    cs.CV 2026-04 unverdicted novelty 6.0

    WikiSeeker boosts KB-VQA performance by using VLMs to rewrite image-informed queries for better retrieval and to decide when to route to external LLM or rely on internal VLM knowledge.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · cited by 1 Pith paper · 10 internal anchors

  1. [1]

    How (not) to ensemble lvlms for vqa

    Lisa Alazraki, Lluis Castrejon, Mostafa Dehghani, Fantine Huot, Jasper Uijlings, and Thomas Mensink. How (not) to ensemble lvlms for vqa. In Proceedings on, pp.\ 1--20. PMLR, 2023

  2. [2]

    The distracting effect: Understanding irrelevant passages in rag

    Chen Amiraz, Florin Cuconasu, Simone Filice, and Zohar Karnin. The distracting effect: Understanding irrelevant passages in rag. arXiv preprint arXiv:2505.06914, 2025

  3. [3]

    Tomayto, tomahto

    Jannis Bulian, Christian Buck, Wojciech Gajewski, Benjamin B \"o rschinger, and Tal Schuster. Tomayto, tomahto. beyond token-level answer equivalence for question answering evaluation. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.\ 291--305, Abu Dhabi,...

  4. [4]

    Wiki-llava: Hierarchical retrieval-augmented generation for multimodal llms

    Davide Caffagni, Federico Cocchi, Nicholas Moratelli, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. Wiki-llava: Hierarchical retrieval-augmented generation for multimodal llms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 1818--1826, 2024

  5. [5]

    Hammr: Hierarchical multimodal react agents for generic vqa

    Lluis Castrejon, Thomas Mensink, Howard Zhou, Vittorio Ferrari, Andre Araujo, and Jasper Uijlings. Hammr: Hierarchical multimodal react agents for generic vqa. arXiv preprint arXiv:2404.05465, 2024

  6. [6]

    Enhancing retrieval-augmented audio captioning with generation-assisted multimodal querying and progressive learning, 2025

    Choi Changin, Lim Sungjun, and Rhee Wonjong. Enhancing retrieval-augmented audio captioning with generation-assisted multimodal querying and progressive learning, 2025. URL https://arxiv.org/abs/2410.10913

  7. [7]

    Yang Chen, Hexiang Hu, Yi Luan, Haitian Sun, Soravit Changpinyo, Alan Ritter, and Ming-Wei Chang. Can pre-trained vision and language models answer visual information-seeking questions? In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 14948--14968, Singapore, ...

  8. [8]

    Mllm is a strong reranker: Advancing multimodal retrieval-augmented generation via knowledge-enhanced reranking and noise-injected training

    Zhanpeng Chen, Chengjin Xu, Yiyan Qi, and Jian Guo. Mllm is a strong reranker: Advancing multimodal retrieval-augmented generation via knowledge-enhanced reranking and noise-injected training. arXiv preprint arXiv:2407.21439, 2024

  9. [9]

    Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering

    Federico Cocchi, Nicholas Moratelli, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering . In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

  10. [10]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025

  11. [11]

    Siciliano, Simone Filice, Cesare Campagnano, Yoelle Maarek, Nicola Tonellotto, and Fabrizio Silvestri

    Florin Cuconasu, Giovanni Trappolini, F. Siciliano, Simone Filice, Cesare Campagnano, Yoelle Maarek, Nicola Tonellotto, and Fabrizio Silvestri. The power of noise: Redefining retrieval for rag systems. In Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2024. URL https://api.semanticscholar.org/CorpusID:267301416

  12. [12]

    Muka: Multimodal knowledge augmented visual information-seeking

    Lianghao Deng, Yuchong Sun, Shizhe Chen, Ning Yang, Yunfeng Wang, and Ruihua Song. Muka: Multimodal knowledge augmented visual information-seeking. In Proceedings of the 31st International Conference on Computational Linguistics, pp.\ 9675--9686, 2025

  13. [13]

    Synergizing rag and reasoning: A systematic review

    Yunfan Gao, Yun Xiong, Yijie Zhong, Yuxi Bi, Ming Xue, and Haofen Wang. Synergizing rag and reasoning: A systematic review. arXiv preprint arXiv:2504.15909, 2025

  14. [14]

    Masking in multi-hop qa: An analysis of how language models perform with context permutation

    Wenyu Huang, Pavlos Vougiouklis, Mirella Lapata, and Jeff Z Pan. Masking in multi-hop qa: An analysis of how language models perform with context permutation. arXiv preprint arXiv:2505.11754, 2025

  15. [15]

    Large language models know what is key visual entity: An llm-assisted multimodal retrieval for vqa

    Pu Jian, Donglei Yu, and Jiajun Zhang. Large language models know what is key visual entity: An llm-assisted multimodal retrieval for vqa. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp.\ 10939--10956, 2024

  16. [16]

    Jiang, J

    Jinhao Jiang, Jiayi Chen, Junyi Li, Ruiyang Ren, Shijie Wang, Wayne Xin Zhao, Yang Song, and Tao Zhang. Rag-star: Enhancing deliberative reasoning with retrieval augmented verification and refinement. arXiv preprint arXiv:2412.12881, 2024 a

  17. [17]

    E5-V: Universal Embeddings with Multimodal Large Language Models

    Ting Jiang, Minghui Song, Zihan Zhang, Haizhen Huang, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, and Fuzhen Zhuang. E5-v: Universal embeddings with multimodal large language models. arXiv preprint arXiv:2407.12580, 2024 b

  18. [18]

    Retrieve, summarize, plan: Advancing multi-hop question answering with an iterative approach

    Zhouyu Jiang, Mengshu Sun, Lei Liang, and Zhiqiang Zhang. Retrieve, summarize, plan: Advancing multi-hop question answering with an iterative approach. arXiv preprint arXiv:2407.13101, 2024 c

  19. [19]

    VLM 2vec: Training vision-language models for massive multimodal embedding tasks

    Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, and Wenhu Chen. VLM 2vec: Training vision-language models for massive multimodal embedding tasks. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=TE0KOzWYAF

  20. [20]

    Flashrag: A modular toolkit for efficient retrieval-augmented generation research

    Jiajie Jin, Yutao Zhu, Guanting Dong, Yuyao Zhang, Xinyu Yang, Chenghao Zhang, Tong Zhao, Zhao Yang, Zhicheng Dou, and Ji-Rong Wen. Flashrag: A modular toolkit for efficient retrieval-augmented generation research. arXiv preprint arXiv:2405.13576, 2024

  21. [21]

    MM - EMBED : UNIVERSAL MULTIMODAL RETRIEVAL WITH MULTIMODAL LLMS

    Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, Jimmy Lin, Bryan Catanzaro, and Wei Ping. MM - EMBED : UNIVERSAL MULTIMODAL RETRIEVAL WITH MULTIMODAL LLMS . In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=i45NQb2iKO

  22. [22]

    Fine-grained late-interaction multi-modal retrieval for retrieval augmented visual question answering

    Weizhe Lin, Jinghong Chen, Jingbiao Mei, Alexandru Coca, and Bill Byrne. Fine-grained late-interaction multi-modal retrieval for retrieval augmented visual question answering. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=IWWWulAX7g

  23. [23]

    P re FLMR : Scaling up fine-grained late-interaction multi-modal retrievers

    Weizhe Lin, Jingbiao Mei, Jinghong Chen, and Bill Byrne. P re FLMR : Scaling up fine-grained late-interaction multi-modal retrievers. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 5294--5316, Bangkok, Thailand, August 2024. Asso...

  24. [24]

    MMKB-RAG: A multi-modal knowledge-based retrieval-augmented generation framework,

    Zihan Ling, Zhiyao Guo, Yixuan Huang, Yi An, Shuai Xiao, Jinsong Lan, Xiaoyong Zhu, and Bo Zheng. Mmkb-rag: A multi-modal knowledge-based retrieval-augmented generation framework. arXiv preprint arXiv:2504.10074, 2025

  25. [25]

    RA - ISF : Learning to Answer and Understand from Retrieval Augmentation via Iterative Self-Feedback

    Yanming Liu, Xinyue Peng, Xuhong Zhang, Weihao Liu, Jianwei Yin, Jiannan Cao, and Tianyu Du. RA - ISF : Learning to answer and understand from retrieval augmentation via iterative self-feedback. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Findings of the Association for Computational Linguistics: ACL 2024, pp.\ 4730--4749, Bangkok, Thailand, ...

  26. [26]

    Lamra: Large multimodal model as your advanced retrieval assistant

    Yikun Liu, Pingan Chen, Jiayin Cai, Xiaolong Jiang, Yao Hu, Jiangchao Yao, Yanfeng Wang, and Weidi Xie. Lamra: Large multimodal model as your advanced retrieval assistant. arXiv preprint arXiv:2412.01720, 2024 b

  27. [27]

    Generative multi-modal knowledge retrieval with large language models

    Xinwei Long, Jiali Zeng, Fandong Meng, Zhiyuan Ma, Kaiyan Zhang, Bowen Zhou, and Jie Zhou. Generative multi-modal knowledge retrieval with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp.\ 18733--18741, 2024

  28. [28]

    Retrieval-augmented visual question answering via built-in autoregressive search engines

    Xinwei Long, Zhiyuan Ma, Ermo Hua, Kaiyan Zhang, Biqing Qi, and Bowen Zhou. Retrieval-augmented visual question answering via built-in autoregressive search engines. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp.\ 24723--24731, 2025

  29. [29]

    Weakly-supervised visual-retriever-reader for knowledge-based question answering

    Man Luo, Yankai Zeng, Pratyay Banerjee, and Chitta Baral. Weakly-supervised visual-retriever-reader for knowledge-based question answering. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 6417--6431, Online and Punta Cana, Domi...

  30. [30]

    End-to-end knowledge retrieval with multi-modal queries

    Man Luo, Zhiyuan Fang, Tejas Gokhale, Yezhou Yang, and Chitta Baral. End-to-end knowledge retrieval with multi-modal queries. arXiv preprint arXiv:2306.00424, 2023

  31. [31]

    Ok-vqa: A visual question answering benchmark requiring external knowledge

    Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pp.\ 3195--3204, 2019

  32. [32]

    Encyclopedic vqa: Visual questions about detailed properties of fine-grained categories

    Thomas Mensink, Jasper Uijlings, Lluis Castrejon, Arushi Goel, Felipe Cadar, Howard Zhou, Fei Sha, Andr \'e Araujo, and Vittorio Ferrari. Encyclopedic vqa: Visual questions about detailed properties of fine-grained categories. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 3113--3124, 2023

  33. [33]

    Openrouter api, 2025

    OpenRouter . Openrouter api, 2025. URL https://openrouter.ai/docs/api-reference. Accessed: 2025-05-21

  34. [34]

    Miller, and Sebastian Riedel

    Fabio Petroni, Patrick Lewis, Aleksandra Piktus, Tim Rockt \"a schel, Yuxiang Wu, Alexander H. Miller, and Sebastian Riedel. How context affects language models' factual predictions. In Automated Knowledge Base Construction, 2020. URL https://openreview.net/forum?id=025X0zPfn

  35. [35]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.\ 8748--8763. PmLR, 2021

  36. [36]

    Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy

    Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp.\ 9248--9274, Singapore, December 2023. Association f...

  37. [37]

    Large language models can be easily distracted by irrelevant context

    Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H Chi, Nathanael Sch \"a rli, and Denny Zhou. Large language models can be easily distracted by irrelevant context. In International Conference on Machine Learning, pp.\ 31210--31227. PMLR, 2023

  38. [38]

    EVA-CLIP: Improved Training Techniques for CLIP at Scale

    Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023

  39. [39]

    Gemma 3 Technical Report

    Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ram \'e , Morgane Rivi \`e re, et al. Gemma 3 technical report. arXiv preprint arXiv:2503.19786, 2025

  40. [40]

    Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions

    Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. arXiv preprint arXiv:2212.10509, 2022

  41. [41]

    SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

    Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features. arXiv preprint arXiv:2502.14786, 2025

  42. [42]

    Fei Wang, Xingchen Wan, Ruoxi Sun, Jiefeng Chen, and Sercan Ö. Arık. Astute rag: Overcoming imperfect retrieval augmentation and knowledge conflicts for large language models, 2025. URL https://arxiv.org/abs/2410.07176

  43. [43]

    Text Embeddings by Weakly-Supervised Contrastive Pre-training

    Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022

  44. [44]

    Rat: Retrieval augmented thoughts elicit context-aware reasoning in long-horizon generation

    Zihao Wang, Anji Liu, Haowei Lin, Jiaqi Li, Xiaojian Ma, and Yitao Liang. Rat: Retrieval augmented thoughts elicit context-aware reasoning in long-horizon generation. arXiv preprint arXiv:2403.05313, 2024

  45. [45]

    Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

    Benjamin Warner, Antoine Chaffin, Benjamin Clavi \'e , Orion Weller, Oskar Hallstr \"o m, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, et al. Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference. arXiv preprint arXiv:2412.13663, 2024

  46. [46]

    Uniir: Training and benchmarking universal multimodal information retrievers

    Cong Wei, Yang Chen, Haonan Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, and Wenhu Chen. Uniir: Training and benchmarking universal multimodal information retrievers. In European Conference on Computer Vision, pp.\ 387--404. Springer, 2024

  47. [47]

    Longmemeval: Benchmarking chat assistants on long-term interactive memory

    Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Longmemeval: Benchmarking chat assistants on long-term interactive memory. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=pZiyCaVuti

  48. [48]

    How easily do irrelevant inputs skew the responses of large language models?arXiv preprint arXiv:2404.03302, 2024

    Siye Wu, Jian Xie, Jiangjie Chen, Tinghui Zhu, Kai Zhang, and Yanghua Xiao. How easily do irrelevant inputs skew the responses of large language models? ArXiv, abs/2404.03302, 2024. URL https://api.semanticscholar.org/CorpusID:268889623

  49. [49]

    Improving retrieval-augmented generation in medicine with iterative follow-up questions

    Guangzhi Xiong, Qiao Jin, Xiao Wang, Minjia Zhang, Zhiyong Lu, and Aidong Zhang. Improving retrieval-augmented generation in medicine with iterative follow-up questions. In Biocomputing 2025: Proceedings of the Pacific Symposium, pp.\ 199--214. World Scientific, 2024

  50. [50]

    Findings of the

    Yibin Yan and Weidi Xie. E cho S ight: Advancing visual-language models with W iki knowledge. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Findings of the Association for Computational Linguistics: EMNLP 2024, pp.\ 1538--1551, Miami, Florida, USA, November 2024. Association for Computational Linguistics. doi:10.18653/v1/2024.findings-emnlp...

  51. [51]

    Omgm: Orchestrate multiple granularities and modalities for efficient multimodal retrieval

    Wei Yang, Jingjing Fu, Rui Wang, Jinyu Wang, Lei Song, and Jiang Bian. Omgm: Orchestrate multiple granularities and modalities for efficient multimodal retrieval. arXiv preprint arXiv:2505.07879, 2025

  52. [52]

    Retrieval-augmented multimodal language modeling

    Michihiro Yasunaga, Armen Aghajanyan, Weijia Shi, Rich James, Jure Leskovec, Percy Liang, Mike Lewis, Luke Zettlemoyer, and Wen-tau Yih. Retrieval-augmented multimodal language modeling. In Proceedings of the 40th International Conference on Machine Learning, ICML'23. JMLR.org, 2023

  53. [53]

    Auto-rag: Autonomous retrieval-augmented generation for large language models

    Tian Yu, Shaolei Zhang, and Yang Feng. Auto-rag: Autonomous retrieval-augmented generation for large language models. arXiv preprint arXiv:2411.19443, 2024

  54. [54]

    Inference scaling for long-context retrieval augmented generation

    Zhenrui Yue, Honglei Zhuang, Aijun Bai, Kai Hui, Rolf Jagerman, Hansi Zeng, Zhen Qin, Dong Wang, Xuanhui Wang, and Michael Bendersky. Inference scaling for long-context retrieval augmented generation. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=FSjIrOm1vz

  55. [55]

    mR2AG: Multimodal Retrieval-Reflection- Augmented Generation for Knowledge-Based VQA // arXiv preprint arXiv:2411.15041

    Tao Zhang, Ziqi Zhang, Zongyang Ma, Yuxin Chen, Zhongang Qi, Chunfeng Yuan, Bing Li, Junfu Pu, Yuxuan Zhao, Zehua Xie, Jin Ma, Ying Shan, and Weiming Hu. mr ^2 ag: Multimodal retrieval-reflection-augmented generation for knowledge-based vqa. arXiv preprint arXiv:2411.15041, 2024 a . URL https://arxiv.org/abs/2411.15041

  56. [56]

    GME: Improving Universal Multimodal Retrieval by Multimodal LLMs

    Xin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, and Min Zhang. Gme: Improving universal multimodal retrieval by multimodal llms. arXiv preprint arXiv:2412.16855, 2024 b

  57. [57]

    Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

    Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, et al. Siren's song in the ai ocean: a survey on hallucination in large language models. arXiv preprint arXiv:2309.01219, 2023

  58. [58]

    Levelrag: Enhancing retrieval-augmented generation with multi-hop logic planning over rewriting augmented searchers

    Zhuocheng Zhang, Yang Feng, and Min Zhang. Levelrag: Enhancing retrieval-augmented generation with multi-hop logic planning over rewriting augmented searchers. arXiv preprint arXiv:2502.18139, 2025

  59. [59]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

  60. [60]

    @esa (Ref

    \@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

  61. [61]

    \@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

  62. [62]

    Amrum Lighthouse

    @open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...