Progressive Multimodal Search and Reasoning for Knowledge-Intensive Visual Question Answering

Changin Choi; Jungmin Ko; Wonjong Rhee; Wonseok Lee

arxiv: 2509.00798 · v7 · submitted 2025-08-31 · 💻 cs.CV · cs.AI

Progressive Multimodal Search and Reasoning for Knowledge-Intensive Visual Question Answering

Changin Choi , Wonseok Lee , Jungmin Ko , Wonjong Rhee This is my paper

Pith reviewed 2026-05-18 19:59 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords progressive multimodal searchknowledge-intensive VQAdual-scope queriescompositional reasoningretrieval-augmented generationreasoning trajectoryvisual question answering

0 comments

The pith

PMSR builds progressive reasoning trajectories with dual-scope queries and compositional synthesis to improve knowledge acquisition in visual question answering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PMSR, a framework designed to overcome the limits of single-pass retrieval in knowledge-intensive visual question answering. It constructs a structured reasoning trajectory step by step rather than attempting to gather and integrate all needed knowledge in one shot. Dual-scope queries draw on both the most recent record and the full prior trajectory to pull diverse evidence from multiple knowledge bases. Compositional reasoning then condenses that evidence into compact, stable records that support further refinement. Experiments across six benchmarks show gains in retrieval recall and final answer accuracy.

Core claim

PMSR progressively constructs a structured reasoning trajectory to enhance both knowledge acquisition and synthesis. Dual-scope queries conditioned on the latest record and the full trajectory retrieve diverse knowledge from heterogeneous knowledge bases. The retrieved evidence is synthesized into compact records via compositional reasoning. This design enables controlled iterative refinement that produces more stable reasoning trajectories with reduced error propagation.

What carries the argument

PMSR framework that progressively builds structured reasoning trajectories by issuing dual-scope queries for retrieval and applying compositional reasoning to create compact synthesis records.

If this is right

Retrieval recall improves across six benchmarks that span encyclopedic, real-world, and live visual questions.
End-to-end answer accuracy rises when the same progressive trajectory is used for final response generation.
Error propagation decreases because each synthesis step produces a compact, stable record for the next iteration.
Heterogeneous knowledge bases can be queried more effectively through repeated, history-aware retrieval passes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same trajectory-building pattern could be tested on other multimodal tasks that require external knowledge, such as long-form image captioning.
Iterative refinement may allow smaller retrieval budgets per step while still reaching higher overall recall than a single large retrieval pass.
If the synthesis records remain stable, the method might support longer reasoning chains without the usual accumulation of hallucinations.

Load-bearing premise

Conditioning dual-scope queries on both the latest record and the full trajectory will acquire sufficient diverse knowledge, and compositional synthesis will produce stable records that reduce error propagation.

What would settle it

A head-to-head comparison on Encyclopedic-VQA or InfoSeek showing that PMSR produces no gain in retrieval recall or end-to-end answer accuracy relative to single-pass baselines would falsify the claimed benefit.

Figures

Figures reproduced from arXiv: 2509.00798 by Changin Choi, Jungmin Ko, Wonjong Rhee, Wonseok Lee.

**Figure 2.** Figure 2: Accuracy and recall of MI-RAG on InfoSeek subset across 9 iterations. We analyze how iterative refinement impacts MI-RAG’s performance by measuring accuracy and recall across iterations. As shown in Figure 2, performance improves consistently with each step. The initial iterations deliver significant gains. Although the rate of improvement moderates in later steps, the model continues to achieve substan… view at source ↗

read the original abstract

Knowledge-intensive visual question answering (VQA) requires external knowledge beyond image content, demanding precise visual grounding and coherent integration of visual and textual information. Although multimodal retrieval-augmented generation has achieved notable advances by incorporating external knowledge bases, existing approaches largely adopt single-pass frameworks that often fail to acquire sufficient knowledge and lack mechanisms to revise misdirected reasoning. We propose PMSR (Progressive Multimodal Search and Reasoning), a framework that progressively constructs a structured reasoning trajectory to enhance both knowledge acquisition and synthesis. PMSR uses dual-scope queries conditioned on both the latest record and the trajectory to retrieve diverse knowledge from heterogeneous knowledge bases. The retrieved evidence is then synthesized into compact records via compositional reasoning. This design facilitates controlled iterative refinement, which supports more stable reasoning trajectories with reduced error propagation. Extensive experiments across six diverse benchmarks (Encyclopedic-VQA, InfoSeek, MMSearch, LiveVQA, FVQA, and OK-VQA) demonstrate that PMSR consistently improves both retrieval recall and end-to-end answer accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PMSR adds iterative trajectory building with dual-scope queries to knowledge-intensive VQA and reports gains on six benchmarks, but the experiments do not isolate whether the progressive conditioning drives results or if extra retrieval rounds alone would suffice.

read the letter

The main point is that this paper shifts from single-pass retrieval to a progressive framework that builds structured reasoning trajectories for multimodal VQA. Dual-scope queries pull knowledge conditioned on both the latest record and the full history, then compositional synthesis turns the evidence into compact records meant to limit error buildup. The abstract claims this produces more stable paths and better knowledge acquisition than prior single-pass methods.

Referee Report

2 major / 1 minor

Summary. The paper claims that PMSR, a progressive multimodal search and reasoning framework, improves knowledge-intensive VQA by constructing structured reasoning trajectories via dual-scope queries (conditioned on both the latest record and full trajectory) that retrieve from heterogeneous knowledge bases, followed by compositional synthesis into compact records. This iterative refinement is said to yield higher retrieval recall and end-to-end answer accuracy than single-pass approaches, with consistent gains shown across six benchmarks (Encyclopedic-VQA, InfoSeek, MMSearch, LiveVQA, FVQA, OK-VQA).

Significance. If the result holds after proper isolation of the progressive mechanism, the work would offer a concrete advance over single-pass multimodal RAG by demonstrating how trajectory-conditioned retrieval and compositional record synthesis can reduce error propagation in knowledge-intensive visual reasoning. The multi-benchmark evaluation scope is a strength, but the absence of ablations or matched-budget controls limits the ability to credit the specific design choices.

major comments (2)

[Experimental evaluation (across the six benchmarks)] The central claim that dual-scope queries conditioned on the latest record plus full trajectory, together with compositional synthesis into stable records, drive the reported gains (rather than simply executing more retrieval steps) is load-bearing yet untested. The abstract and experimental description provide no non-progressive multi-round baseline or control that matches total retrieval budget or round count, leaving open the possibility that improvements arise from extra retrieval effort alone.
[Experiments and results] No ablation studies, error bars, or analysis of error propagation cases are described, which weakens support for the assertion that the progressive design produces more stable trajectories. Without these, the moderate soundness noted in the review cannot be elevated.

minor comments (1)

[Abstract] The abstract would be clearer if it quantified the reported gains (e.g., absolute or relative improvements in recall and accuracy) rather than stating only that improvements are 'consistent'.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our work. We address each major comment below and describe the revisions we will make to strengthen the experimental support for the progressive mechanism.

read point-by-point responses

Referee: The central claim that dual-scope queries conditioned on the latest record plus full trajectory, together with compositional synthesis into stable records, drive the reported gains (rather than simply executing more retrieval steps) is load-bearing yet untested. The abstract and experimental description provide no non-progressive multi-round baseline or control that matches total retrieval budget or round count, leaving open the possibility that improvements arise from extra retrieval effort alone.

Authors: We agree that a matched-budget multi-round non-progressive baseline is necessary to isolate the contribution of the progressive design. While the current experiments compare PMSR against single-pass multimodal RAG baselines and show consistent gains in retrieval recall and answer accuracy, we did not include a control that performs the same number of retrieval rounds without dual-scope conditioning or compositional synthesis. In the revised manuscript we will add this baseline, matching total retrieval steps and computational budget (e.g., same number of API calls or token budget per question). This addition will allow direct attribution of performance differences to the trajectory-conditioned queries and record synthesis rather than extra retrieval effort. revision: yes
Referee: No ablation studies, error bars, or analysis of error propagation cases are described, which weakens support for the assertion that the progressive design produces more stable trajectories. Without these, the moderate soundness noted in the review cannot be elevated.

Authors: We acknowledge that the absence of component ablations, statistical error bars, and targeted error-propagation analysis limits the strength of our claims about trajectory stability. In the revision we will add (1) ablations that remove dual-scope conditioning and compositional synthesis individually while keeping the iterative loop, (2) error bars computed over multiple random seeds for the main results on all six benchmarks, and (3) a qualitative case study that traces specific error-propagation examples, showing how the progressive record synthesis corrects early mistakes that persist in single-pass or non-compositional variants. These additions will provide clearer evidence that the observed improvements stem from reduced error accumulation. revision: yes

Circularity Check

0 steps flagged

No circularity: PMSR is an independent architectural proposal validated empirically.

full rationale

The paper introduces PMSR as a design framework for progressive multimodal search and reasoning, specifying dual-scope queries conditioned on latest record and trajectory plus compositional synthesis into records. These are presented as explicit methodological choices rather than quantities derived from equations, fitted parameters, or self-citations. No load-bearing steps reduce by construction to inputs; the abstract and description frame the approach as an external proposal whose value is assessed via experiments on six benchmarks. The central claims rest on empirical improvements in recall and accuracy, not on self-referential definitions or imported uniqueness results from the authors' prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the method appears to rest on standard assumptions of multimodal retrieval and iterative reasoning.

pith-pipeline@v0.9.0 · 5712 in / 977 out tokens · 42013 ms · 2026-05-18T19:59:49.727496+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

WikiSeeker: Rethinking the Role of Vision-Language Models in Knowledge-Based Visual Question Answering
cs.CV 2026-04 unverdicted novelty 6.0

WikiSeeker boosts KB-VQA performance by using VLMs to rewrite image-informed queries for better retrieval and to decide when to route to external LLM or rely on internal VLM knowledge.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · cited by 1 Pith paper · 10 internal anchors

[1]

How (not) to ensemble lvlms for vqa

Lisa Alazraki, Lluis Castrejon, Mostafa Dehghani, Fantine Huot, Jasper Uijlings, and Thomas Mensink. How (not) to ensemble lvlms for vqa. In Proceedings on, pp.\ 1--20. PMLR, 2023

work page 2023
[2]

The distracting effect: Understanding irrelevant passages in rag

Chen Amiraz, Florin Cuconasu, Simone Filice, and Zohar Karnin. The distracting effect: Understanding irrelevant passages in rag. arXiv preprint arXiv:2505.06914, 2025

work page arXiv 2025
[3]

Tomayto, tomahto

Jannis Bulian, Christian Buck, Wojciech Gajewski, Benjamin B \"o rschinger, and Tal Schuster. Tomayto, tomahto. beyond token-level answer equivalence for question answering evaluation. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.\ 291--305, Abu Dhabi,...

work page doi:10.18653/v1/2022.emnlp-main.20 2022
[4]

Wiki-llava: Hierarchical retrieval-augmented generation for multimodal llms

Davide Caffagni, Federico Cocchi, Nicholas Moratelli, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. Wiki-llava: Hierarchical retrieval-augmented generation for multimodal llms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 1818--1826, 2024

work page 2024
[5]

Hammr: Hierarchical multimodal react agents for generic vqa

Lluis Castrejon, Thomas Mensink, Howard Zhou, Vittorio Ferrari, Andre Araujo, and Jasper Uijlings. Hammr: Hierarchical multimodal react agents for generic vqa. arXiv preprint arXiv:2404.05465, 2024

work page arXiv 2024
[6]

Enhancing retrieval-augmented audio captioning with generation-assisted multimodal querying and progressive learning, 2025

Choi Changin, Lim Sungjun, and Rhee Wonjong. Enhancing retrieval-augmented audio captioning with generation-assisted multimodal querying and progressive learning, 2025. URL https://arxiv.org/abs/2410.10913

work page arXiv 2025
[7]

Yang Chen, Hexiang Hu, Yi Luan, Haitian Sun, Soravit Changpinyo, Alan Ritter, and Ming-Wei Chang. Can pre-trained vision and language models answer visual information-seeking questions? In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 14948--14968, Singapore, ...

work page doi:10.18653/v1/2023.emnlp-main.925 2023
[8]

Mllm is a strong reranker: Advancing multimodal retrieval-augmented generation via knowledge-enhanced reranking and noise-injected training

Zhanpeng Chen, Chengjin Xu, Yiyan Qi, and Jian Guo. Mllm is a strong reranker: Advancing multimodal retrieval-augmented generation via knowledge-enhanced reranking and noise-injected training. arXiv preprint arXiv:2407.21439, 2024

work page arXiv 2024
[9]

Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering

Federico Cocchi, Nicholas Moratelli, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering . In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

work page 2025
[10]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[11]

Siciliano, Simone Filice, Cesare Campagnano, Yoelle Maarek, Nicola Tonellotto, and Fabrizio Silvestri

Florin Cuconasu, Giovanni Trappolini, F. Siciliano, Simone Filice, Cesare Campagnano, Yoelle Maarek, Nicola Tonellotto, and Fabrizio Silvestri. The power of noise: Redefining retrieval for rag systems. In Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2024. URL https://api.semanticscholar.org/CorpusID:267301416

work page 2024
[12]

Muka: Multimodal knowledge augmented visual information-seeking

Lianghao Deng, Yuchong Sun, Shizhe Chen, Ning Yang, Yunfeng Wang, and Ruihua Song. Muka: Multimodal knowledge augmented visual information-seeking. In Proceedings of the 31st International Conference on Computational Linguistics, pp.\ 9675--9686, 2025

work page 2025
[13]

Synergizing rag and reasoning: A systematic review

Yunfan Gao, Yun Xiong, Yijie Zhong, Yuxi Bi, Ming Xue, and Haofen Wang. Synergizing rag and reasoning: A systematic review. arXiv preprint arXiv:2504.15909, 2025

work page arXiv 2025
[14]

Masking in multi-hop qa: An analysis of how language models perform with context permutation

Wenyu Huang, Pavlos Vougiouklis, Mirella Lapata, and Jeff Z Pan. Masking in multi-hop qa: An analysis of how language models perform with context permutation. arXiv preprint arXiv:2505.11754, 2025

work page arXiv 2025
[15]

Large language models know what is key visual entity: An llm-assisted multimodal retrieval for vqa

Pu Jian, Donglei Yu, and Jiajun Zhang. Large language models know what is key visual entity: An llm-assisted multimodal retrieval for vqa. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp.\ 10939--10956, 2024

work page 2024
[16]

Jiang, J

Jinhao Jiang, Jiayi Chen, Junyi Li, Ruiyang Ren, Shijie Wang, Wayne Xin Zhao, Yang Song, and Tao Zhang. Rag-star: Enhancing deliberative reasoning with retrieval augmented verification and refinement. arXiv preprint arXiv:2412.12881, 2024 a

work page arXiv 2024
[17]

E5-V: Universal Embeddings with Multimodal Large Language Models

Ting Jiang, Minghui Song, Zihan Zhang, Haizhen Huang, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, and Fuzhen Zhuang. E5-v: Universal embeddings with multimodal large language models. arXiv preprint arXiv:2407.12580, 2024 b

work page internal anchor Pith review Pith/arXiv arXiv 2024
[18]

Retrieve, summarize, plan: Advancing multi-hop question answering with an iterative approach

Zhouyu Jiang, Mengshu Sun, Lei Liang, and Zhiqiang Zhang. Retrieve, summarize, plan: Advancing multi-hop question answering with an iterative approach. arXiv preprint arXiv:2407.13101, 2024 c

work page arXiv 2024
[19]

VLM 2vec: Training vision-language models for massive multimodal embedding tasks

Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, and Wenhu Chen. VLM 2vec: Training vision-language models for massive multimodal embedding tasks. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=TE0KOzWYAF

work page 2025
[20]

Flashrag: A modular toolkit for efficient retrieval-augmented generation research

Jiajie Jin, Yutao Zhu, Guanting Dong, Yuyao Zhang, Xinyu Yang, Chenghao Zhang, Tong Zhao, Zhao Yang, Zhicheng Dou, and Ji-Rong Wen. Flashrag: A modular toolkit for efficient retrieval-augmented generation research. arXiv preprint arXiv:2405.13576, 2024

work page arXiv 2024
[21]

MM - EMBED : UNIVERSAL MULTIMODAL RETRIEVAL WITH MULTIMODAL LLMS

Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, Jimmy Lin, Bryan Catanzaro, and Wei Ping. MM - EMBED : UNIVERSAL MULTIMODAL RETRIEVAL WITH MULTIMODAL LLMS . In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=i45NQb2iKO

work page 2025
[22]

Fine-grained late-interaction multi-modal retrieval for retrieval augmented visual question answering

Weizhe Lin, Jinghong Chen, Jingbiao Mei, Alexandru Coca, and Bill Byrne. Fine-grained late-interaction multi-modal retrieval for retrieval augmented visual question answering. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=IWWWulAX7g

work page 2023
[23]

P re FLMR : Scaling up fine-grained late-interaction multi-modal retrievers

Weizhe Lin, Jingbiao Mei, Jinghong Chen, and Bill Byrne. P re FLMR : Scaling up fine-grained late-interaction multi-modal retrievers. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 5294--5316, Bangkok, Thailand, August 2024. Asso...

work page 2024
[24]

MMKB-RAG: A multi-modal knowledge-based retrieval-augmented generation framework,

Zihan Ling, Zhiyao Guo, Yixuan Huang, Yi An, Shuai Xiao, Jinsong Lan, Xiaoyong Zhu, and Bo Zheng. Mmkb-rag: A multi-modal knowledge-based retrieval-augmented generation framework. arXiv preprint arXiv:2504.10074, 2025

work page arXiv 2025
[25]

RA - ISF : Learning to Answer and Understand from Retrieval Augmentation via Iterative Self-Feedback

Yanming Liu, Xinyue Peng, Xuhong Zhang, Weihao Liu, Jianwei Yin, Jiannan Cao, and Tianyu Du. RA - ISF : Learning to answer and understand from retrieval augmentation via iterative self-feedback. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Findings of the Association for Computational Linguistics: ACL 2024, pp.\ 4730--4749, Bangkok, Thailand, ...

work page doi:10.18653/v1/2024.findings-acl.281 2024
[26]

Lamra: Large multimodal model as your advanced retrieval assistant

Yikun Liu, Pingan Chen, Jiayin Cai, Xiaolong Jiang, Yao Hu, Jiangchao Yao, Yanfeng Wang, and Weidi Xie. Lamra: Large multimodal model as your advanced retrieval assistant. arXiv preprint arXiv:2412.01720, 2024 b

work page arXiv 2024
[27]

Generative multi-modal knowledge retrieval with large language models

Xinwei Long, Jiali Zeng, Fandong Meng, Zhiyuan Ma, Kaiyan Zhang, Bowen Zhou, and Jie Zhou. Generative multi-modal knowledge retrieval with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp.\ 18733--18741, 2024

work page 2024
[28]

Retrieval-augmented visual question answering via built-in autoregressive search engines

Xinwei Long, Zhiyuan Ma, Ermo Hua, Kaiyan Zhang, Biqing Qi, and Bowen Zhou. Retrieval-augmented visual question answering via built-in autoregressive search engines. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp.\ 24723--24731, 2025

work page 2025
[29]

Weakly-supervised visual-retriever-reader for knowledge-based question answering

Man Luo, Yankai Zeng, Pratyay Banerjee, and Chitta Baral. Weakly-supervised visual-retriever-reader for knowledge-based question answering. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 6417--6431, Online and Punta Cana, Domi...

work page doi:10.18653/v1/2021.emnlp-main.517 2021
[30]

End-to-end knowledge retrieval with multi-modal queries

Man Luo, Zhiyuan Fang, Tejas Gokhale, Yezhou Yang, and Chitta Baral. End-to-end knowledge retrieval with multi-modal queries. arXiv preprint arXiv:2306.00424, 2023

work page arXiv 2023
[31]

Ok-vqa: A visual question answering benchmark requiring external knowledge

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pp.\ 3195--3204, 2019

work page 2019
[32]

Encyclopedic vqa: Visual questions about detailed properties of fine-grained categories

Thomas Mensink, Jasper Uijlings, Lluis Castrejon, Arushi Goel, Felipe Cadar, Howard Zhou, Fei Sha, Andr \'e Araujo, and Vittorio Ferrari. Encyclopedic vqa: Visual questions about detailed properties of fine-grained categories. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 3113--3124, 2023

work page 2023
[33]

Openrouter api, 2025

OpenRouter . Openrouter api, 2025. URL https://openrouter.ai/docs/api-reference. Accessed: 2025-05-21

work page 2025
[34]

Miller, and Sebastian Riedel

Fabio Petroni, Patrick Lewis, Aleksandra Piktus, Tim Rockt \"a schel, Yuxiang Wu, Alexander H. Miller, and Sebastian Riedel. How context affects language models' factual predictions. In Automated Knowledge Base Construction, 2020. URL https://openreview.net/forum?id=025X0zPfn

work page 2020
[35]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.\ 8748--8763. PmLR, 2021

work page 2021
[36]

Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy

Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp.\ 9248--9274, Singapore, December 2023. Association f...

work page doi:10.18653/v1/2023.findings-emnlp.620 2023
[37]

Large language models can be easily distracted by irrelevant context

Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H Chi, Nathanael Sch \"a rli, and Denny Zhou. Large language models can be easily distracted by irrelevant context. In International Conference on Machine Learning, pp.\ 31210--31227. PMLR, 2023

work page 2023
[38]

EVA-CLIP: Improved Training Techniques for CLIP at Scale

Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[39]

Gemma 3 Technical Report

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ram \'e , Morgane Rivi \`e re, et al. Gemma 3 technical report. arXiv preprint arXiv:2503.19786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. arXiv preprint arXiv:2212.10509, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[41]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features. arXiv preprint arXiv:2502.14786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[42]

Fei Wang, Xingchen Wan, Ruoxi Sun, Jiefeng Chen, and Sercan Ö. Arık. Astute rag: Overcoming imperfect retrieval augmentation and knowledge conflicts for large language models, 2025. URL https://arxiv.org/abs/2410.07176

work page arXiv 2025
[43]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[44]

Rat: Retrieval augmented thoughts elicit context-aware reasoning in long-horizon generation

Zihao Wang, Anji Liu, Haowei Lin, Jiaqi Li, Xiaojian Ma, and Yitao Liang. Rat: Retrieval augmented thoughts elicit context-aware reasoning in long-horizon generation. arXiv preprint arXiv:2403.05313, 2024

work page arXiv 2024
[45]

Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

Benjamin Warner, Antoine Chaffin, Benjamin Clavi \'e , Orion Weller, Oskar Hallstr \"o m, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, et al. Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference. arXiv preprint arXiv:2412.13663, 2024

work page internal anchor Pith review arXiv 2024
[46]

Uniir: Training and benchmarking universal multimodal information retrievers

Cong Wei, Yang Chen, Haonan Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, and Wenhu Chen. Uniir: Training and benchmarking universal multimodal information retrievers. In European Conference on Computer Vision, pp.\ 387--404. Springer, 2024

work page 2024
[47]

Longmemeval: Benchmarking chat assistants on long-term interactive memory

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Longmemeval: Benchmarking chat assistants on long-term interactive memory. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=pZiyCaVuti

work page 2025
[48]

How easily do irrelevant inputs skew the responses of large language models?arXiv preprint arXiv:2404.03302, 2024

Siye Wu, Jian Xie, Jiangjie Chen, Tinghui Zhu, Kai Zhang, and Yanghua Xiao. How easily do irrelevant inputs skew the responses of large language models? ArXiv, abs/2404.03302, 2024. URL https://api.semanticscholar.org/CorpusID:268889623

work page arXiv 2024
[49]

Improving retrieval-augmented generation in medicine with iterative follow-up questions

Guangzhi Xiong, Qiao Jin, Xiao Wang, Minjia Zhang, Zhiyong Lu, and Aidong Zhang. Improving retrieval-augmented generation in medicine with iterative follow-up questions. In Biocomputing 2025: Proceedings of the Pacific Symposium, pp.\ 199--214. World Scientific, 2024

work page 2025
[50]

Findings of the

Yibin Yan and Weidi Xie. E cho S ight: Advancing visual-language models with W iki knowledge. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Findings of the Association for Computational Linguistics: EMNLP 2024, pp.\ 1538--1551, Miami, Florida, USA, November 2024. Association for Computational Linguistics. doi:10.18653/v1/2024.findings-emnlp...

work page doi:10.18653/v1/2024.findings-emnlp.83 2024
[51]

Omgm: Orchestrate multiple granularities and modalities for efficient multimodal retrieval

Wei Yang, Jingjing Fu, Rui Wang, Jinyu Wang, Lei Song, and Jiang Bian. Omgm: Orchestrate multiple granularities and modalities for efficient multimodal retrieval. arXiv preprint arXiv:2505.07879, 2025

work page arXiv 2025
[52]

Retrieval-augmented multimodal language modeling

Michihiro Yasunaga, Armen Aghajanyan, Weijia Shi, Rich James, Jure Leskovec, Percy Liang, Mike Lewis, Luke Zettlemoyer, and Wen-tau Yih. Retrieval-augmented multimodal language modeling. In Proceedings of the 40th International Conference on Machine Learning, ICML'23. JMLR.org, 2023

work page 2023
[53]

Auto-rag: Autonomous retrieval-augmented generation for large language models

Tian Yu, Shaolei Zhang, and Yang Feng. Auto-rag: Autonomous retrieval-augmented generation for large language models. arXiv preprint arXiv:2411.19443, 2024

work page arXiv 2024
[54]

Inference scaling for long-context retrieval augmented generation

Zhenrui Yue, Honglei Zhuang, Aijun Bai, Kai Hui, Rolf Jagerman, Hansi Zeng, Zhen Qin, Dong Wang, Xuanhui Wang, and Michael Bendersky. Inference scaling for long-context retrieval augmented generation. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=FSjIrOm1vz

work page 2025
[55]

mR2AG: Multimodal Retrieval-Reflection- Augmented Generation for Knowledge-Based VQA // arXiv preprint arXiv:2411.15041

Tao Zhang, Ziqi Zhang, Zongyang Ma, Yuxin Chen, Zhongang Qi, Chunfeng Yuan, Bing Li, Junfu Pu, Yuxuan Zhao, Zehua Xie, Jin Ma, Ying Shan, and Weiming Hu. mr ^2 ag: Multimodal retrieval-reflection-augmented generation for knowledge-based vqa. arXiv preprint arXiv:2411.15041, 2024 a . URL https://arxiv.org/abs/2411.15041

work page arXiv 2024
[56]

GME: Improving Universal Multimodal Retrieval by Multimodal LLMs

Xin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, and Min Zhang. Gme: Improving universal multimodal retrieval by multimodal llms. arXiv preprint arXiv:2412.16855, 2024 b

work page internal anchor Pith review Pith/arXiv arXiv 2024
[57]

Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, et al. Siren's song in the ai ocean: a survey on hallucination in large language models. arXiv preprint arXiv:2309.01219, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[58]

Levelrag: Enhancing retrieval-augmented generation with multi-hop logic planning over rewriting augmented searchers

Zhuocheng Zhang, Yang Feng, and Min Zhang. Levelrag: Enhancing retrieval-augmented generation with multi-hop logic planning over rewriting augmented searchers. arXiv preprint arXiv:2502.18139, 2025

work page arXiv 2025
[59]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page
[60]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page
[61]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page
[62]

Amrum Lighthouse

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv 2024

[1] [1]

How (not) to ensemble lvlms for vqa

Lisa Alazraki, Lluis Castrejon, Mostafa Dehghani, Fantine Huot, Jasper Uijlings, and Thomas Mensink. How (not) to ensemble lvlms for vqa. In Proceedings on, pp.\ 1--20. PMLR, 2023

work page 2023

[2] [2]

The distracting effect: Understanding irrelevant passages in rag

Chen Amiraz, Florin Cuconasu, Simone Filice, and Zohar Karnin. The distracting effect: Understanding irrelevant passages in rag. arXiv preprint arXiv:2505.06914, 2025

work page arXiv 2025

[3] [3]

Tomayto, tomahto

Jannis Bulian, Christian Buck, Wojciech Gajewski, Benjamin B \"o rschinger, and Tal Schuster. Tomayto, tomahto. beyond token-level answer equivalence for question answering evaluation. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.\ 291--305, Abu Dhabi,...

work page doi:10.18653/v1/2022.emnlp-main.20 2022

[4] [4]

Wiki-llava: Hierarchical retrieval-augmented generation for multimodal llms

Davide Caffagni, Federico Cocchi, Nicholas Moratelli, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. Wiki-llava: Hierarchical retrieval-augmented generation for multimodal llms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 1818--1826, 2024

work page 2024

[5] [5]

Hammr: Hierarchical multimodal react agents for generic vqa

Lluis Castrejon, Thomas Mensink, Howard Zhou, Vittorio Ferrari, Andre Araujo, and Jasper Uijlings. Hammr: Hierarchical multimodal react agents for generic vqa. arXiv preprint arXiv:2404.05465, 2024

work page arXiv 2024

[6] [6]

Enhancing retrieval-augmented audio captioning with generation-assisted multimodal querying and progressive learning, 2025

Choi Changin, Lim Sungjun, and Rhee Wonjong. Enhancing retrieval-augmented audio captioning with generation-assisted multimodal querying and progressive learning, 2025. URL https://arxiv.org/abs/2410.10913

work page arXiv 2025

[7] [7]

Yang Chen, Hexiang Hu, Yi Luan, Haitian Sun, Soravit Changpinyo, Alan Ritter, and Ming-Wei Chang. Can pre-trained vision and language models answer visual information-seeking questions? In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 14948--14968, Singapore, ...

work page doi:10.18653/v1/2023.emnlp-main.925 2023

[8] [8]

Mllm is a strong reranker: Advancing multimodal retrieval-augmented generation via knowledge-enhanced reranking and noise-injected training

Zhanpeng Chen, Chengjin Xu, Yiyan Qi, and Jian Guo. Mllm is a strong reranker: Advancing multimodal retrieval-augmented generation via knowledge-enhanced reranking and noise-injected training. arXiv preprint arXiv:2407.21439, 2024

work page arXiv 2024

[9] [9]

Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering

Federico Cocchi, Nicholas Moratelli, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering . In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

work page 2025

[10] [10]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[11] [11]

Siciliano, Simone Filice, Cesare Campagnano, Yoelle Maarek, Nicola Tonellotto, and Fabrizio Silvestri

Florin Cuconasu, Giovanni Trappolini, F. Siciliano, Simone Filice, Cesare Campagnano, Yoelle Maarek, Nicola Tonellotto, and Fabrizio Silvestri. The power of noise: Redefining retrieval for rag systems. In Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2024. URL https://api.semanticscholar.org/CorpusID:267301416

work page 2024

[12] [12]

Muka: Multimodal knowledge augmented visual information-seeking

Lianghao Deng, Yuchong Sun, Shizhe Chen, Ning Yang, Yunfeng Wang, and Ruihua Song. Muka: Multimodal knowledge augmented visual information-seeking. In Proceedings of the 31st International Conference on Computational Linguistics, pp.\ 9675--9686, 2025

work page 2025

[13] [13]

Synergizing rag and reasoning: A systematic review

Yunfan Gao, Yun Xiong, Yijie Zhong, Yuxi Bi, Ming Xue, and Haofen Wang. Synergizing rag and reasoning: A systematic review. arXiv preprint arXiv:2504.15909, 2025

work page arXiv 2025

[14] [14]

Masking in multi-hop qa: An analysis of how language models perform with context permutation

Wenyu Huang, Pavlos Vougiouklis, Mirella Lapata, and Jeff Z Pan. Masking in multi-hop qa: An analysis of how language models perform with context permutation. arXiv preprint arXiv:2505.11754, 2025

work page arXiv 2025

[15] [15]

Large language models know what is key visual entity: An llm-assisted multimodal retrieval for vqa

Pu Jian, Donglei Yu, and Jiajun Zhang. Large language models know what is key visual entity: An llm-assisted multimodal retrieval for vqa. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp.\ 10939--10956, 2024

work page 2024

[16] [16]

Jiang, J

Jinhao Jiang, Jiayi Chen, Junyi Li, Ruiyang Ren, Shijie Wang, Wayne Xin Zhao, Yang Song, and Tao Zhang. Rag-star: Enhancing deliberative reasoning with retrieval augmented verification and refinement. arXiv preprint arXiv:2412.12881, 2024 a

work page arXiv 2024

[17] [17]

E5-V: Universal Embeddings with Multimodal Large Language Models

Ting Jiang, Minghui Song, Zihan Zhang, Haizhen Huang, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, and Fuzhen Zhuang. E5-v: Universal embeddings with multimodal large language models. arXiv preprint arXiv:2407.12580, 2024 b

work page internal anchor Pith review Pith/arXiv arXiv 2024

[18] [18]

Retrieve, summarize, plan: Advancing multi-hop question answering with an iterative approach

Zhouyu Jiang, Mengshu Sun, Lei Liang, and Zhiqiang Zhang. Retrieve, summarize, plan: Advancing multi-hop question answering with an iterative approach. arXiv preprint arXiv:2407.13101, 2024 c

work page arXiv 2024

[19] [19]

VLM 2vec: Training vision-language models for massive multimodal embedding tasks

Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, and Wenhu Chen. VLM 2vec: Training vision-language models for massive multimodal embedding tasks. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=TE0KOzWYAF

work page 2025

[20] [20]

Flashrag: A modular toolkit for efficient retrieval-augmented generation research

Jiajie Jin, Yutao Zhu, Guanting Dong, Yuyao Zhang, Xinyu Yang, Chenghao Zhang, Tong Zhao, Zhao Yang, Zhicheng Dou, and Ji-Rong Wen. Flashrag: A modular toolkit for efficient retrieval-augmented generation research. arXiv preprint arXiv:2405.13576, 2024

work page arXiv 2024

[21] [21]

MM - EMBED : UNIVERSAL MULTIMODAL RETRIEVAL WITH MULTIMODAL LLMS

Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, Jimmy Lin, Bryan Catanzaro, and Wei Ping. MM - EMBED : UNIVERSAL MULTIMODAL RETRIEVAL WITH MULTIMODAL LLMS . In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=i45NQb2iKO

work page 2025

[22] [22]

Fine-grained late-interaction multi-modal retrieval for retrieval augmented visual question answering

Weizhe Lin, Jinghong Chen, Jingbiao Mei, Alexandru Coca, and Bill Byrne. Fine-grained late-interaction multi-modal retrieval for retrieval augmented visual question answering. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=IWWWulAX7g

work page 2023

[23] [23]

P re FLMR : Scaling up fine-grained late-interaction multi-modal retrievers

Weizhe Lin, Jingbiao Mei, Jinghong Chen, and Bill Byrne. P re FLMR : Scaling up fine-grained late-interaction multi-modal retrievers. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 5294--5316, Bangkok, Thailand, August 2024. Asso...

work page 2024

[24] [24]

MMKB-RAG: A multi-modal knowledge-based retrieval-augmented generation framework,

Zihan Ling, Zhiyao Guo, Yixuan Huang, Yi An, Shuai Xiao, Jinsong Lan, Xiaoyong Zhu, and Bo Zheng. Mmkb-rag: A multi-modal knowledge-based retrieval-augmented generation framework. arXiv preprint arXiv:2504.10074, 2025

work page arXiv 2025

[25] [25]

RA - ISF : Learning to Answer and Understand from Retrieval Augmentation via Iterative Self-Feedback

Yanming Liu, Xinyue Peng, Xuhong Zhang, Weihao Liu, Jianwei Yin, Jiannan Cao, and Tianyu Du. RA - ISF : Learning to answer and understand from retrieval augmentation via iterative self-feedback. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Findings of the Association for Computational Linguistics: ACL 2024, pp.\ 4730--4749, Bangkok, Thailand, ...

work page doi:10.18653/v1/2024.findings-acl.281 2024

[26] [26]

Lamra: Large multimodal model as your advanced retrieval assistant

Yikun Liu, Pingan Chen, Jiayin Cai, Xiaolong Jiang, Yao Hu, Jiangchao Yao, Yanfeng Wang, and Weidi Xie. Lamra: Large multimodal model as your advanced retrieval assistant. arXiv preprint arXiv:2412.01720, 2024 b

work page arXiv 2024

[27] [27]

Generative multi-modal knowledge retrieval with large language models

Xinwei Long, Jiali Zeng, Fandong Meng, Zhiyuan Ma, Kaiyan Zhang, Bowen Zhou, and Jie Zhou. Generative multi-modal knowledge retrieval with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp.\ 18733--18741, 2024

work page 2024

[28] [28]

Retrieval-augmented visual question answering via built-in autoregressive search engines

Xinwei Long, Zhiyuan Ma, Ermo Hua, Kaiyan Zhang, Biqing Qi, and Bowen Zhou. Retrieval-augmented visual question answering via built-in autoregressive search engines. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp.\ 24723--24731, 2025

work page 2025

[29] [29]

Weakly-supervised visual-retriever-reader for knowledge-based question answering

Man Luo, Yankai Zeng, Pratyay Banerjee, and Chitta Baral. Weakly-supervised visual-retriever-reader for knowledge-based question answering. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 6417--6431, Online and Punta Cana, Domi...

work page doi:10.18653/v1/2021.emnlp-main.517 2021

[30] [30]

End-to-end knowledge retrieval with multi-modal queries

Man Luo, Zhiyuan Fang, Tejas Gokhale, Yezhou Yang, and Chitta Baral. End-to-end knowledge retrieval with multi-modal queries. arXiv preprint arXiv:2306.00424, 2023

work page arXiv 2023

[31] [31]

Ok-vqa: A visual question answering benchmark requiring external knowledge

Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pp.\ 3195--3204, 2019

work page 2019

[32] [32]

Encyclopedic vqa: Visual questions about detailed properties of fine-grained categories

Thomas Mensink, Jasper Uijlings, Lluis Castrejon, Arushi Goel, Felipe Cadar, Howard Zhou, Fei Sha, Andr \'e Araujo, and Vittorio Ferrari. Encyclopedic vqa: Visual questions about detailed properties of fine-grained categories. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 3113--3124, 2023

work page 2023

[33] [33]

Openrouter api, 2025

OpenRouter . Openrouter api, 2025. URL https://openrouter.ai/docs/api-reference. Accessed: 2025-05-21

work page 2025

[34] [34]

Miller, and Sebastian Riedel

Fabio Petroni, Patrick Lewis, Aleksandra Piktus, Tim Rockt \"a schel, Yuxiang Wu, Alexander H. Miller, and Sebastian Riedel. How context affects language models' factual predictions. In Automated Knowledge Base Construction, 2020. URL https://openreview.net/forum?id=025X0zPfn

work page 2020

[35] [35]

Learning transferable visual models from natural language supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.\ 8748--8763. PmLR, 2021

work page 2021

[36] [36]

Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy

Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp.\ 9248--9274, Singapore, December 2023. Association f...

work page doi:10.18653/v1/2023.findings-emnlp.620 2023

[37] [37]

Large language models can be easily distracted by irrelevant context

Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H Chi, Nathanael Sch \"a rli, and Denny Zhou. Large language models can be easily distracted by irrelevant context. In International Conference on Machine Learning, pp.\ 31210--31227. PMLR, 2023

work page 2023

[38] [38]

EVA-CLIP: Improved Training Techniques for CLIP at Scale

Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[39] [39]

Gemma 3 Technical Report

Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ram \'e , Morgane Rivi \`e re, et al. Gemma 3 technical report. arXiv preprint arXiv:2503.19786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[40] [40]

Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions

Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. arXiv preprint arXiv:2212.10509, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[41] [41]

SigLIP 2: Multilingual Vision-Language Encoders with Improved Semantic Understanding, Localization, and Dense Features

Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features. arXiv preprint arXiv:2502.14786, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[42] [42]

Fei Wang, Xingchen Wan, Ruoxi Sun, Jiefeng Chen, and Sercan Ö. Arık. Astute rag: Overcoming imperfect retrieval augmentation and knowledge conflicts for large language models, 2025. URL https://arxiv.org/abs/2410.07176

work page arXiv 2025

[43] [43]

Text Embeddings by Weakly-Supervised Contrastive Pre-training

Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[44] [44]

Rat: Retrieval augmented thoughts elicit context-aware reasoning in long-horizon generation

Zihao Wang, Anji Liu, Haowei Lin, Jiaqi Li, Xiaojian Ma, and Yitao Liang. Rat: Retrieval augmented thoughts elicit context-aware reasoning in long-horizon generation. arXiv preprint arXiv:2403.05313, 2024

work page arXiv 2024

[45] [45]

Smarter, Better, Faster, Longer: A Modern Bidirectional Encoder for Fast, Memory Efficient, and Long Context Finetuning and Inference

Benjamin Warner, Antoine Chaffin, Benjamin Clavi \'e , Orion Weller, Oskar Hallstr \"o m, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, et al. Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference. arXiv preprint arXiv:2412.13663, 2024

work page internal anchor Pith review arXiv 2024

[46] [46]

Uniir: Training and benchmarking universal multimodal information retrievers

Cong Wei, Yang Chen, Haonan Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, and Wenhu Chen. Uniir: Training and benchmarking universal multimodal information retrievers. In European Conference on Computer Vision, pp.\ 387--404. Springer, 2024

work page 2024

[47] [47]

Longmemeval: Benchmarking chat assistants on long-term interactive memory

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Longmemeval: Benchmarking chat assistants on long-term interactive memory. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=pZiyCaVuti

work page 2025

[48] [48]

How easily do irrelevant inputs skew the responses of large language models?arXiv preprint arXiv:2404.03302, 2024

Siye Wu, Jian Xie, Jiangjie Chen, Tinghui Zhu, Kai Zhang, and Yanghua Xiao. How easily do irrelevant inputs skew the responses of large language models? ArXiv, abs/2404.03302, 2024. URL https://api.semanticscholar.org/CorpusID:268889623

work page arXiv 2024

[49] [49]

Improving retrieval-augmented generation in medicine with iterative follow-up questions

Guangzhi Xiong, Qiao Jin, Xiao Wang, Minjia Zhang, Zhiyong Lu, and Aidong Zhang. Improving retrieval-augmented generation in medicine with iterative follow-up questions. In Biocomputing 2025: Proceedings of the Pacific Symposium, pp.\ 199--214. World Scientific, 2024

work page 2025

[50] [50]

Findings of the

Yibin Yan and Weidi Xie. E cho S ight: Advancing visual-language models with W iki knowledge. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Findings of the Association for Computational Linguistics: EMNLP 2024, pp.\ 1538--1551, Miami, Florida, USA, November 2024. Association for Computational Linguistics. doi:10.18653/v1/2024.findings-emnlp...

work page doi:10.18653/v1/2024.findings-emnlp.83 2024

[51] [51]

Omgm: Orchestrate multiple granularities and modalities for efficient multimodal retrieval

Wei Yang, Jingjing Fu, Rui Wang, Jinyu Wang, Lei Song, and Jiang Bian. Omgm: Orchestrate multiple granularities and modalities for efficient multimodal retrieval. arXiv preprint arXiv:2505.07879, 2025

work page arXiv 2025

[52] [52]

Retrieval-augmented multimodal language modeling

Michihiro Yasunaga, Armen Aghajanyan, Weijia Shi, Rich James, Jure Leskovec, Percy Liang, Mike Lewis, Luke Zettlemoyer, and Wen-tau Yih. Retrieval-augmented multimodal language modeling. In Proceedings of the 40th International Conference on Machine Learning, ICML'23. JMLR.org, 2023

work page 2023

[53] [53]

Auto-rag: Autonomous retrieval-augmented generation for large language models

Tian Yu, Shaolei Zhang, and Yang Feng. Auto-rag: Autonomous retrieval-augmented generation for large language models. arXiv preprint arXiv:2411.19443, 2024

work page arXiv 2024

[54] [54]

Inference scaling for long-context retrieval augmented generation

Zhenrui Yue, Honglei Zhuang, Aijun Bai, Kai Hui, Rolf Jagerman, Hansi Zeng, Zhen Qin, Dong Wang, Xuanhui Wang, and Michael Bendersky. Inference scaling for long-context retrieval augmented generation. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=FSjIrOm1vz

work page 2025

[55] [55]

mR2AG: Multimodal Retrieval-Reflection- Augmented Generation for Knowledge-Based VQA // arXiv preprint arXiv:2411.15041

Tao Zhang, Ziqi Zhang, Zongyang Ma, Yuxin Chen, Zhongang Qi, Chunfeng Yuan, Bing Li, Junfu Pu, Yuxuan Zhao, Zehua Xie, Jin Ma, Ying Shan, and Weiming Hu. mr ^2 ag: Multimodal retrieval-reflection-augmented generation for knowledge-based vqa. arXiv preprint arXiv:2411.15041, 2024 a . URL https://arxiv.org/abs/2411.15041

work page arXiv 2024

[56] [56]

GME: Improving Universal Multimodal Retrieval by Multimodal LLMs

Xin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, and Min Zhang. Gme: Improving universal multimodal retrieval by multimodal llms. arXiv preprint arXiv:2412.16855, 2024 b

work page internal anchor Pith review Pith/arXiv arXiv 2024

[57] [57]

Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models

Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, et al. Siren's song in the ai ocean: a survey on hallucination in large language models. arXiv preprint arXiv:2309.01219, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[58] [58]

Levelrag: Enhancing retrieval-augmented generation with multi-hop logic planning over rewriting augmented searchers

Zhuocheng Zhang, Yang Feng, and Min Zhang. Levelrag: Enhancing retrieval-augmented generation with multi-hop logic planning over rewriting augmented searchers. arXiv preprint arXiv:2502.18139, 2025

work page arXiv 2025

[59] [59]

write newline

" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...

work page

[60] [60]

@esa (Ref

\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...

work page

[61] [61]

\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...

work page

[62] [62]

Amrum Lighthouse

@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...

work page arXiv 2024