Progressive Multimodal Search and Reasoning for Knowledge-Intensive Visual Question Answering
Pith reviewed 2026-05-18 19:59 UTC · model grok-4.3
The pith
PMSR builds progressive reasoning trajectories with dual-scope queries and compositional synthesis to improve knowledge acquisition in visual question answering.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PMSR progressively constructs a structured reasoning trajectory to enhance both knowledge acquisition and synthesis. Dual-scope queries conditioned on the latest record and the full trajectory retrieve diverse knowledge from heterogeneous knowledge bases. The retrieved evidence is synthesized into compact records via compositional reasoning. This design enables controlled iterative refinement that produces more stable reasoning trajectories with reduced error propagation.
What carries the argument
PMSR framework that progressively builds structured reasoning trajectories by issuing dual-scope queries for retrieval and applying compositional reasoning to create compact synthesis records.
If this is right
- Retrieval recall improves across six benchmarks that span encyclopedic, real-world, and live visual questions.
- End-to-end answer accuracy rises when the same progressive trajectory is used for final response generation.
- Error propagation decreases because each synthesis step produces a compact, stable record for the next iteration.
- Heterogeneous knowledge bases can be queried more effectively through repeated, history-aware retrieval passes.
Where Pith is reading between the lines
- The same trajectory-building pattern could be tested on other multimodal tasks that require external knowledge, such as long-form image captioning.
- Iterative refinement may allow smaller retrieval budgets per step while still reaching higher overall recall than a single large retrieval pass.
- If the synthesis records remain stable, the method might support longer reasoning chains without the usual accumulation of hallucinations.
Load-bearing premise
Conditioning dual-scope queries on both the latest record and the full trajectory will acquire sufficient diverse knowledge, and compositional synthesis will produce stable records that reduce error propagation.
What would settle it
A head-to-head comparison on Encyclopedic-VQA or InfoSeek showing that PMSR produces no gain in retrieval recall or end-to-end answer accuracy relative to single-pass baselines would falsify the claimed benefit.
Figures
read the original abstract
Knowledge-intensive visual question answering (VQA) requires external knowledge beyond image content, demanding precise visual grounding and coherent integration of visual and textual information. Although multimodal retrieval-augmented generation has achieved notable advances by incorporating external knowledge bases, existing approaches largely adopt single-pass frameworks that often fail to acquire sufficient knowledge and lack mechanisms to revise misdirected reasoning. We propose PMSR (Progressive Multimodal Search and Reasoning), a framework that progressively constructs a structured reasoning trajectory to enhance both knowledge acquisition and synthesis. PMSR uses dual-scope queries conditioned on both the latest record and the trajectory to retrieve diverse knowledge from heterogeneous knowledge bases. The retrieved evidence is then synthesized into compact records via compositional reasoning. This design facilitates controlled iterative refinement, which supports more stable reasoning trajectories with reduced error propagation. Extensive experiments across six diverse benchmarks (Encyclopedic-VQA, InfoSeek, MMSearch, LiveVQA, FVQA, and OK-VQA) demonstrate that PMSR consistently improves both retrieval recall and end-to-end answer accuracy.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that PMSR, a progressive multimodal search and reasoning framework, improves knowledge-intensive VQA by constructing structured reasoning trajectories via dual-scope queries (conditioned on both the latest record and full trajectory) that retrieve from heterogeneous knowledge bases, followed by compositional synthesis into compact records. This iterative refinement is said to yield higher retrieval recall and end-to-end answer accuracy than single-pass approaches, with consistent gains shown across six benchmarks (Encyclopedic-VQA, InfoSeek, MMSearch, LiveVQA, FVQA, OK-VQA).
Significance. If the result holds after proper isolation of the progressive mechanism, the work would offer a concrete advance over single-pass multimodal RAG by demonstrating how trajectory-conditioned retrieval and compositional record synthesis can reduce error propagation in knowledge-intensive visual reasoning. The multi-benchmark evaluation scope is a strength, but the absence of ablations or matched-budget controls limits the ability to credit the specific design choices.
major comments (2)
- [Experimental evaluation (across the six benchmarks)] The central claim that dual-scope queries conditioned on the latest record plus full trajectory, together with compositional synthesis into stable records, drive the reported gains (rather than simply executing more retrieval steps) is load-bearing yet untested. The abstract and experimental description provide no non-progressive multi-round baseline or control that matches total retrieval budget or round count, leaving open the possibility that improvements arise from extra retrieval effort alone.
- [Experiments and results] No ablation studies, error bars, or analysis of error propagation cases are described, which weakens support for the assertion that the progressive design produces more stable trajectories. Without these, the moderate soundness noted in the review cannot be elevated.
minor comments (1)
- [Abstract] The abstract would be clearer if it quantified the reported gains (e.g., absolute or relative improvements in recall and accuracy) rather than stating only that improvements are 'consistent'.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our work. We address each major comment below and describe the revisions we will make to strengthen the experimental support for the progressive mechanism.
read point-by-point responses
-
Referee: The central claim that dual-scope queries conditioned on the latest record plus full trajectory, together with compositional synthesis into stable records, drive the reported gains (rather than simply executing more retrieval steps) is load-bearing yet untested. The abstract and experimental description provide no non-progressive multi-round baseline or control that matches total retrieval budget or round count, leaving open the possibility that improvements arise from extra retrieval effort alone.
Authors: We agree that a matched-budget multi-round non-progressive baseline is necessary to isolate the contribution of the progressive design. While the current experiments compare PMSR against single-pass multimodal RAG baselines and show consistent gains in retrieval recall and answer accuracy, we did not include a control that performs the same number of retrieval rounds without dual-scope conditioning or compositional synthesis. In the revised manuscript we will add this baseline, matching total retrieval steps and computational budget (e.g., same number of API calls or token budget per question). This addition will allow direct attribution of performance differences to the trajectory-conditioned queries and record synthesis rather than extra retrieval effort. revision: yes
-
Referee: No ablation studies, error bars, or analysis of error propagation cases are described, which weakens support for the assertion that the progressive design produces more stable trajectories. Without these, the moderate soundness noted in the review cannot be elevated.
Authors: We acknowledge that the absence of component ablations, statistical error bars, and targeted error-propagation analysis limits the strength of our claims about trajectory stability. In the revision we will add (1) ablations that remove dual-scope conditioning and compositional synthesis individually while keeping the iterative loop, (2) error bars computed over multiple random seeds for the main results on all six benchmarks, and (3) a qualitative case study that traces specific error-propagation examples, showing how the progressive record synthesis corrects early mistakes that persist in single-pass or non-compositional variants. These additions will provide clearer evidence that the observed improvements stem from reduced error accumulation. revision: yes
Circularity Check
No circularity: PMSR is an independent architectural proposal validated empirically.
full rationale
The paper introduces PMSR as a design framework for progressive multimodal search and reasoning, specifying dual-scope queries conditioned on latest record and trajectory plus compositional synthesis into records. These are presented as explicit methodological choices rather than quantities derived from equations, fitted parameters, or self-citations. No load-bearing steps reduce by construction to inputs; the abstract and description frame the approach as an external proposal whose value is assessed via experiments on six benchmarks. The central claims rest on empirical improvements in recall and accuracy, not on self-referential definitions or imported uniqueness results from the authors' prior work.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
WikiSeeker: Rethinking the Role of Vision-Language Models in Knowledge-Based Visual Question Answering
WikiSeeker boosts KB-VQA performance by using VLMs to rewrite image-informed queries for better retrieval and to decide when to route to external LLM or rely on internal VLM knowledge.
Reference graph
Works this paper leans on
-
[1]
How (not) to ensemble lvlms for vqa
Lisa Alazraki, Lluis Castrejon, Mostafa Dehghani, Fantine Huot, Jasper Uijlings, and Thomas Mensink. How (not) to ensemble lvlms for vqa. In Proceedings on, pp.\ 1--20. PMLR, 2023
work page 2023
-
[2]
The distracting effect: Understanding irrelevant passages in rag
Chen Amiraz, Florin Cuconasu, Simone Filice, and Zohar Karnin. The distracting effect: Understanding irrelevant passages in rag. arXiv preprint arXiv:2505.06914, 2025
-
[3]
Jannis Bulian, Christian Buck, Wojciech Gajewski, Benjamin B \"o rschinger, and Tal Schuster. Tomayto, tomahto. beyond token-level answer equivalence for question answering evaluation. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.\ 291--305, Abu Dhabi,...
-
[4]
Wiki-llava: Hierarchical retrieval-augmented generation for multimodal llms
Davide Caffagni, Federico Cocchi, Nicholas Moratelli, Sara Sarto, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. Wiki-llava: Hierarchical retrieval-augmented generation for multimodal llms. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp.\ 1818--1826, 2024
work page 2024
-
[5]
Hammr: Hierarchical multimodal react agents for generic vqa
Lluis Castrejon, Thomas Mensink, Howard Zhou, Vittorio Ferrari, Andre Araujo, and Jasper Uijlings. Hammr: Hierarchical multimodal react agents for generic vqa. arXiv preprint arXiv:2404.05465, 2024
-
[6]
Choi Changin, Lim Sungjun, and Rhee Wonjong. Enhancing retrieval-augmented audio captioning with generation-assisted multimodal querying and progressive learning, 2025. URL https://arxiv.org/abs/2410.10913
-
[7]
Yang Chen, Hexiang Hu, Yi Luan, Haitian Sun, Soravit Changpinyo, Alan Ritter, and Ming-Wei Chang. Can pre-trained vision and language models answer visual information-seeking questions? In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 14948--14968, Singapore, ...
-
[8]
Zhanpeng Chen, Chengjin Xu, Yiyan Qi, and Jian Guo. Mllm is a strong reranker: Advancing multimodal retrieval-augmented generation via knowledge-enhanced reranking and noise-injected training. arXiv preprint arXiv:2407.21439, 2024
-
[9]
Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering
Federico Cocchi, Nicholas Moratelli, Marcella Cornia, Lorenzo Baraldi, and Rita Cucchiara. Augmenting Multimodal LLMs with Self-Reflective Tokens for Knowledge-based Visual Question Answering . In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025
work page 2025
-
[10]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blistein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Florin Cuconasu, Giovanni Trappolini, F. Siciliano, Simone Filice, Cesare Campagnano, Yoelle Maarek, Nicola Tonellotto, and Fabrizio Silvestri. The power of noise: Redefining retrieval for rag systems. In Annual International ACM SIGIR Conference on Research and Development in Information Retrieval, 2024. URL https://api.semanticscholar.org/CorpusID:267301416
work page 2024
-
[12]
Muka: Multimodal knowledge augmented visual information-seeking
Lianghao Deng, Yuchong Sun, Shizhe Chen, Ning Yang, Yunfeng Wang, and Ruihua Song. Muka: Multimodal knowledge augmented visual information-seeking. In Proceedings of the 31st International Conference on Computational Linguistics, pp.\ 9675--9686, 2025
work page 2025
-
[13]
Synergizing rag and reasoning: A systematic review
Yunfan Gao, Yun Xiong, Yijie Zhong, Yuxi Bi, Ming Xue, and Haofen Wang. Synergizing rag and reasoning: A systematic review. arXiv preprint arXiv:2504.15909, 2025
-
[14]
Masking in multi-hop qa: An analysis of how language models perform with context permutation
Wenyu Huang, Pavlos Vougiouklis, Mirella Lapata, and Jeff Z Pan. Masking in multi-hop qa: An analysis of how language models perform with context permutation. arXiv preprint arXiv:2505.11754, 2025
-
[15]
Large language models know what is key visual entity: An llm-assisted multimodal retrieval for vqa
Pu Jian, Donglei Yu, and Jiajun Zhang. Large language models know what is key visual entity: An llm-assisted multimodal retrieval for vqa. In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pp.\ 10939--10956, 2024
work page 2024
- [16]
-
[17]
E5-V: Universal Embeddings with Multimodal Large Language Models
Ting Jiang, Minghui Song, Zihan Zhang, Haizhen Huang, Weiwei Deng, Feng Sun, Qi Zhang, Deqing Wang, and Fuzhen Zhuang. E5-v: Universal embeddings with multimodal large language models. arXiv preprint arXiv:2407.12580, 2024 b
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Retrieve, summarize, plan: Advancing multi-hop question answering with an iterative approach
Zhouyu Jiang, Mengshu Sun, Lei Liang, and Zhiqiang Zhang. Retrieve, summarize, plan: Advancing multi-hop question answering with an iterative approach. arXiv preprint arXiv:2407.13101, 2024 c
-
[19]
VLM 2vec: Training vision-language models for massive multimodal embedding tasks
Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, and Wenhu Chen. VLM 2vec: Training vision-language models for massive multimodal embedding tasks. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=TE0KOzWYAF
work page 2025
-
[20]
Flashrag: A modular toolkit for efficient retrieval-augmented generation research
Jiajie Jin, Yutao Zhu, Guanting Dong, Yuyao Zhang, Xinyu Yang, Chenghao Zhang, Tong Zhao, Zhao Yang, Zhicheng Dou, and Ji-Rong Wen. Flashrag: A modular toolkit for efficient retrieval-augmented generation research. arXiv preprint arXiv:2405.13576, 2024
-
[21]
MM - EMBED : UNIVERSAL MULTIMODAL RETRIEVAL WITH MULTIMODAL LLMS
Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, Jimmy Lin, Bryan Catanzaro, and Wei Ping. MM - EMBED : UNIVERSAL MULTIMODAL RETRIEVAL WITH MULTIMODAL LLMS . In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=i45NQb2iKO
work page 2025
-
[22]
Weizhe Lin, Jinghong Chen, Jingbiao Mei, Alexandru Coca, and Bill Byrne. Fine-grained late-interaction multi-modal retrieval for retrieval augmented visual question answering. In Thirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=IWWWulAX7g
work page 2023
-
[23]
P re FLMR : Scaling up fine-grained late-interaction multi-modal retrievers
Weizhe Lin, Jingbiao Mei, Jinghong Chen, and Bill Byrne. P re FLMR : Scaling up fine-grained late-interaction multi-modal retrievers. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 5294--5316, Bangkok, Thailand, August 2024. Asso...
work page 2024
-
[24]
MMKB-RAG: A multi-modal knowledge-based retrieval-augmented generation framework,
Zihan Ling, Zhiyao Guo, Yixuan Huang, Yi An, Shuai Xiao, Jinsong Lan, Xiaoyong Zhu, and Bo Zheng. Mmkb-rag: A multi-modal knowledge-based retrieval-augmented generation framework. arXiv preprint arXiv:2504.10074, 2025
-
[25]
RA - ISF : Learning to Answer and Understand from Retrieval Augmentation via Iterative Self-Feedback
Yanming Liu, Xinyue Peng, Xuhong Zhang, Weihao Liu, Jianwei Yin, Jiannan Cao, and Tianyu Du. RA - ISF : Learning to answer and understand from retrieval augmentation via iterative self-feedback. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Findings of the Association for Computational Linguistics: ACL 2024, pp.\ 4730--4749, Bangkok, Thailand, ...
-
[26]
Lamra: Large multimodal model as your advanced retrieval assistant
Yikun Liu, Pingan Chen, Jiayin Cai, Xiaolong Jiang, Yao Hu, Jiangchao Yao, Yanfeng Wang, and Weidi Xie. Lamra: Large multimodal model as your advanced retrieval assistant. arXiv preprint arXiv:2412.01720, 2024 b
-
[27]
Generative multi-modal knowledge retrieval with large language models
Xinwei Long, Jiali Zeng, Fandong Meng, Zhiyuan Ma, Kaiyan Zhang, Bowen Zhou, and Jie Zhou. Generative multi-modal knowledge retrieval with large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pp.\ 18733--18741, 2024
work page 2024
-
[28]
Retrieval-augmented visual question answering via built-in autoregressive search engines
Xinwei Long, Zhiyuan Ma, Ermo Hua, Kaiyan Zhang, Biqing Qi, and Bowen Zhou. Retrieval-augmented visual question answering via built-in autoregressive search engines. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pp.\ 24723--24731, 2025
work page 2025
-
[29]
Weakly-supervised visual-retriever-reader for knowledge-based question answering
Man Luo, Yankai Zeng, Pratyay Banerjee, and Chitta Baral. Weakly-supervised visual-retriever-reader for knowledge-based question answering. In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 6417--6431, Online and Punta Cana, Domi...
-
[30]
End-to-end knowledge retrieval with multi-modal queries
Man Luo, Zhiyuan Fang, Tejas Gokhale, Yezhou Yang, and Chitta Baral. End-to-end knowledge retrieval with multi-modal queries. arXiv preprint arXiv:2306.00424, 2023
-
[31]
Ok-vqa: A visual question answering benchmark requiring external knowledge
Kenneth Marino, Mohammad Rastegari, Ali Farhadi, and Roozbeh Mottaghi. Ok-vqa: A visual question answering benchmark requiring external knowledge. In Proceedings of the IEEE/cvf conference on computer vision and pattern recognition, pp.\ 3195--3204, 2019
work page 2019
-
[32]
Encyclopedic vqa: Visual questions about detailed properties of fine-grained categories
Thomas Mensink, Jasper Uijlings, Lluis Castrejon, Arushi Goel, Felipe Cadar, Howard Zhou, Fei Sha, Andr \'e Araujo, and Vittorio Ferrari. Encyclopedic vqa: Visual questions about detailed properties of fine-grained categories. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pp.\ 3113--3124, 2023
work page 2023
-
[33]
OpenRouter . Openrouter api, 2025. URL https://openrouter.ai/docs/api-reference. Accessed: 2025-05-21
work page 2025
-
[34]
Fabio Petroni, Patrick Lewis, Aleksandra Piktus, Tim Rockt \"a schel, Yuxiang Wu, Alexander H. Miller, and Sebastian Riedel. How context affects language models' factual predictions. In Automated Knowledge Base Construction, 2020. URL https://openreview.net/forum?id=025X0zPfn
work page 2020
-
[35]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. In International conference on machine learning, pp.\ 8748--8763. PmLR, 2021
work page 2021
-
[36]
Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy
Zhihong Shao, Yeyun Gong, Yelong Shen, Minlie Huang, Nan Duan, and Weizhu Chen. Enhancing retrieval-augmented large language models with iterative retrieval-generation synergy. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Findings of the Association for Computational Linguistics: EMNLP 2023, pp.\ 9248--9274, Singapore, December 2023. Association f...
-
[37]
Large language models can be easily distracted by irrelevant context
Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H Chi, Nathanael Sch \"a rli, and Denny Zhou. Large language models can be easily distracted by irrelevant context. In International Conference on Machine Learning, pp.\ 31210--31227. PMLR, 2023
work page 2023
-
[38]
EVA-CLIP: Improved Training Techniques for CLIP at Scale
Quan Sun, Yuxin Fang, Ledell Wu, Xinlong Wang, and Yue Cao. Eva-clip: Improved training techniques for clip at scale. arXiv preprint arXiv:2303.15389, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[39]
Gemma Team, Aishwarya Kamath, Johan Ferret, Shreya Pathak, Nino Vieillard, Ramona Merhej, Sarah Perrin, Tatiana Matejovicova, Alexandre Ram \'e , Morgane Rivi \`e re, et al. Gemma 3 technical report. arXiv preprint arXiv:2503.19786, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[40]
Interleaving Retrieval with Chain-of-Thought Reasoning for Knowledge-Intensive Multi-Step Questions
Harsh Trivedi, Niranjan Balasubramanian, Tushar Khot, and Ashish Sabharwal. Interleaving retrieval with chain-of-thought reasoning for knowledge-intensive multi-step questions. arXiv preprint arXiv:2212.10509, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[41]
Michael Tschannen, Alexey Gritsenko, Xiao Wang, Muhammad Ferjad Naeem, Ibrahim Alabdulmohsin, Nikhil Parthasarathy, Talfan Evans, Lucas Beyer, Ye Xia, Basil Mustafa, et al. Siglip 2: Multilingual vision-language encoders with improved semantic understanding, localization, and dense features. arXiv preprint arXiv:2502.14786, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [42]
-
[43]
Text Embeddings by Weakly-Supervised Contrastive Pre-training
Liang Wang, Nan Yang, Xiaolong Huang, Binxing Jiao, Linjun Yang, Daxin Jiang, Rangan Majumder, and Furu Wei. Text embeddings by weakly-supervised contrastive pre-training. arXiv preprint arXiv:2212.03533, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[44]
Rat: Retrieval augmented thoughts elicit context-aware reasoning in long-horizon generation
Zihao Wang, Anji Liu, Haowei Lin, Jiaqi Li, Xiaojian Ma, and Yitao Liang. Rat: Retrieval augmented thoughts elicit context-aware reasoning in long-horizon generation. arXiv preprint arXiv:2403.05313, 2024
-
[45]
Benjamin Warner, Antoine Chaffin, Benjamin Clavi \'e , Orion Weller, Oskar Hallstr \"o m, Said Taghadouini, Alexis Gallagher, Raja Biswas, Faisal Ladhak, Tom Aarsen, et al. Smarter, better, faster, longer: A modern bidirectional encoder for fast, memory efficient, and long context finetuning and inference. arXiv preprint arXiv:2412.13663, 2024
work page internal anchor Pith review arXiv 2024
-
[46]
Uniir: Training and benchmarking universal multimodal information retrievers
Cong Wei, Yang Chen, Haonan Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, and Wenhu Chen. Uniir: Training and benchmarking universal multimodal information retrievers. In European Conference on Computer Vision, pp.\ 387--404. Springer, 2024
work page 2024
-
[47]
Longmemeval: Benchmarking chat assistants on long-term interactive memory
Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Longmemeval: Benchmarking chat assistants on long-term interactive memory. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=pZiyCaVuti
work page 2025
-
[48]
Siye Wu, Jian Xie, Jiangjie Chen, Tinghui Zhu, Kai Zhang, and Yanghua Xiao. How easily do irrelevant inputs skew the responses of large language models? ArXiv, abs/2404.03302, 2024. URL https://api.semanticscholar.org/CorpusID:268889623
-
[49]
Improving retrieval-augmented generation in medicine with iterative follow-up questions
Guangzhi Xiong, Qiao Jin, Xiao Wang, Minjia Zhang, Zhiyong Lu, and Aidong Zhang. Improving retrieval-augmented generation in medicine with iterative follow-up questions. In Biocomputing 2025: Proceedings of the Pacific Symposium, pp.\ 199--214. World Scientific, 2024
work page 2025
-
[50]
Yibin Yan and Weidi Xie. E cho S ight: Advancing visual-language models with W iki knowledge. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Findings of the Association for Computational Linguistics: EMNLP 2024, pp.\ 1538--1551, Miami, Florida, USA, November 2024. Association for Computational Linguistics. doi:10.18653/v1/2024.findings-emnlp...
-
[51]
Omgm: Orchestrate multiple granularities and modalities for efficient multimodal retrieval
Wei Yang, Jingjing Fu, Rui Wang, Jinyu Wang, Lei Song, and Jiang Bian. Omgm: Orchestrate multiple granularities and modalities for efficient multimodal retrieval. arXiv preprint arXiv:2505.07879, 2025
-
[52]
Retrieval-augmented multimodal language modeling
Michihiro Yasunaga, Armen Aghajanyan, Weijia Shi, Rich James, Jure Leskovec, Percy Liang, Mike Lewis, Luke Zettlemoyer, and Wen-tau Yih. Retrieval-augmented multimodal language modeling. In Proceedings of the 40th International Conference on Machine Learning, ICML'23. JMLR.org, 2023
work page 2023
-
[53]
Auto-rag: Autonomous retrieval-augmented generation for large language models
Tian Yu, Shaolei Zhang, and Yang Feng. Auto-rag: Autonomous retrieval-augmented generation for large language models. arXiv preprint arXiv:2411.19443, 2024
-
[54]
Inference scaling for long-context retrieval augmented generation
Zhenrui Yue, Honglei Zhuang, Aijun Bai, Kai Hui, Rolf Jagerman, Hansi Zeng, Zhen Qin, Dong Wang, Xuanhui Wang, and Michael Bendersky. Inference scaling for long-context retrieval augmented generation. In The Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum?id=FSjIrOm1vz
work page 2025
-
[55]
Tao Zhang, Ziqi Zhang, Zongyang Ma, Yuxin Chen, Zhongang Qi, Chunfeng Yuan, Bing Li, Junfu Pu, Yuxuan Zhao, Zehua Xie, Jin Ma, Ying Shan, and Weiming Hu. mr ^2 ag: Multimodal retrieval-reflection-augmented generation for knowledge-based vqa. arXiv preprint arXiv:2411.15041, 2024 a . URL https://arxiv.org/abs/2411.15041
-
[56]
GME: Improving Universal Multimodal Retrieval by Multimodal LLMs
Xin Zhang, Yanzhao Zhang, Wen Xie, Mingxin Li, Ziqi Dai, Dingkun Long, Pengjun Xie, Meishan Zhang, Wenjie Li, and Min Zhang. Gme: Improving universal multimodal retrieval by multimodal llms. arXiv preprint arXiv:2412.16855, 2024 b
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[57]
Siren's Song in the AI Ocean: A Survey on Hallucination in Large Language Models
Yue Zhang, Yafu Li, Leyang Cui, Deng Cai, Lemao Liu, Tingchen Fu, Xinting Huang, Enbo Zhao, Yu Zhang, Yulong Chen, et al. Siren's song in the ai ocean: a survey on hallucination in large language models. arXiv preprint arXiv:2309.01219, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[58]
Zhuocheng Zhang, Yang Feng, and Min Zhang. Levelrag: Enhancing retrieval-augmented generation with multi-hop logic planning over rewriting augmented searchers. arXiv preprint arXiv:2502.18139, 2025
-
[59]
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION format.date year duplicate empty "emp...
-
[60]
\@ifxundefined[1] #1\@undefined \@firstoftwo \@secondoftwo \@ifnum[1] #1 \@firstoftwo \@secondoftwo \@ifx[1] #1 \@firstoftwo \@secondoftwo [2] @ #1 \@temptokena #2 #1 @ \@temptokena \@ifclassloaded agu2001 natbib The agu2001 class already includes natbib coding, so you should not add it explicitly Type <Return> for now, but then later remove the command n...
-
[61]
\@lbibitem[] @bibitem@first@sw\@secondoftwo \@lbibitem[#1]#2 \@extra@b@citeb \@ifundefined br@#2\@extra@b@citeb \@namedef br@#2 \@nameuse br@#2\@extra@b@citeb \@ifundefined b@#2\@extra@b@citeb @num @parse #2 @tmp #1 NAT@b@open@#2 NAT@b@shut@#2 \@ifnum @merge>\@ne @bibitem@first@sw \@firstoftwo \@ifundefined NAT@b*@#2 \@firstoftwo @num @NAT@ctr \@secondoft...
-
[62]
@open @close @open @close and [1] URL: #1 \@ifundefined chapter * \@mkboth \@ifxundefined @sectionbib * \@mkboth * \@mkboth\@gobbletwo \@ifclassloaded amsart * \@ifclassloaded amsbook * \@ifxundefined @heading @heading NAT@ctr thebibliography [1] @ \@biblabel @NAT@ctr \@bibsetup #1 @NAT@ctr @ @openbib .11em \@plus.33em \@minus.07em 4000 4000 `\.\@m @bibit...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.