pith. machine review for the scientific record. sign in

arxiv: 2410.10594 · v2 · submitted 2024-10-14 · 💻 cs.IR · cs.AI· cs.CL· cs.CV

Recognition: 1 theorem link

· Lean Theorem

VisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents

Authors on Pith no claims yet

Pith reviewed 2026-05-16 15:33 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.CLcs.CV
keywords retrieval-augmented generationvision-language modelsmulti-modal documentsimage embeddingdocument retrievalRAG pipelinelayout preservation
0
0 comments X

The pith

VisRAG retrieves and generates from multi-modal documents by embedding them directly as images rather than parsing to text.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

VisRAG builds a retrieval-augmented generation pipeline that feeds document pages straight into vision-language models as images. Traditional systems first convert pages to plain text, which drops layout details, tables, figures, and other visual cues. By skipping that conversion step, VisRAG keeps the full visual structure available for both retrieval and answer generation. The authors train the retriever on a combination of public and synthetic image data and test several ways to condition generation on the retrieved images. End-to-end results improve 20 to 40 percent over standard text-only RAG pipelines while using training data efficiently and generalizing to new documents.

Core claim

VisRAG replaces the text-parsing stage of RAG with direct image embedding: a vision-language model encodes entire document pages as images, a retriever selects relevant pages by visual similarity, and a VLM generator produces answers conditioned on those image chunks. This pipeline retains layout and visual information that text extraction discards, yielding higher retrieval precision and stronger generation quality with 20-40 percent overall gains.

What carries the argument

A vision-language model retriever that embeds document pages as whole images and ranks them by visual similarity, followed by VLM generation that conditions directly on the selected image regions.

If this is right

  • Document pipelines no longer require separate OCR or layout-analysis steps for many visual-heavy files.
  • Retrieval can exploit spatial cues such as table structure and figure placement that text loses.
  • Generation quality rises because the model sees the original visuals instead of reconstructed text.
  • Training data needs remain modest because the same image embeddings support both retrieval and generation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same direct-image approach could be tested on video frames or slide decks where temporal layout matters.
  • Hybrid systems might combine image retrieval for visual sections with text retrieval for dense prose.
  • Indexing costs shift from text tokenization to image feature extraction, which may favor different hardware choices.
  • If generalization holds, VisRAG-style pipelines could replace parsing-heavy stacks in legal, scientific, and technical document tools.

Load-bearing premise

Vision-language models can extract and match the relevant content from document images at least as well as text parsers without losing critical layout or visual details.

What would settle it

An experiment on a held-out set of real multi-modal documents in which the image-embedding retriever and generator produce lower end-to-end accuracy than a strong text-parsing baseline.

read the original abstract

Retrieval-augmented generation (RAG) is an effective technique that enables large language models (LLMs) to utilize external knowledge sources for generation. However, current RAG systems are solely based on text, rendering it impossible to utilize vision information like layout and images that play crucial roles in real-world multi-modality documents. In this paper, we introduce VisRAG, which tackles this issue by establishing a vision-language model (VLM)-based RAG pipeline. In this pipeline, instead of first parsing the document to obtain text, the document is directly embedded using a VLM as an image and then retrieved to enhance the generation of a VLM. Compared to traditional text-based RAG, VisRAG maximizes the retention and utilization of the data information in the original documents, eliminating the information loss introduced during the parsing process. We collect both open-source and synthetic data to train the retriever in VisRAG and explore a variety of generation methods. Experiments demonstrate that VisRAG outperforms traditional RAG in both the retrieval and generation stages, achieving a 20--40% end-to-end performance gain over traditional text-based RAG pipeline. Further analysis reveals that VisRAG is efficient in utilizing training data and demonstrates strong generalization capability, positioning it as a promising solution for RAG on multi-modality documents. Our code and data are available at https://github.com/openbmb/visrag.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces VisRAG, a vision-language model (VLM)-based retrieval-augmented generation pipeline for multi-modality documents. Documents are embedded directly as images using a VLM for retrieval, then used to augment generation by another VLM, avoiding text parsing losses from layout and images. The retriever is trained on collected open-source and synthetic data; experiments claim 20-40% end-to-end gains over traditional text-based RAG in both retrieval and generation stages, with analysis of data efficiency and generalization.

Significance. If the performance gains hold under fair, fully specified baselines, VisRAG could meaningfully advance RAG for real-world documents by preserving visual and layout information that text pipelines discard. The open release of code and data at the cited GitHub repository is a clear strength for reproducibility and extension in the IR community.

major comments (2)
  1. [Abstract and Experiments] Abstract and Experiments section: The central claim of 20--40% end-to-end gains over traditional text-based RAG is load-bearing but unsupported without any description of the text extraction method (OCR engine, layout parser, or tool) used to create the baseline pipeline. If the baseline parser is lossy on multi-modal documents, reported gains may reflect avoidance of parsing artifacts rather than superior vision-based retrieval.
  2. [Experiments] Experiments section: No details are provided on the specific retrieval and generation metrics, data splits, error bars, statistical significance tests, or controls for post-hoc choices in synthetic data creation. These omissions leave the outperformance claim only partially supported and difficult to reproduce or compare.
minor comments (1)
  1. [Method] The description of VLM backbone choices and training hyperparameters could be expanded with a table for clarity and reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We agree that greater specificity on the text-based baseline and experimental protocol is required for reproducibility and fair comparison. We have revised the manuscript to address both major comments and provide point-by-point responses below.

read point-by-point responses
  1. Referee: [Abstract and Experiments] Abstract and Experiments section: The central claim of 20--40% end-to-end gains over traditional text-based RAG is load-bearing but unsupported without any description of the text extraction method (OCR engine, layout parser, or tool) used to create the baseline pipeline. If the baseline parser is lossy on multi-modal documents, reported gains may reflect avoidance of parsing artifacts rather than superior vision-based retrieval.

    Authors: We agree that the original manuscript lacked sufficient detail on the text extraction pipeline used for the traditional RAG baseline, which is necessary to substantiate the performance claims. The baseline employed pdfplumber for layout-aware text extraction combined with Tesseract OCR (v5.0) for embedded images, with no additional post-processing beyond standard cleaning. We have added a dedicated paragraph in the Experiments section describing this pipeline, including the exact tools, versions, and parameters. While we maintain that the observed gains arise primarily from direct visual embedding and retrieval (supported by our ablation studies comparing parsed vs. image inputs), we acknowledge that explicit baseline specification strengthens the claim and have incorporated the requested description. revision: yes

  2. Referee: [Experiments] Experiments section: No details are provided on the specific retrieval and generation metrics, data splits, error bars, statistical significance tests, or controls for post-hoc choices in synthetic data creation. These omissions leave the outperformance claim only partially supported and difficult to reproduce or compare.

    Authors: We concur that the original Experiments section omitted key reproducibility details. In the revised version we now report: retrieval metrics (Recall@5, nDCG@10), generation metrics (Exact Match, F1, ROUGE-L), data splits (70/15/15 train/validation/test per dataset), error bars as standard deviation over three independent runs, and paired t-tests for statistical significance (p < 0.05 reported). For synthetic data creation we have added the exact prompt templates, sampling parameters, and filtering criteria to the appendix. These additions directly address the concerns and make the experimental protocol fully specified. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical system with external training and evaluation

full rationale

The paper introduces VisRAG as a VLM-based pipeline that embeds multi-modal documents directly as images for retrieval and generation, trained on collected open-source plus synthetic data. All central claims (20-40% end-to-end gains) are presented as experimental outcomes rather than derived from equations or parameters internal to the paper. No self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citation chains appear in the abstract or described method; the approach relies on external VLM capabilities and new data rather than reducing results to quantities defined by the method itself. This is the expected non-circular finding for an empirical systems paper.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method depends on the assumption that existing VLMs can serve as effective document embedders and on choices made when mixing open-source and synthetic training data; no new physical entities are postulated.

free parameters (1)
  • Retriever training data composition
    Open-source and synthetic data are collected and mixed to train the retriever; the exact proportions and selection criteria are free choices that affect performance.
axioms (1)
  • domain assumption Vision-language models can embed document images in a way that supports effective retrieval for generation tasks.
    Invoked when the pipeline directly uses VLM image embeddings instead of parsed text.

pith-pipeline@v0.9.0 · 5597 in / 1207 out tokens · 81655 ms · 2026-05-16T15:33:33.637224+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Cost.FunctionalEquation washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    In this pipeline, instead of first parsing the document to obtain text, the document is directly embedded using a VLM as an image and then retrieved to enhance the generation of a VLM. ... Experiments demonstrate that VisRAG outperforms traditional RAG in both the retrieval and generation stages, achieving a 20–40% end-to-end performance gain over traditional text-based RAG pipeline.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 21 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence

    cs.CL 2026-05 accept novelty 8.0

    CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed mod...

  2. DocPrune:Efficient Document Question Answering via Background, Question, and Comprehension-aware Token Pruning

    cs.CV 2026-04 unverdicted novelty 7.0

    DocPrune is a training-free token pruning method that removes background and irrelevant tokens from document images using question and comprehension signals, yielding 3x encoder and 3.3x decoder throughput gains plus ...

  3. SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs

    cs.CV 2026-04 unverdicted novelty 7.0

    SLQ turns frozen MLLMs into retrievers via shared latent queries appended to inputs, outperforming fine-tuning on COCO and Flickr30K while introducing KARR-Bench for knowledge-aware evaluation.

  4. Bottleneck Tokens for Unified Multimodal Retrieval

    cs.LG 2026-04 unverdicted novelty 7.0

    Bottleneck Tokens paired with a masked generative objective achieve state-of-the-art unified multimodal retrieval performance among 2B-scale models on the MMEB-V2 benchmark with 78 datasets.

  5. Visual Late Chunking: An Empirical Study of Contextual Chunking for Efficient Visual Document Retrieval

    cs.CV 2026-04 unverdicted novelty 7.0

    ColChunk adaptively chunks visual document patches into contextual multi-vectors via clustering, cutting storage by over 90% while raising average nDCG@5 by 9 points.

  6. VISOR: Agentic Visual Retrieval-Augmented Generation via Iterative Search and Over-horizon Reasoning

    cs.CV 2026-04 unverdicted novelty 7.0

    VISOR is a unified agentic VRAG framework with Evidence Space structuring, visual action evaluation/correction, and dynamic sliding-window trajectories trained via GRPO-based RL that achieves SOTA performance on long-...

  7. MARVEL: Multimodal Adaptive Reasoning-intensiVe Expand-rerank and retrievaL

    cs.IR 2026-04 unverdicted novelty 7.0

    MARVEL reaches 37.9 nDCG@10 on the MM-BRIGHT benchmark by combining LLM query expansion, a reasoning-enhanced dense retriever, and GPT-4o CoT reranking, beating prior multimodal encoders by 10.3 points.

  8. PLUME: Latent Reasoning Based Universal Multimodal Embedding

    cs.CV 2026-04 unverdicted novelty 7.0

    PLUME uses latent-state autoregressive rollouts and a progressive training curriculum to deliver efficient reasoning for universal multimodal embeddings without generating explicit rationales.

  9. MMSearch-R1: Incentivizing LMMs to Search

    cs.CV 2025-06 unverdicted novelty 7.0

    MMSearch-R1 uses reinforcement learning to train multimodal models for on-demand multi-turn internet search with image and text tools, outperforming same-size RAG baselines and matching larger ones while cutting searc...

  10. VLADriver-RAG: Retrieval-Augmented Vision-Language-Action Models for Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 6.0

    VLADriver-RAG reaches a new state-of-the-art Driving Score of 89.12 on Bench2Drive by retrieving structure-aware historical knowledge through spatiotemporal semantic graphs and Graph-DTW alignment.

  11. Beyond Chain-of-Thought: Rewrite as a Universal Interface for Generative Multimodal Embeddings

    cs.CV 2026-04 unverdicted novelty 6.0

    Rewrite-driven generation with alignment and RL produces shorter, more effective generative multimodal embeddings than CoT methods on retrieval benchmarks.

  12. POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch

    cs.CV 2026-04 unverdicted novelty 6.0

    POINTS-Seeker-8B is an 8B multimodal model trained from scratch for agentic search that uses seeding and visual-space history folding to outperform prior models on six visual reasoning benchmarks.

  13. SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs

    cs.CV 2026-04 conditional novelty 6.0

    SLQ adapts frozen MLLMs for multimodal retrieval by appending shared latent queries to text and image tokens and introduces KARR-Bench to test knowledge-aware reasoning retrieval.

  14. HIVE: Query, Hypothesize, Verify An LLM Framework for Multimodal Reasoning-Intensive Retrieval

    cs.IR 2026-04 unverdicted novelty 6.0

    HIVE raises multimodal retrieval nDCG@10 to 41.7 on the MM-BRIGHT benchmark by inserting LLM-driven hypothesis generation and verification between retrieval passes, delivering +9.5 over the best text-only baseline and...

  15. FileGram: Grounding Agent Personalization in File-System Behavioral Traces

    cs.CV 2026-04 unverdicted novelty 6.0

    FileGram grounds AI agent personalization in file-system behavioral traces via a data simulation engine, a diagnostic benchmark, and a bottom-up memory architecture.

  16. MicroWorld: Empowering Multimodal Large Language Models to Bridge the Microscopic Domain Gap with Multimodal Attribute Graph

    cs.CV 2026-05 unverdicted novelty 5.0

    MicroWorld constructs a multimodal attributed property graph from scientific image-caption data and augments MLLM prompts via retrieval to raise Qwen3-VL-8B performance by 37.5% on MicroVQA and 6% on MicroBench.

  17. VLADriver-RAG: Retrieval-Augmented Vision-Language-Action Models for Autonomous Driving

    cs.CV 2026-05 unverdicted novelty 5.0

    VLADriver-RAG achieves state-of-the-art performance on Bench2Drive by grounding VLA planning in structure-aware retrieved priors via spatiotemporal semantic graphs and Graph-DTW alignment.

  18. DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding

    cs.AI 2026-04 unverdicted novelty 5.0

    DocSeeker uses supervised fine-tuning on distilled data followed by evidence-aware group relative policy optimization to improve long-document understanding and evidence grounding in MLLMs.

  19. DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding

    cs.AI 2026-04 unverdicted novelty 5.0

    DocSeeker improves long-document understanding in MLLMs via a two-stage training process that combines supervised fine-tuning from distilled data with evidence-aware group relative policy optimization and memory-effic...

  20. BRIDGE: Multimodal-to-Text Retrieval via Reinforcement-Learned Query Alignment

    cs.IR 2026-04 unverdicted novelty 5.0

    BRIDGE reaches 29.7 nDCG@10 on MM-BRIGHT by RL-aligning multimodal queries to text and using a reasoning retriever, beating multimodal encoders and, when combined with Nomic-Vision, exceeding the best text-only retrie...

  21. Shaping Schema via Language Representation as the Next Frontier for LLM Intelligence Expanding

    cs.AI 2026-05 unverdicted novelty 3.0

    Advanced language representations shape LLMs' schemas to improve knowledge activation and problem-solving.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · cited by 18 Pith papers · 10 internal anchors

  1. [1]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale- man, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,

  2. [2]

    A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity

    Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, et al. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. In Proceedings of AACL/IJCNLP 2023, pp. 675–718,

  3. [3]

    Allava: Harnessing gpt4v-synthesized data for a lite vision-language model

    Guiming Hardy Chen, Shunian Chen, Ruifei Zhang, Junying Chen, Xiangbo Wu, Zhiyi Zhang, Zhi- hong Chen, Jianquan Li, Xiang Wan, and Benyou Wang. Allava: Harnessing gpt4v-synthesized data for a lite vision-language model. arXiv preprint arXiv:2402.11684, 2024a. Wentong Chen, Junbo Cui, Jinyi Hu, Yujia Qin, Junjie Fang, Yue Zhao, Chongyi Wang, Jun Liu, Guiro...

  4. [4]

    Pp-OCR: A Practical Ultra Lightweight OCR System

    Yuning Du, Chenxia Li, Ruoyu Guo, Xiaoting Yin, Weiwei Liu, Jun Zhou, Yifan Bai, Zilin Yu, Yehua Yang, Qingqing Dang, and Haoshuang Wang. Pp-OCR: A Practical Ultra Lightweight OCR System. arXiv, abs/2009.09941,

  5. [5]

    ColPali: Efficient Document Retrieval with Vision Language Models

    Manuel Faysse, Hugues Sibille, Tony Wu, Gautier Viaud, C´eline Hudelot, and Pierre Colombo. Col- pali: Efficient document retrieval with vision language models.arXiv preprint arXiv:2407.01449,

  6. [6]

    Cogagent: A visual language model for gui agents

    11 Published as a conference paper at ICLR 2025 Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. Cogagent: A visual language model for gui agents. In Proceedings of CVPR, pp. 14281–14290,

  7. [7]

    mplug-docowl 1.5: Unified structure learning for ocr-free document understanding

    Anwen Hu, Haiyang Xu, Jiabo Ye, Ming Yan, Liang Zhang, Bo Zhang, Ji Zhang, Qin Jin, Fei Huang, and Jingren Zhou. mplug-docowl 1.5: Unified structure learning for ocr-free document understanding. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Findings of EMNLP, pp. 3096–3120, 2024a. Anwen Hu, Haiyang Xu, Liang Zhang, Jiabo Ye, Ming Yan, Ji Zh...

  8. [8]

    What matters when building vision-language models? arXiv preprint arXiv:2405.02246,

    Hugo Laurenc ¸on, L´eo Tronchon, Matthieu Cord, and Victor Sanh. What matters when building vision-language models? arXiv preprint arXiv:2405.02246,

  9. [9]

    NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models

    Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catan- zaro, and Wei Ping. Nv-Embed: Improved Techniques for Training LLMs as Generalist Embed- ding Models. arXiv, abs/2405.17428,

  10. [10]

    LLaVA-OneVision: Easy Visual Task Transfer

    12 Published as a conference paper at ICLR 2025 Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024a. Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre- training for...

  11. [11]

    Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models

    Ziyi Lin, Chris Liu, Renrui Zhang, Peng Gao, Longtian Qiu, Han Xiao, Han Qiu, Chen Lin, Wenqi Shao, Keqin Chen, et al. Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575,

  12. [12]

    Retrieval- augmented multi-modal chain-of-thoughts reasoning for large language models

    Bingshuai Liu, Chenyang Lyu, Zijun Min, Zhanyu Wang, Jinsong Su, and Longyue Wang. Retrieval- augmented multi-modal chain-of-thoughts reasoning for large language models. arXiv preprint arXiv:2312.01714, 2023a. Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InProceed- ings of NeurIPS, volume 36, pp. 34892–34916, 2023b....

  13. [13]

    Yuliang Liu, Biao Yang, Qiang Liu, Zhang Li, Zhiyin Ma, Shuo Zhang, and Xiang Bai

    URL https://github.com/jerryjliu/llama_index. Yuliang Liu, Biao Yang, Qiang Liu, Zhang Li, Zhiyin Ma, Shuo Zhang, and Xiang Bai. Textmonkey: An ocr-free large multimodal model for understanding document. arXiv preprint arXiv:2403.04473, 2024b. Man Luo, Yankai Zeng, Pratyay Banerjee, and Chitta Baral. Weakly-supervised visual-retriever- reader for knowledg...

  14. [14]

    Unifying multimodal retrieval via document screenshot embedding

    Xueguang Ma, Sheng-Chieh Lin, Minghan Li, Wenhu Chen, and Jimmy Lin. Unifying multimodal retrieval via document screenshot embedding. arXiv preprint arXiv:2406.11251,

  15. [15]

    Sgpt: Gpt sentence embeddings for semantic search

    Niklas Muennighoff. Sgpt: Gpt sentence embeddings for semantic search. arXiv preprint arXiv:2202.08904,

  16. [16]

    Mteb: Massive text embed- ding benchmark

    Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. Mteb: Massive text embed- ding benchmark. In Proceedings of EACL, pp. 2014–2037,

  17. [17]

    Hello, gpt-4o — openai,

    13 Published as a conference paper at ICLR 2025 OpenAI. Hello, gpt-4o — openai,

  18. [18]

    Wikichat: A few-shot llm-based chatbot grounded with wikipedia

    Sina J Semnani, Violet Z Yao, Heidi C Zhang, and Monica S Lam. Wikichat: A few-shot llm-based chatbot grounded with wikipedia. arXiv preprint arXiv:2305.14292,

  19. [19]

    Unirag: Universal retrieval augmentation for multi-modal large language models

    Sahel Sharifymoghaddam, Shivani Upadhyay, Wenhu Chen, and Jimmy Lin. Unirag: Universal retrieval augmentation for multi-modal large language models. arXiv preprint arXiv:2405.10311,

  20. [20]

    Retrieval meets reasoning: Even high-school textbook knowledge benefits multimodal reasoning

    Cheng Tan, Jingxuan Wei, Linzhuang Sun, Zhangyang Gao, Siyuan Li, Bihui Yu, Ruifeng Guo, and Stan Z Li. Retrieval meets reasoning: Even high-school textbook knowledge benefits multimodal reasoning. arXiv preprint arXiv:2405.20834,

  21. [21]

    Gemma: Open Models Based on Gemini Research and Technology

    Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi `ere, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295,

  22. [22]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko- lay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open founda- tion and fine-tuned chat models. arXiv preprint arXiv:2307.09288,

  23. [23]

    Freshllms: Refreshing large language models with search engine augmentation

    Tu Vu, Mohit Iyyer, Xuezhi Wang, Noah Constant, Jerry Wei, Jason Wei, Chris Tar, Yun-Hsuan Sung, Denny Zhou, Quoc Le, et al. Freshllms: Refreshing large language models with search engine augmentation. arXiv preprint arXiv:2310.03214,

  24. [24]

    CogVLM: Visual Expert for Pretrained Language Models

    14 Published as a conference paper at ICLR 2025 Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, et al. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079,

  25. [25]

    Uniir: Training and benchmarking universal multimodal information retrievers

    Cong Wei, Yang Chen, Haonan Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, and Wenhu Chen. Uniir: Training and benchmarking universal multimodal information retrievers. arXiv preprint arXiv:2311.17136,

  26. [26]

    C-Pack: Packed Resources For General Chinese Embeddings

    Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muennighoff, Defu Lian, and Jian-Yun Nie. C-pack: Packaged resources to advance general chinese embedding. arXiv preprint arXiv:2309.07597 ,

  27. [27]

    Chatqa 2: Bridging the gap to proprietary llms in long context and rag capabilities

    Peng Xu, Wei Ping, Xianchao Wu, Zihan Liu, Mohammad Shoeybi, and Bryan Catanzaro. Chatqa 2: Bridging the gap to proprietary llms in long context and rag capabilities. arXiv preprint arXiv:2407.14482, 2024a. Ruyi Xu, Yuan Yao, Zonghao Guo, Junbo Cui, Zanlin Ni, Chunjiang Ge, Tat-Seng Chua, Zhiyuan Liu, Maosong Sun, and Gao Huang. Llava-uhd: an lmm perceivi...

  28. [28]

    Qwen2 Technical Report

    An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report. arXiv preprint arXiv:2407.10671,

  29. [29]

    MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, Qianyu Chen, Huarong Zhou, Zhensheng Zou, Haoye Zhang, Shengding Hu, Zhi Zheng, Jie Zhou, Jie Cai, Xu Han, Guoyang Zeng, Dahai Li, Zhiyuan Liu, and Maosong Sun. Minicpm-V: A GPT-4v Level MLLM on Your Phone. arXiv, abs/2408.01800,

  30. [30]

    mplug-docowl: Modularized multimodal large language model for document understanding

    Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Yuhao Dan, Chenlin Zhao, Guohai Xu, Chenliang Li, Junfeng Tian, et al. mplug-docowl: Modularized multimodal large language model for document understanding. arXiv preprint arXiv:2307.02499, 2023a. Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Guohai Xu, Chenliang Li, Junfeng Tian, Qi Qian, Ji Z...

  31. [31]

    Map-neo: Highly capable and transparent bilingual large language model series

    Ge Zhang, Scott Qu, Jiaheng Liu, Chenchen Zhang, Chenghua Lin, Chou Leuang Yu, Danny Pan, Esther Cheng, Jie Liu, Qunshu Lin, et al. Map-neo: Highly capable and transparent bilingual large language model series. arXiv preprint arXiv:2405.19327,

  32. [32]

    MARVEL: unlocking the multi-modal capability of dense retrieval via visual module plugin

    15 Published as a conference paper at ICLR 2025 Tianshuo Zhou, Sen Mei, Xinze Li, Zhenghao Liu, Chenyan Xiong, Zhiyuan Liu, Yu Gu, and Ge Yu. MARVEL: unlocking the multi-modal capability of dense retrieval via visual module plugin. In Proceedings of ACL, pp. 14608–14624,

  33. [33]

    Rageval: Scenario specific rag evaluation dataset generation framework

    Kunlun Zhu, Yifan Luo, Dingling Xu, Ruobing Wang, Shi Yu, Shuo Wang, Yukun Yan, Zhenghao Liu, Xu Han, Zhiyuan Liu, et al. Rageval: Scenario specific rag evaluation dataset generation framework. arXiv preprint arXiv:2408.01262,

  34. [34]

    We prompt GPT-4o to generate queries on these docu- ments

    16 Published as a conference paper at ICLR 2025 A D ATA CONSTRUCTION DETAILS A.1 S YNTHETIC DATA Table 4: Statistics of crawled documents. We prompt GPT-4o to generate queries on these docu- ments. Name Source Description # Pages Textbooks https://openstax.org/ College-level textbooks including various subjects 10,000 ICML Papers ICML 2023 ICML papers on ...

  35. [35]

    Although this filtering step reduces context-dependent queries, a small number may still remain

    using the instruction shown in Figure 7, which includes human-annotated samples from DocVQA. Although this filtering step reduces context-dependent queries, a small number may still remain. However, their presence is minimal and does not significantly impact the overall quality of our dataset. 17 Published as a conference paper at ICLR 2025 I have some QA...

  36. [36]

    (Captioner)

    Methods based on PPOCR demonstrate significantly better performance compared to pytesseract, with adjacent merging and layout preserving yielding similar results. Consequently, we opt to use the adjacent merging policy for our “(OCR)” runs. Table 5: Overall retrieval performance of different document parsing pipelines. ArxivQA ChartQA DocVQA InfoVQA PlotQ...

  37. [37]

    In this paper, we employ MiniCPM to construct the baseline text-based retriever (Table

    and Gemma-7B (Team et al., 2024). In this paper, we employ MiniCPM to construct the baseline text-based retriever (Table

  38. [38]

    It is built upon SigLIP-400M and Qwen2-7B (Yang et al.,

    is an upgrade of MiniCPM-V 2.0 and MiniCPM-Llama3-V 2.5 (Yao et al., 2024). It is built upon SigLIP-400M and Qwen2-7B (Yang et al.,

  39. [39]

    Different from previous models, MiniCPM-V 2.6 can accept multiple images as the input and perform multi-modal in-context learning

    with a total of 8.5B parameters, exihibiting a significant performance improvement over MiniCPM-Llama3-V 2.5 (Yao et al., 2024). Different from previous models, MiniCPM-V 2.6 can accept multiple images as the input and perform multi-modal in-context learning. It also demonstrates stronger OCR capabilities. We use MiniCPM-V 2.6 to build VisRAG-Gen (Table

  40. [40]

    Model # Para

    2https://huggingface.co/HuggingFaceM4/siglip-so400m-14-980-flash-attn2-navit 20 Published as a conference paper at ICLR 2025 Table 6: Overall retrieval performance in Recall@10. Model # Para. ArxivQAChartQADocVQAInfoVQAPlotQASlideVQAAverage (a) Off-the-shelf Models BM25 (OCR) n.a. 54.29 79.37 86.80 82.59 76.01 91.64 78.45 bge-large (2023) (OCR)335M 48.65 ...

  41. [41]

    End-to-end Performance

    In both instances, we compare VisRAG with TextRAG, maintaining the same setup as described in the “End-to-end Performance” paragraph in Sec. 5.1. In the first case from DocVQA, the user queries about “Club Jetty,” however, the term “Club Jetty” in the relevant document is not successfully extracted due to its decorative font. This leads to TextRAG failing...

  42. [42]

    It is a straightforward method to integrate textual information extracted from the page with its visual clues

    to combine the outputs of MiniCPM (OCR) and SigLIP. It is a straightforward method to integrate textual information extracted from the page with its visual clues. The results indicate that fusing text and image modalities provides a meaningful performance boost over in- dividual modality baselines. However, this approach still falls short of the performan...

  43. [43]

    We report offline latencies per document, including document parsing and encoding latencies, as well as online latencies per query, including query encoding and search latencies

    As shown in the table, although VisRAG-Ret, a VLM-based model, requires more time for document encoding compared to MiniCPM (OCR), it bypasses the time-consuming parsing stage required by 24 Published as a conference paper at ICLR 2025 Table 12: Retrieval efficiency (ms). We report offline latencies per document, including document parsing and encoding la...