Recognition: 1 theorem link
· Lean TheoremVisRAG: Vision-based Retrieval-augmented Generation on Multi-modality Documents
Pith reviewed 2026-05-16 15:33 UTC · model grok-4.3
The pith
VisRAG retrieves and generates from multi-modal documents by embedding them directly as images rather than parsing to text.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
VisRAG replaces the text-parsing stage of RAG with direct image embedding: a vision-language model encodes entire document pages as images, a retriever selects relevant pages by visual similarity, and a VLM generator produces answers conditioned on those image chunks. This pipeline retains layout and visual information that text extraction discards, yielding higher retrieval precision and stronger generation quality with 20-40 percent overall gains.
What carries the argument
A vision-language model retriever that embeds document pages as whole images and ranks them by visual similarity, followed by VLM generation that conditions directly on the selected image regions.
If this is right
- Document pipelines no longer require separate OCR or layout-analysis steps for many visual-heavy files.
- Retrieval can exploit spatial cues such as table structure and figure placement that text loses.
- Generation quality rises because the model sees the original visuals instead of reconstructed text.
- Training data needs remain modest because the same image embeddings support both retrieval and generation.
Where Pith is reading between the lines
- The same direct-image approach could be tested on video frames or slide decks where temporal layout matters.
- Hybrid systems might combine image retrieval for visual sections with text retrieval for dense prose.
- Indexing costs shift from text tokenization to image feature extraction, which may favor different hardware choices.
- If generalization holds, VisRAG-style pipelines could replace parsing-heavy stacks in legal, scientific, and technical document tools.
Load-bearing premise
Vision-language models can extract and match the relevant content from document images at least as well as text parsers without losing critical layout or visual details.
What would settle it
An experiment on a held-out set of real multi-modal documents in which the image-embedding retriever and generator produce lower end-to-end accuracy than a strong text-parsing baseline.
read the original abstract
Retrieval-augmented generation (RAG) is an effective technique that enables large language models (LLMs) to utilize external knowledge sources for generation. However, current RAG systems are solely based on text, rendering it impossible to utilize vision information like layout and images that play crucial roles in real-world multi-modality documents. In this paper, we introduce VisRAG, which tackles this issue by establishing a vision-language model (VLM)-based RAG pipeline. In this pipeline, instead of first parsing the document to obtain text, the document is directly embedded using a VLM as an image and then retrieved to enhance the generation of a VLM. Compared to traditional text-based RAG, VisRAG maximizes the retention and utilization of the data information in the original documents, eliminating the information loss introduced during the parsing process. We collect both open-source and synthetic data to train the retriever in VisRAG and explore a variety of generation methods. Experiments demonstrate that VisRAG outperforms traditional RAG in both the retrieval and generation stages, achieving a 20--40% end-to-end performance gain over traditional text-based RAG pipeline. Further analysis reveals that VisRAG is efficient in utilizing training data and demonstrates strong generalization capability, positioning it as a promising solution for RAG on multi-modality documents. Our code and data are available at https://github.com/openbmb/visrag.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces VisRAG, a vision-language model (VLM)-based retrieval-augmented generation pipeline for multi-modality documents. Documents are embedded directly as images using a VLM for retrieval, then used to augment generation by another VLM, avoiding text parsing losses from layout and images. The retriever is trained on collected open-source and synthetic data; experiments claim 20-40% end-to-end gains over traditional text-based RAG in both retrieval and generation stages, with analysis of data efficiency and generalization.
Significance. If the performance gains hold under fair, fully specified baselines, VisRAG could meaningfully advance RAG for real-world documents by preserving visual and layout information that text pipelines discard. The open release of code and data at the cited GitHub repository is a clear strength for reproducibility and extension in the IR community.
major comments (2)
- [Abstract and Experiments] Abstract and Experiments section: The central claim of 20--40% end-to-end gains over traditional text-based RAG is load-bearing but unsupported without any description of the text extraction method (OCR engine, layout parser, or tool) used to create the baseline pipeline. If the baseline parser is lossy on multi-modal documents, reported gains may reflect avoidance of parsing artifacts rather than superior vision-based retrieval.
- [Experiments] Experiments section: No details are provided on the specific retrieval and generation metrics, data splits, error bars, statistical significance tests, or controls for post-hoc choices in synthetic data creation. These omissions leave the outperformance claim only partially supported and difficult to reproduce or compare.
minor comments (1)
- [Method] The description of VLM backbone choices and training hyperparameters could be expanded with a table for clarity and reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We agree that greater specificity on the text-based baseline and experimental protocol is required for reproducibility and fair comparison. We have revised the manuscript to address both major comments and provide point-by-point responses below.
read point-by-point responses
-
Referee: [Abstract and Experiments] Abstract and Experiments section: The central claim of 20--40% end-to-end gains over traditional text-based RAG is load-bearing but unsupported without any description of the text extraction method (OCR engine, layout parser, or tool) used to create the baseline pipeline. If the baseline parser is lossy on multi-modal documents, reported gains may reflect avoidance of parsing artifacts rather than superior vision-based retrieval.
Authors: We agree that the original manuscript lacked sufficient detail on the text extraction pipeline used for the traditional RAG baseline, which is necessary to substantiate the performance claims. The baseline employed pdfplumber for layout-aware text extraction combined with Tesseract OCR (v5.0) for embedded images, with no additional post-processing beyond standard cleaning. We have added a dedicated paragraph in the Experiments section describing this pipeline, including the exact tools, versions, and parameters. While we maintain that the observed gains arise primarily from direct visual embedding and retrieval (supported by our ablation studies comparing parsed vs. image inputs), we acknowledge that explicit baseline specification strengthens the claim and have incorporated the requested description. revision: yes
-
Referee: [Experiments] Experiments section: No details are provided on the specific retrieval and generation metrics, data splits, error bars, statistical significance tests, or controls for post-hoc choices in synthetic data creation. These omissions leave the outperformance claim only partially supported and difficult to reproduce or compare.
Authors: We concur that the original Experiments section omitted key reproducibility details. In the revised version we now report: retrieval metrics (Recall@5, nDCG@10), generation metrics (Exact Match, F1, ROUGE-L), data splits (70/15/15 train/validation/test per dataset), error bars as standard deviation over three independent runs, and paired t-tests for statistical significance (p < 0.05 reported). For synthetic data creation we have added the exact prompt templates, sampling parameters, and filtering criteria to the appendix. These additions directly address the concerns and make the experimental protocol fully specified. revision: yes
Circularity Check
No significant circularity; empirical system with external training and evaluation
full rationale
The paper introduces VisRAG as a VLM-based pipeline that embeds multi-modal documents directly as images for retrieval and generation, trained on collected open-source plus synthetic data. All central claims (20-40% end-to-end gains) are presented as experimental outcomes rather than derived from equations or parameters internal to the paper. No self-definitional reductions, fitted inputs renamed as predictions, or load-bearing self-citation chains appear in the abstract or described method; the approach relies on external VLM capabilities and new data rather than reducing results to quantities defined by the method itself. This is the expected non-circular finding for an empirical systems paper.
Axiom & Free-Parameter Ledger
free parameters (1)
- Retriever training data composition
axioms (1)
- domain assumption Vision-language models can embed document images in a way that supports effective retrieval for generation tasks.
Lean theorems connected to this paper
-
Cost.FunctionalEquationwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
In this pipeline, instead of first parsing the document to obtain text, the document is directly embedded using a VLM as an image and then retrieved to enhance the generation of a VLM. ... Experiments demonstrate that VisRAG outperforms traditional RAG in both the retrieval and generation stages, achieving a 20–40% end-to-end performance gain over traditional text-based RAG pipeline.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 21 Pith papers
-
CiteVQA: Benchmarking Evidence Attribution for Trustworthy Document Intelligence
CiteVQA requires models to cite specific document regions with bounding boxes alongside answers and finds that even the strongest MLLMs frequently cite the wrong region, with top SAA scores of only 76.0 for closed mod...
-
DocPrune:Efficient Document Question Answering via Background, Question, and Comprehension-aware Token Pruning
DocPrune is a training-free token pruning method that removes background and irrelevant tokens from document images using question and comprehension signals, yielding 3x encoder and 3.3x decoder throughput gains plus ...
-
SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs
SLQ turns frozen MLLMs into retrievers via shared latent queries appended to inputs, outperforming fine-tuning on COCO and Flickr30K while introducing KARR-Bench for knowledge-aware evaluation.
-
Bottleneck Tokens for Unified Multimodal Retrieval
Bottleneck Tokens paired with a masked generative objective achieve state-of-the-art unified multimodal retrieval performance among 2B-scale models on the MMEB-V2 benchmark with 78 datasets.
-
Visual Late Chunking: An Empirical Study of Contextual Chunking for Efficient Visual Document Retrieval
ColChunk adaptively chunks visual document patches into contextual multi-vectors via clustering, cutting storage by over 90% while raising average nDCG@5 by 9 points.
-
VISOR: Agentic Visual Retrieval-Augmented Generation via Iterative Search and Over-horizon Reasoning
VISOR is a unified agentic VRAG framework with Evidence Space structuring, visual action evaluation/correction, and dynamic sliding-window trajectories trained via GRPO-based RL that achieves SOTA performance on long-...
-
MARVEL: Multimodal Adaptive Reasoning-intensiVe Expand-rerank and retrievaL
MARVEL reaches 37.9 nDCG@10 on the MM-BRIGHT benchmark by combining LLM query expansion, a reasoning-enhanced dense retriever, and GPT-4o CoT reranking, beating prior multimodal encoders by 10.3 points.
-
PLUME: Latent Reasoning Based Universal Multimodal Embedding
PLUME uses latent-state autoregressive rollouts and a progressive training curriculum to deliver efficient reasoning for universal multimodal embeddings without generating explicit rationales.
-
MMSearch-R1: Incentivizing LMMs to Search
MMSearch-R1 uses reinforcement learning to train multimodal models for on-demand multi-turn internet search with image and text tools, outperforming same-size RAG baselines and matching larger ones while cutting searc...
-
VLADriver-RAG: Retrieval-Augmented Vision-Language-Action Models for Autonomous Driving
VLADriver-RAG reaches a new state-of-the-art Driving Score of 89.12 on Bench2Drive by retrieving structure-aware historical knowledge through spatiotemporal semantic graphs and Graph-DTW alignment.
-
Beyond Chain-of-Thought: Rewrite as a Universal Interface for Generative Multimodal Embeddings
Rewrite-driven generation with alignment and RL produces shorter, more effective generative multimodal embeddings than CoT methods on retrieval benchmarks.
-
POINTS-Seeker: Towards Training a Multimodal Agentic Search Model from Scratch
POINTS-Seeker-8B is an 8B multimodal model trained from scratch for agentic search that uses seeding and visual-space history folding to outperform prior models on six visual reasoning benchmarks.
-
SLQ: Bridging Modalities via Shared Latent Queries for Retrieval with Frozen MLLMs
SLQ adapts frozen MLLMs for multimodal retrieval by appending shared latent queries to text and image tokens and introduces KARR-Bench to test knowledge-aware reasoning retrieval.
-
HIVE: Query, Hypothesize, Verify An LLM Framework for Multimodal Reasoning-Intensive Retrieval
HIVE raises multimodal retrieval nDCG@10 to 41.7 on the MM-BRIGHT benchmark by inserting LLM-driven hypothesis generation and verification between retrieval passes, delivering +9.5 over the best text-only baseline and...
-
FileGram: Grounding Agent Personalization in File-System Behavioral Traces
FileGram grounds AI agent personalization in file-system behavioral traces via a data simulation engine, a diagnostic benchmark, and a bottom-up memory architecture.
-
MicroWorld: Empowering Multimodal Large Language Models to Bridge the Microscopic Domain Gap with Multimodal Attribute Graph
MicroWorld constructs a multimodal attributed property graph from scientific image-caption data and augments MLLM prompts via retrieval to raise Qwen3-VL-8B performance by 37.5% on MicroVQA and 6% on MicroBench.
-
VLADriver-RAG: Retrieval-Augmented Vision-Language-Action Models for Autonomous Driving
VLADriver-RAG achieves state-of-the-art performance on Bench2Drive by grounding VLA planning in structure-aware retrieved priors via spatiotemporal semantic graphs and Graph-DTW alignment.
-
DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding
DocSeeker uses supervised fine-tuning on distilled data followed by evidence-aware group relative policy optimization to improve long-document understanding and evidence grounding in MLLMs.
-
DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding
DocSeeker improves long-document understanding in MLLMs via a two-stage training process that combines supervised fine-tuning from distilled data with evidence-aware group relative policy optimization and memory-effic...
-
BRIDGE: Multimodal-to-Text Retrieval via Reinforcement-Learned Query Alignment
BRIDGE reaches 29.7 nDCG@10 on MM-BRIGHT by RL-aligning multimodal queries to text and using a reasoning retriever, beating multimodal encoders and, when combined with Nomic-Vision, exceeding the best text-only retrie...
-
Shaping Schema via Language Representation as the Next Frontier for LLM Intelligence Expanding
Advanced language representations shape LLMs' schemas to improve knowledge activation and problem-solving.
Reference graph
Works this paper leans on
-
[1]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale- man, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Yejin Bang, Samuel Cahyawijaya, Nayeon Lee, Wenliang Dai, Dan Su, Bryan Wilie, Holy Lovenia, Ziwei Ji, Tiezheng Yu, Willy Chung, et al. A multitask, multilingual, multimodal evaluation of chatgpt on reasoning, hallucination, and interactivity. In Proceedings of AACL/IJCNLP 2023, pp. 675–718,
work page 2023
-
[3]
Allava: Harnessing gpt4v-synthesized data for a lite vision-language model
Guiming Hardy Chen, Shunian Chen, Ruifei Zhang, Junying Chen, Xiangbo Wu, Zhiyi Zhang, Zhi- hong Chen, Jianquan Li, Xiang Wan, and Benyou Wang. Allava: Harnessing gpt4v-synthesized data for a lite vision-language model. arXiv preprint arXiv:2402.11684, 2024a. Wentong Chen, Junbo Cui, Jinyi Hu, Yujia Qin, Junjie Fang, Yue Zhao, Chongyi Wang, Jun Liu, Guiro...
-
[4]
Pp-OCR: A Practical Ultra Lightweight OCR System
Yuning Du, Chenxia Li, Ruoyu Guo, Xiaoting Yin, Weiwei Liu, Jun Zhou, Yifan Bai, Zilin Yu, Yehua Yang, Qingqing Dang, and Haoshuang Wang. Pp-OCR: A Practical Ultra Lightweight OCR System. arXiv, abs/2009.09941,
-
[5]
ColPali: Efficient Document Retrieval with Vision Language Models
Manuel Faysse, Hugues Sibille, Tony Wu, Gautier Viaud, C´eline Hudelot, and Pierre Colombo. Col- pali: Efficient document retrieval with vision language models.arXiv preprint arXiv:2407.01449,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
Cogagent: A visual language model for gui agents
11 Published as a conference paper at ICLR 2025 Wenyi Hong, Weihan Wang, Qingsong Lv, Jiazheng Xu, Wenmeng Yu, Junhui Ji, Yan Wang, Zihan Wang, Yuxiao Dong, Ming Ding, et al. Cogagent: A visual language model for gui agents. In Proceedings of CVPR, pp. 14281–14290,
work page 2025
-
[7]
mplug-docowl 1.5: Unified structure learning for ocr-free document understanding
Anwen Hu, Haiyang Xu, Jiabo Ye, Ming Yan, Liang Zhang, Bo Zhang, Ji Zhang, Qin Jin, Fei Huang, and Jingren Zhou. mplug-docowl 1.5: Unified structure learning for ocr-free document understanding. In Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (eds.), Findings of EMNLP, pp. 3096–3120, 2024a. Anwen Hu, Haiyang Xu, Liang Zhang, Jiabo Ye, Ming Yan, Ji Zh...
-
[8]
What matters when building vision-language models? arXiv preprint arXiv:2405.02246,
Hugo Laurenc ¸on, L´eo Tronchon, Matthieu Cord, and Victor Sanh. What matters when building vision-language models? arXiv preprint arXiv:2405.02246,
-
[9]
NV-Embed: Improved Techniques for Training LLMs as Generalist Embedding Models
Chankyu Lee, Rajarshi Roy, Mengyao Xu, Jonathan Raiman, Mohammad Shoeybi, Bryan Catan- zaro, and Wei Ping. Nv-Embed: Improved Techniques for Training LLMs as Generalist Embed- ding Models. arXiv, abs/2405.17428,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
LLaVA-OneVision: Easy Visual Task Transfer
12 Published as a conference paper at ICLR 2025 Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Yanwei Li, Ziwei Liu, and Chunyuan Li. Llava-onevision: Easy visual task transfer. arXiv preprint arXiv:2408.03326, 2024a. Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre- training for...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[11]
Ziyi Lin, Chris Liu, Renrui Zhang, Peng Gao, Longtian Qiu, Han Xiao, Han Qiu, Chen Lin, Wenqi Shao, Keqin Chen, et al. Sphinx: The joint mixing of weights, tasks, and visual embeddings for multi-modal large language models. arXiv preprint arXiv:2311.07575,
-
[12]
Retrieval- augmented multi-modal chain-of-thoughts reasoning for large language models
Bingshuai Liu, Chenyang Lyu, Zijun Min, Zhanyu Wang, Jinsong Su, and Longyue Wang. Retrieval- augmented multi-modal chain-of-thoughts reasoning for large language models. arXiv preprint arXiv:2312.01714, 2023a. Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. InProceed- ings of NeurIPS, volume 36, pp. 34892–34916, 2023b....
-
[13]
Yuliang Liu, Biao Yang, Qiang Liu, Zhang Li, Zhiyin Ma, Shuo Zhang, and Xiang Bai
URL https://github.com/jerryjliu/llama_index. Yuliang Liu, Biao Yang, Qiang Liu, Zhang Li, Zhiyin Ma, Shuo Zhang, and Xiang Bai. Textmonkey: An ocr-free large multimodal model for understanding document. arXiv preprint arXiv:2403.04473, 2024b. Man Luo, Yankai Zeng, Pratyay Banerjee, and Chitta Baral. Weakly-supervised visual-retriever- reader for knowledg...
-
[14]
Unifying multimodal retrieval via document screenshot embedding
Xueguang Ma, Sheng-Chieh Lin, Minghan Li, Wenhu Chen, and Jimmy Lin. Unifying multimodal retrieval via document screenshot embedding. arXiv preprint arXiv:2406.11251,
-
[15]
Sgpt: Gpt sentence embeddings for semantic search
Niklas Muennighoff. Sgpt: Gpt sentence embeddings for semantic search. arXiv preprint arXiv:2202.08904,
-
[16]
Mteb: Massive text embed- ding benchmark
Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. Mteb: Massive text embed- ding benchmark. In Proceedings of EACL, pp. 2014–2037,
work page 2014
-
[17]
13 Published as a conference paper at ICLR 2025 OpenAI. Hello, gpt-4o — openai,
work page 2025
-
[18]
Wikichat: A few-shot llm-based chatbot grounded with wikipedia
Sina J Semnani, Violet Z Yao, Heidi C Zhang, and Monica S Lam. Wikichat: A few-shot llm-based chatbot grounded with wikipedia. arXiv preprint arXiv:2305.14292,
-
[19]
Unirag: Universal retrieval augmentation for multi-modal large language models
Sahel Sharifymoghaddam, Shivani Upadhyay, Wenhu Chen, and Jimmy Lin. Unirag: Universal retrieval augmentation for multi-modal large language models. arXiv preprint arXiv:2405.10311,
-
[20]
Retrieval meets reasoning: Even high-school textbook knowledge benefits multimodal reasoning
Cheng Tan, Jingxuan Wei, Linzhuang Sun, Zhangyang Gao, Siyuan Li, Bihui Yu, Ruifeng Guo, and Stan Z Li. Retrieval meets reasoning: Even high-school textbook knowledge benefits multimodal reasoning. arXiv preprint arXiv:2405.20834,
-
[21]
Gemma: Open Models Based on Gemini Research and Technology
Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivi `ere, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295,
work page internal anchor Pith review Pith/arXiv arXiv
-
[22]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Niko- lay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open founda- tion and fine-tuned chat models. arXiv preprint arXiv:2307.09288,
work page internal anchor Pith review Pith/arXiv arXiv
-
[23]
Freshllms: Refreshing large language models with search engine augmentation
Tu Vu, Mohit Iyyer, Xuezhi Wang, Noah Constant, Jerry Wei, Jason Wei, Chris Tar, Yun-Hsuan Sung, Denny Zhou, Quoc Le, et al. Freshllms: Refreshing large language models with search engine augmentation. arXiv preprint arXiv:2310.03214,
-
[24]
CogVLM: Visual Expert for Pretrained Language Models
14 Published as a conference paper at ICLR 2025 Weihan Wang, Qingsong Lv, Wenmeng Yu, Wenyi Hong, Ji Qi, Yan Wang, Junhui Ji, Zhuoyi Yang, Lei Zhao, Xixuan Song, et al. Cogvlm: Visual expert for pretrained language models. arXiv preprint arXiv:2311.03079,
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[25]
Uniir: Training and benchmarking universal multimodal information retrievers
Cong Wei, Yang Chen, Haonan Chen, Hexiang Hu, Ge Zhang, Jie Fu, Alan Ritter, and Wenhu Chen. Uniir: Training and benchmarking universal multimodal information retrievers. arXiv preprint arXiv:2311.17136,
-
[26]
C-Pack: Packed Resources For General Chinese Embeddings
Shitao Xiao, Zheng Liu, Peitian Zhang, Niklas Muennighoff, Defu Lian, and Jian-Yun Nie. C-pack: Packaged resources to advance general chinese embedding. arXiv preprint arXiv:2309.07597 ,
work page internal anchor Pith review Pith/arXiv arXiv
-
[27]
Chatqa 2: Bridging the gap to proprietary llms in long context and rag capabilities
Peng Xu, Wei Ping, Xianchao Wu, Zihan Liu, Mohammad Shoeybi, and Bryan Catanzaro. Chatqa 2: Bridging the gap to proprietary llms in long context and rag capabilities. arXiv preprint arXiv:2407.14482, 2024a. Ruyi Xu, Yuan Yao, Zonghao Guo, Junbo Cui, Zanlin Ni, Chunjiang Ge, Tat-Seng Chua, Zhiyuan Liu, Maosong Sun, and Gao Huang. Llava-uhd: an lmm perceivi...
-
[28]
An Yang, Baosong Yang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Zhou, Chengpeng Li, Chengyuan Li, Dayiheng Liu, Fei Huang, et al. Qwen2 technical report. arXiv preprint arXiv:2407.10671,
work page internal anchor Pith review Pith/arXiv arXiv
-
[29]
MiniCPM-V: A GPT-4V Level MLLM on Your Phone
Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, Qianyu Chen, Huarong Zhou, Zhensheng Zou, Haoye Zhang, Shengding Hu, Zhi Zheng, Jie Zhou, Jie Cai, Xu Han, Guoyang Zeng, Dahai Li, Zhiyuan Liu, and Maosong Sun. Minicpm-V: A GPT-4v Level MLLM on Your Phone. arXiv, abs/2408.01800,
work page internal anchor Pith review Pith/arXiv arXiv
-
[30]
mplug-docowl: Modularized multimodal large language model for document understanding
Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Yuhao Dan, Chenlin Zhao, Guohai Xu, Chenliang Li, Junfeng Tian, et al. mplug-docowl: Modularized multimodal large language model for document understanding. arXiv preprint arXiv:2307.02499, 2023a. Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Guohai Xu, Chenliang Li, Junfeng Tian, Qi Qian, Ji Z...
-
[31]
Map-neo: Highly capable and transparent bilingual large language model series
Ge Zhang, Scott Qu, Jiaheng Liu, Chenchen Zhang, Chenghua Lin, Chou Leuang Yu, Danny Pan, Esther Cheng, Jie Liu, Qunshu Lin, et al. Map-neo: Highly capable and transparent bilingual large language model series. arXiv preprint arXiv:2405.19327,
-
[32]
MARVEL: unlocking the multi-modal capability of dense retrieval via visual module plugin
15 Published as a conference paper at ICLR 2025 Tianshuo Zhou, Sen Mei, Xinze Li, Zhenghao Liu, Chenyan Xiong, Zhiyuan Liu, Yu Gu, and Ge Yu. MARVEL: unlocking the multi-modal capability of dense retrieval via visual module plugin. In Proceedings of ACL, pp. 14608–14624,
work page 2025
-
[33]
Rageval: Scenario specific rag evaluation dataset generation framework
Kunlun Zhu, Yifan Luo, Dingling Xu, Ruobing Wang, Shi Yu, Shuo Wang, Yukun Yan, Zhenghao Liu, Xu Han, Zhiyuan Liu, et al. Rageval: Scenario specific rag evaluation dataset generation framework. arXiv preprint arXiv:2408.01262,
-
[34]
We prompt GPT-4o to generate queries on these docu- ments
16 Published as a conference paper at ICLR 2025 A D ATA CONSTRUCTION DETAILS A.1 S YNTHETIC DATA Table 4: Statistics of crawled documents. We prompt GPT-4o to generate queries on these docu- ments. Name Source Description # Pages Textbooks https://openstax.org/ College-level textbooks including various subjects 10,000 ICML Papers ICML 2023 ICML papers on ...
work page 2025
-
[35]
Although this filtering step reduces context-dependent queries, a small number may still remain
using the instruction shown in Figure 7, which includes human-annotated samples from DocVQA. Although this filtering step reduces context-dependent queries, a small number may still remain. However, their presence is minimal and does not significantly impact the overall quality of our dataset. 17 Published as a conference paper at ICLR 2025 I have some QA...
work page 2025
-
[36]
Methods based on PPOCR demonstrate significantly better performance compared to pytesseract, with adjacent merging and layout preserving yielding similar results. Consequently, we opt to use the adjacent merging policy for our “(OCR)” runs. Table 5: Overall retrieval performance of different document parsing pipelines. ArxivQA ChartQA DocVQA InfoVQA PlotQ...
work page 2019
-
[37]
In this paper, we employ MiniCPM to construct the baseline text-based retriever (Table
and Gemma-7B (Team et al., 2024). In this paper, we employ MiniCPM to construct the baseline text-based retriever (Table
work page 2024
-
[38]
It is built upon SigLIP-400M and Qwen2-7B (Yang et al.,
is an upgrade of MiniCPM-V 2.0 and MiniCPM-Llama3-V 2.5 (Yao et al., 2024). It is built upon SigLIP-400M and Qwen2-7B (Yang et al.,
work page 2024
-
[39]
with a total of 8.5B parameters, exihibiting a significant performance improvement over MiniCPM-Llama3-V 2.5 (Yao et al., 2024). Different from previous models, MiniCPM-V 2.6 can accept multiple images as the input and perform multi-modal in-context learning. It also demonstrates stronger OCR capabilities. We use MiniCPM-V 2.6 to build VisRAG-Gen (Table
work page 2024
-
[40]
2https://huggingface.co/HuggingFaceM4/siglip-so400m-14-980-flash-attn2-navit 20 Published as a conference paper at ICLR 2025 Table 6: Overall retrieval performance in Recall@10. Model # Para. ArxivQAChartQADocVQAInfoVQAPlotQASlideVQAAverage (a) Off-the-shelf Models BM25 (OCR) n.a. 54.29 79.37 86.80 82.59 76.01 91.64 78.45 bge-large (2023) (OCR)335M 48.65 ...
work page 2025
-
[41]
In both instances, we compare VisRAG with TextRAG, maintaining the same setup as described in the “End-to-end Performance” paragraph in Sec. 5.1. In the first case from DocVQA, the user queries about “Club Jetty,” however, the term “Club Jetty” in the relevant document is not successfully extracted due to its decorative font. This leads to TextRAG failing...
work page 2025
-
[42]
to combine the outputs of MiniCPM (OCR) and SigLIP. It is a straightforward method to integrate textual information extracted from the page with its visual clues. The results indicate that fusing text and image modalities provides a meaningful performance boost over in- dividual modality baselines. However, this approach still falls short of the performan...
work page 2024
-
[43]
As shown in the table, although VisRAG-Ret, a VLM-based model, requires more time for document encoding compared to MiniCPM (OCR), it bypasses the time-consuming parsing stage required by 24 Published as a conference paper at ICLR 2025 Table 12: Retrieval efficiency (ms). We report offline latencies per document, including document parsing and encoding la...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.