arxiv: 2604.12812 · v5 · submitted 2026-04-14 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding

Hao Yan, Jihao Wu, Minghui Liao, Wei Chen, Xiang Bai, Xingchen Liu, Yuliang Liu, Yuyi Zhang

Pith reviewed 2026-05-12 04:14 UTC · model grok-4.3

classification 💻 cs.AI

keywords long document understandingmultimodal large language modelsevidence groundingstructured visual reasoningknowledge distillationpolicy optimizationretrieval-augmented generation

0 comments

The pith

DocSeeker trains multimodal models to localize evidence pages before reasoning on long documents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multimodal large language models lose accuracy on long documents because relevant evidence gets buried among many irrelevant pages and because training data provides only final answers without intermediate signals. The paper claims that forcing the model through an explicit Analysis-Localization-Reasoning sequence, taught first by supervised fine-tuning on distilled data and then by policy optimization that scores both correct evidence selection and correct answers, solves both problems. If the approach works, models trained on short documents can handle much longer ones, and the same models integrate directly with retrieval systems that supply candidate pages. The authors also add a resolution-allocation step to keep memory use manageable during training on multi-page inputs. Experiments reportedly show gains on both familiar and new document tasks.

Core claim

By requiring the model to first localize the specific evidence pages needed for a query and then reason from those pages, and by training this behavior through an initial supervised stage on knowledge-distilled examples followed by Evidence-aware Group Relative Policy Optimization that rewards both localization quality and answer accuracy, DocSeeker overcomes the low signal-to-noise ratio and weak supervision that normally degrade performance as document length grows.

What carries the argument

Evidence-aware Group Relative Policy Optimization (EGRPO), which jointly scores evidence localization and final-answer correctness inside the required Analysis-Localization-Reasoning workflow.

If this is right

Performance stays high on both in-domain and out-of-domain long-document tasks.
Models trained only on short pages generalize to ultra-long documents.
The same model works naturally as a component inside visual retrieval-augmented generation pipelines.
Evidence-Guided Resolution Allocation keeps memory costs manageable when training on many pages at once.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The method may let future systems avoid ever-larger context windows by retrieving and localizing evidence on demand instead.
Similar evidence-localization training could apply to other long sequential inputs such as video or multi-turn conversations.
Making the selected evidence explicit may improve user trust and allow easier debugging of incorrect answers.
The approach could be combined with stronger base models to further raise the ceiling on long-document performance.

Load-bearing premise

High-quality structured reasoning traces can be created by knowledge distillation and that the subsequent policy optimization will improve both localization and accuracy without introducing new failure modes or requiring extensive tuning.

What would settle it

A collection of ultra-long documents in which the correct answer depends on evidence pages that the trained model systematically fails to select, producing no improvement over a baseline that receives the same pages without localization training.

Figures

Figures reproduced from arXiv: 2604.12812 by Hao Yan, Jihao Wu, Minghui Liao, Wei Chen, Xiang Bai, Xingchen Liu, Yuliang Liu, Yuyi Zhang.

**Figure 1.** Figure 1: Overview of Principal Experimental Results in Long Document Understanding (Left) and the ALR Reasoning Paradigm of [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Overview of our training framework. (Left) Stage I SFT, where the model is supervised fine-tuned on high-quality ALR CoT [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 4.** Figure 4: Performance comparison on MMLongBench-doc fol [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Data Distillation Pipeline [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: ALR CoT Data Distillation Prompt Templete. [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Vanilla CoT Data Distillation Prompt Templete. [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: LLM-based Judge Prompt Templete in Secondary Verification. [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Failure cases in ALR CoT Data Distillation. [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative Comparison of Reasoning Paradigms. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Performance Comparison across Different Training Stages. [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗

read the original abstract

Existing Multimodal Large Language Models (MLLMs) suffer from significant performance degradation on the long document understanding task as document length increases. This stems from two fundamental challenges: 1) a low Signal-to-Noise Ratio (SNR), with crucial evidence buried in irrelevant pages; and 2) supervision scarcity, as datasets offering only final short answers provide a weak learning signal. In this paper, we address these challenges by proposing a paradigm that requires the model to execute a structured Analysis, Localization and Reasoning workflow. To instill this capability, we design a two-stage training framework: we first perform Supervised Fine-Tuning on high-quality data generated via an efficient knowledge distillation strategy. Subsequently, we employ an Evidence-aware Group Relative Policy Optimization which jointly optimizes for both evidence localization and answer accuracy. Additionally, we introduce a Evidence-Guided Resolution Allocation strategy to mitigate memory constraints of training on multi-pages documents. Extensive experiments demonstrate that DocSeeker achieves superior performance on both in-domain and out-of-domain tasks. We show it robustly generalizes from short-page training to ultra-long documents and is naturally synergistic with visual Retrieval-Augmented Generation systems, serving as a solid foundation for their implementation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DocSeeker proposes a two-stage training pipeline with Evidence-aware GRPO to force explicit evidence localization in long-document MLLMs, but the abstract supplies no metrics or ablations to show whether the joint objective actually works.

read the letter

DocSeeker tries to fix the performance drop on long documents by making the model follow a clear Analysis-Localization-Reasoning workflow instead of guessing from noisy pages. The two-stage setup starts with supervised fine-tuning on distilled high-quality data, then switches to Evidence-aware Group Relative Policy Optimization that scores both evidence selection and final answer accuracy together. They also add Evidence-Guided Resolution Allocation to handle memory limits when pages stack up. These pieces feel like practical engineering moves on top of distillation and RL-style training that already exist in the literature. The paper correctly names the two core problems—low signal-to-noise ratio and weak supervision from short-answer labels—and the workflow is a direct response to them. That part is straightforward and useful to think about. The main weakness is the complete absence of numbers. No baselines, no ablation tables, no reward-function details, and no failure-case analysis appear in the abstract, so there is no way to check whether the group-relative formulation avoids reward hacking or simply shifts errors to new places. The claim of robust generalization from short-page training to ultra-long documents therefore sits on an untested assumption about how the joint objective behaves under distribution shift. If the full paper contains solid experiments that address these points, the contribution becomes more concrete; right now the evidence is too thin to judge. This work is aimed at groups building multimodal document systems or visual RAG pipelines who need training recipes rather than new theory. A reader looking for ideas on structuring long-context reasoning would find the pipeline worth skimming, but anyone needing verified gains should wait for the numbers. I would send it to peer review because the problem is practical and the method is specific enough for referees to test the claims directly.

Referee Report

1 major / 2 minor

Summary. The paper introduces DocSeeker, a framework for long document understanding in multimodal large language models (MLLMs). It proposes a structured Analysis, Localization, and Reasoning workflow trained in two stages: supervised fine-tuning on knowledge-distilled high-quality data, followed by Evidence-aware Group Relative Policy Optimization (GRPO) to jointly optimize evidence localization and answer accuracy. An Evidence-Guided Resolution Allocation strategy is introduced to address memory issues with multi-page documents. The authors claim that DocSeeker achieves superior performance on in-domain and out-of-domain tasks, robustly generalizes from short-page training to ultra-long documents, and is synergistic with visual Retrieval-Augmented Generation systems.

Significance. If the empirical results hold, this work has the potential to significantly advance the field by providing a structured approach to mitigate performance degradation in long-document MLLM tasks due to low signal-to-noise ratio and weak supervision. The two-stage training with GRPO for joint optimization of localization and reasoning is a novel contribution, and the synergy with RAG systems could have broad applications in document AI. The use of knowledge distillation for data generation is efficient and practical.

major comments (1)

[§4.2] §4.2 (Evidence-aware Group Relative Policy Optimization): The central claim of superior performance and robust short-to-ultra-long generalization depends on the two-stage pipeline, specifically that SFT on distilled data followed by Evidence-aware GRPO will simultaneously improve evidence localization and final-answer accuracy. The manuscript supplies no ablations, reward-function details, or failure-case analysis showing that the group-relative formulation avoids reward hacking, over-penalization of valid long-context reasoning, or distribution shift between the distilled teacher and the student policy.

minor comments (2)

[Abstract] The abstract asserts superior performance and generalization but supplies no quantitative metrics, baselines, ablation results, or experimental details; these should be summarized with specific numbers and comparisons to allow immediate assessment of the claims.
[Throughout] Ensure all acronyms (MLLM, SNR, GRPO, RAG) are defined on first use and used consistently in the method and experiments sections.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive assessment of the potential impact of DocSeeker and for their detailed feedback. We address the major comment point-by-point below.

read point-by-point responses

Referee: [§4.2] §4.2 (Evidence-aware Group Relative Policy Optimization): The central claim of superior performance and robust short-to-ultra-long generalization depends on the two-stage pipeline, specifically that SFT on distilled data followed by Evidence-aware GRPO will simultaneously improve evidence localization and final-answer accuracy. The manuscript supplies no ablations, reward-function details, or failure-case analysis showing that the group-relative formulation avoids reward hacking, over-penalization of valid long-context reasoning, or distribution shift between the distilled teacher and the student policy.

Authors: We agree that the manuscript would benefit from more detailed analysis of the GRPO component to fully substantiate the claims. In the revised version, we will expand §4.2 with: (1) ablations comparing the full two-stage pipeline against SFT-only and GRPO-only variants, reporting both localization and accuracy metrics; (2) the complete specification of the reward function and the group-relative advantage estimation; and (3) a dedicated subsection on failure cases and robustness checks, including discussion of potential reward hacking, over-penalization, and distribution shift. These additions will provide empirical evidence that the Evidence-aware GRPO improves upon the SFT baseline without introducing the issues raised. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on experimental outcomes, not derivations

full rationale

The paper describes an empirical two-stage training pipeline (SFT on distilled data followed by Evidence-aware GRPO) and reports performance gains on in-domain/out-of-domain tasks plus short-to-long generalization. No mathematical derivations, equations, fitted parameters, or predictions are presented that could reduce to their own inputs by construction. Claims are framed as outcomes of the proposed workflow rather than first-principles results or self-referential definitions. No self-citation load-bearing steps or ansatz smuggling appear in the abstract or described methodology.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the unverified effectiveness of knowledge distillation for creating high-quality supervision signals and on the assumption that the proposed RL objective can simultaneously improve localization and accuracy. No free parameters, axioms, or invented entities are explicitly quantified in the abstract.

axioms (2)

domain assumption Knowledge distillation from an existing model can produce high-quality step-by-step reasoning data for long documents.
Invoked to justify the first-stage supervised fine-tuning data generation.
domain assumption Joint optimization of evidence localization and answer accuracy via group relative policy optimization is feasible and beneficial.
Central to the second training stage.

pith-pipeline@v0.9.0 · 5529 in / 1375 out tokens · 53331 ms · 2026-05-12T04:14:01.548751+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
two-stage training framework: Supervised Fine-Tuning on high-quality data generated via knowledge distillation ... Evidence-aware Group Relative Policy Optimization which jointly optimizes for both evidence localization and answer accuracy
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery unclear
ALR visual reasoning paradigm ... Page-Aware Input Representation ... Structured Reasoning Paradigm with Explicit Evidence Grounding

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 15 internal anchors

[1]

Hi- erarchical multimodal transformers for multipage docvqa

Rubèn Tito, Dimosthenis Karatzas, and Ernest Valveny. Hi- erarchical multimodal transformers for multipage docvqa. Pattern Recognition, 144:109834, 2023. 1, 2, 5, 6, 19

work page 2023
[2]

Document understanding dataset and evaluation (dude)

Jordy Van Landeghem, Rubèn Tito, Łukasz Borchmann, Michał Pietruszka, Pawel Joziak, Rafal Powalski, Dawid Ju- rkiewicz, Mickaël Coustaty, Bertrand Anckaert, Ernest Val- veny, et al. Document understanding dataset and evaluation (dude). InIEEE/CVF International Conference on Computer Vision, page 19471–19483, 2023. 2, 5, 12

work page 2023
[3]

Large language models in document intelligence: A comprehensive survey, recent ad- vances, challenges and future trends.ACM Transactions on Information Systems, 2025

Wenjun Ke, Yifan Zheng, Yining Li, Hengyuan Xu, Dong Nie, Peng Wang, and Yao He. Large language models in document intelligence: A comprehensive survey, recent ad- vances, challenges and future trends.ACM Transactions on Information Systems, 2025. 1

work page 2025
[4]

arXiv preprint arXiv:2409.18839 , year=

Bin Wang, Chao Xu, Xiaomeng Zhao, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Rui Xu, Kaiwen Liu, Yuan Qu, Fukai Shang, et al. Mineru: An open-source solution for precise document content extraction. InarXiv:2409.18839, 2024. 1

work page arXiv 2024
[5]

Dolphin: Document image parsing via heterogeneous anchor prompting

Hao Feng, Shu Wei, Xiang Fei, Wei Shi, Yingdong Han, Lei Liao, Jinghui Lu, Binghong Wu, Qi Liu, Chunhui Lin, et al. Dolphin: Document image parsing via heterogeneous anchor prompting. InarXiv:2505.14059, 2025

work page arXiv 2025
[6]

arXiv preprint arXiv:2506.05218 , year=

Zhang Li, Yuliang Liu, Qiang Liu, Zhiyin Ma, Ziyang Zhang, Shuo Zhang, Zidun Guo, Jiarui Zhang, Xinyu Wang, and Xiang Bai. Monkeyocr: Document parsing with a structure-recognition-relation triplet paradigm. In arXiv:2506.05218, 2025. 1

work page arXiv 2025
[7]

A brief analysis of chatgpt: historical evolution current applications and future prospects [j].Journal of Image and Graphics, 28(04):893–902, 2023

YL Liu, HL Li, X Bai, et al. A brief analysis of chatgpt: historical evolution current applications and future prospects [j].Journal of Image and Graphics, 28(04):893–902, 2023. 1

work page 2023
[8]

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wen- bin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. In arXiv:2502.13923, 2025. 1, 3, 5, 19

work page internal anchor Pith review Pith/arXiv arXiv 2025
[9]

InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. In arXiv:2504.10479, 2025. 3, 5, 6, 19, 20

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

mplug- docowl2: High-resolution compressing for ocr-free multi- page document understanding

Anwen Hu, Haiyang Xu, Liang Zhang, Jiabo Ye, Ming Yan, Ji Zhang, Qin Jin, Fei Huang, and Jingren Zhou. mplug- docowl2: High-resolution compressing for ocr-free multi- page document understanding. InAnnual Meeting of the As- sociation for Computational Linguistics, pages 5817–5834,

work page
[11]

Ocr-free document understanding transformer

Geewook Kim, Teakgyu Hong, Moonbin Yim, JeongYeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sang- doo Yun, Dongyoon Han, and Seunghyun Park. Ocr-free document understanding transformer. InEuropean Confer- ence on Computer Vision, pages 498–517. Springer, 2022

work page 2022
[12]

GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

GLM-V Team, Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, Aohan Zeng, Baoxu Wang, Bin Chen, Boyan Shi, Changyu Pang, Chenhui Zhang, Da Yin, Fan Yang, Guoqing Chen, Jiazheng Xu, Jiale Zhu, Jiali Che...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, Zhenda Xie, Yu Wu, Kai Hu, Ji- awei Wang, Yaofeng Sun, Yukun Li, Yishi Piao, Kang Guan, Aixin Liu, Xin Xie, Yuxiang You, Kai Dong, Xingkai Yu, Haowei Zhang, Liang Zhao, Yisong Wang, and Chong Ruan. Deepseek-vl2: Mixture-of-experts vis...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Textmonkey: Anocr-freelargemultimodal model for understanding document.arXiv preprint arXiv:2403.04473,

Yuliang Liu, Biao Yang, Qiang Liu, Zhang Li, Zhiyin Ma, Shuo Zhang, and Xiang Bai. Textmonkey: An ocr-free large multimodal model for understanding document. In arXiv:2403.04473, 2024

work page arXiv 2024
[15]

Pix2struct: Screenshot parsing as pretraining for visual lan- guage understanding

Kenton Lee, Mandar Joshi, Iulia Raluca Turc, Hexiang Hu, Fangyu Liu, Julian Martin Eisenschlos, Urvashi Khandel- wal, Peter Shaw, Ming-Wei Chang, and Kristina Toutanova. Pix2struct: Screenshot parsing as pretraining for visual lan- guage understanding. InInternational Conference on Ma- chine Learning, pages 18893–18912. PMLR, 2023

work page 2023
[16]

MiniCPM-V: A GPT-4V Level MLLM on Your Phone

Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone. In arXiv:2408.01800, 2024. 1

work page internal anchor Pith review Pith/arXiv arXiv 2024
[17]

Visrag: Vision-based retrieval-augmented generation on multi-modality documents

Shi Yu, Chaoyue Tang, Bokai Xu, Junbo Cui, Junhao Ran, Yukun Yan, Zhenghao Liu, Shuo Wang, Xu Han, Zhiyuan Liu, et al. Visrag: Vision-based retrieval-augmented gener- 9 ation on multi-modality documents. InarXiv:2410.10594,

work page arXiv
[18]

Colpali: Efficient document retrieval with vision language models.arXiv preprint arXiv:2407.01449, 2024

Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, Céline Hudelot, and Pierre Colombo. Col- pali: Efficient document retrieval with vision language mod- els. InarXiv:2407.01449, 2024. 1, 3, 7, 19

work page arXiv 2024
[19]

mplug-docowl: Modularized multimodal large language model for document understanding.arXiv preprint arXiv:2307.02499, 2023

Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Yuhao Dan, Chenlin Zhao, Guohai Xu, Chenliang Li, Junfeng Tian, et al. mplug-docowl: Modularized multi- modal large language model for document understanding. In arXiv:2307.02499, 2023. 1

work page arXiv 2023
[20]

PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling

Xudong Xie, Hao Yan, Liang Yin, Yang Liu, Jing Ding, Minghui Liao, Yuliang Liu, Wei Chen, and Xiang Bai. Wukong: A large multimodal model for efficient long pdf reading with end-to-end sparse sampling. In arXiv:2410.05970, 2024. 1, 19

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Slidevqa: A dataset for document visual question answering on multiple images

Ryota Tanaka, Kyosuke Nishida, Kosuke Nishida, Taku Hasegawa, Itsumi Saito, and Kuniko Saito. Slidevqa: A dataset for document visual question answering on multiple images. InAAAI Conference on Artificial Intelligence, vol- ume 37, pages 13636–13645, 2023. 2, 5

work page 2023
[22]

MultimodalArXiv: A dataset for improving scientific comprehension of large vision-language models

Linjie Li, Yuxuan Wang, Rui Xu, Peiyi Wang, Xinyun Feng, Lingpeng Kong, and Qun Liu. MultimodalArXiv: A dataset for improving scientific comprehension of large vision-language models. InAnnual Meeting of the Asso- ciation for Computational Linguistics, pages 14369–14387, 2024

work page 2024
[23]

MVQA: A dataset for multimodal infor- mation retrieval in pdf-based visual question answering

Yihao Ding, Kaixuan Ren, Jiabin Huang, Siwen Luo, and Soyeon Caren Han. MVQA: A dataset for multimodal infor- mation retrieval in pdf-based visual question answering. In arXiv:2404.12720, 2024. 2

work page arXiv 2024
[24]

Mmlongbench-doc: Benchmarking long-context doc- ument understanding with visualizations

Yubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen, Yizhu Jiao, Xinze Li, Xinyuan Lu, Ziyu Liu, Yan Ma, Xiaoyi Dong, et al. Mmlongbench-doc: Benchmarking long-context doc- ument understanding with visualizations. InAdvances in Neural Information Processing Systems, volume 37, pages 95963–96010, 2024. 3, 5

work page 2024
[25]

Longdocurl: a comprehensive multimodal long document benchmark integrating understanding, reasoning, and locat- ing

Chao Deng, Jiale Yuan, Pi Bu, Peijie Wang, Zhong-Zhi Li, Jian Xu, Xiao-Hui Li, Yuan Gao, Jun Song, Bo Zheng, et al. Longdocurl: a comprehensive multimodal long document benchmark integrating understanding, reasoning, and locat- ing. InAnnual Meeting of the Association for Computational Linguistics, pages 1135–1159, 2025. 3, 5

work page 2025
[26]

MMDocIR: Benchmarking multimodal retrieval for long documents

Kuicai Dong, Yujing Chang, Derrick Goh Xin Deik, Dexun Li, Ruiming Tang, and Yong Liu. MMDocIR: Benchmarking multimodal retrieval for long documents. InConference on Empirical Methods in Natural Language Processing, pages 30971–31005, 2025. 3

work page 2025
[27]

M-LongDoc: A benchmark for multimodal super-long document understanding and a retrieval-aware tuning frame- work

Yew Ken Chia, Liying Cheng, Hou Pong Chan, Maojia Song, Chaoqun Liu, Mahani Aljunied, Soujanya Poria, and Lidong Bing. M-LongDoc: A benchmark for multimodal super-long document understanding and a retrieval-aware tuning frame- work. InConference on Empirical Methods in Natural Lan- guage Processing, pages 9233–9250, 2025. 3

work page 2025
[28]

DocBench: A benchmark for evaluating LLM-based doc- ument reading systems

Anni Zou, Wenhao Yu, Hongming Zhang, Kaixin Ma, Deng Cai, Zhuosheng Zhang, Hai Zhao, and Dong Yu. DocBench: A benchmark for evaluating LLM-based doc- ument reading systems. InInternational Workshop on Knowledge-Augmented Methods for Natural Language Pro- cessing, pages 359–373, 2025. 3

work page 2025
[29]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. In arXiv:2004.05150, 2020. 3, 19

work page internal anchor Pith review Pith/arXiv arXiv 2004
[30]

A simple yet effective layout token in large language models for document understanding

Zhaoqing Zhu, Chuwei Luo, Zirui Shao, Feiyu Gao, Hangdi Xing, Qi Zheng, and Ji Zhang. A simple yet effective layout token in large language models for document understanding. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14472–14482, 2025. 3, 19, 20

work page 2025
[31]

Unifying multimodal retrieval via document screenshot embedding

Xueguang Ma, Sheng-Chieh Lin, Minghan Li, Wenhu Chen, and Jimmy Lin. Unifying multimodal retrieval via document screenshot embedding. InConference on Empirical Methods in Natural Language Processing, pages 6492–6505, 2024. 3

work page 2024
[32]

Sv-rag: Lora-contextualizing adaptation of mllms for long document understanding

Jian Chen, Ruiyi Zhang, Yufan Zhou, Tong Yu, Franck Der- noncourt, Jiuxiang Gu, Ryan A Rossi, Changyou Chen, and Tong Sun. Sv-rag: Lora-contextualizing adaptation of mllms for long document understanding. InarXiv:2411.01106,

work page arXiv
[33]

Vdocrag: Retrieval- augmented generation over visually-rich documents

Ryota Tanaka, Taichi Iki, Taku Hasegawa, Kyosuke Nishida, Kuniko Saito, and Jun Suzuki. Vdocrag: Retrieval- augmented generation over visually-rich documents. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24827–24837, 2025. 3, 6, 19, 20

work page 2025
[34]

Attention is all you need

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems, volume 30, 2017. 3

work page 2017
[35]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. InarXiv:2010.11929,

work page internal anchor Pith review Pith/arXiv arXiv 2010
[36]

MIT Press, 2008

Keith Stenning and Michiel Van Lambalgen.Human reason- ing and cognitive science. MIT Press, 2008. 3

work page 2008
[37]

OpenAI o1 System Card

Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richard- son, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. InarXiv:2412.16720, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[38]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing rea- soning capability in llms via reinforcement learning. In arXiv:2501.12948, 2025. 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodal- ity, long context, and next generation agentic capabilities. In arXiv:2507.06261, 2025. 4, 5, 12

work page internal anchor Pith review Pith/arXiv arXiv 2025
[40]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. InarXiv:2410.21276, 2024. 4, 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[41]

A normalized levenshtein distance metric.IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(6):1091–1095, 2007

Li Yujian and Liu Bo. A normalized levenshtein distance metric.IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(6):1091–1095, 2007. 5 10

work page 2007
[42]

MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, et al. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforce- ment learning. InarXiv:2503.07365, 2025. 5

work page Pith review arXiv 2025
[43]

Cream: coarse-to- fine retrieval and multi-modal efficient tuning for document vqa

Jinxu Zhang, Yongqi Yu, and Yu Zhang. Cream: coarse-to- fine retrieval and multi-modal efficient tuning for document vqa. InACM International Conference on Multimedia, pages 925–934, 2024. 5, 6, 19

work page 2024
[44]

M3docrag: Multi-modal retrieval is what you need for multi-page multi-document understanding.arXiv preprint arXiv:2411.04952, 2024

Jaemin Cho, Debanjan Mahata, Ozan Irsoy, Yujie He, and Mohit Bansal. M3docrag: Multi-modal retrieval is what you need for multi-page multi-document understanding. In arXiv:2411.04952, 2024. 5, 6, 19

work page arXiv 2024
[45]

Docopilot: Improving multimodal models for document-level understanding

Yuchen Duan, Zhe Chen, Yusong Hu, Weiyun Wang, Sheng- long Ye, Botian Shi, Lewei Lu, Qibin Hou, Tong Lu, Hong- sheng Li, et al. Docopilot: Improving multimodal models for document-level understanding. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4026– 4037, 2025. 5, 6, 19

work page 2025
[46]

Docvlm: Make your vlm an efficient reader

Mor Shpigel Nacson, Aviad Aberdam, Roy Ganz, Elad Ben Avraham, Alona Golts, Yair Kittenplon, Shai Mazor, and Ron Litman. Docvlm: Make your vlm an efficient reader. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 29005–29015, 2025. 5, 6, 19

work page 2025
[47]

Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. In arXiv:2308.12966, 2023. 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[48]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of con- text. InarXiv:2403.05530, 2024. 6

work page internal anchor Pith review Pith/arXiv arXiv 2024
[49]

C-Pack: Packed Resources For General Chinese Embeddings

Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muen- nighoff. C-Pack: Packaged resources to advance general chi- nese embedding. InarXiv:2309.07597, 2023. 19

work page internal anchor Pith review arXiv 2023
[50]

Min- imal Context

Dongsheng Wang, Natraj Raman, Mathieu Sibue, Zhiqiang Ma, Petr Babkin, Simerjot Kaur, Yulong Pei, Armineh Nour- bakhsh, and Xiaomo Liu. DocLLM: A layout-aware gener- ative language model for multimodal document understand- ing. InarXiv:2401.00908, 2023. 19, 20 11 APPENDIX A. Details of Data Distillation A.1. Overview of the Distillation Pipeline To addres...

work page arXiv 2023
[51]

ADJUDICATORY CASES\

Locate the section: The \"ADJUDICATORY CASES\" section begins on page 3

work page
[52]

Count cases on Page 3:- PCB 21-3, PCB 21-5 (One combined entry)- PCB 21-29, -PCB 21-47 (page 3: 3 cases)

work page
[53]

Count cases on Page 4: - PCB 21-110 - PCB 21-114 - PCB 22-11…

work page
[54]

evidence_pages\

Sum the counts: 3 (page 3) + 7 (page 4) + 1 (page 5) = 11. Therefore, there are 11 adjudicatory cases listed. </think> <answer> {\"evidence_pages\": [3, 4, 5], \"answer\": \"11\"} </answer> Answer correct but formatting mismatch Flawed reasoning Wrong Answer Question Figure 9. Failure cases in ALR CoT Data Distillation. 17 f Assume that in a survey conduc...

work page 2020
[55]

Most Americans see China as having dealt poorly with COVID-19

Analyze …; 2. Scan the document: …3. Locate the relevant chart: I see a bar chart titled “Most Americans see China as having dealt poorly with COVID-19” shows …; 4. Examine the chart's data: The chart has two categories: "Bad" and "Good." … ; 5. Extract the specific numbers: The question states … increased by 10 percentage points. Looking at the chart, th...

work page
[56]

Thank u very much

OCR-based Methods: These approaches,such as Lay- 19 What position is the man with the red shirt in the figure on the last page, directly answer 'bottom', 'middle' or 'top'? Case Study Answer: bottom Evidence_pages: [77] Evidence_sources: ['Figure’] Question Ground Truth … Page 1 Page 76 Page 77 … Page 40 Baseline Response: The image shows a group of peopl...

work page
[57]

RAG-based Methods. To address the limitations of MLLMs in processing ultra-long documents, these meth- ods, such as Vis-RAG [17] and VDocRAG [33], introduce a retriever to select the Top-kmost relevant pages for sub- sequent reasoning

work page
[58]

These approaches, such as mPLUG-DocOwl2 [10] and InternVL3 [9], directly encode document images into visual tokens for processing, thereby preserving complete visual features

End-to-End Methods. These approaches, such as mPLUG-DocOwl2 [10] and InternVL3 [9], directly encode document images into visual tokens for processing, thereby preserving complete visual features. 20

work page