pith. machine review for the scientific record. sign in

arxiv: 2604.12812 · v5 · submitted 2026-04-14 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

DocSeeker: Structured Visual Reasoning with Evidence Grounding for Long Document Understanding

Hao Yan, Jihao Wu, Minghui Liao, Wei Chen, Xiang Bai, Xingchen Liu, Yuliang Liu, Yuyi Zhang

Pith reviewed 2026-05-12 04:14 UTC · model grok-4.3

classification 💻 cs.AI
keywords long document understandingmultimodal large language modelsevidence groundingstructured visual reasoningknowledge distillationpolicy optimizationretrieval-augmented generation
0
0 comments X

The pith

DocSeeker trains multimodal models to localize evidence pages before reasoning on long documents.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Multimodal large language models lose accuracy on long documents because relevant evidence gets buried among many irrelevant pages and because training data provides only final answers without intermediate signals. The paper claims that forcing the model through an explicit Analysis-Localization-Reasoning sequence, taught first by supervised fine-tuning on distilled data and then by policy optimization that scores both correct evidence selection and correct answers, solves both problems. If the approach works, models trained on short documents can handle much longer ones, and the same models integrate directly with retrieval systems that supply candidate pages. The authors also add a resolution-allocation step to keep memory use manageable during training on multi-page inputs. Experiments reportedly show gains on both familiar and new document tasks.

Core claim

By requiring the model to first localize the specific evidence pages needed for a query and then reason from those pages, and by training this behavior through an initial supervised stage on knowledge-distilled examples followed by Evidence-aware Group Relative Policy Optimization that rewards both localization quality and answer accuracy, DocSeeker overcomes the low signal-to-noise ratio and weak supervision that normally degrade performance as document length grows.

What carries the argument

Evidence-aware Group Relative Policy Optimization (EGRPO), which jointly scores evidence localization and final-answer correctness inside the required Analysis-Localization-Reasoning workflow.

If this is right

  • Performance stays high on both in-domain and out-of-domain long-document tasks.
  • Models trained only on short pages generalize to ultra-long documents.
  • The same model works naturally as a component inside visual retrieval-augmented generation pipelines.
  • Evidence-Guided Resolution Allocation keeps memory costs manageable when training on many pages at once.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The method may let future systems avoid ever-larger context windows by retrieving and localizing evidence on demand instead.
  • Similar evidence-localization training could apply to other long sequential inputs such as video or multi-turn conversations.
  • Making the selected evidence explicit may improve user trust and allow easier debugging of incorrect answers.
  • The approach could be combined with stronger base models to further raise the ceiling on long-document performance.

Load-bearing premise

High-quality structured reasoning traces can be created by knowledge distillation and that the subsequent policy optimization will improve both localization and accuracy without introducing new failure modes or requiring extensive tuning.

What would settle it

A collection of ultra-long documents in which the correct answer depends on evidence pages that the trained model systematically fails to select, producing no improvement over a baseline that receives the same pages without localization training.

Figures

Figures reproduced from arXiv: 2604.12812 by Hao Yan, Jihao Wu, Minghui Liao, Wei Chen, Xiang Bai, Xingchen Liu, Yuliang Liu, Yuyi Zhang.

Figure 1
Figure 1. Figure 1: Overview of Principal Experimental Results in Long Document Understanding (Left) and the ALR Reasoning Paradigm of [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of our training framework. (Left) Stage I SFT, where the model is supervised fine-tuned on high-quality ALR CoT [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Performance comparison on MMLongBench-doc fol [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Data Distillation Pipeline [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: ALR CoT Data Distillation Prompt Templete. [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Vanilla CoT Data Distillation Prompt Templete. [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: LLM-based Judge Prompt Templete in Secondary Verification. [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Failure cases in ALR CoT Data Distillation. [PITH_FULL_IMAGE:figures/full_fig_p017_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative Comparison of Reasoning Paradigms. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Performance Comparison across Different Training Stages. [PITH_FULL_IMAGE:figures/full_fig_p020_11.png] view at source ↗
read the original abstract

Existing Multimodal Large Language Models (MLLMs) suffer from significant performance degradation on the long document understanding task as document length increases. This stems from two fundamental challenges: 1) a low Signal-to-Noise Ratio (SNR), with crucial evidence buried in irrelevant pages; and 2) supervision scarcity, as datasets offering only final short answers provide a weak learning signal. In this paper, we address these challenges by proposing a paradigm that requires the model to execute a structured Analysis, Localization and Reasoning workflow. To instill this capability, we design a two-stage training framework: we first perform Supervised Fine-Tuning on high-quality data generated via an efficient knowledge distillation strategy. Subsequently, we employ an Evidence-aware Group Relative Policy Optimization which jointly optimizes for both evidence localization and answer accuracy. Additionally, we introduce a Evidence-Guided Resolution Allocation strategy to mitigate memory constraints of training on multi-pages documents. Extensive experiments demonstrate that DocSeeker achieves superior performance on both in-domain and out-of-domain tasks. We show it robustly generalizes from short-page training to ultra-long documents and is naturally synergistic with visual Retrieval-Augmented Generation systems, serving as a solid foundation for their implementation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces DocSeeker, a framework for long document understanding in multimodal large language models (MLLMs). It proposes a structured Analysis, Localization, and Reasoning workflow trained in two stages: supervised fine-tuning on knowledge-distilled high-quality data, followed by Evidence-aware Group Relative Policy Optimization (GRPO) to jointly optimize evidence localization and answer accuracy. An Evidence-Guided Resolution Allocation strategy is introduced to address memory issues with multi-page documents. The authors claim that DocSeeker achieves superior performance on in-domain and out-of-domain tasks, robustly generalizes from short-page training to ultra-long documents, and is synergistic with visual Retrieval-Augmented Generation systems.

Significance. If the empirical results hold, this work has the potential to significantly advance the field by providing a structured approach to mitigate performance degradation in long-document MLLM tasks due to low signal-to-noise ratio and weak supervision. The two-stage training with GRPO for joint optimization of localization and reasoning is a novel contribution, and the synergy with RAG systems could have broad applications in document AI. The use of knowledge distillation for data generation is efficient and practical.

major comments (1)
  1. [§4.2] §4.2 (Evidence-aware Group Relative Policy Optimization): The central claim of superior performance and robust short-to-ultra-long generalization depends on the two-stage pipeline, specifically that SFT on distilled data followed by Evidence-aware GRPO will simultaneously improve evidence localization and final-answer accuracy. The manuscript supplies no ablations, reward-function details, or failure-case analysis showing that the group-relative formulation avoids reward hacking, over-penalization of valid long-context reasoning, or distribution shift between the distilled teacher and the student policy.
minor comments (2)
  1. [Abstract] The abstract asserts superior performance and generalization but supplies no quantitative metrics, baselines, ablation results, or experimental details; these should be summarized with specific numbers and comparisons to allow immediate assessment of the claims.
  2. [Throughout] Ensure all acronyms (MLLM, SNR, GRPO, RAG) are defined on first use and used consistently in the method and experiments sections.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their positive assessment of the potential impact of DocSeeker and for their detailed feedback. We address the major comment point-by-point below.

read point-by-point responses
  1. Referee: [§4.2] §4.2 (Evidence-aware Group Relative Policy Optimization): The central claim of superior performance and robust short-to-ultra-long generalization depends on the two-stage pipeline, specifically that SFT on distilled data followed by Evidence-aware GRPO will simultaneously improve evidence localization and final-answer accuracy. The manuscript supplies no ablations, reward-function details, or failure-case analysis showing that the group-relative formulation avoids reward hacking, over-penalization of valid long-context reasoning, or distribution shift between the distilled teacher and the student policy.

    Authors: We agree that the manuscript would benefit from more detailed analysis of the GRPO component to fully substantiate the claims. In the revised version, we will expand §4.2 with: (1) ablations comparing the full two-stage pipeline against SFT-only and GRPO-only variants, reporting both localization and accuracy metrics; (2) the complete specification of the reward function and the group-relative advantage estimation; and (3) a dedicated subsection on failure cases and robustness checks, including discussion of potential reward hacking, over-penalization, and distribution shift. These additions will provide empirical evidence that the Evidence-aware GRPO improves upon the SFT baseline without introducing the issues raised. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on experimental outcomes, not derivations

full rationale

The paper describes an empirical two-stage training pipeline (SFT on distilled data followed by Evidence-aware GRPO) and reports performance gains on in-domain/out-of-domain tasks plus short-to-long generalization. No mathematical derivations, equations, fitted parameters, or predictions are presented that could reduce to their own inputs by construction. Claims are framed as outcomes of the proposed workflow rather than first-principles results or self-referential definitions. No self-citation load-bearing steps or ansatz smuggling appear in the abstract or described methodology.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claims rest on the unverified effectiveness of knowledge distillation for creating high-quality supervision signals and on the assumption that the proposed RL objective can simultaneously improve localization and accuracy. No free parameters, axioms, or invented entities are explicitly quantified in the abstract.

axioms (2)
  • domain assumption Knowledge distillation from an existing model can produce high-quality step-by-step reasoning data for long documents.
    Invoked to justify the first-stage supervised fine-tuning data generation.
  • domain assumption Joint optimization of evidence localization and answer accuracy via group relative policy optimization is feasible and beneficial.
    Central to the second training stage.

pith-pipeline@v0.9.0 · 5529 in / 1375 out tokens · 53331 ms · 2026-05-12T04:14:01.548751+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 15 internal anchors

  1. [1]

    Hi- erarchical multimodal transformers for multipage docvqa

    Rubèn Tito, Dimosthenis Karatzas, and Ernest Valveny. Hi- erarchical multimodal transformers for multipage docvqa. Pattern Recognition, 144:109834, 2023. 1, 2, 5, 6, 19

  2. [2]

    Document understanding dataset and evaluation (dude)

    Jordy Van Landeghem, Rubèn Tito, Łukasz Borchmann, Michał Pietruszka, Pawel Joziak, Rafal Powalski, Dawid Ju- rkiewicz, Mickaël Coustaty, Bertrand Anckaert, Ernest Val- veny, et al. Document understanding dataset and evaluation (dude). InIEEE/CVF International Conference on Computer Vision, page 19471–19483, 2023. 2, 5, 12

  3. [3]

    Large language models in document intelligence: A comprehensive survey, recent ad- vances, challenges and future trends.ACM Transactions on Information Systems, 2025

    Wenjun Ke, Yifan Zheng, Yining Li, Hengyuan Xu, Dong Nie, Peng Wang, and Yao He. Large language models in document intelligence: A comprehensive survey, recent ad- vances, challenges and future trends.ACM Transactions on Information Systems, 2025. 1

  4. [4]

    arXiv preprint arXiv:2409.18839 , year=

    Bin Wang, Chao Xu, Xiaomeng Zhao, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Rui Xu, Kaiwen Liu, Yuan Qu, Fukai Shang, et al. Mineru: An open-source solution for precise document content extraction. InarXiv:2409.18839, 2024. 1

  5. [5]

    Dolphin: Document image parsing via heterogeneous anchor prompting

    Hao Feng, Shu Wei, Xiang Fei, Wei Shi, Yingdong Han, Lei Liao, Jinghui Lu, Binghong Wu, Qi Liu, Chunhui Lin, et al. Dolphin: Document image parsing via heterogeneous anchor prompting. InarXiv:2505.14059, 2025

  6. [6]

    arXiv preprint arXiv:2506.05218 , year=

    Zhang Li, Yuliang Liu, Qiang Liu, Zhiyin Ma, Ziyang Zhang, Shuo Zhang, Zidun Guo, Jiarui Zhang, Xinyu Wang, and Xiang Bai. Monkeyocr: Document parsing with a structure-recognition-relation triplet paradigm. In arXiv:2506.05218, 2025. 1

  7. [7]

    A brief analysis of chatgpt: historical evolution current applications and future prospects [j].Journal of Image and Graphics, 28(04):893–902, 2023

    YL Liu, HL Li, X Bai, et al. A brief analysis of chatgpt: historical evolution current applications and future prospects [j].Journal of Image and Graphics, 28(04):893–902, 2023. 1

  8. [8]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wen- bin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report. In arXiv:2502.13923, 2025. 1, 3, 5, 19

  9. [9]

    InternVL3: Exploring Advanced Training and Test-Time Recipes for Open-Source Multimodal Models

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shen- glong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. In arXiv:2504.10479, 2025. 3, 5, 6, 19, 20

  10. [10]

    mplug- docowl2: High-resolution compressing for ocr-free multi- page document understanding

    Anwen Hu, Haiyang Xu, Liang Zhang, Jiabo Ye, Ming Yan, Ji Zhang, Qin Jin, Fei Huang, and Jingren Zhou. mplug- docowl2: High-resolution compressing for ocr-free multi- page document understanding. InAnnual Meeting of the As- sociation for Computational Linguistics, pages 5817–5834,

  11. [11]

    Ocr-free document understanding transformer

    Geewook Kim, Teakgyu Hong, Moonbin Yim, JeongYeon Nam, Jinyoung Park, Jinyeong Yim, Wonseok Hwang, Sang- doo Yun, Dongyoon Han, and Seunghyun Park. Ocr-free document understanding transformer. InEuropean Confer- ence on Computer Vision, pages 498–517. Springer, 2022

  12. [12]

    GLM-4.5V and GLM-4.1V-Thinking: Towards Versatile Multimodal Reasoning with Scalable Reinforcement Learning

    GLM-V Team, Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, Shuaiqi Duan, Weihan Wang, Yan Wang, Yean Cheng, Zehai He, Zhe Su, Zhen Yang, Ziyang Pan, Aohan Zeng, Baoxu Wang, Bin Chen, Boyan Shi, Changyu Pang, Chenhui Zhang, Da Yin, Fan Yang, Guoqing Chen, Jiazheng Xu, Jiale Zhu, Jiali Che...

  13. [13]

    DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding

    Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, Zhenda Xie, Yu Wu, Kai Hu, Ji- awei Wang, Yaofeng Sun, Yukun Li, Yishi Piao, Kang Guan, Aixin Liu, Xin Xie, Yuxiang You, Kai Dong, Xingkai Yu, Haowei Zhang, Liang Zhao, Yisong Wang, and Chong Ruan. Deepseek-vl2: Mixture-of-experts vis...

  14. [14]

    Textmonkey: Anocr-freelargemultimodal model for understanding document.arXiv preprint arXiv:2403.04473,

    Yuliang Liu, Biao Yang, Qiang Liu, Zhang Li, Zhiyin Ma, Shuo Zhang, and Xiang Bai. Textmonkey: An ocr-free large multimodal model for understanding document. In arXiv:2403.04473, 2024

  15. [15]

    Pix2struct: Screenshot parsing as pretraining for visual lan- guage understanding

    Kenton Lee, Mandar Joshi, Iulia Raluca Turc, Hexiang Hu, Fangyu Liu, Julian Martin Eisenschlos, Urvashi Khandel- wal, Peter Shaw, Ming-Wei Chang, and Kristina Toutanova. Pix2struct: Screenshot parsing as pretraining for visual lan- guage understanding. InInternational Conference on Ma- chine Learning, pages 18893–18912. PMLR, 2023

  16. [16]

    MiniCPM-V: A GPT-4V Level MLLM on Your Phone

    Yuan Yao, Tianyu Yu, Ao Zhang, Chongyi Wang, Junbo Cui, Hongji Zhu, Tianchi Cai, Haoyu Li, Weilin Zhao, Zhihui He, et al. Minicpm-v: A gpt-4v level mllm on your phone. In arXiv:2408.01800, 2024. 1

  17. [17]

    Visrag: Vision-based retrieval-augmented generation on multi-modality documents

    Shi Yu, Chaoyue Tang, Bokai Xu, Junbo Cui, Junhao Ran, Yukun Yan, Zhenghao Liu, Shuo Wang, Xu Han, Zhiyuan Liu, et al. Visrag: Vision-based retrieval-augmented gener- 9 ation on multi-modality documents. InarXiv:2410.10594,

  18. [18]

    Colpali: Efficient document retrieval with vision language models.arXiv preprint arXiv:2407.01449, 2024

    Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, Céline Hudelot, and Pierre Colombo. Col- pali: Efficient document retrieval with vision language mod- els. InarXiv:2407.01449, 2024. 1, 3, 7, 19

  19. [19]

    mplug-docowl: Modularized multimodal large language model for document understanding.arXiv preprint arXiv:2307.02499, 2023

    Jiabo Ye, Anwen Hu, Haiyang Xu, Qinghao Ye, Ming Yan, Yuhao Dan, Chenlin Zhao, Guohai Xu, Chenliang Li, Junfeng Tian, et al. mplug-docowl: Modularized multi- modal large language model for document understanding. In arXiv:2307.02499, 2023. 1

  20. [20]

    PDF-WuKong: A Large Multimodal Model for Efficient Long PDF Reading with End-to-End Sparse Sampling

    Xudong Xie, Hao Yan, Liang Yin, Yang Liu, Jing Ding, Minghui Liao, Yuliang Liu, Wei Chen, and Xiang Bai. Wukong: A large multimodal model for efficient long pdf reading with end-to-end sparse sampling. In arXiv:2410.05970, 2024. 1, 19

  21. [21]

    Slidevqa: A dataset for document visual question answering on multiple images

    Ryota Tanaka, Kyosuke Nishida, Kosuke Nishida, Taku Hasegawa, Itsumi Saito, and Kuniko Saito. Slidevqa: A dataset for document visual question answering on multiple images. InAAAI Conference on Artificial Intelligence, vol- ume 37, pages 13636–13645, 2023. 2, 5

  22. [22]

    MultimodalArXiv: A dataset for improving scientific comprehension of large vision-language models

    Linjie Li, Yuxuan Wang, Rui Xu, Peiyi Wang, Xinyun Feng, Lingpeng Kong, and Qun Liu. MultimodalArXiv: A dataset for improving scientific comprehension of large vision-language models. InAnnual Meeting of the Asso- ciation for Computational Linguistics, pages 14369–14387, 2024

  23. [23]

    MVQA: A dataset for multimodal infor- mation retrieval in pdf-based visual question answering

    Yihao Ding, Kaixuan Ren, Jiabin Huang, Siwen Luo, and Soyeon Caren Han. MVQA: A dataset for multimodal infor- mation retrieval in pdf-based visual question answering. In arXiv:2404.12720, 2024. 2

  24. [24]

    Mmlongbench-doc: Benchmarking long-context doc- ument understanding with visualizations

    Yubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen, Yizhu Jiao, Xinze Li, Xinyuan Lu, Ziyu Liu, Yan Ma, Xiaoyi Dong, et al. Mmlongbench-doc: Benchmarking long-context doc- ument understanding with visualizations. InAdvances in Neural Information Processing Systems, volume 37, pages 95963–96010, 2024. 3, 5

  25. [25]

    Longdocurl: a comprehensive multimodal long document benchmark integrating understanding, reasoning, and locat- ing

    Chao Deng, Jiale Yuan, Pi Bu, Peijie Wang, Zhong-Zhi Li, Jian Xu, Xiao-Hui Li, Yuan Gao, Jun Song, Bo Zheng, et al. Longdocurl: a comprehensive multimodal long document benchmark integrating understanding, reasoning, and locat- ing. InAnnual Meeting of the Association for Computational Linguistics, pages 1135–1159, 2025. 3, 5

  26. [26]

    MMDocIR: Benchmarking multimodal retrieval for long documents

    Kuicai Dong, Yujing Chang, Derrick Goh Xin Deik, Dexun Li, Ruiming Tang, and Yong Liu. MMDocIR: Benchmarking multimodal retrieval for long documents. InConference on Empirical Methods in Natural Language Processing, pages 30971–31005, 2025. 3

  27. [27]

    M-LongDoc: A benchmark for multimodal super-long document understanding and a retrieval-aware tuning frame- work

    Yew Ken Chia, Liying Cheng, Hou Pong Chan, Maojia Song, Chaoqun Liu, Mahani Aljunied, Soujanya Poria, and Lidong Bing. M-LongDoc: A benchmark for multimodal super-long document understanding and a retrieval-aware tuning frame- work. InConference on Empirical Methods in Natural Lan- guage Processing, pages 9233–9250, 2025. 3

  28. [28]

    DocBench: A benchmark for evaluating LLM-based doc- ument reading systems

    Anni Zou, Wenhao Yu, Hongming Zhang, Kaixin Ma, Deng Cai, Zhuosheng Zhang, Hai Zhao, and Dong Yu. DocBench: A benchmark for evaluating LLM-based doc- ument reading systems. InInternational Workshop on Knowledge-Augmented Methods for Natural Language Pro- cessing, pages 359–373, 2025. 3

  29. [29]

    Longformer: The Long-Document Transformer

    Iz Beltagy, Matthew E Peters, and Arman Cohan. Longformer: The long-document transformer. In arXiv:2004.05150, 2020. 3, 19

  30. [30]

    A simple yet effective layout token in large language models for document understanding

    Zhaoqing Zhu, Chuwei Luo, Zirui Shao, Feiyu Gao, Hangdi Xing, Qi Zheng, and Ji Zhang. A simple yet effective layout token in large language models for document understanding. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14472–14482, 2025. 3, 19, 20

  31. [31]

    Unifying multimodal retrieval via document screenshot embedding

    Xueguang Ma, Sheng-Chieh Lin, Minghan Li, Wenhu Chen, and Jimmy Lin. Unifying multimodal retrieval via document screenshot embedding. InConference on Empirical Methods in Natural Language Processing, pages 6492–6505, 2024. 3

  32. [32]

    Sv-rag: Lora-contextualizing adaptation of mllms for long document understanding

    Jian Chen, Ruiyi Zhang, Yufan Zhou, Tong Yu, Franck Der- noncourt, Jiuxiang Gu, Ryan A Rossi, Changyou Chen, and Tong Sun. Sv-rag: Lora-contextualizing adaptation of mllms for long document understanding. InarXiv:2411.01106,

  33. [33]

    Vdocrag: Retrieval- augmented generation over visually-rich documents

    Ryota Tanaka, Taichi Iki, Taku Hasegawa, Kyosuke Nishida, Kuniko Saito, and Jun Suzuki. Vdocrag: Retrieval- augmented generation over visually-rich documents. In IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 24827–24837, 2025. 3, 6, 19, 20

  34. [34]

    Attention is all you need

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems, volume 30, 2017. 3

  35. [35]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. InarXiv:2010.11929,

  36. [36]

    MIT Press, 2008

    Keith Stenning and Michiel Van Lambalgen.Human reason- ing and cognitive science. MIT Press, 2008. 3

  37. [37]

    OpenAI o1 System Card

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richard- son, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card. InarXiv:2412.16720, 2024. 3

  38. [38]

    DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing rea- soning capability in llms via reinforcement learning. In arXiv:2501.12948, 2025. 3, 5

  39. [39]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodal- ity, long context, and next generation agentic capabilities. In arXiv:2507.06261, 2025. 4, 5, 12

  40. [40]

    GPT-4o System Card

    Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. InarXiv:2410.21276, 2024. 4, 6

  41. [41]

    A normalized levenshtein distance metric.IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(6):1091–1095, 2007

    Li Yujian and Liu Bo. A normalized levenshtein distance metric.IEEE Transactions on Pattern Analysis and Machine Intelligence, 29(6):1091–1095, 2007. 5 10

  42. [42]

    MM-Eureka: Exploring the Frontiers of Multimodal Reasoning with Rule-based Reinforcement Learning

    Fanqing Meng, Lingxiao Du, Zongkai Liu, Zhixiang Zhou, Quanfeng Lu, Daocheng Fu, Tiancheng Han, Botian Shi, Wenhai Wang, Junjun He, et al. Mm-eureka: Exploring the frontiers of multimodal reasoning with rule-based reinforce- ment learning. InarXiv:2503.07365, 2025. 5

  43. [43]

    Cream: coarse-to- fine retrieval and multi-modal efficient tuning for document vqa

    Jinxu Zhang, Yongqi Yu, and Yu Zhang. Cream: coarse-to- fine retrieval and multi-modal efficient tuning for document vqa. InACM International Conference on Multimedia, pages 925–934, 2024. 5, 6, 19

  44. [44]

    M3docrag: Multi-modal retrieval is what you need for multi-page multi-document understanding.arXiv preprint arXiv:2411.04952, 2024

    Jaemin Cho, Debanjan Mahata, Ozan Irsoy, Yujie He, and Mohit Bansal. M3docrag: Multi-modal retrieval is what you need for multi-page multi-document understanding. In arXiv:2411.04952, 2024. 5, 6, 19

  45. [45]

    Docopilot: Improving multimodal models for document-level understanding

    Yuchen Duan, Zhe Chen, Yusong Hu, Weiyun Wang, Sheng- long Ye, Botian Shi, Lewei Lu, Qibin Hou, Tong Lu, Hong- sheng Li, et al. Docopilot: Improving multimodal models for document-level understanding. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 4026– 4037, 2025. 5, 6, 19

  46. [46]

    Docvlm: Make your vlm an efficient reader

    Mor Shpigel Nacson, Aviad Aberdam, Roy Ganz, Elad Ben Avraham, Alona Golts, Yair Kittenplon, Shai Mazor, and Ron Litman. Docvlm: Make your vlm an efficient reader. InIEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 29005–29015, 2025. 5, 6, 19

  47. [47]

    Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond

    Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. Qwen-vl: A versatile vision-language model for understanding, localization, text reading, and beyond. In arXiv:2308.12966, 2023. 6

  48. [48]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Gemini Team, Petko Georgiev, Ving Ian Lei, Ryan Burnell, Libin Bai, Anmol Gulati, Garrett Tanzer, Damien Vincent, Zhufeng Pan, Shibo Wang, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of con- text. InarXiv:2403.05530, 2024. 6

  49. [49]

    C-Pack: Packed Resources For General Chinese Embeddings

    Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muen- nighoff. C-Pack: Packaged resources to advance general chi- nese embedding. InarXiv:2309.07597, 2023. 19

  50. [50]

    Min- imal Context

    Dongsheng Wang, Natraj Raman, Mathieu Sibue, Zhiqiang Ma, Petr Babkin, Simerjot Kaur, Yulong Pei, Armineh Nour- bakhsh, and Xiaomo Liu. DocLLM: A layout-aware gener- ative language model for multimodal document understand- ing. InarXiv:2401.00908, 2023. 19, 20 11 APPENDIX A. Details of Data Distillation A.1. Overview of the Distillation Pipeline To addres...

  51. [51]

    ADJUDICATORY CASES\

    Locate the section: The \"ADJUDICATORY CASES\" section begins on page 3

  52. [52]

    Count cases on Page 3:- PCB 21-3, PCB 21-5 (One combined entry)- PCB 21-29, -PCB 21-47 (page 3: 3 cases)

  53. [53]

    Count cases on Page 4: - PCB 21-110 - PCB 21-114 - PCB 22-11…

  54. [54]

    evidence_pages\

    Sum the counts: 3 (page 3) + 7 (page 4) + 1 (page 5) = 11. Therefore, there are 11 adjudicatory cases listed. </think> <answer> {\"evidence_pages\": [3, 4, 5], \"answer\": \"11\"} </answer> Answer correct but formatting mismatch Flawed reasoning Wrong Answer Question Figure 9. Failure cases in ALR CoT Data Distillation. 17 f Assume that in a survey conduc...

  55. [55]

    Most Americans see China as having dealt poorly with COVID-19

    Analyze …; 2. Scan the document: …3. Locate the relevant chart: I see a bar chart titled “Most Americans see China as having dealt poorly with COVID-19” shows …; 4. Examine the chart's data: The chart has two categories: "Bad" and "Good." … ; 5. Extract the specific numbers: The question states … increased by 10 percentage points. Looking at the chart, th...

  56. [56]

    Thank u very much

    OCR-based Methods: These approaches,such as Lay- 19 What position is the man with the red shirt in the figure on the last page, directly answer 'bottom', 'middle' or 'top'? Case Study Answer: bottom Evidence_pages: [77] Evidence_sources: ['Figure’] Question Ground Truth … Page 1 Page 76 Page 77 … Page 40 Baseline Response: The image shows a group of peopl...

  57. [57]

    RAG-based Methods. To address the limitations of MLLMs in processing ultra-long documents, these meth- ods, such as Vis-RAG [17] and VDocRAG [33], introduce a retriever to select the Top-kmost relevant pages for sub- sequent reasoning

  58. [58]

    These approaches, such as mPLUG-DocOwl2 [10] and InternVL3 [9], directly encode document images into visual tokens for processing, thereby preserving complete visual features

    End-to-End Methods. These approaches, such as mPLUG-DocOwl2 [10] and InternVL3 [9], directly encode document images into visual tokens for processing, thereby preserving complete visual features. 20