Recognition: unknown
Doc-V*:Coarse-to-Fine Interactive Visual Reasoning for Multi-Page Document VQA
Pith reviewed 2026-05-10 12:48 UTC · model grok-4.3
The pith
Doc-V* casts multi-page document VQA as an agentic sequence of thumbnail overview, targeted page fetching, and evidence aggregation in structured working memory.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Doc-V* is an OCR-free agentic framework that casts multi-page DocVQA as sequential evidence aggregation: it begins with a thumbnail overview, navigates via semantic retrieval and targeted page fetching, and aggregates evidence in structured working memory for grounded reasoning. The system is trained by imitation learning from expert trajectories and further optimized with Group Relative Policy Optimization.
What carries the argument
The agentic coarse-to-fine navigation process paired with structured working memory, which performs selective attention over fetched pages rather than full-document input or fixed retrieval sets.
If this is right
- The approach scales to arbitrarily long documents by fetching pages on demand instead of increasing context length.
- Out-of-domain generalization improves substantially, reaching gains of up to 47.9 percent over RAG baselines across five benchmarks.
- Evidence aggregation succeeds through selective attention rather than by ingesting more pages.
- Imitation from expert paths produces navigation strategies that balance answer accuracy with fewer page fetches.
Where Pith is reading between the lines
- Similar agentic navigation with working memory could apply to other long-context visual tasks such as multi-image or video question answering where evidence is sparse.
- The framework implies that learned selective fetching may reduce the need for ever-larger context windows in multimodal models.
- Developing automatic ways to generate navigation trajectories would lower the barrier to applying the method beyond domains with human experts.
Load-bearing premise
High-quality expert trajectories must be available to train the imitation learning so the agent learns reliable navigation and evidence selection without introducing retrieval errors.
What would settle it
Performance drops sharply on a test set of multi-page documents where relevant evidence lies in pages that semantic retrieval from thumbnails consistently misses, even though the answer is present in the full document.
Figures
read the original abstract
Multi-page Document Visual Question Answering requires reasoning over semantics, layouts, and visual elements in long, visually dense documents. Existing OCR-free methods face a trade-off between capacity and precision: end-to-end models scale poorly with document length, while visual retrieval-based pipelines are brittle and passive. We propose Doc-$V^*$, an \textbf{OCR-free agentic} framework that casts multi-page DocVQA as sequential evidence aggregation. Doc-$V^*$ begins with a thumbnail overview, then actively navigates via semantic retrieval and targeted page fetching, and aggregates evidence in a structured working memory for grounded reasoning. Trained by imitation learning from expert trajectories and further optimized with Group Relative Policy Optimization, Doc-$V^*$ balances answer accuracy with evidence-seeking efficiency. Across five benchmarks, Doc-$V^*$ outperforms open-source baselines and approaches proprietary models, improving out-of-domain performance by up to \textbf{47.9\%} over RAG baseline. Other results reveal effective evidence aggregation with selective attention, not increased input pages.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Doc-V*, an OCR-free agentic framework for multi-page Document Visual Question Answering that casts the task as sequential evidence aggregation. It starts with a thumbnail overview, performs active navigation via semantic retrieval and targeted page fetching, and aggregates evidence in a structured working memory for grounded reasoning. The system is trained by imitation learning from expert trajectories and further optimized with Group Relative Policy Optimization (GRPO) to balance accuracy and evidence-seeking efficiency. Across five benchmarks, it reports outperforming open-source baselines, approaching proprietary models, and achieving up to 47.9% improvement in out-of-domain performance over a RAG baseline, with gains attributed to selective attention rather than simply processing more pages.
Significance. If the empirical claims hold under rigorous verification, the work advances agentic approaches to long-document visual reasoning by showing how interactive navigation and structured memory can mitigate the capacity-precision trade-off in OCR-free DocVQA. The combination of imitation learning from trajectories with GRPO for efficiency-accuracy trade-offs, plus the emphasis on out-of-domain generalization, provides a concrete path forward for handling visually dense multi-page documents. Reproducible code or detailed trajectory datasets would further strengthen the contribution.
major comments (3)
- [§4.2] §4.2 (Expert Trajectories and Imitation Learning): The description of expert trajectory collection, validation, and coverage across document layouts is insufficiently detailed. This is load-bearing for the central performance claims, including the 47.9% out-of-domain gain, because unreliable or biased trajectories would directly undermine the learned navigation policy and evidence aggregation in the working memory.
- [Table 3] Table 3 and §5.1 (Ablations and Component Isolation): The ablations do not sufficiently isolate the contribution of agentic navigation and selective attention from base-model effects or retrieval quality. Without these controls, it is unclear whether the reported improvements stem from the proposed framework or from unstated differences in training data or optimization, weakening the claim that gains arise from 'effective evidence aggregation with selective attention, not increased input pages.'
- [§5.3] §5.3 (Error Analysis and Failure Modes): No quantitative error analysis is provided for navigation failures, retrieval misses, or propagation of errors from incomplete trajectories. This is critical to evaluate the stress-test concern that agentic advantages may collapse when semantic retrieval misses key pages in visually dense documents.
minor comments (3)
- [Abstract] The abstract states results 'across five benchmarks' without naming them; this should be specified early for clarity.
- [§3] Notation for the structured working memory (e.g., how evidence is stored and retrieved) would benefit from an explicit equation or pseudocode in §3.
- [Figure 2] Figure 2 (qualitative examples) could include more failure cases to illustrate the limits of selective attention.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which highlights important areas for strengthening the presentation of our method and results. We address each major comment below and have revised the manuscript to incorporate additional details, controls, and analyses as described.
read point-by-point responses
-
Referee: [§4.2] §4.2 (Expert Trajectories and Imitation Learning): The description of expert trajectory collection, validation, and coverage across document layouts is insufficiently detailed. This is load-bearing for the central performance claims, including the 47.9% out-of-domain gain, because unreliable or biased trajectories would directly undermine the learned navigation policy and evidence aggregation in the working memory.
Authors: We agree that §4.2 provided insufficient detail on trajectory collection. In the revised manuscript we have substantially expanded this section to describe the expert selection process, annotation guidelines, validation procedure (including inter-annotator agreement statistics), and quantitative coverage across document layouts, page counts, and visual densities. A new appendix supplies representative trajectory examples and a breakdown table. These additions directly address the concern that the 47.9% out-of-domain gain could rest on unreliable trajectories. revision: yes
-
Referee: [Table 3] Table 3 and §5.1 (Ablations and Component Isolation): The ablations do not sufficiently isolate the contribution of agentic navigation and selective attention from base-model effects or retrieval quality. Without these controls, it is unclear whether the reported improvements stem from the proposed framework or from unstated differences in training data or optimization, weakening the claim that gains arise from 'effective evidence aggregation with selective attention, not increased input pages.'
Authors: We acknowledge that the original ablations in Table 3 and §5.1 did not fully isolate the agentic navigation component from base-model or retrieval effects. In the revision we have added new controlled experiments that hold the underlying vision-language model and retrieval corpus fixed while varying only the navigation policy and working-memory structure. These results are now reported alongside the existing ablations and support the claim that performance gains derive from selective evidence aggregation rather than simply processing more pages or differences in training data. revision: yes
-
Referee: [§5.3] §5.3 (Error Analysis and Failure Modes): No quantitative error analysis is provided for navigation failures, retrieval misses, or propagation of errors from incomplete trajectories. This is critical to evaluate the stress-test concern that agentic advantages may collapse when semantic retrieval misses key pages in visually dense documents.
Authors: We agree that a quantitative error analysis is necessary. We have conducted and added to the revised §5.3 a breakdown of failure modes across all five benchmarks, reporting the frequency of navigation failures, semantic-retrieval misses, and downstream accuracy drops when key pages are omitted. We also compare error propagation rates between Doc-V* and the RAG baseline on visually dense subsets, directly addressing the robustness concern. revision: yes
Circularity Check
No circularity: purely empirical system description with no derivations
full rationale
The paper describes an agentic framework for multi-page DocVQA, trained via imitation learning from expert trajectories and optimized with Group Relative Policy Optimization, then reports empirical benchmark results. No equations, mathematical derivations, or first-principles claims appear in the provided text. Performance numbers (e.g., 47.9% out-of-domain gain) are presented as measured outcomes of training and evaluation, not as quantities forced by construction from fitted inputs or self-citations. The central claims rest on external benchmarks and ablations rather than internal reductions, satisfying the self-contained empirical criterion.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[2]
ColPali: Efficient Document Retrieval with Vision Language Models
Colpali: Efficient document retrieval with vi- sion language models.Preprint, arXiv:2407.01449. Masato Fujitake. 2024. Layoutllm: Large language model instruction tuning for visually rich document understanding.arXiv preprint arXiv:2403.14252. Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shi- rong Ma, Peiyi Wang, X...
work page internal anchor Pith review arXiv 2024
-
[3]
mplug-docowl2: High-resolution compressing for ocr-free multi-page document understanding. In Proceedings of the 63rd Annual Meeting of the As- sociation for Computational Linguistics (Volume 1: Long Papers), pages 5817–5834. Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perelman, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Welihinda, Alan Hayes, Alec Ra...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[4]
Hierarchical multimodal transformers for mul- tipage docvqa.Pattern Recognition, 144:109834. Jordy Van Landeghem, Rubèn Tito, Łukasz Borchmann, Michał Pietruszka, Pawel Joziak, Rafal Powalski, Dawid Jurkiewicz, Mickaël Coustaty, Bertrand Anck- aert, Ernest Valveny, and 1 others. 2023. Document understanding dataset and evaluation (dude). InPro- ceedings o...
-
[5]
Retriever
reinforcement learning to further optimize answer correctness and evidence discovery under a bounded interaction budget. In both stages, all environment feedback (returned page images and working memorys) is used only as conditioning context; training losses are applied only to tokens generated by the agent itself. SFT: Closed-loop Interaction Trajectory ...
2024
-
[6]
Cumulative Abnormal Returns
Locate the slide titled “Cumulative Abnormal Returns.”
-
[7]
Read the values for Year 0 and Year 3 on that chart
-
[8]
Purchase now! Send a mail to: info@nordstromfrank.se €2,500 Contains facts, maps, flags and templates for use in InDesign
Compute the difference between Year 3 and Year 0 percentages.} </think> <fetch_page> [14] </fetch_page> Figure 7: Case 1 in SlideVQA between different methods Question: What is the difference in percentage from Year 0 to Year 3 regarding cumulative abnormal returns? Input All Pages ··· ··· <answer> The World's Conflicts Focus 2009 </answer> RAG-Top 5 Page...
2009
-
[9]
2,500 euro
Locate where the price “2,500 euro” is mentioned
-
[10]
Identify what item is available for that price
-
[11]
DRAMATIC STORY CANVAS: Storyboard Template
Provide the concise answer. } </think> <fetch_page> [1] </fetch_page> Doc-V* (SFT) Doc-V* (GRPO) <think> \\boxed{analysis} { …… This directly answers what can be purchased for 2,500 euro: a supplement containing facts, maps, flags, and templates for InDesign. } \\boxed{relevant_pages} { 1 } \\boxed{summary} {From Page Index 1, we learned that for 2,500 eu...
-
[12]
Identify which page contains a red arrow pointing to a specific labeled field
-
[13]
Read the label/word the red arrow points toward.} </think> <page_retrieval>Find the page(s) where a red arrow points to a specific labeled field; identify the word the red arrow targets.</page_retrieval> Page 2 Page 6 Page 11 Figure 9: Case 3 in SlideVQA between different methods
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.