pith. sign in

arxiv: 2606.07252 · v2 · pith:F4GP5TFCnew · submitted 2026-06-05 · 💻 cs.IR

Constrained Dominant Sets for Multimodal Document Question Answering

Pith reviewed 2026-06-27 20:43 UTC · model grok-4.3

classification 💻 cs.IR
keywords constrained dominant setsmultimodal document question answeringretrieval-augmented generationaffinity graphreplicator dynamicsevidence selectionVisDoMBenchMMLongBench-Doc
0
0 comments X

The pith

Constrained dominant sets on a query-augmented affinity graph select complementary evidence for multimodal document question answering.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that similarity-based retrievers in multimodal RAG waste capacity on near-duplicates in long documents and miss complementary evidence. It replaces ranking with selection of a Constrained Dominant Set on a query-augmented affinity graph. The query becomes a hard structural constraint that every chosen item must satisfy through the cluster anchor. Relevance-redundancy balance is set automatically by a spectral bound rather than manual tuning, and replicator dynamics drive the process to global equilibrium without greedy steps or training. These properties produce higher reader accuracy on VisDoMBench and MMLongBench-Doc.

Core claim

The central claim is that Constrained Dominant Set selection on a query-augmented affinity graph supplies three concrete advantages over similarity ranking: the query is encoded as a hard structural constraint so every selected element connects directly to it; the relevance-redundancy tradeoff is fixed by a spectral bound instead of hand-tuned parameters; and the selection reaches global equilibrium through replicator dynamics, avoiding heuristic distortions. The method is graph-based and training-free. With a fixed reader it reaches 66.99 average on VisDoMBench while lifting performance 37.1 points above the no-retrieval baseline on that benchmark and 4.8 points on MMLongBench-Doc.

What carries the argument

Constrained Dominant Set (CDS) selection on a query-augmented affinity graph, which treats the query as a structural anchor and uses spectral bounds plus replicator dynamics to enforce relevance without redundancy or manual tuning.

If this is right

  • Every selected evidence item must connect directly to the query through the cluster anchor.
  • Relevance-redundancy balance is fixed automatically by the spectral bound, removing manual trade-off parameters.
  • The selection reaches global equilibrium via replicator dynamics rather than greedy heuristics.
  • The method yields 66.99 average accuracy on VisDoMBench and lifts the no-retrieval baseline by 37.1 points on VisDoMBench and 4.8 points on MMLongBench-Doc.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same graph construction could be tested on single-modality document QA to isolate whether the gains come from handling figures and captions.
  • Because the approach needs no training, it offers a drop-in replacement for existing retrievers in production RAG pipelines.
  • The spectral-bound mechanism might extend to other diversity-aware retrieval problems where parameter-free balance is required.

Load-bearing premise

The query-augmented affinity graph can be built so that Constrained Dominant Set selection automatically enforces a hard query constraint and sets the relevance-redundancy balance through its spectral bound.

What would settle it

Running the CDS retriever on VisDoMBench and obtaining scores no higher than the no-retrieval baseline or standard similarity retrievers would falsify the performance advantage.

Figures

Figures reproduced from arXiv: 2606.07252 by Ambuj Mehrish, Sebastiano Vascon.

Figure 1
Figure 1. Figure 1: Overview of the proposed pipeline. (1) The document is parsed into atomic nodes (text, figures, tables), each embedded as zi ∈ R d and connected into a multimodal graph GD via structural and VLM-verified semantic links. (2) The query q is added as a constraint vertex with a spectral self-loop α, forming the query-augmented affinity matrix M that combines note–note dissimilarity (diversity) with query–note … view at source ↗
Figure 2
Figure 2. Figure 2: Cosine vs. CDS evidence selection on the per-query note map (2-D Principal Component Analysis, PCA); [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
read the original abstract

Long multimodal document question answering is limited by which evidence reaches the reader, rather than by the quantity retrieved. In lengthy documents, findings often recur across figures, captions, and introductory sentences, causing similarity based retrievers in modern multimodal retrieval-augmented generation (RAG) systems to allocate resources to near-duplicates while overlooking complementary evidence. This work introduces a retriever that selects evidence as a Constrained Dominant Set (CDS) on a query-augmented affinity graph, offering three advantages that similarity ranking does not. First, the query is encoded as a hard structural constraint, ensuring that every selected element is directly connected to the question through the cluster anchor. Second, the relevance-redundancy balance is determined automatically by a spectral bound, eliminating the need for manually tuned trade offs required by diversity-aware selectors. Third, the selection process achieves a global equilibrium via replicator dynamics, thereby avoiding the distortions introduced by greedy heuristics. The method is inherently graph-based and does not require training. Using a Qwen3-VL-32B reader, CDS establishes a new state of the art on VisDoMBench ($66.99$ average) and improves over the no-retrieval baseline by $37.1$ points on VisDoMBench and $4.8$ on MMLongBench-Doc.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The manuscript introduces Constrained Dominant Sets (CDS) on a query-augmented affinity graph as a training-free retriever for evidence selection in long multimodal document QA. It claims three advantages over similarity ranking: (1) the query acts as a hard structural constraint via the cluster anchor, (2) relevance-redundancy balance is set automatically by a spectral bound, and (3) selection reaches global equilibrium via replicator dynamics. Using a Qwen3-VL-32B reader, the method reports a new SOTA of 66.99 average on VisDoMBench and gains of 37.1 points on VisDoMBench and 4.8 points on MMLongBench-Doc over the no-retrieval baseline.

Significance. If the empirical gains and the three claimed properties hold under full implementation details, the work would offer a principled, parameter-free alternative to tuned diversity-aware retrievers in multimodal RAG. The training-free nature and grounding in standard CDS properties (hard constraint, spectral bound, replicator dynamics) constitute a clear strength.

major comments (2)
  1. [Abstract] Abstract: the central empirical claim (SOTA at 66.99 and the 37.1 / 4.8 point gains) is stated without any derivation, graph-construction details, ablation studies, or error analysis, so it is impossible to assess whether the data support the three advantages or the SOTA assertion.
  2. [Abstract] Abstract (weakest assumption): the claim that the query-augmented affinity graph plus CDS automatically enforces a hard query constraint, sets relevance-redundancy via spectral bound, and reaches equilibrium via replicator dynamics without training or manual tuning is asserted but not derived or illustrated with even a small example; this is load-bearing for all three listed advantages.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the abstract and the need for clearer support of the central claims. We address each point below and will revise the manuscript to improve clarity and accessibility while preserving the core contributions.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central empirical claim (SOTA at 66.99 and the 37.1 / 4.8 point gains) is stated without any derivation, graph-construction details, ablation studies, or error analysis, so it is impossible to assess whether the data support the three advantages or the SOTA assertion.

    Authors: The abstract serves as a concise summary; the full manuscript provides the requested details in Section 3.1 (graph construction), Section 3.2 (CDS formulation and properties), Section 4.3 (ablations), and Section 4.4 (error analysis). The reported SOTA of 66.99 and gains of 37.1 / 4.8 points are obtained with the Qwen3-VL-32B reader on VisDoMBench and MMLongBench-Doc. To address the concern, we will revise the abstract to briefly reference the key methodological components and direct readers to the relevant sections. revision: yes

  2. Referee: [Abstract] Abstract (weakest assumption): the claim that the query-augmented affinity graph plus CDS automatically enforces a hard query constraint, sets relevance-redundancy via spectral bound, and reaches equilibrium via replicator dynamics without training or manual tuning is asserted but not derived or illustrated with even a small example; this is load-bearing for all three listed advantages.

    Authors: These three properties are direct consequences of the CDS definition applied to the query-augmented graph, as formalized in Section 3.2: the query node serves as the cluster anchor (hard constraint), the spectral radius of the affinity matrix supplies the relevance-redundancy bound, and replicator dynamics converge to the equilibrium of the quadratic program. The manuscript cites the foundational CDS results for these guarantees. We agree that an explicit illustration would strengthen the exposition and will add a small synthetic example in the revised Section 3 demonstrating the process on a toy graph. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's core construction builds a query-augmented affinity graph and applies Constrained Dominant Set selection via replicator dynamics. The three listed advantages (hard query constraint via cluster anchor, spectral bound for relevance-redundancy, and global equilibrium) are presented as direct consequences of standard CDS properties rather than derived predictions or fitted parameters. No equations or steps are shown that reduce the output selection to the input by construction, and the reported gains (37.1 / 4.8 points, 66.99 SOTA) are empirical benchmark results. The method is explicitly training-free and graph-based with no load-bearing self-citations or ansatzes invoked in the abstract. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only input supplies no equations, parameters, or explicit assumptions beyond the high-level description of the graph and dynamics.

pith-pipeline@v0.9.1-grok · 5763 in / 1121 out tokens · 20652 ms · 2026-06-27T20:43:26.296996+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

63 extracted references · 1 canonical work pages

  1. [1]

    Mmlongbench-doc: Benchmarking long-context document understanding with visualizations

    Yubo Ma, Yuhang Zang, Liangyu Chen, Meiqi Chen, Yizhu Jiao, Xinze Li, Xinyuan Lu, Ziyu Liu, Yan Ma, Xiaoyi Dong, et al. Mmlongbench-doc: Benchmarking long-context document understanding with visualizations. Advances in Neural Information Processing Systems, 37:95963–96010, 2024

  2. [2]

    Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

    Nelson F Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the association for computational linguistics, 12:157–173, 2024

  3. [3]

    Colpali: Efficient document retrieval with vision language models.arXiv preprint arXiv:2407.01449, 2024

    Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, Céline Hudelot, and Pierre Colombo. Colpali: Efficient document retrieval with vision language models.arXiv preprint arXiv:2407.01449, 2024

  4. [4]

    Visdom: Multi-document qa with visually rich elements using multimodal retrieval-augmented generation

    Manan Suri, Puneet Mathur, Franck Dernoncourt, Kanika Goswami, Ryan A Rossi, and Dinesh Manocha. Visdom: Multi-document qa with visually rich elements using multimodal retrieval-augmented generation. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Vo...

  5. [5]

    Vidorag: Visual document retrieval-augmented generation via dynamic iterative reasoning agents

    Qiuchen Wang, Ruixue Ding, Zehui Chen, Weiqi Wu, Shihang Wang, Pengjun Xie, and Feng Zhao. Vidorag: Visual document retrieval-augmented generation via dynamic iterative reasoning agents. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 9124–9145, 2025

  6. [6]

    Mmgraphrag: Bridging vision and language with interpretable multimodal knowledge graphs.arXiv preprint arXiv:2507.20804, 2025

    Xueyao Wan and Hang Yu. Mmgraphrag: Bridging vision and language with interpretable multimodal knowledge graphs.arXiv preprint arXiv:2507.20804, 2025

  7. [7]

    Lightrag: Simple and fast retrieval-augmented generation.arXiv preprint arXiv:2410.05779, 2(3), 2024

    Zirui Guo, Lianghao Xia, Yanhua Yu, Tian Ao, and Chao Huang. Lightrag: Simple and fast retrieval-augmented generation.arXiv preprint arXiv:2410.05779, 2(3), 2024

  8. [8]

    Rag-anything: All-in-one rag framework

    Zirui Guo, Xubin Ren, Lingrui Xu, Jiahao Zhang, and Chao Huang. Rag-anything: All-in-one rag framework. arXiv preprint arXiv:2510.12323, 2025

  9. [9]

    Ma-rag: Multi-agent retrieval-augmented generation via collabora- tive chain-of-thought reasoning.arXiv preprint arXiv:2505.20096, 2025

    Thang Nguyen, Peter Chin, and Yu-Wing Tai. Ma-rag: Multi-agent retrieval-augmented generation via collabora- tive chain-of-thought reasoning.arXiv preprint arXiv:2505.20096, 2025

  10. [10]

    g2-Reader: Dual Evolving Graphs for Multimodal Document Comprehension.arXiv preprint arXiv:2601.22055, 2026

    Yaxin Du, Junru Song, Yifan Zhou, Cheng Wang, Jiahao Gu, Zimeng Chen, Menglan Chen, Wen Yao, Yang Yang, Ying Wen, et al. g2-Reader: Dual Evolving Graphs for Multimodal Document Comprehension.arXiv preprint arXiv:2601.22055, 2026

  11. [11]

    The use of mmr, diversity-based reranking for reordering documents and producing summaries

    Jaime Carbonell and Jade Goldstein. The use of mmr, diversity-based reranking for reordering documents and producing summaries. InProceedings of the 21st annual international ACM SIGIR conference on Research and development in information retrieval, pages 335–336, 1998

  12. [12]

    Determinantal point processes for machine learning.Foundations and Trends® in Machine Learning, 5(2-3):123–286, 2012

    Alex Kulesza and Ben Taskar. Determinantal point processes for machine learning.Foundations and Trends® in Machine Learning, 5(2-3):123–286, 2012. 9 Constrained Dominant Sets for Multimodal Document Question AnsweringA PREPRINT

  13. [13]

    Fast greedy map inference for determinantal point process to improve recommendation diversity.Advances in neural information processing systems, 31, 2018

    Laming Chen, Guoxin Zhang, and Eric Zhou. Fast greedy map inference for determinantal point process to improve recommendation diversity.Advances in neural information processing systems, 31, 2018

  14. [14]

    Near-optimal map inference for determinantal point processes

    Jennifer Gillenwater, Alex Kulesza, and Ben Taskar. Near-optimal map inference for determinantal point processes. Advances in Neural Information Processing Systems, 25, 2012

  15. [15]

    Dominant-set clustering: A review.European Journal of Operational Research, 262(1):1–13, 2017

    Samuel Rota Bulò and Marcello Pelillo. Dominant-set clustering: A review.European Journal of Operational Research, 262(1):1–13, 2017

  16. [16]

    Dominant sets and pairwise clustering.IEEE transactions on pattern analysis and machine intelligence, 29(1):167–172, 2007

    Massimiliano Pavan and Marcello Pelillo. Dominant sets and pairwise clustering.IEEE transactions on pattern analysis and machine intelligence, 29(1):167–172, 2007

  17. [17]

    Interactive image segmentation using constrained dominant sets

    Eyasu Zemene and Marcello Pelillo. Interactive image segmentation using constrained dominant sets. InEuropean Conference on Computer Vision, pages 278–294. Springer, 2016

  18. [18]

    constrained

    Eyasu Zemene Zemene, Leulseged Tesfaye Alemu, and Marcello Pelillo. Dominant sets for “constrained” image segmentation.IEEE Transactions on Pattern Analysis and Machine Intelligence, 41(10):2438–2451, 2018

  19. [19]

    Replicator equations, maximal cliques, and graph isomorphism.Advances in Neural Information Processing Systems, 11, 1998

    Marcello Pelillo. Replicator equations, maximal cliques, and graph isomorphism.Advances in Neural Information Processing Systems, 11, 1998

  20. [20]

    Evolution towards the maximum clique.Journal of Global Optimization, 10(2):143–164, 1997

    Immanuel M Bomze. Evolution towards the maximum clique.Journal of Global Optimization, 10(2):143–164, 1997

  21. [21]

    Multi-feature fusion for image retrieval using constrained dominant sets.Image and Vision Computing, 94:103862, 2020

    Leulseged Tesfaye Alemu and Marcello Pelillo. Multi-feature fusion for image retrieval using constrained dominant sets.Image and Vision Computing, 94:103862, 2020. ISSN 0262-8856. doi: https://doi.org/ 10.1016/j.imavis.2019.103862. URL https://www.sciencedirect.com/science/article/pii/ S026288561930455X

  22. [22]

    Deep constrained dominant sets for person re-identification

    Leulseged Tesfaye Alemu, Marcello Pelillo, and Mubarak Shah. Deep constrained dominant sets for person re-identification. InProceedings of the IEEE/CVF international conference on computer vision, pages 9855–9864, 2019

  23. [23]

    Multi-target tracking in multiple non-overlapping cameras using fast-constrained dominant sets.International Journal of Computer Vision, 127(9):1303–1320, 2019

    Yonatan Tariku Tesfaye, Eyasu Zemene, Andrea Prati, Marcello Pelillo, and Mubarak Shah. Multi-target tracking in multiple non-overlapping cameras using fast-constrained dominant sets.International Journal of Computer Vision, 127(9):1303–1320, 2019

  24. [24]

    Large-scale image geo-localization using dominant sets.IEEE transactions on pattern analysis and machine intelligence, 41(1):148–161, 2018

    Eyasu Zemene, Yonatan Tariku Tesfaye, Haroon Idrees, Andrea Prati, Marcello Pelillo, and Mubarak Shah. Large-scale image geo-localization using dominant sets.IEEE transactions on pattern analysis and machine intelligence, 41(1):148–161, 2018

  25. [25]

    M3docrag: Multi-modal retrieval is what you need for multi-page multi-document understanding.arXiv preprint arXiv:2411.04952, 2024

    Jaemin Cho, Debanjan Mahata, Ozan Irsoy, Yujie He, and Mohit Bansal. M3docrag: Multi-modal retrieval is what you need for multi-page multi-document understanding.arXiv preprint arXiv:2411.04952, 2024

  26. [26]

    A-mem: Agentic memory for llm agents.Advances in Neural Information Processing Systems, 38:17577–17604, 2026

    Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.Advances in Neural Information Processing Systems, 38:17577–17604, 2026

  27. [27]

    Retrieval-augmented generation for knowledge- intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge- intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

  28. [28]

    Unifying multimodal retrieval via document screenshot embedding

    Xueguang Ma, Sheng-Chieh Lin, Minghan Li, Wenhu Chen, and Jimmy Lin. Unifying multimodal retrieval via document screenshot embedding. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 6492–6505, 2024

  29. [29]

    From local to global: A graph rag approach to query-focused summarization.arXiv preprint arXiv:2404.16130, 2024

    Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. From local to global: A graph rag approach to query-focused summarization.arXiv preprint arXiv:2404.16130, 2024

  30. [30]

    M-longdoc: A benchmark for multimodal super-long document understanding and a retrieval-aware tuning framework

    Yew Ken Chia, Liying Cheng, Hou Pong Chan, Maojia Song, Chaoqun Liu, Mahani Aljunied, Soujanya Poria, and Lidong Bing. M-longdoc: A benchmark for multimodal super-long document understanding and a retrieval-aware tuning framework. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 9244–9261, 2025

  31. [31]

    Docvqa: A dataset for vqa on document images

    Minesh Mathew, Dimosthenis Karatzas, and CV Jawahar. Docvqa: A dataset for vqa on document images. In Proceedings of the IEEE/CVF winter conference on applications of computer vision, pages 2200–2209, 2021

  32. [32]

    Hierarchical multimodal transformers for multipage docvqa.Pattern Recognition, 144:109834, 2023

    Rubèn Tito, Dimosthenis Karatzas, and Ernest Valveny. Hierarchical multimodal transformers for multipage docvqa.Pattern Recognition, 144:109834, 2023

  33. [33]

    Slidevqa: A dataset for document visual question answering on multiple images

    Ryota Tanaka, Kyosuke Nishida, Kosuke Nishida, Taku Hasegawa, Itsumi Saito, and Kuniko Saito. Slidevqa: A dataset for document visual question answering on multiple images. InProceedings of the AAAI Conference on Artificial Intelligence, volume 37, pages 13636–13645, 2023. 10 Constrained Dominant Sets for Multimodal Document Question AnsweringA PREPRINT

  34. [34]

    Spiqa: A dataset for multimodal question answering on scientific papers.Advances in Neural Information Processing Systems, 37:118807–118833, 2024

    Shraman Pramanick, Rama Chellappa, and Subhashini Venugopalan. Spiqa: A dataset for multimodal question answering on scientific papers.Advances in Neural Information Processing Systems, 37:118807–118833, 2024

  35. [35]

    Uda: A benchmark suite for retrieval augmented generation in real-world document analysis.Advances in Neural Information Processing Systems, 37:67200–67217, 2024

    Yulong Hui, Yao Lu, and Huanchen Zhang. Uda: A benchmark suite for retrieval augmented generation in real-world document analysis.Advances in Neural Information Processing Systems, 37:67200–67217, 2024

  36. [36]

    Fetaqa: Free-form table question answering.Transactions of the Association for Computational Linguistics, 10:35–49, 2022

    Linyong Nan, Chiachun Hsieh, Ziming Mao, Xi Victoria Lin, Neha Verma, Rui Zhang, Wojciech Kry´sci´nski, Hailey Schoelkopf, Riley Kong, Xiangru Tang, et al. Fetaqa: Free-form table question answering.Transactions of the Association for Computational Linguistics, 10:35–49, 2022

  37. [37]

    Scigraphqa: A large-scale synthetic multi-turn question-answering dataset for scientific graphs.arXiv preprint arXiv:2308.03349, 2023

    Shengzhi Li and Nima Tajbakhsh. Scigraphqa: A large-scale synthetic multi-turn question-answering dataset for scientific graphs.arXiv preprint arXiv:2308.03349, 2023

  38. [38]

    Longdocurl: a comprehensive multimodal long document benchmark integrating understanding, reasoning, and locating

    Chao Deng, Jiale Yuan, Pi Bu, Peijie Wang, Zhong-Zhi Li, Jian Xu, Xiao-Hui Li, Yuan Gao, Jun Song, Bo Zheng, et al. Longdocurl: a comprehensive multimodal long document benchmark integrating understanding, reasoning, and locating. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 1135–...

  39. [39]

    Memgpt: Towards llms as operating systems.arXiv preprint arXiv:2310.08560, 2023

    Charles Packer, Vivian Fang, Shishir G Patil, Kevin Lin, Sarah Wooders, and Joseph E Gonzalez. Memgpt: Towards llms as operating systems.arXiv preprint arXiv:2310.08560, 2023

  40. [40]

    Memorybank: Enhancing large language models with long-term memory

    Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 19724–19731, 2024

  41. [41]

    Mineru: An open-source solution for precise document content extraction.arXiv preprint arXiv:2409.18839, 2024

    Bin Wang, Chao Xu, Xiaomeng Zhao, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Rui Xu, Kaiwen Liu, Yuan Qu, Fukai Shang, et al. Mineru: An open-source solution for precise document content extraction.arXiv preprint arXiv:2409.18839, 2024

  42. [42]

    Deepseek-ocr: Contexts optical compression.arXiv preprint arXiv:2510.18234, 2025

    Haoran Wei, Yaofeng Sun, and Yukun Li. Deepseek-ocr: Contexts optical compression.arXiv preprint arXiv:2510.18234, 2025

  43. [43]

    Nomic embed: Training a reproducible long context text embedder.arXiv preprint arXiv:2402.01613, 2024

    Zach Nussbaum, John X Morris, Brandon Duderstadt, and Andriy Mulyar. Nomic embed: Training a reproducible long context text embedder.arXiv preprint arXiv:2402.01613, 2024

  44. [44]

    Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

    Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, et al. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267, 2025

  45. [45]

    k-dpps: Fixed-size determinantal point processes

    Alex Kulesza and Ben Taskar. k-dpps: Fixed-size determinantal point processes. InProceedings of the 28th International Conference on Machine Learning (ICML-11), pages 1193–1200, 2011

  46. [46]

    Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric Xing, et al. Judging llm-as-a-judge with mt-bench and chatbot arena.Advances in neural information processing systems, 36:46595–46623, 2023

  47. [47]

    Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

    Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

  48. [48]

    Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025

  49. [49]

    Glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv e-prints, pages arXiv–2507, 2025

    Wenyi Hong, Wenmeng Yu, Xiaotao Gu, Guo Wang, Guobing Gan, Haomiao Tang, Jiale Cheng, Ji Qi, Junhui Ji, Lihang Pan, et al. Glm-4.1 v-thinking: Towards versatile multimodal reasoning with scalable reinforcement learning.arXiv e-prints, pages arXiv–2507, 2025

  50. [50]

    Efficient memory management for large language model serving with pagedattention

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph Gonzalez, Hao Zhang, and Ion Stoica. Efficient memory management for large language model serving with pagedattention. In Proceedings of the 29th symposium on operating systems principles, pages 611–626, 2023. 11 Constrained Dominant Sets for Multimodal Document Questio...

  51. [51]

    Identifying the most salient keywords (focus on nouns, verbs, and key concepts)

  52. [52]

    Extracting core themes, concepts and arguments

  53. [53]

    keywords

    Creating relevant categorical tags Format the response as a JSON object: { "keywords": [ // specific, distinct keywords, ordered most->least important; // at least three, avoid redundancy ], "summary": // one sentence: main topic/domain + key points; concise , "tags": [ // broad categories/themes (domain, format, type); >=3, non-redundant,→ ] } Content fo...

  54. [54]

    Base the summary primarily on the visual evidence and the caption

    Use the context ONLY to aid understanding of the image's role; do not quote or rely on it unless it aligns with what is visible or stated in the caption. Base the summary primarily on the visual evidence and the caption

  55. [55]

    Keywords MUST include exact in-image terms: labels, legends, axis titles, category names, and domain-specific terms; preserve their exact wording

  56. [56]

    Figure 1

    If the caption has an index (e.g., "Figure 1", "Table 2"), begin the summary by formalizing it ("Figure X -- ...", "Table Y -- ...") then describe concisely. Format the response as a JSON object: { "keywords": [ // exact in-image labels/legends/axis titles/terms; >=3, non-redundant ],,→ "summary": // start with "Figure X -- ..."/"Table Y -- ..." if indexe...

  57. [57]

    Which neighboring notes should be linked to this note?

  58. [58]

    Should this note's summary/keywords be updated given those relationships?

  59. [59]

    suggested_connections

    If so, what are the new summary and keywords? 19 Constrained Dominant Sets for Multimodal Document Question AnsweringA PREPRINT Connect two notes ONLY for a specific logical relationship: direct reference, causal, part-whole, conceptual elaboration, temporal sequence, contrastive/comparative, hierarchical, or contextual dependency. DO NOT connect notes th...

  60. [60]

    Do not invent facts

    Use ONLY the provided context. Do not invent facts

  61. [61]

    Quote table numbers, figure numbers, named entities, and numeric values exactly.,→

  62. [62]

    Not found in the provided context

    If the answer is not in the context, reply exactly: "Not found in the provided context."

  63. [63]

    <reason>detailed reason for your answer here</reason> <answer>the correct answer here</answer>

    Answer concisely. No preamble. [User] Question: {question} Context: --- Note 1 (id=.., weight=.., type=text|image) --- {text passage OR image caption/summary + <image>} --- Note 2 ... --- ... Answer: Prompt: Single-VLM Baseline (no retrieval, chain-of-thought) Please read the following text and the attached images and answer the question below.,→ <text> {...