Recognition: unknown
Document-as-Image Representations Fall Short for Scientific Retrieval
Pith reviewed 2026-05-10 03:13 UTC · model grok-4.3
The pith
Text-based representations outperform document-as-image approaches for scientific document retrieval.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Document-as-image representations are consistently suboptimal for scientific retrieval tasks, particularly as document length increases, while text-based representations prove most effective even on figure queries by using captions and surrounding context, and interleaved text-plus-image representations outperform document-as-image methods without requiring specialized training.
What carries the argument
The ArXivDoc benchmark, built from LaTeX sources to give direct access to structured elements such as sections, tables, figures, and equations for controlled query construction based on specific evidence types.
If this is right
- Retrieval systems for scientific literature should favor text or interleaved representations over pure page-image embeddings.
- Performance differences between methods grow with document length, indicating that image approaches have trouble with information spread across many pages.
- Figure queries can be answered effectively by text models that read captions and context rather than by processing the visual content of the figure itself.
- Interleaved text-plus-image models can surpass document-as-image approaches using existing training methods.
- Benchmarks that treat documents only as page images may overstate the usefulness of image-based embedding models.
Where Pith is reading between the lines
- Future multimodal document models could improve by keeping text structure intact instead of converting everything to rendered images.
- The same comparison of representations might produce similar patterns in other text-heavy domains such as legal contracts or technical manuals.
- System builders might reconsider training corpora that rely on rendered page images when the target domain contains structured text and tables.
- Testing the same models on queries that require cross-referencing multiple distant sections could further highlight where text advantages appear.
Load-bearing premise
The ArXivDoc benchmark and its controlled queries based on specific evidence types fairly represent real-world scientific retrieval needs and the distribution of evidence in documents.
What would settle it
A new retrieval test set drawn from actual user search logs on scientific papers in which image-based embeddings match or exceed text-based performance on longer documents or figure queries.
Figures
read the original abstract
Many recent document embedding models are trained on document-as-image representations, embedding rendered pages as images rather than the underlying source. Meanwhile, existing benchmarks for scientific document retrieval, such as ArXivQA and ViDoRe, treat documents as images of pages, implicitly favoring such representations. In this work, we argue that this paradigm is not well-suited for text-rich multimodal scientific documents, where critical evidence is distributed across structured sources, including text, tables, and figures. To study this setting, we introduce ArXivDoc, a new benchmark constructed from the underlying LaTeX sources of scientific papers. Unlike PDF or image-based representations, LaTeX provides direct access to structured elements (e.g., sections, tables, figures, equations), enabling controlled query construction grounded in specific evidence types. We systematically compare text-only, image-based, and multimodal representations across both single-vector and multi-vector retrieval models. Our results show that: (1) document-as-image representations are consistently suboptimal, especially as document length increases; (2) text-based representations are most effective, even for figure-based queries, by leveraging captions and surrounding context; and (3) interleaved text+image representations outperform document-as-image approaches without requiring specialized training.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces ArXivDoc, a new benchmark for scientific document retrieval constructed directly from LaTeX sources of arXiv papers. It performs a systematic empirical comparison of text-only, document-as-image, and interleaved text+image representations using both single-vector and multi-vector retrieval models. The central claims are that document-as-image representations are consistently suboptimal (especially as document length grows), text-based representations are most effective even for figure-based queries by exploiting captions and context, and interleaved text+image approaches outperform pure document-as-image methods without requiring specialized training.
Significance. If the results hold under more detailed scrutiny, the work is significant for challenging the recent trend of training document embedding models on rendered page images. It supplies a structured, evidence-type-controlled benchmark that existing image-centric evaluations (e.g., ViDoRe) lack, and it supplies concrete directional evidence favoring text-centric and interleaved strategies for text-rich scientific literature. The provision of LaTeX-derived data and controlled query construction is a clear methodological strength that enables future reproducible comparisons.
major comments (3)
- [§3] §3 (ArXivDoc benchmark construction): The procedure for generating queries from specific LaTeX evidence types (sections, tables, figures, equations) is described at a high level but lacks the exact mapping rules, prompt templates, or filtering criteria used. This detail is load-bearing for the claim that text models succeed on figure queries via context; without it, the possibility remains that figure queries systematically include captioned or surrounding text, creating a benchmark-specific advantage for text representations over pure image models.
- [§4] §4 (Experimental setup and results): The manuscript does not report the precise retrieval metrics (e.g., nDCG@10, Recall@K, MRR), the number of queries per evidence category, the document collection size, or any statistical significance tests for the observed performance gaps. These omissions make it difficult to assess the robustness and magnitude of the reported superiority of text and interleaved representations.
- [§4.3] §4.3 (length analysis): The finding that document-as-image performance degrades with increasing document length requires an explicit definition of length (pages, tokens, or element count) and a breakdown by query type; the current presentation leaves unclear whether the trend is driven by a few long documents or holds consistently.
minor comments (2)
- A summary table listing all evaluated models, their representation types (text/image/interleaved), and whether they are single- or multi-vector would improve readability of the experimental design.
- [Abstract] The abstract states that interleaved representations 'outperform document-as-image approaches without requiring specialized training'; the manuscript should clarify which specific interleaved models were used and whether any fine-tuning occurred.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed comments, which help strengthen the clarity and reproducibility of our work on the ArXivDoc benchmark. We address each major comment below and will revise the manuscript accordingly to incorporate the requested details.
read point-by-point responses
-
Referee: [§3] §3 (ArXivDoc benchmark construction): The procedure for generating queries from specific LaTeX evidence types (sections, tables, figures, equations) is described at a high level but lacks the exact mapping rules, prompt templates, or filtering criteria used. This detail is load-bearing for the claim that text models succeed on figure queries via context; without it, the possibility remains that figure queries systematically include captioned or surrounding text, creating a benchmark-specific advantage for text representations over pure image models.
Authors: We agree that additional specificity on query generation is needed to support the claims about text models leveraging context for figure queries. In the revised manuscript, we will expand §3 with the exact mapping rules from LaTeX elements to queries, the full prompt templates employed, and the filtering criteria (e.g., exclusion of queries that directly copy surrounding text or captions). This will demonstrate that figure queries are constructed to isolate visual evidence while allowing text models to use only the provided context, without systematic leakage. revision: yes
-
Referee: [§4] §4 (Experimental setup and results): The manuscript does not report the precise retrieval metrics (e.g., nDCG@10, Recall@K, MRR), the number of queries per evidence category, the document collection size, or any statistical significance tests for the observed performance gaps. These omissions make it difficult to assess the robustness and magnitude of the reported superiority of text and interleaved representations.
Authors: We will update §4 to explicitly state the retrieval metrics computed (nDCG@10, Recall@10, MRR), the exact number of queries per evidence category (sections, tables, figures, equations), the total document collection size, and the results of statistical significance tests (paired t-tests with p-values) on the performance differences between representation types. These additions will allow readers to better evaluate the magnitude and reliability of the observed trends. revision: yes
-
Referee: [§4.3] §4.3 (length analysis): The finding that document-as-image performance degrades with increasing document length requires an explicit definition of length (pages, tokens, or element count) and a breakdown by query type; the current presentation leaves unclear whether the trend is driven by a few long documents or holds consistently.
Authors: We define document length as the number of pages in the rendered PDF (consistent with the image-based models' input). In the revised §4.3, we will include a per-query-type breakdown (e.g., performance curves for figure queries vs. section queries) and additional analysis showing the trend across length bins, with checks to confirm it is not driven by outliers. This will clarify the consistency of the degradation for document-as-image representations. revision: yes
Circularity Check
No circularity: purely empirical benchmark comparison
full rationale
The paper introduces ArXivDoc benchmark from LaTeX sources and reports retrieval performance comparisons across text-only, image-based, and multimodal representations. No equations, derivations, fitted parameters, or predictions appear in the abstract or described content. Claims rest on direct experimental results rather than any self-definitional reduction, self-citation load-bearing argument, or ansatz smuggled via prior work. The analysis is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Standard information retrieval metrics and evaluation protocols are appropriate for comparing document representations.
Reference graph
Works this paper leans on
-
[1]
URL https://aclanthology.org/2025
doi: 10.18653/v1/2025.emnlp-main.1324. URL https://aclanthology.org/2025. emnlp-main.1324/. Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, et al. Paddleocr-vl: Boosting multilingual document parsing via a 0.9 b ultra-compact vision-language model.arXiv preprint arXiv:2510.14528...
-
[2]
URL https://aclanthology.org/2025
doi: 10.18653/v1/2025.emnlp-main.1576. URL https://aclanthology.org/2025. emnlp-main.1576/. Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, CELINE HUDELOT, and Pierre Colombo. Colpali: Efficient document retrieval with vision language models. InThe Thirteenth International Conference on Learning Representations, 2025. Michael Günther,...
-
[3]
URLhttps://arxiv.org/abs/2601.04720. Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, Jimmy Lin, Bryan Catanzaro, and Wei Ping. Mm-embed: Universal multimodal retrieval with multimodal llms. InThe Thirteenth International Conference on Learning Representations. Zhiheng Lyu, Xueguang Ma, and Wenhu Chen. Pixelworld: Towards perceiving everything as pixels. T...
-
[4]
The query must require expert-level reasoning over implications, trends, limitations, or constraints discussed in the document, and must not be a direct restatement of any sentence from the input
-
[5]
The query must minimize lexical overlap with the input text by avoiding distinctive phrases or terminology, relying instead on abstraction and paraphrasing rather than keyword matching
-
[6]
The query must be answerable from the document but not trivially retrievable via keyword or phrase matching, and must not reference sections, figures, experiments, or document-specific wording
-
[7]
The query must ask exactly one focused question, without combining multiple sub-questions or enumerating parameters or conditions
-
[8]
The query must be realistic and concise, phrased as a single sentence that a knowledgeable researcher would plausibly ask, without verbose framing or artificial difficulty
-
[9]
query":
If no query satisfying these criteria can be generated, return null. Required Output Format: { "query": "<generated question or null>" } Here is the document content: {paper_text} Figure 4: Prompt used for generating synthetic, open-domain retrieval queries from scientific text. 15 Preprint. Under review. B.2 Query Decontextualization Prompt PROMPT You ar...
-
[10]
Preserve the core scientific intent, variables, and conditions present in the original question
-
[11]
based on the graph
Replace visual or deictic phrasing with concept-level wording (e.g., remove references such as “based on the graph” and ask directly about the relationship or effect)
-
[12]
A minimal parenthetical alias may be included only if it appears in the input
If symbols (e.g.,f spec) appear without definition, retain them exactly as written and do not invent meanings. A minimal parenthetical alias may be included only if it appears in the input
-
[13]
Remove all references to figures, plots, tables, panels, or document-local indices
-
[14]
Ensure the rewritten query can be answered by a knowledgeable reader without access to the original document or image
-
[15]
Retain units, ranges, and experimental or observational conditions if present
-
[16]
the parameter
Avoid unresolved pronouns or placeholders (e.g., “the parameter”, “the system”) unless the domain makes them unambiguous
-
[17]
If the original question contains multiple sub-questions, keep only one and discard the rest
-
[18]
The final query must be a single, concise sentence with no superfluous framing or background. Required Output Format: { "query": "<single rewritten question or null>", "reasoning": "<one-sentence rationale>" } If a valid context-independent query cannot be produced, set "query" to null and briefly explain why in "reasoning". Figure 5: Prompt used for deco...
-
[19]
Perform initial evidence screening using retrieval tools (e.g., NotebookLM or Gemini Flash 2.5) to surface candidate passages, tables, or figures and identify potential issues with naturalness, ambiguity, or answerability
-
[20]
Assign a coarse quality score (1-10) to guide assessment of ambiguity and retrieval specificity; this score is used for calibration and not thresholded directly
-
[21]
Manually inspect the retrieved evidence against the original LATEX source, including text, parsed tables, and rendered figures, to confirm correctness
-
[22]
Figure 8: Human annotation protocol for evaluating and refining scientific retrieval queries
Rewrite queries to improve clarity and specificity while preserving intent, or discard queries that cannot be made valid. Figure 8: Human annotation protocol for evaluating and refining scientific retrieval queries. 19
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.