arxiv: 2604.18508 · v1 · submitted 2026-04-20 · 💻 cs.IR · cs.AI· cs.CL

Recognition: unknown

Document-as-Image Representations Fall Short for Scientific Retrieval

Ghazal Khalighinejad , Raghuveer Thirukovalluru , Alexander H. Oh , Bhuwan Dhingra

Authors on Pith no claims yet

Pith reviewed 2026-05-10 03:13 UTC · model grok-4.3

classification 💻 cs.IR cs.AIcs.CL

keywords scientific document retrievalmultimodal embeddingsdocument-as-image representationstext-based retrievalArXivDoc benchmarkfigure queries

0 comments

The pith

Text-based representations outperform document-as-image approaches for scientific document retrieval.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper builds ArXivDoc, a benchmark drawn directly from LaTeX sources of arXiv papers, to test how well different embedding strategies retrieve scientific content. It compares text-only, image-based, and interleaved text-plus-image models on both single-vector and multi-vector setups. The results indicate that rendering pages as images and embedding them works worse than using the underlying text, especially once documents grow longer. Text methods even handle figure-related queries better by drawing on captions and nearby paragraphs. Interleaved approaches that keep text and images separate beat pure image methods without any extra training.

Core claim

Document-as-image representations are consistently suboptimal for scientific retrieval tasks, particularly as document length increases, while text-based representations prove most effective even on figure queries by using captions and surrounding context, and interleaved text-plus-image representations outperform document-as-image methods without requiring specialized training.

What carries the argument

The ArXivDoc benchmark, built from LaTeX sources to give direct access to structured elements such as sections, tables, figures, and equations for controlled query construction based on specific evidence types.

If this is right

Retrieval systems for scientific literature should favor text or interleaved representations over pure page-image embeddings.
Performance differences between methods grow with document length, indicating that image approaches have trouble with information spread across many pages.
Figure queries can be answered effectively by text models that read captions and context rather than by processing the visual content of the figure itself.
Interleaved text-plus-image models can surpass document-as-image approaches using existing training methods.
Benchmarks that treat documents only as page images may overstate the usefulness of image-based embedding models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future multimodal document models could improve by keeping text structure intact instead of converting everything to rendered images.
The same comparison of representations might produce similar patterns in other text-heavy domains such as legal contracts or technical manuals.
System builders might reconsider training corpora that rely on rendered page images when the target domain contains structured text and tables.
Testing the same models on queries that require cross-referencing multiple distant sections could further highlight where text advantages appear.

Load-bearing premise

The ArXivDoc benchmark and its controlled queries based on specific evidence types fairly represent real-world scientific retrieval needs and the distribution of evidence in documents.

What would settle it

A new retrieval test set drawn from actual user search logs on scientific papers in which image-based embeddings match or exceed text-based performance on longer documents or figure queries.

Figures

Figures reproduced from arXiv: 2604.18508 by Alexander H. Oh, Bhuwan Dhingra, Ghazal Khalighinejad, Raghuveer Thirukovalluru.

**Figure 2.** Figure 2: Dataset construction pipeline. Query counts across stages: LLM verification reduces [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Retrieval performance (NDCG@10) as a function of context length (token length) across document representations. The max-pixels parameters are tuned (see Appendix A). We study how retrieval performance changes as the amount of context increases. Starting from the flattened LaTeX source, we sample a base window of 500 tokens and progressively expand it to larger contexts (1000, 4000, and 8000 tokens), whil… view at source ↗

**Figure 4.** Figure 4: Prompt used for generating synthetic, open-domain retrieval queries from scientific text. [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: Prompt used for decontextualizing document-dependent scientific questions into open [PITH_FULL_IMAGE:figures/full_fig_p016_5.png] view at source ↗

**Figure 6.** Figure 6: Prompt used to verify whether a generated question is a valid, context-independent query [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Representative examples of decontextualized, evidence-grounded queries in TeXODQ. [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Human annotation protocol for evaluating and refining scientific retrieval queries. [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

read the original abstract

Many recent document embedding models are trained on document-as-image representations, embedding rendered pages as images rather than the underlying source. Meanwhile, existing benchmarks for scientific document retrieval, such as ArXivQA and ViDoRe, treat documents as images of pages, implicitly favoring such representations. In this work, we argue that this paradigm is not well-suited for text-rich multimodal scientific documents, where critical evidence is distributed across structured sources, including text, tables, and figures. To study this setting, we introduce ArXivDoc, a new benchmark constructed from the underlying LaTeX sources of scientific papers. Unlike PDF or image-based representations, LaTeX provides direct access to structured elements (e.g., sections, tables, figures, equations), enabling controlled query construction grounded in specific evidence types. We systematically compare text-only, image-based, and multimodal representations across both single-vector and multi-vector retrieval models. Our results show that: (1) document-as-image representations are consistently suboptimal, especially as document length increases; (2) text-based representations are most effective, even for figure-based queries, by leveraging captions and surrounding context; and (3) interleaved text+image representations outperform document-as-image approaches without requiring specialized training.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces ArXivDoc, a new benchmark for scientific document retrieval constructed directly from LaTeX sources of arXiv papers. It performs a systematic empirical comparison of text-only, document-as-image, and interleaved text+image representations using both single-vector and multi-vector retrieval models. The central claims are that document-as-image representations are consistently suboptimal (especially as document length grows), text-based representations are most effective even for figure-based queries by exploiting captions and context, and interleaved text+image approaches outperform pure document-as-image methods without requiring specialized training.

Significance. If the results hold under more detailed scrutiny, the work is significant for challenging the recent trend of training document embedding models on rendered page images. It supplies a structured, evidence-type-controlled benchmark that existing image-centric evaluations (e.g., ViDoRe) lack, and it supplies concrete directional evidence favoring text-centric and interleaved strategies for text-rich scientific literature. The provision of LaTeX-derived data and controlled query construction is a clear methodological strength that enables future reproducible comparisons.

major comments (3)

[§3] §3 (ArXivDoc benchmark construction): The procedure for generating queries from specific LaTeX evidence types (sections, tables, figures, equations) is described at a high level but lacks the exact mapping rules, prompt templates, or filtering criteria used. This detail is load-bearing for the claim that text models succeed on figure queries via context; without it, the possibility remains that figure queries systematically include captioned or surrounding text, creating a benchmark-specific advantage for text representations over pure image models.
[§4] §4 (Experimental setup and results): The manuscript does not report the precise retrieval metrics (e.g., nDCG@10, Recall@K, MRR), the number of queries per evidence category, the document collection size, or any statistical significance tests for the observed performance gaps. These omissions make it difficult to assess the robustness and magnitude of the reported superiority of text and interleaved representations.
[§4.3] §4.3 (length analysis): The finding that document-as-image performance degrades with increasing document length requires an explicit definition of length (pages, tokens, or element count) and a breakdown by query type; the current presentation leaves unclear whether the trend is driven by a few long documents or holds consistently.

minor comments (2)

A summary table listing all evaluated models, their representation types (text/image/interleaved), and whether they are single- or multi-vector would improve readability of the experimental design.
[Abstract] The abstract states that interleaved representations 'outperform document-as-image approaches without requiring specialized training'; the manuscript should clarify which specific interleaved models were used and whether any fine-tuning occurred.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments, which help strengthen the clarity and reproducibility of our work on the ArXivDoc benchmark. We address each major comment below and will revise the manuscript accordingly to incorporate the requested details.

read point-by-point responses

Referee: [§3] §3 (ArXivDoc benchmark construction): The procedure for generating queries from specific LaTeX evidence types (sections, tables, figures, equations) is described at a high level but lacks the exact mapping rules, prompt templates, or filtering criteria used. This detail is load-bearing for the claim that text models succeed on figure queries via context; without it, the possibility remains that figure queries systematically include captioned or surrounding text, creating a benchmark-specific advantage for text representations over pure image models.

Authors: We agree that additional specificity on query generation is needed to support the claims about text models leveraging context for figure queries. In the revised manuscript, we will expand §3 with the exact mapping rules from LaTeX elements to queries, the full prompt templates employed, and the filtering criteria (e.g., exclusion of queries that directly copy surrounding text or captions). This will demonstrate that figure queries are constructed to isolate visual evidence while allowing text models to use only the provided context, without systematic leakage. revision: yes
Referee: [§4] §4 (Experimental setup and results): The manuscript does not report the precise retrieval metrics (e.g., nDCG@10, Recall@K, MRR), the number of queries per evidence category, the document collection size, or any statistical significance tests for the observed performance gaps. These omissions make it difficult to assess the robustness and magnitude of the reported superiority of text and interleaved representations.

Authors: We will update §4 to explicitly state the retrieval metrics computed (nDCG@10, Recall@10, MRR), the exact number of queries per evidence category (sections, tables, figures, equations), the total document collection size, and the results of statistical significance tests (paired t-tests with p-values) on the performance differences between representation types. These additions will allow readers to better evaluate the magnitude and reliability of the observed trends. revision: yes
Referee: [§4.3] §4.3 (length analysis): The finding that document-as-image performance degrades with increasing document length requires an explicit definition of length (pages, tokens, or element count) and a breakdown by query type; the current presentation leaves unclear whether the trend is driven by a few long documents or holds consistently.

Authors: We define document length as the number of pages in the rendered PDF (consistent with the image-based models' input). In the revised §4.3, we will include a per-query-type breakdown (e.g., performance curves for figure queries vs. section queries) and additional analysis showing the trend across length bins, with checks to confirm it is not driven by outliers. This will clarify the consistency of the degradation for document-as-image representations. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark comparison

full rationale

The paper introduces ArXivDoc benchmark from LaTeX sources and reports retrieval performance comparisons across text-only, image-based, and multimodal representations. No equations, derivations, fitted parameters, or predictions appear in the abstract or described content. Claims rest on direct experimental results rather than any self-definitional reduction, self-citation load-bearing argument, or ansatz smuggled via prior work. The analysis is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claims rest on standard information retrieval evaluation practices and the assumption that LaTeX-derived structure provides an unbiased view of evidence distribution in scientific documents.

axioms (1)

domain assumption Standard information retrieval metrics and evaluation protocols are appropriate for comparing document representations.
The paper relies on conventional retrieval evaluation to compare text, image, and multimodal approaches.

pith-pipeline@v0.9.0 · 5530 in / 1286 out tokens · 42335 ms · 2026-05-10T03:13:31.744951+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

22 extracted references · 4 canonical work pages

[1]

URL https://aclanthology.org/2025

doi: 10.18653/v1/2025.emnlp-main.1324. URL https://aclanthology.org/2025. emnlp-main.1324/. Cheng Cui, Ting Sun, Suyin Liang, Tingquan Gao, Zelun Zhang, Jiaxuan Liu, Xueqing Wang, Changda Zhou, Hongen Liu, Manhui Lin, et al. Paddleocr-vl: Boosting multilingual document parsing via a 0.9 b ultra-compact vision-language model.arXiv preprint arXiv:2510.14528...

work page doi:10.18653/v1/2025.emnlp-main.1324 2025
[2]

URL https://aclanthology.org/2025

doi: 10.18653/v1/2025.emnlp-main.1576. URL https://aclanthology.org/2025. emnlp-main.1576/. Manuel Faysse, Hugues Sibille, Tony Wu, Bilel Omrani, Gautier Viaud, CELINE HUDELOT, and Pierre Colombo. Colpali: Efficient document retrieval with vision language models. InThe Thirteenth International Conference on Learning Representations, 2025. Michael Günther,...

work page doi:10.18653/v1/2025.emnlp-main.1576 2025
[3]

InProceedings of the 2024 Conference on Empirical Methods in Natural Lan- guage Processing, Yaser Al-Onaizan, Mohit Bansal, and Yun-Nung Chen (Eds.)

URLhttps://arxiv.org/abs/2601.04720. Sheng-Chieh Lin, Chankyu Lee, Mohammad Shoeybi, Jimmy Lin, Bryan Catanzaro, and Wei Ping. Mm-embed: Universal multimodal retrieval with multimodal llms. InThe Thirteenth International Conference on Learning Representations. Zhiheng Lyu, Xueguang Ma, and Wenhu Chen. Pixelworld: Towards perceiving everything as pixels. T...

work page doi:10.18653/v1/2024.emnlp-main.373 2025
[4]

The query must require expert-level reasoning over implications, trends, limitations, or constraints discussed in the document, and must not be a direct restatement of any sentence from the input
[5]

The query must minimize lexical overlap with the input text by avoiding distinctive phrases or terminology, relying instead on abstraction and paraphrasing rather than keyword matching
[6]

The query must be answerable from the document but not trivially retrievable via keyword or phrase matching, and must not reference sections, figures, experiments, or document-specific wording
[7]

The query must ask exactly one focused question, without combining multiple sub-questions or enumerating parameters or conditions
[8]

The query must be realistic and concise, phrased as a single sentence that a knowledgeable researcher would plausibly ask, without verbose framing or artificial difficulty
[9]

query":

If no query satisfying these criteria can be generated, return null. Required Output Format: { "query": "<generated question or null>" } Here is the document content: {paper_text} Figure 4: Prompt used for generating synthetic, open-domain retrieval queries from scientific text. 15 Preprint. Under review. B.2 Query Decontextualization Prompt PROMPT You ar...
[10]

Preserve the core scientific intent, variables, and conditions present in the original question
[11]

based on the graph

Replace visual or deictic phrasing with concept-level wording (e.g., remove references such as “based on the graph” and ask directly about the relationship or effect)
[12]

A minimal parenthetical alias may be included only if it appears in the input

If symbols (e.g.,f spec) appear without definition, retain them exactly as written and do not invent meanings. A minimal parenthetical alias may be included only if it appears in the input
[13]

Remove all references to figures, plots, tables, panels, or document-local indices
[14]

Ensure the rewritten query can be answered by a knowledgeable reader without access to the original document or image
[15]

Retain units, ranges, and experimental or observational conditions if present
[16]

the parameter

Avoid unresolved pronouns or placeholders (e.g., “the parameter”, “the system”) unless the domain makes them unambiguous
[17]

If the original question contains multiple sub-questions, keep only one and discard the rest
[18]

query":

The final query must be a single, concise sentence with no superfluous framing or background. Required Output Format: { "query": "<single rewritten question or null>", "reasoning": "<one-sentence rationale>" } If a valid context-independent query cannot be produced, set "query" to null and briefly explain why in "reasoning". Figure 5: Prompt used for deco...

work page arXiv 2006
[19]

Perform initial evidence screening using retrieval tools (e.g., NotebookLM or Gemini Flash 2.5) to surface candidate passages, tables, or figures and identify potential issues with naturalness, ambiguity, or answerability
[20]

Assign a coarse quality score (1-10) to guide assessment of ambiguity and retrieval specificity; this score is used for calibration and not thresholded directly
[21]

Manually inspect the retrieved evidence against the original LATEX source, including text, parsed tables, and rendered figures, to confirm correctness
[22]

Figure 8: Human annotation protocol for evaluating and refining scientific retrieval queries

Rewrite queries to improve clarity and specificity while preserving intent, or discard queries that cannot be made valid. Figure 8: Human annotation protocol for evaluating and refining scientific retrieval queries. 19