FATHOMS-RAG: A Framework for the Assessment of Thinking and Observation in Multimodal Systems that use Retrieval Augmented Generation

(2) Oak Ridge National Lab; (3) University of Florida); Amir Sadovnik (2); Brandon Schreiber (2); Curtis Taylor (2); James M Ghawaly Jr (1); Kevin Kurian (3) ((1) Louisiana State University; Ryan Shivers (2); Samuel Hildebrand (1); Sean Oesch (2)

arxiv: 2510.08945 · v3 · pith:R2KX3EQMnew · submitted 2025-10-10 · 💻 cs.AI

FATHOMS-RAG: A Framework for the Assessment of Thinking and Observation in Multimodal Systems that use Retrieval Augmented Generation

Samuel Hildebrand (1) , Curtis Taylor (2) , Sean Oesch (2) , James M Ghawaly Jr (1) , Amir Sadovnik (2) , Ryan Shivers (2) , Brandon Schreiber (2) , Kevin Kurian (3) ((1) Louisiana State University

show 2 more authors

(2) Oak Ridge National Lab (3) University of Florida)

This is my paper

Pith reviewed 2026-05-25 08:09 UTC · model grok-4.3

classification 💻 cs.AI

keywords retrieval augmented generationRAG evaluationmultimodal RAGhallucination detectionbenchmark datasetclosed-source vs open-sourcecross-document reasoning

0 comments

The pith

Closed-source RAG pipelines outperform open-source ones on multimodal and cross-document questions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FATHOMS-RAG, a benchmark that tests entire retrieval-augmented generation pipelines on their capacity to ingest, retrieve from, and reason over text, tables, images, and information spread across multiple documents. It supplies a set of 93 human-written questions, a phrase-level recall score to measure factual correctness, and a nearest-neighbor embedding check to flag hallucinations. When four closed-source foundation models are paired with two open-source retrievers and compared, the closed-source pipelines score higher on both metrics, and the advantage grows on questions that require combining modalities or crossing document boundaries. A separate human review finds strong agreement with the automated scores.

Core claim

A benchmark consisting of 93 human-created questions, phrase-level recall, and nearest-neighbor embedding hallucination detection shows that closed-source RAG pipelines significantly outperform open-source pipelines in correctness and hallucination avoidance, with the performance gap widening on questions that depend on multimodal or cross-document information.

What carries the argument

The 93-question dataset spanning text, tables, images, and cross-document modalities, together with phrase-level recall for correctness and nearest-neighbor embedding classification for hallucination detection.

If this is right

Closed-source pipelines maintain higher correctness and lower hallucination rates across all question types.
The advantage of closed-source systems increases when questions require combining information from images, tables, or multiple documents.
The phrase-level recall and embedding-based hallucination metrics align closely with third-party human judgments.
Open-source retrieval mechanisms remain a limiting factor when pipelines must reason over non-text modalities or distributed sources.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Open-source RAG development may need targeted improvements in multimodal retrieval and cross-document linking to close the observed gap.
Small human-curated test sets like this one could be used to guide iterative pipeline tuning before scaling to larger automated evaluations.
The benchmark's separation of ingestion, retrieval, and reasoning stages offers a template for diagnosing where a given pipeline fails.

Load-bearing premise

The 93 questions and the two automated metrics together give a representative, unbiased picture of any RAG pipeline's ability to handle multimodal and cross-document information.

What would settle it

A larger or differently sampled question set that shows open-source pipelines matching or exceeding closed-source performance on the same metrics, or human raters showing low agreement with the automated scores.

Figures

Figures reproduced from arXiv: 2510.08945 by (2) Oak Ridge National Lab, (3) University of Florida), Amir Sadovnik (2), Brandon Schreiber (2), Curtis Taylor (2), James M Ghawaly Jr (1), Kevin Kurian (3) ((1) Louisiana State University, Ryan Shivers (2), Samuel Hildebrand (1), Sean Oesch (2).

**Figure 2.** Figure 2: Calculated Hallucination Rate for LlamaIndex Text Only RAG [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 5.** Figure 5: Answer Retrieval Accuracy for RAG pipelines of Closed-Source APIs. [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

**Figure 4.** Figure 4: Calculated Hallucination Rate for Docling and EasyOCR RAG [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 6.** Figure 6: Calculated Hallucination Rate for RAG pipelines of Closed-Source [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

read the original abstract

Retrieval-augmented generation (RAG) has emerged as a promising paradigm for improving factual accuracy in large language models (LLMs). We introduce a benchmark designed to evaluate RAG pipelines as a whole, evaluating a pipeline's ability to ingest, retrieve, and reason about several modalities of information, differentiating it from existing benchmarks that focus on particular aspects such as retrieval. We present (1) a small, human-created dataset of 93 questions designed to evaluate a pipeline's ability to ingest textual data, tables, images, and data spread across these modalities in one or more documents; (2) a phrase-level recall metric for correctness; (3) a nearest-neighbor embedding classifier to identify potential pipeline hallucinations; (4) a comparative evaluation of 2 pipelines built with open-source retrieval mechanisms and 4 closed-source foundation models; and (5) a third-party human evaluation of the alignment of our correctness and hallucination metrics. We find that closed-source pipelines significantly outperform open-source pipelines in both correctness and hallucination metrics, with wider performance gaps in questions relying on multimodal and cross-document information. Human evaluation of our metrics showed average agreement of 4.62 for correctness and 4.53 for hallucination detection on a 1-5 Likert scale (5 indicating "strongly agree").

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The benchmark idea for full multimodal RAG pipelines is useful but the 93-question dataset has no reported stratification or variance checks, so the claim of wider closed-source advantages on multimodal items rests on shaky ground.

read the letter

The paper puts forward a benchmark meant to test entire RAG pipelines on multimodal inputs rather than just retrieval. It supplies a 93-question set covering text, tables, images, and cross-document cases, plus a phrase-level recall metric and a nearest-neighbor embedding classifier for hallucinations. They compare two open-source retrieval setups against four closed-source models and run a small human study that gives the metrics average agreement scores of 4.62 and 4.53 on a 1-5 scale. That human check and the focus on the full pipeline are the parts that actually move the needle beyond existing retrieval-only benchmarks. The metrics themselves are defined independently and the human ratings provide some external anchor. The central finding is that closed-source pipelines score higher on both correctness and hallucination control, with the gap growing on the multimodal and cross-document questions. The problem is that nothing in the description shows how the 93 questions break down by modality or by single versus multi-document, and there is no variance analysis or power check. With a sample this small and two new metrics validated on only 22 items, any imbalance in question difficulty or category counts can produce the reported wider gaps without reflecting real pipeline differences. The abstract also skips details on exact model versions, question construction process, and statistical tests. This work is aimed at people who build or evaluate multimodal RAG systems and want concrete test cases beyond standard retrieval metrics. A reader could borrow the metric definitions or the question style, but the comparative results are too thinly supported to cite as evidence. It deserves peer review because the target problem is real and the human validation step is a step in the right direction, but any referee would need to see the missing dataset breakdown and stronger statistical grounding before the performance claims can be taken at face value.

Referee Report

3 major / 2 minor

Summary. The paper introduces FATHOMS-RAG, a benchmark framework for holistic evaluation of RAG pipelines on multimodal (text/tables/images) and cross-document reasoning. It contributes (1) a 93-question human-created dataset spanning these modalities and single/cross-document cases, (2) a phrase-level recall metric for correctness, (3) a nearest-neighbor embedding classifier for hallucination detection, (4) comparative results on 2 open-source retrieval pipelines and 4 closed-source foundation models, and (5) human validation of the metrics (avg. Likert agreement 4.62/4.53 on a 22-question subset). The central claim is that closed-source pipelines significantly outperform open-source ones on both metrics, with wider gaps on multimodal and cross-document questions.

Significance. If the performance differences and their modality dependence hold under proper validation, the work supplies a needed end-to-end benchmark that goes beyond isolated retrieval metrics and supplies human-validated automatic proxies for correctness and hallucination. The explicit construction of a multimodal/cross-document test set and the third-party human agreement study are concrete strengths that could support reproducible pipeline comparisons.

major comments (3)

[Dataset construction] Dataset construction (abstract and § on data): the 93 questions are described only as “human-created” to cover modalities and single vs. cross-document cases, with no table or section reporting the actual counts per category, no stratification details, no inter-annotator agreement on question design, and no power or bootstrap analysis. This directly undermines the claim of “significantly wider performance gaps” on multimodal/cross-document items, as any imbalance in difficulty or distribution can produce the observed pattern with N=93.
[Comparative evaluation] Evaluation results (comparative evaluation section): no statistical tests, error bars, confidence intervals, or variance analysis across modalities are reported for the correctness and hallucination differences between the 2 open-source and 4 closed-source pipelines. The central claim that closed-source pipelines “significantly outperform” and exhibit “wider gaps” therefore lacks the quantitative support required to distinguish genuine capability differences from sampling artifacts.
[Human evaluation] Metric validation (human evaluation section): the phrase-level recall and nearest-neighbor embedding hallucination metrics are novel and are validated only against human ratings on a 22-question subset. Given that these metrics are load-bearing for all reported performance numbers, the small validation set size leaves open whether the high average Likert scores (4.62/4.53) generalize to the full 93-question set or to the multimodal subset.

minor comments (2)

[Abstract] The abstract states “2 pipelines built with open-source retrieval mechanisms and 4 closed-source foundation models” but does not name the exact model versions or retrieval implementations; this information should be added for reproducibility.
[Metric definitions] Notation for the nearest-neighbor embedding classifier (e.g., how the decision threshold is chosen and whether it is fixed or tuned) is not fully specified in the metric definition section.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below, indicating planned revisions where feasible while being transparent about limitations in the original work.

read point-by-point responses

Referee: [Dataset construction] Dataset construction (abstract and § on data): the 93 questions are described only as “human-created” to cover modalities and single vs. cross-document cases, with no table or section reporting the actual counts per category, no stratification details, no inter-annotator agreement on question design, and no power or bootstrap analysis. This directly undermines the claim of “significantly wider performance gaps” on multimodal/cross-document items, as any imbalance in difficulty or distribution can produce the observed pattern with N=93.

Authors: We agree that a detailed breakdown of dataset composition is needed for transparency. In the revised manuscript we will add a table reporting question counts by modality (text/tables/images) and by single- versus cross-document setting. The questions were designed by one primary author with team review; no formal inter-annotator agreement was collected. We will also add an explicit limitations paragraph noting the small fixed N=93 and the absence of power or bootstrap analysis. We maintain that the patterns are directionally informative but accept that stronger claims about modality-specific gaps require these details. revision: partial
Referee: [Comparative evaluation] Evaluation results (comparative evaluation section): no statistical tests, error bars, confidence intervals, or variance analysis across modalities are reported for the correctness and hallucination differences between the 2 open-source and 4 closed-source pipelines. The central claim that closed-source pipelines “significantly outperform” and exhibit “wider gaps” therefore lacks the quantitative support required to distinguish genuine capability differences from sampling artifacts.

Authors: We accept that the lack of statistical support weakens the central claims. In revision we will add confidence intervals for all reported metrics, perform paired statistical tests (e.g., Wilcoxon signed-rank or McNemar) on pipeline differences, and include modality-stratified variance where sample sizes permit. We will also discuss the limited statistical power given the modest total N and smaller subcategory sizes. revision: yes
Referee: [Human evaluation] Metric validation (human evaluation section): the phrase-level recall and nearest-neighbor embedding hallucination metrics are novel and are validated only against human ratings on a 22-question subset. Given that these metrics are load-bearing for all reported performance numbers, the small validation set size leaves open whether the high average Likert scores (4.62/4.53) generalize to the full 93-question set or to the multimodal subset.

Authors: The 22-question subset was selected to cover all modalities and question types, but we acknowledge its small size limits generalizability. In revision we will document the selection criteria, report any available per-category agreement, and add a limitations statement that the high Likert scores provide only preliminary evidence and do not guarantee performance on the full set or multimodal questions. Expanding validation to all 93 questions would require substantial additional human effort that was not feasible in the original study. revision: partial

Circularity Check

0 steps flagged

No significant circularity; metrics and claims are independently defined

full rationale

The paper introduces a fixed human-created dataset of 93 questions and two metrics (phrase-level recall for correctness; nearest-neighbor embedding classifier for hallucinations) that are defined prior to and independently of any pipeline evaluation results. These are then applied to compare open- and closed-source pipelines, with a separate human validation study (average Likert agreement 4.62/4.53) providing external alignment check. No equations, fitted parameters renamed as predictions, self-definitional constructs, or load-bearing self-citations appear in the derivation; the performance-gap claims remain empirical observations on fixed inputs rather than reductions to the inputs themselves.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework depends on the assumption that the small human-curated question set captures the target capabilities; no free parameters or invented entities are described.

axioms (1)

domain assumption The 93 questions are representative of real-world multimodal and cross-document RAG challenges
Benchmark validity and the reported performance gaps rest on this selection being comprehensive and unbiased.

pith-pipeline@v0.9.0 · 5832 in / 1215 out tokens · 55795 ms · 2026-05-25T08:09:40.094435+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 17 canonical work pages · 2 internal anchors

[1]

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark,

Y . Wang, X. Ma, G. Zhang, Y . Ni, A. Chandra, S. Guo, W. Ren, A. Arulraj, X. He, Z. Jiang, T. Li, M. Ku, K. Wang, A. Zhuang, R. Fan, X. Yue, and W. Chen, “Mmlu-pro: A more robust and challenging multi-task language understanding benchmark,” in Advances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. T...

work page 2024
[2]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi,

X. Yue, Y . Ni, T. Zheng, K. Zhang, R. Liu, G. Zhang, S. Stevens, D. Jiang, W. Ren, Y . Sunet al., “Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi,” in 2024 IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR). IEEE, 2024, pp. 9556–9567

work page 2024
[3]

Nphardeval4v: A dynamic reasoning benchmark of multimodal large language models,

L. Fan, W. Hua, X. Li, K. Zhu, M. Jin, L. Li, H. Ling, J. Chi, J. Wang, X. Ma, and Y . Zhang, “Nphardeval4v: A dynamic reasoning benchmark of multimodal large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2403.01777

work page arXiv 2024
[4]

Mllm-compbench: A comparative reason- ing benchmark for multimodal llms,

J. Kil, Z. Mai, J. Lee, A. Chowdhury, Z. Wang, K. Cheng, L. Wang, Y . Liu, and W.-L. H. Chao, “Mllm-compbench: A comparative reason- ing benchmark for multimodal llms,”Advances in Neural Information Processing Systems, vol. 37, pp. 28 798–28 827, 2024

work page 2024
[5]

Rbench: Graduate-level multi- disciplinary benchmarks for llm & mllm complex reasoning evaluation,

M.-H. Guo, J. Xu, Y . Zhang, J. Song, H. Peng, Y .-X. Deng, X. Dong, K. Nakayama, Z. Geng, C. Wanget al., “Rbench: Graduate-level multi- disciplinary benchmarks for llm & mllm complex reasoning evaluation,” inF orty-second International Conference on Machine Learning, 2025

work page 2025
[6]

Can mllms reason in multimodality? emma: An enhanced multimodal reasoning benchmark,

Y . Hao, J. Gu, H. W. Wang, L. Li, Z. Yang, L. Wang, and Y . Cheng, “Can mllms reason in multimodality? emma: An enhanced multimodal reasoning benchmark,” inF orty-second International Conference on Machine Learning, 2025

work page 2025
[7]

Crag - comprehensive rag benchmark,

X. Yang, K. Sun, H. Xin, Y . Sun, N. Bhalla, X. Chen, S. Choudhary, R. D. Gui, Z. W. Jiang, Z. Jiang, L. Kong, B. Moran, J. Wang, Y . E. Xu, A. Yan, C. Yang, E. Yuan, H. Zha, N. Tang, L. Chen, N. Scheffer, Y . Liu, N. Shah, R. Wanga, A. Kumar, W.-t. Yih, and X. L. Dong, “Crag - comprehensive rag benchmark,” inAdvances in Neural Information Processing Syst...

work page 2024
[8]

Ragas: Automated Evaluation of Retrieval Augmented Generation

S. Es, J. James, L. Espinosa-Anke, and S. Schockaert, “Ragas: Au- tomated evaluation of retrieval augmented generation,”arXiv preprint arXiv:2309.15217, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[9]

Detecting hallucinations in large language models using semantic entropy,

S. Farquhar, J. Kossen, L. Kuhn, and Y . Gal, “Detecting hallucinations in large language models using semantic entropy,”Nature, vol. 630, no. 8017, pp. 625–630, 2024

work page 2024
[10]

Unsu- pervised real-time hallucination detection based on the internal states of large language models,

W. Su, C. Wang, Q. Ai, Y . Hu, Z. Wu, Y . Zhou, and Y . Liu, “Unsu- pervised real-time hallucination detection based on the internal states of large language models,”Findings of the Association for Computational Linguistics, 2024

work page 2024
[11]

Docllm: A layout-aware generative language model for multimodal document understanding,

D. Wang, N. Raman, M. Sibue, Z. Ma, P. Babkin, S. Kaur, Y . Pei, A. Nourbakhsh, and X. Liu, “Docllm: A layout-aware generative language model for multimodal document understanding,” 2023. [Online]. Available: https://arxiv.org/abs/2401.00908

work page arXiv 2023
[12]

Physics-constrained flow matching: Sampling generative models with hard constraints,

U. Utkarsh, P. Cai, A. Edelman, R. Gomez-Bombarelli, and C. V . Rackauckas, “Physics-constrained flow matching: Sampling generative models with hard constraints,” 2025. [Online]. Available: https://arxiv.org/abs/2506.04171

work page arXiv 2025
[13]

The path to autonomous cyber defense,

S. Oesch, P. Austria, A. Chaulagain, B. Weber, C. Watson, M. Dixson, and A. Sadovnik, “The path to autonomous cyber defense,” 2024. [Online]. Available: https://arxiv.org/abs/2404.10788

work page arXiv 2024
[14]

Chartmimic: Evaluating lmm’s cross-modal reasoning capability via chart-to-code generation,

C. Yang, C. Shi, Y . Liu, B. Shui, J. Wang, M. Jing, L. Xu, X. Zhu, S. Li, Y . Zhang, G. Liu, X. Nie, D. Cai, and Y . Yang, “Chartmimic: Evaluating lmm’s cross-modal reasoning capability via chart-to-code generation,” 2025. [Online]. Available: https://arxiv.org/abs/2406.09961 TABLE II FULLLIST OFCORRECTNESS ANDHALLUCINATIONSCORES ACROSS ALLEVALUATIONS Pi...

work page arXiv 2025
[15]

Enigmaeval: A benchmark of long multimodal reasoning challenges,

C. J. Wang, D. Lee, C. Menghini, J. Mols, J. Doughty, A. Khoja, J. Lynch, S. Hendryx, S. Yue, and D. Hendrycks, “Enigmaeval: A benchmark of long multimodal reasoning challenges,” 2025. [Online]. Available: https://arxiv.org/abs/2502.08859

work page arXiv 2025
[16]

Hypertransformer: Model generation for supervised and semi-supervised few-shot learning,

A. Zhmoginov, M. Sandler, and M. Vladymyrov, “Hypertransformer: Model generation for supervised and semi-supervised few-shot learning,” 2022. [Online]. Available: https://arxiv.org/abs/2201.04182

work page arXiv 2022
[17]

Multimodal Chain-of-Thought Reasoning in Language Models

Z. Zhang, A. Zhang, M. Li, H. Zhao, G. Karypis, and A. Smola, “Multimodal chain-of-thought reasoning in language models,” 2024. [Online]. Available: https://arxiv.org/abs/2302.00923 APPENDIXA FULLTABLE OFEVALUATIONRESULTS All gathered correctness and hallucination scores are in- cluded in Table A. All of the code and raw result JSON files containing pipel...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark,

Y . Wang, X. Ma, G. Zhang, Y . Ni, A. Chandra, S. Guo, W. Ren, A. Arulraj, X. He, Z. Jiang, T. Li, M. Ku, K. Wang, A. Zhuang, R. Fan, X. Yue, and W. Chen, “Mmlu-pro: A more robust and challenging multi-task language understanding benchmark,” in Advances in Neural Information Processing Systems, A. Globerson, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. T...

work page 2024

[2] [2]

Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi,

X. Yue, Y . Ni, T. Zheng, K. Zhang, R. Liu, G. Zhang, S. Stevens, D. Jiang, W. Ren, Y . Sunet al., “Mmmu: A massive multi-discipline multimodal understanding and reasoning benchmark for expert agi,” in 2024 IEEE/CVF Conference on Computer Vision and Pattern Recogni- tion (CVPR). IEEE, 2024, pp. 9556–9567

work page 2024

[3] [3]

Nphardeval4v: A dynamic reasoning benchmark of multimodal large language models,

L. Fan, W. Hua, X. Li, K. Zhu, M. Jin, L. Li, H. Ling, J. Chi, J. Wang, X. Ma, and Y . Zhang, “Nphardeval4v: A dynamic reasoning benchmark of multimodal large language models,” 2024. [Online]. Available: https://arxiv.org/abs/2403.01777

work page arXiv 2024

[4] [4]

Mllm-compbench: A comparative reason- ing benchmark for multimodal llms,

J. Kil, Z. Mai, J. Lee, A. Chowdhury, Z. Wang, K. Cheng, L. Wang, Y . Liu, and W.-L. H. Chao, “Mllm-compbench: A comparative reason- ing benchmark for multimodal llms,”Advances in Neural Information Processing Systems, vol. 37, pp. 28 798–28 827, 2024

work page 2024

[5] [5]

Rbench: Graduate-level multi- disciplinary benchmarks for llm & mllm complex reasoning evaluation,

M.-H. Guo, J. Xu, Y . Zhang, J. Song, H. Peng, Y .-X. Deng, X. Dong, K. Nakayama, Z. Geng, C. Wanget al., “Rbench: Graduate-level multi- disciplinary benchmarks for llm & mllm complex reasoning evaluation,” inF orty-second International Conference on Machine Learning, 2025

work page 2025

[6] [6]

Can mllms reason in multimodality? emma: An enhanced multimodal reasoning benchmark,

Y . Hao, J. Gu, H. W. Wang, L. Li, Z. Yang, L. Wang, and Y . Cheng, “Can mllms reason in multimodality? emma: An enhanced multimodal reasoning benchmark,” inF orty-second International Conference on Machine Learning, 2025

work page 2025

[7] [7]

Crag - comprehensive rag benchmark,

X. Yang, K. Sun, H. Xin, Y . Sun, N. Bhalla, X. Chen, S. Choudhary, R. D. Gui, Z. W. Jiang, Z. Jiang, L. Kong, B. Moran, J. Wang, Y . E. Xu, A. Yan, C. Yang, E. Yuan, H. Zha, N. Tang, L. Chen, N. Scheffer, Y . Liu, N. Shah, R. Wanga, A. Kumar, W.-t. Yih, and X. L. Dong, “Crag - comprehensive rag benchmark,” inAdvances in Neural Information Processing Syst...

work page 2024

[8] [8]

Ragas: Automated Evaluation of Retrieval Augmented Generation

S. Es, J. James, L. Espinosa-Anke, and S. Schockaert, “Ragas: Au- tomated evaluation of retrieval augmented generation,”arXiv preprint arXiv:2309.15217, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[9] [9]

Detecting hallucinations in large language models using semantic entropy,

S. Farquhar, J. Kossen, L. Kuhn, and Y . Gal, “Detecting hallucinations in large language models using semantic entropy,”Nature, vol. 630, no. 8017, pp. 625–630, 2024

work page 2024

[10] [10]

Unsu- pervised real-time hallucination detection based on the internal states of large language models,

W. Su, C. Wang, Q. Ai, Y . Hu, Z. Wu, Y . Zhou, and Y . Liu, “Unsu- pervised real-time hallucination detection based on the internal states of large language models,”Findings of the Association for Computational Linguistics, 2024

work page 2024

[11] [11]

Docllm: A layout-aware generative language model for multimodal document understanding,

D. Wang, N. Raman, M. Sibue, Z. Ma, P. Babkin, S. Kaur, Y . Pei, A. Nourbakhsh, and X. Liu, “Docllm: A layout-aware generative language model for multimodal document understanding,” 2023. [Online]. Available: https://arxiv.org/abs/2401.00908

work page arXiv 2023

[12] [12]

Physics-constrained flow matching: Sampling generative models with hard constraints,

U. Utkarsh, P. Cai, A. Edelman, R. Gomez-Bombarelli, and C. V . Rackauckas, “Physics-constrained flow matching: Sampling generative models with hard constraints,” 2025. [Online]. Available: https://arxiv.org/abs/2506.04171

work page arXiv 2025

[13] [13]

The path to autonomous cyber defense,

S. Oesch, P. Austria, A. Chaulagain, B. Weber, C. Watson, M. Dixson, and A. Sadovnik, “The path to autonomous cyber defense,” 2024. [Online]. Available: https://arxiv.org/abs/2404.10788

work page arXiv 2024

[14] [14]

Chartmimic: Evaluating lmm’s cross-modal reasoning capability via chart-to-code generation,

C. Yang, C. Shi, Y . Liu, B. Shui, J. Wang, M. Jing, L. Xu, X. Zhu, S. Li, Y . Zhang, G. Liu, X. Nie, D. Cai, and Y . Yang, “Chartmimic: Evaluating lmm’s cross-modal reasoning capability via chart-to-code generation,” 2025. [Online]. Available: https://arxiv.org/abs/2406.09961 TABLE II FULLLIST OFCORRECTNESS ANDHALLUCINATIONSCORES ACROSS ALLEVALUATIONS Pi...

work page arXiv 2025

[15] [15]

Enigmaeval: A benchmark of long multimodal reasoning challenges,

C. J. Wang, D. Lee, C. Menghini, J. Mols, J. Doughty, A. Khoja, J. Lynch, S. Hendryx, S. Yue, and D. Hendrycks, “Enigmaeval: A benchmark of long multimodal reasoning challenges,” 2025. [Online]. Available: https://arxiv.org/abs/2502.08859

work page arXiv 2025

[16] [16]

Hypertransformer: Model generation for supervised and semi-supervised few-shot learning,

A. Zhmoginov, M. Sandler, and M. Vladymyrov, “Hypertransformer: Model generation for supervised and semi-supervised few-shot learning,” 2022. [Online]. Available: https://arxiv.org/abs/2201.04182

work page arXiv 2022

[17] [17]

Multimodal Chain-of-Thought Reasoning in Language Models

Z. Zhang, A. Zhang, M. Li, H. Zhao, G. Karypis, and A. Smola, “Multimodal chain-of-thought reasoning in language models,” 2024. [Online]. Available: https://arxiv.org/abs/2302.00923 APPENDIXA FULLTABLE OFEVALUATIONRESULTS All gathered correctness and hallucination scores are in- cluded in Table A. All of the code and raw result JSON files containing pipel...

work page internal anchor Pith review Pith/arXiv arXiv 2024