pith. machine review for the scientific record. sign in

arxiv: 2604.21304 · v2 · submitted 2026-04-23 · 💻 cs.IR

Recognition: unknown

PaperMind: Benchmarking Agentic Reasoning and Critique over Scientific Papers in Multimodal LLMs

Authors on Pith no claims yet

Pith reviewed 2026-05-09 21:06 UTC · model grok-4.3

classification 💻 cs.IR
keywords benchmarkmultimodal LLMsscientific reasoningagentic reasoningpaper understandingcritical assessmentexperimental interpretationmultimodal grounding
0
0 comments X

The pith

PaperMind is a benchmark that evaluates multimodal LLMs on integrated scientific reasoning over full research papers using four complementary tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PaperMind to address the limitation that existing benchmarks assess scientific paper understanding abilities in isolation. It constructs a benchmark from real papers in seven domains with four task families that test multimodal grounding, experimental interpretation, cross-source evidence reasoning, and critical assessment. This setup allows analysis of model behavior across tasks to diagnose integrated reasoning capabilities that isolated tests miss. A sympathetic reader would care because it provides a more realistic way to measure how well AI systems can handle the complex, interacting skills needed for actual scientific work.

Core claim

PaperMind is constructed from real scientific papers across seven domains and comprises four complementary task families that collectively operationalize distinct cognitive facets of scientific paper reasoning, including multimodal grounding, experimental interpretation, cross-source evidence reasoning, and critical assessment. By analyzing model behavior across multiple tasks, PaperMind enables a diagnostic evaluation of integrated scientific reasoning behaviors that are difficult to assess through isolated task evaluations. Experiments on both open-source and closed-source multimodal LLMs reveal consistent performance gaps across tasks, highlighting persistent challenges in integrated науч

What carries the argument

The PaperMind benchmark with its four task families that together test how models ground text and visuals, interpret experiments, synthesize across sources, and critique claims on real papers.

If this is right

  • Models exhibit persistent challenges in integrated scientific reasoning and critique that single-task tests overlook.
  • Performance gaps appear consistently across both open-source and closed-source multimodal LLMs.
  • The benchmark supports diagnostic analysis by comparing results across the linked task families rather than in isolation.
  • Evaluation becomes possible for how well models handle real papers drawn from agriculture, biology, chemistry, computer science, medicine, physics, and economics.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Future model training could target the connections between the four task types to improve agentic performance in scientific settings.
  • The multi-task structure might apply to other document-heavy domains where reasoning steps must interact, such as legal or technical reports.
  • If the tasks prove linked in practice, evaluation standards could move away from isolated question answering toward full-workflow simulations.

Load-bearing premise

That the four task families validly capture the interacting cognitive steps people use when reading and evaluating scientific papers.

What would settle it

A follow-up study in which models excel on each task separately yet fail to combine them successfully in an agent-style workflow over the same papers.

Figures

Figures reproduced from arXiv: 2604.21304 by Hanghang Tong, Jiaru Zou, Jingrui He, Lingjie Chen, Simmi Rana, Tianxin Wei, Wendy H. Yang, Xuying Ning, Yanjun Zhao, Yuanchen Bei.

Figure 1
Figure 1. Figure 1: Overview of the design and scope of the PaperMind benchmark. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Question distribution across seven scientific domains. The benchmark includes four categories of tasks. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Number of papers (top) and average PDF length in pages (bottom) across different domains. Multimodal Ground In this task, the LLM is provided with a figure extracted from a scientific paper, along with the paper’s introduction as con￾textual background. LLM is required to generate a concise and accurate caption describing the figure’s content. The original figure caption serves as the groundtruth. This tas… view at source ↗
Figure 4
Figure 4. Figure 4: Impact of Maximum Number of Tool Calls on Cross-Source Evidence Reasoning. Dashed lines indicate [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 7
Figure 7. Figure 7: A case study illustrating the reasoning process [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 6
Figure 6. Figure 6: Eight-way error taxonomy proportions on Crit [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 8
Figure 8. Figure 8: Overview of the design and scope of the four tasks. [PITH_FULL_IMAGE:figures/full_fig_p014_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Prompt setting with introduction input for Multimodal Ground. [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Prompt setting with introduction input for Experimental Interpretation. [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Prompt setting for Cross-Source Evidence Reasoning and Critical Assessment. Tool-invocation prompts [PITH_FULL_IMAGE:figures/full_fig_p015_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Prompt used for LLM-as-a-Judge with GPT-4o, where the model evaluates each prediction on a 5-point [PITH_FULL_IMAGE:figures/full_fig_p017_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Eight-way error taxonomy proportions on Cross-Source Evidence Reasoning using Qwen3-VL￾4B-Instruct (higher is worse), annotated by Gemini-2.5- Pro [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗
read the original abstract

Understanding scientific papers requires more than answering isolated questions or summarizing content. It involves an integrated reasoning process that grounds textual and visual information, interprets experimental evidence, synthesizes information across sources, and critically evaluates scientific claims. However, existing benchmarks typically assess these abilities in isolation, making it difficult to evaluate scientific paper understanding as a unified set of interacting cognitive abilities. In this work, we introduce PaperMind, a benchmark designed to evaluate integrated and agent-oriented scientific reasoning over research papers. PaperMind is constructed from real scientific papers across seven domains, including agriculture, biology, chemistry, computer science, medicine, physics, and economics. It comprises four complementary task families that collectively operationalize distinct cognitive facets of scientific paper reasoning, including multimodal grounding, experimental interpretation, cross-source evidence reasoning, and critical assessment. By analyzing model behavior across multiple tasks, PaperMind enables a diagnostic evaluation of integrated scientific reasoning behaviors that are difficult to assess through isolated task evaluations. Extensive experiments on both opensource and closed-source multimodal LLMs reveal consistent performance gaps across tasks, highlighting persistent challenges in integrated scientific reasoning and critique. Our benchmark and dataset are available at https:// github.com/Yanjun-Zhao/PaperMind.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The paper introduces PaperMind, a benchmark built from real scientific papers across seven domains (agriculture, biology, chemistry, computer science, medicine, physics, economics). It defines four task families—multimodal grounding, experimental interpretation, cross-source evidence reasoning, and critical assessment—intended to evaluate integrated, agentic scientific reasoning and critique in multimodal LLMs. The central claim is that analyzing model behavior across these families enables diagnostic evaluation of interacting cognitive abilities that isolated-task benchmarks cannot capture; experiments on open- and closed-source models reveal consistent performance gaps.

Significance. If the integration claim holds, PaperMind would provide a useful diagnostic tool for identifying weaknesses in multimodal LLMs' ability to synthesize evidence and critique claims across modalities and sources. The use of real papers from diverse domains is a strength, and the public release of the benchmark and dataset supports reproducibility. However, the significance is tempered by the absence of evidence that the tasks enforce synthesis rather than parallel evaluation of independent skills.

major comments (3)
  1. [Benchmark construction] Benchmark construction (likely §3 or §4): The manuscript states that the four task families 'collectively operationalize distinct cognitive facets' and enable evaluation of 'integrated scientific reasoning behaviors,' but provides no details on task validation, inter-annotator agreement, or any mechanism requiring models to carry state or synthesize outputs across families on the same paper. Without such linkage, the benchmark reduces to a multi-task suite, weakening the diagnostic advantage over existing isolated evaluations.
  2. [Experiments] Experiments and analysis (likely §5): Performance gaps are reported across tasks, but there is no cross-task correlation analysis, error propagation study, or ablation showing that success on one family predicts or depends on another. This leaves the 'integrated' claim unsupported by the empirical results.
  3. [Task families] Task design: The weakest assumption—that the four families operationalize interacting rather than parallel facets—is not tested. If task instances are constructed and scored independently (as implied by the lack of joint evaluation protocols), the benchmark does not enforce the agentic, multi-step reasoning asserted in the abstract.
minor comments (3)
  1. [Introduction] The abstract and introduction use 'agentic' and 'agent-oriented' without a precise definition or operationalization in the task families; clarify whether this refers to tool use, multi-turn interaction, or simply multi-task evaluation.
  2. [Dataset] Dataset statistics (size per domain, number of papers per task family, annotation process) are referenced but not quantified in the provided text; add a table with these counts and any quality-control metrics.
  3. [Availability] The GitHub link is given but the manuscript does not specify the exact data splits, evaluation scripts, or licensing; ensure these are documented for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The comments highlight important aspects of how the benchmark's integrated and agentic claims are supported. We address each major comment below with clarifications from the manuscript and indicate planned revisions to strengthen the presentation.

read point-by-point responses
  1. Referee: [Benchmark construction] Benchmark construction (likely §3 or §4): The manuscript states that the four task families 'collectively operationalize distinct cognitive facets' and enable evaluation of 'integrated scientific reasoning behaviors,' but provides no details on task validation, inter-annotator agreement, or any mechanism requiring models to carry state or synthesize outputs across families on the same paper. Without such linkage, the benchmark reduces to a multi-task suite, weakening the diagnostic advantage over existing isolated evaluations.

    Authors: We agree that explicit details on validation and linkage would better substantiate the integrated claim. The four task families are constructed from the same underlying papers across the seven domains precisely to enable diagnostic comparisons of model behavior on complementary facets of the same documents. However, the current evaluation protocol scores each family independently to isolate specific abilities, without a required cross-family state carry-over or joint scoring mechanism. We did not report inter-annotator agreement because the benchmark was curated by domain experts rather than through multi-annotator crowdsourcing. In the revision we will add a dedicated subsection on benchmark construction that includes validation procedures and introduce an optional chained evaluation protocol in which model outputs from one family (e.g., multimodal grounding) are provided as context for a subsequent family (e.g., critical assessment) on the identical paper. revision: yes

  2. Referee: [Experiments] Experiments and analysis (likely §5): Performance gaps are reported across tasks, but there is no cross-task correlation analysis, error propagation study, or ablation showing that success on one family predicts or depends on another. This leaves the 'integrated' claim unsupported by the empirical results.

    Authors: The referee correctly notes the absence of cross-task analyses in the reported experiments. While the manuscript documents consistent performance gaps across the four families for both open- and closed-source models, it does not include correlation coefficients, error-propagation studies, or ablations linking success on one family to another. Such analyses would indeed provide stronger empirical grounding for the claim that the families capture interacting rather than merely parallel abilities. We will incorporate these in the revised experiments section, including Pearson correlations between family scores across all evaluated models and a qualitative breakdown of models that succeed on grounding yet fail on cross-source or critique tasks. revision: yes

  3. Referee: [Task families] Task design: The weakest assumption—that the four families operationalize interacting rather than parallel facets—is not tested. If task instances are constructed and scored independently (as implied by the lack of joint evaluation protocols), the benchmark does not enforce the agentic, multi-step reasoning asserted in the abstract.

    Authors: We acknowledge that the manuscript does not include an explicit test distinguishing interacting from parallel facets, nor does it present joint evaluation protocols that force models to perform multi-step, stateful reasoning across families within a single agentic trajectory. The abstract's reference to 'agent-oriented scientific reasoning' and 'integrated scientific reasoning behaviors' is supported conceptually by the complementary design of the families on shared papers, yet the empirical evaluation remains per-family. We will revise the task design section to articulate the intended interactions more precisely and add a discussion of agentic workflows, together with a small-scale pilot of chained tasks to illustrate how the benchmark can be extended to enforce multi-step synthesis. revision: partial

Circularity Check

0 steps flagged

No circularity: benchmark definition is self-contained with no derivations or self-referential reductions

full rationale

The paper introduces PaperMind as a new benchmark constructed from real scientific papers across seven domains, comprising four task families (multimodal grounding, experimental interpretation, cross-source evidence reasoning, and critical assessment) that are defined to operationalize cognitive facets. There are no equations, fitted parameters, predictions, or derivation chains in the provided text. The central claim that cross-task analysis enables diagnostic evaluation of integrated reasoning is a direct statement about the benchmark's design and evaluation protocol rather than a result derived from prior inputs or self-citations. All supporting elements reference external papers and model evaluations without load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that the chosen tasks reflect real integrated scientific reasoning; no free parameters or invented physical entities are introduced.

axioms (1)
  • domain assumption The four task families collectively operationalize distinct cognitive facets of scientific paper reasoning that interact in practice.
    Invoked in the abstract as the justification for why the benchmark enables diagnostic evaluation of integrated behaviors.
invented entities (1)
  • PaperMind benchmark with four task families no independent evidence
    purpose: To evaluate integrated and agent-oriented scientific reasoning in multimodal LLMs
    Newly constructed in this work from real papers across seven domains.

pith-pipeline@v0.9.0 · 5545 in / 1304 out tokens · 40807 ms · 2026-05-09T21:06:07.339035+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 26 canonical work pages · 3 internal anchors

  1. [1]

    arXiv preprint arXiv:2503.16856 , year=

    MMCR: Benchmarking Cross-Source Reasoning in Scientific Papers , author=. arXiv preprint arXiv:2503.16856 , year=

  2. [2]

    2023 , html =

    Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , booktitle =. 2023 , html =

  3. [3]

    MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning

    MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning , author=. arXiv preprint arXiv:2205.00445 , year=

  4. [4]

    Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =

    Li, Guohao and Al Kader Hammoud, Hasan Abed and Itani, Hani and Khizbullin, Dmitrii and Ghanem, Bernard , title =. Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =. 2023 , publisher =

  5. [5]

    2024 , eprint=

    ChatDev: Communicative Agents for Software Development , author=. 2024 , eprint=

  6. [6]

    Pub- MedQA: A dataset for biomedical research question answering

    Jin, Qiao and Dhingra, Bhuwan and Liu, Zhengping and Cohen, William and Lu, Xinghua. P ub M ed QA : A Dataset for Biomedical Research Question Answering. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1259

  7. [7]

    Rodriques, and Andrew D

    PaperQA: Retrieval-Augmented Generative Agent for Scientific Research , author =. arXiv preprint arXiv:2312.07559 , year =

  8. [8]

    2025 , eprint=

    PaperArena: An Evaluation Benchmark for Tool-Augmented Agentic Reasoning on Scientific Literature , author=. 2025 , eprint=

  9. [9]

    M 3 S ci QA : A Multi-Modal Multi-Document Scientific QA Benchmark for Evaluating Foundation Models

    Li, Chuhan and Shangguan, Ziyao and Zhao, Yilun and Li, Deyuan and Liu, Yixin and Cohan, Arman. M 3 S ci QA : A Multi-Modal Multi-Document Scientific QA Benchmark for Evaluating Foundation Models. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.904

  10. [10]

    M3DOCRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding

    M3DOCRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding , author=. arXiv preprint arXiv:2411.04952 , year=

  11. [11]

    2025 , eprint=

    MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding , author=. 2025 , eprint=

  12. [12]

    D oc A gent: An Agentic Framework for Multi-Modal Long-Context Document Understanding

    Sun, Li and He, Liu and Jia, Shuyue and He, Yangfan and You, Chenyu. D oc A gent: An Agentic Framework for Multi-Modal Long-Context Document Understanding. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.893

  13. [13]

    S ci DQA : A Deep Reading Comprehension Dataset over Scientific Papers

    Singh, Shruti and Sarkar, Nandan and Cohan, Arman. S ci DQA : A Deep Reading Comprehension Dataset over Scientific Papers. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.1163

  14. [14]

    SQuAI: Scientific Question-Answering with Multi-Agent Retrieval-Augmented Generation , year =

    Besrour, Ines and He, Jingbo and Schreieder, Tobias and F\". SQuAI: Scientific Question-Answering with Multi-Agent Retrieval-Augmented Generation , year =

  15. [15]

    arXiv preprint arXiv:2406.11633 , year=

    DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models , author=. arXiv preprint arXiv:2406.11633 , year=

  16. [16]

    Scientific Reports , year=

    The SciQA Scientific Question Answering Benchmark for Scholarly Knowledge , author=. Scientific Reports , year=

  17. [17]

    2025 , eprint=

    CG-RAG: Research Question Answering by Citation Graph Retrieval-Augmented LLMs , author=. 2025 , eprint=

  18. [18]

    SCITAT : A Question Answering Benchmark for Scientific Tables and Text Covering Diverse Reasoning Types

    Zhang, Xuanliang and Wang, Dingzirui and Wang, Baoxin and Dou, Longxu and Lu, Xinyuan and Xu, Keyan and Wu, Dayong and Zhu, Qingfu. SCITAT : A Question Answering Benchmark for Scientific Tables and Text Covering Diverse Reasoning Types. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.199

  19. [19]

    Proceedings of the 40th International Conference on Machine Learning , year=

    QASA: Advanced Question Answering on Scientific Articles , author=. Proceedings of the 40th International Conference on Machine Learning , year=

  20. [20]

    2025 , eprint=

    DocHop-QA: Towards Multi-Hop Reasoning over Multimodal Document Collections , author=. 2025 , eprint=

  21. [21]

    P eer QA : A Scientific Question Answering Dataset from Peer Reviews

    Baumg. P eer QA : A Scientific Question Answering Dataset from Peer Reviews. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.naacl-long.22

  22. [22]

    ISBN 979-8-89176-251-0

    Ye, Junjie and Du, Zhengyin and Yao, Xuesong and Lin, Weijian and Xu, Yufei and Chen, Zehui and Wang, Zaiyuan and Zhu, Sining and Xi, Zhiheng and Yuan, Siyu and Gui, Tao and Zhang, Qi and Huang, Xuanjing and Chen, Jiecao. T ool H op: A Query-Driven Benchmark for Evaluating Large Language Models in Multi-Hop Tool Use. Proceedings of the 63rd Annual Meeting...

  23. [23]

    Fact or fiction: Verifying scientific claims

    Wadden, David and Lin, Shanchuan and Lo, Kyle and Wang, Lucy Lu and van Zuylen, Madeleine and Cohan, Arman and Hajishirzi, Hannaneh. Fact or Fiction: Verifying Scientific Claims. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.609

  24. [24]

    EMNLP , year=

    MultiDoc2Dial: Modeling Dialogues Grounded in Multiple Documents , author=. EMNLP , year=

  25. [25]

    and Gardner, Matt , booktitle =

    Dasigi, Pradeep and Lo, Kyle and Beltagy, Iz and Cohan, Arman and Smith, Noah A. and Gardner, Matt. A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021. doi:10.18653/v1/2021.na...

  26. [26]

    AutoGen: Enabling Next-Gen

    Qingyun Wu and Gagan Bansal and Jieyu Zhang and Yiran Wu and Beibin Li and Erkang Zhu and Li Jiang and Xiaoyun Zhang and Shaokun Zhang and Jiale Liu and Ahmed Hassan Awadallah and Ryen W White and Doug Burger and Chi Wang , booktitle=. AutoGen: Enabling Next-Gen. 2024 , url=

  27. [27]

    , author =

    `smolagents`: a smol library to build great agentic systems. , author =

  28. [28]

    The eleventh international conference on learning representations , year=

    React: Synergizing reasoning and acting in language models , author=. The eleventh international conference on learning representations , year=

  29. [29]

    2025 , eprint=

    Second-Order Fine-Tuning without Pain for LLMs:A Hessian Informed Zeroth-Order Optimizer , author=. 2025 , eprint=

  30. [30]

    2024 , eprint=

    Sparse-VQ Transformer: An FFN-Free Framework with Vector Quantization for Enhanced Time Series Forecasting , author=. 2024 , eprint=

  31. [31]

    2025 , eprint=

    FZOO: Fast Zeroth-Order Optimizer for Fine-Tuning Large Language Models towards Adam-Scale Speed , author=. 2025 , eprint=

  32. [32]

    2026 , eprint=

    MC-Search: Evaluating and Enhancing Multimodal Agentic Search with Structured Long Reasoning Chains , author=. 2026 , eprint=

  33. [33]

    2025 , eprint=

    RIVAL: Reinforcement Learning with Iterative and Adversarial Optimization for Machine Translation , author=. 2025 , eprint=

  34. [34]

    2026 , eprint=

    Agentic Reasoning for Large Language Models , author=. 2026 , eprint=

  35. [35]

    Proceedings of the AAAI Conference on Artificial Intelligence , author=

    SABER: Switchable and Balanced Training for Efficient LLM Reasoning , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2026 , month=. doi:10.1609/aaai.v40i41.40799 , abstractNote=

  36. [36]

    2026 , eprint=

    Mem-Gallery: Benchmarking Multimodal Long-Term Conversational Memory for MLLM Agents , author=. 2026 , eprint=

  37. [37]

    2025 , eprint=

    RiskPO: Risk-based Policy Optimization via Verifiable Reward for LLM Post-Training , author=. 2025 , eprint=

  38. [38]

    2026 , eprint=

    Look Inward to Explore Outward: Learning Temperature Policy from LLM Internal States via Hierarchical RL , author=. 2026 , eprint=

  39. [39]

    2025 , eprint=

    Agentic Reasoning and Tool Integration for LLMs via Reinforcement Learning , author=. 2025 , eprint=

  40. [40]

    Sirui Hong and Mingchen Zhuge and Jonathan Chen and Xiawu Zheng and Yuheng Cheng and Jinlin Wang and Ceyao Zhang and Zili Wang and Steven Ka Shing Yau and Zijuan Lin and Liyang Zhou and Chenyu Ran and Lingfeng Xiao and Chenglin Wu and J. Meta. The Twelfth International Conference on Learning Representations , year=

  41. [41]

    and Staar, Peter , title =

    Pfitzmann, Birgit and Auer, Christoph and Dolfi, Michele and Nassar, Ahmed S. and Staar, Peter , year=. DocLayNet: A Large Human-Annotated Dataset for Document-Layout Segmentation , url=. doi:10.1145/3534678.3539043 , booktitle=

  42. [42]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=

  43. [43]

    On the Use of ArXiv as a Dataset

    On the use of arxiv as a dataset , author=. arXiv preprint arXiv:1905.00075 , year=

  44. [44]

    arXiv preprint arXiv:2505.07920 , year=

    Re2: A Consistency-ensured Dataset for Full-stage Peer Review and Multi-turn Rebuttal Discussions , author=. arXiv preprint arXiv:2505.07920 , year=

  45. [45]

    arXiv preprint arXiv:2403.00231 , year=

    Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models , author=. arXiv preprint arXiv:2403.00231 , year=

  46. [46]

    Claude-3 Model Card , volume=

    The claude 3 model family: Opus, sonnet, haiku , author=. Claude-3 Model Card , volume=

  47. [47]

    GPT-4o System Card

    Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=

  48. [48]

    2024 , eprint=

    Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone , author=. 2024 , eprint=

  49. [49]

    2025 , eprint=

    Qwen3 Technical Report , author=. 2025 , eprint=

  50. [50]

    2025 , eprint=

    Gemma 3 Technical Report , author=. 2025 , eprint=

  51. [51]

    G -Eval: NLG Evaluation using Gpt-4 with Better Human Alignment

    Liu, Yang and Iter, Dan and Xu, Yichong and Wang, Shuohang and Xu, Ruochen and Zhu, Chenguang. G -Eval: NLG Evaluation using Gpt-4 with Better Human Alignment. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.153

  52. [52]

    Large Language Models are not Fair Evaluators

    Wang, Peiyi and Li, Lei and Chen, Liang and Cai, Zefan and Zhu, Dawei and Lin, Binghuai and Cao, Yunbo and Kong, Lingpeng and Liu, Qi and Liu, Tianyu and Sui, Zhifang. Large Language Models are not Fair Evaluators. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.ac...

  53. [53]

    and Zhang, Hao and Gonzalez, Joseph E

    Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E. and Stoica, Ion , title =. Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =. 2023 , publisher =

  54. [54]

    PDFFigures 2.0: Mining figures from research papers , year=

    Clark, Christopher and Divvala, Santosh , booktitle=. PDFFigures 2.0: Mining figures from research papers , year=

  55. [55]

    Toolformer: language models can teach themselves to use tools , year =

    Schick, Timo and Dwivedi-Yu, Jane and Dess\'. Toolformer: language models can teach themselves to use tools , year =. Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =

  56. [56]

    Proceedings of the 40th International Conference on Machine Learning , articleno =

    Gao, Luyu and Madaan, Aman and Zhou, Shuyan and Alon, Uri and Liu, Pengfei and Yang, Yiming and Callan, Jamie and Neubig, Graham , title =. Proceedings of the 40th International Conference on Machine Learning , articleno =. 2023 , publisher =

  57. [57]

    Multimodal A r X iv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models

    Li, Lei and Wang, Yuqi and Xu, Runxin and Wang, Peiyi and Feng, Xiachong and Kong, Lingpeng and Liu, Qi. Multimodal A r X iv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.775

  58. [58]

    Peerqa: A scientific question answering dataset from peer reviews, 2025

    PeerQA: A Scientific Question Answering Dataset from Peer Reviews , author=. arXiv preprint arXiv:2502.13668 , year=

  59. [59]

    Advances in Neural Information Processing Systems , volume=

    cPAPERS: A Dataset of Situated and Multimodal Interactive Conversations in Scientific Papers , author=. Advances in Neural Information Processing Systems , volume=

  60. [60]

    2024 , url=

    Shraman Pramanick and Rama Chellappa and Subhashini Venugopalan , booktitle=. 2024 , url=

  61. [61]

    S ci VQA 2025: Overview of the First Scientific Visual Question Answering Shared Task

    Borisova, Ekaterina and Rauscher, Nikolas and Rehm, Georg. S ci VQA 2025: Overview of the First Scientific Visual Question Answering Shared Task. Proceedings of the Fifth Workshop on Scholarly Document Processing (SDP 2025). 2025. doi:10.18653/v1/2025.sdp-1.18