Recognition: unknown
PaperMind: Benchmarking Agentic Reasoning and Critique over Scientific Papers in Multimodal LLMs
Pith reviewed 2026-05-09 21:06 UTC · model grok-4.3
The pith
PaperMind is a benchmark that evaluates multimodal LLMs on integrated scientific reasoning over full research papers using four complementary tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PaperMind is constructed from real scientific papers across seven domains and comprises four complementary task families that collectively operationalize distinct cognitive facets of scientific paper reasoning, including multimodal grounding, experimental interpretation, cross-source evidence reasoning, and critical assessment. By analyzing model behavior across multiple tasks, PaperMind enables a diagnostic evaluation of integrated scientific reasoning behaviors that are difficult to assess through isolated task evaluations. Experiments on both open-source and closed-source multimodal LLMs reveal consistent performance gaps across tasks, highlighting persistent challenges in integrated науч
What carries the argument
The PaperMind benchmark with its four task families that together test how models ground text and visuals, interpret experiments, synthesize across sources, and critique claims on real papers.
If this is right
- Models exhibit persistent challenges in integrated scientific reasoning and critique that single-task tests overlook.
- Performance gaps appear consistently across both open-source and closed-source multimodal LLMs.
- The benchmark supports diagnostic analysis by comparing results across the linked task families rather than in isolation.
- Evaluation becomes possible for how well models handle real papers drawn from agriculture, biology, chemistry, computer science, medicine, physics, and economics.
Where Pith is reading between the lines
- Future model training could target the connections between the four task types to improve agentic performance in scientific settings.
- The multi-task structure might apply to other document-heavy domains where reasoning steps must interact, such as legal or technical reports.
- If the tasks prove linked in practice, evaluation standards could move away from isolated question answering toward full-workflow simulations.
Load-bearing premise
That the four task families validly capture the interacting cognitive steps people use when reading and evaluating scientific papers.
What would settle it
A follow-up study in which models excel on each task separately yet fail to combine them successfully in an agent-style workflow over the same papers.
Figures
read the original abstract
Understanding scientific papers requires more than answering isolated questions or summarizing content. It involves an integrated reasoning process that grounds textual and visual information, interprets experimental evidence, synthesizes information across sources, and critically evaluates scientific claims. However, existing benchmarks typically assess these abilities in isolation, making it difficult to evaluate scientific paper understanding as a unified set of interacting cognitive abilities. In this work, we introduce PaperMind, a benchmark designed to evaluate integrated and agent-oriented scientific reasoning over research papers. PaperMind is constructed from real scientific papers across seven domains, including agriculture, biology, chemistry, computer science, medicine, physics, and economics. It comprises four complementary task families that collectively operationalize distinct cognitive facets of scientific paper reasoning, including multimodal grounding, experimental interpretation, cross-source evidence reasoning, and critical assessment. By analyzing model behavior across multiple tasks, PaperMind enables a diagnostic evaluation of integrated scientific reasoning behaviors that are difficult to assess through isolated task evaluations. Extensive experiments on both opensource and closed-source multimodal LLMs reveal consistent performance gaps across tasks, highlighting persistent challenges in integrated scientific reasoning and critique. Our benchmark and dataset are available at https:// github.com/Yanjun-Zhao/PaperMind.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PaperMind, a benchmark built from real scientific papers across seven domains (agriculture, biology, chemistry, computer science, medicine, physics, economics). It defines four task families—multimodal grounding, experimental interpretation, cross-source evidence reasoning, and critical assessment—intended to evaluate integrated, agentic scientific reasoning and critique in multimodal LLMs. The central claim is that analyzing model behavior across these families enables diagnostic evaluation of interacting cognitive abilities that isolated-task benchmarks cannot capture; experiments on open- and closed-source models reveal consistent performance gaps.
Significance. If the integration claim holds, PaperMind would provide a useful diagnostic tool for identifying weaknesses in multimodal LLMs' ability to synthesize evidence and critique claims across modalities and sources. The use of real papers from diverse domains is a strength, and the public release of the benchmark and dataset supports reproducibility. However, the significance is tempered by the absence of evidence that the tasks enforce synthesis rather than parallel evaluation of independent skills.
major comments (3)
- [Benchmark construction] Benchmark construction (likely §3 or §4): The manuscript states that the four task families 'collectively operationalize distinct cognitive facets' and enable evaluation of 'integrated scientific reasoning behaviors,' but provides no details on task validation, inter-annotator agreement, or any mechanism requiring models to carry state or synthesize outputs across families on the same paper. Without such linkage, the benchmark reduces to a multi-task suite, weakening the diagnostic advantage over existing isolated evaluations.
- [Experiments] Experiments and analysis (likely §5): Performance gaps are reported across tasks, but there is no cross-task correlation analysis, error propagation study, or ablation showing that success on one family predicts or depends on another. This leaves the 'integrated' claim unsupported by the empirical results.
- [Task families] Task design: The weakest assumption—that the four families operationalize interacting rather than parallel facets—is not tested. If task instances are constructed and scored independently (as implied by the lack of joint evaluation protocols), the benchmark does not enforce the agentic, multi-step reasoning asserted in the abstract.
minor comments (3)
- [Introduction] The abstract and introduction use 'agentic' and 'agent-oriented' without a precise definition or operationalization in the task families; clarify whether this refers to tool use, multi-turn interaction, or simply multi-task evaluation.
- [Dataset] Dataset statistics (size per domain, number of papers per task family, annotation process) are referenced but not quantified in the provided text; add a table with these counts and any quality-control metrics.
- [Availability] The GitHub link is given but the manuscript does not specify the exact data splits, evaluation scripts, or licensing; ensure these are documented for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the detailed and constructive feedback. The comments highlight important aspects of how the benchmark's integrated and agentic claims are supported. We address each major comment below with clarifications from the manuscript and indicate planned revisions to strengthen the presentation.
read point-by-point responses
-
Referee: [Benchmark construction] Benchmark construction (likely §3 or §4): The manuscript states that the four task families 'collectively operationalize distinct cognitive facets' and enable evaluation of 'integrated scientific reasoning behaviors,' but provides no details on task validation, inter-annotator agreement, or any mechanism requiring models to carry state or synthesize outputs across families on the same paper. Without such linkage, the benchmark reduces to a multi-task suite, weakening the diagnostic advantage over existing isolated evaluations.
Authors: We agree that explicit details on validation and linkage would better substantiate the integrated claim. The four task families are constructed from the same underlying papers across the seven domains precisely to enable diagnostic comparisons of model behavior on complementary facets of the same documents. However, the current evaluation protocol scores each family independently to isolate specific abilities, without a required cross-family state carry-over or joint scoring mechanism. We did not report inter-annotator agreement because the benchmark was curated by domain experts rather than through multi-annotator crowdsourcing. In the revision we will add a dedicated subsection on benchmark construction that includes validation procedures and introduce an optional chained evaluation protocol in which model outputs from one family (e.g., multimodal grounding) are provided as context for a subsequent family (e.g., critical assessment) on the identical paper. revision: yes
-
Referee: [Experiments] Experiments and analysis (likely §5): Performance gaps are reported across tasks, but there is no cross-task correlation analysis, error propagation study, or ablation showing that success on one family predicts or depends on another. This leaves the 'integrated' claim unsupported by the empirical results.
Authors: The referee correctly notes the absence of cross-task analyses in the reported experiments. While the manuscript documents consistent performance gaps across the four families for both open- and closed-source models, it does not include correlation coefficients, error-propagation studies, or ablations linking success on one family to another. Such analyses would indeed provide stronger empirical grounding for the claim that the families capture interacting rather than merely parallel abilities. We will incorporate these in the revised experiments section, including Pearson correlations between family scores across all evaluated models and a qualitative breakdown of models that succeed on grounding yet fail on cross-source or critique tasks. revision: yes
-
Referee: [Task families] Task design: The weakest assumption—that the four families operationalize interacting rather than parallel facets—is not tested. If task instances are constructed and scored independently (as implied by the lack of joint evaluation protocols), the benchmark does not enforce the agentic, multi-step reasoning asserted in the abstract.
Authors: We acknowledge that the manuscript does not include an explicit test distinguishing interacting from parallel facets, nor does it present joint evaluation protocols that force models to perform multi-step, stateful reasoning across families within a single agentic trajectory. The abstract's reference to 'agent-oriented scientific reasoning' and 'integrated scientific reasoning behaviors' is supported conceptually by the complementary design of the families on shared papers, yet the empirical evaluation remains per-family. We will revise the task design section to articulate the intended interactions more precisely and add a discussion of agentic workflows, together with a small-scale pilot of chained tasks to illustrate how the benchmark can be extended to enforce multi-step synthesis. revision: partial
Circularity Check
No circularity: benchmark definition is self-contained with no derivations or self-referential reductions
full rationale
The paper introduces PaperMind as a new benchmark constructed from real scientific papers across seven domains, comprising four task families (multimodal grounding, experimental interpretation, cross-source evidence reasoning, and critical assessment) that are defined to operationalize cognitive facets. There are no equations, fitted parameters, predictions, or derivation chains in the provided text. The central claim that cross-task analysis enables diagnostic evaluation of integrated reasoning is a direct statement about the benchmark's design and evaluation protocol rather than a result derived from prior inputs or self-citations. All supporting elements reference external papers and model evaluations without load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The four task families collectively operationalize distinct cognitive facets of scientific paper reasoning that interact in practice.
invented entities (1)
-
PaperMind benchmark with four task families
no independent evidence
Reference graph
Works this paper leans on
-
[1]
arXiv preprint arXiv:2503.16856 , year=
MMCR: Benchmarking Cross-Source Reasoning in Scientific Papers , author=. arXiv preprint arXiv:2503.16856 , year=
-
[2]
2023 , html =
Yao, Shunyu and Zhao, Jeffrey and Yu, Dian and Du, Nan and Shafran, Izhak and Narasimhan, Karthik and Cao, Yuan , booktitle =. 2023 , html =
2023
-
[3]
MRKL Systems: A modular, neuro-symbolic architecture that combines large language models, external knowledge sources and discrete reasoning , author=. arXiv preprint arXiv:2205.00445 , year=
work page internal anchor Pith review arXiv
-
[4]
Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =
Li, Guohao and Al Kader Hammoud, Hasan Abed and Itani, Hani and Khizbullin, Dmitrii and Ghanem, Bernard , title =. Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =. 2023 , publisher =
2023
-
[5]
2024 , eprint=
ChatDev: Communicative Agents for Software Development , author=. 2024 , eprint=
2024
-
[6]
Pub- MedQA: A dataset for biomedical research question answering
Jin, Qiao and Dhingra, Bhuwan and Liu, Zhengping and Cohen, William and Lu, Xinghua. P ub M ed QA : A Dataset for Biomedical Research Question Answering. Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP). 2019. doi:10.18653/v1/D19-1259
-
[7]
PaperQA: Retrieval-Augmented Generative Agent for Scientific Research , author =. arXiv preprint arXiv:2312.07559 , year =
-
[8]
2025 , eprint=
PaperArena: An Evaluation Benchmark for Tool-Augmented Agentic Reasoning on Scientific Literature , author=. 2025 , eprint=
2025
-
[9]
M 3 S ci QA : A Multi-Modal Multi-Document Scientific QA Benchmark for Evaluating Foundation Models
Li, Chuhan and Shangguan, Ziyao and Zhao, Yilun and Li, Deyuan and Liu, Yixin and Cohan, Arman. M 3 S ci QA : A Multi-Modal Multi-Document Scientific QA Benchmark for Evaluating Foundation Models. Findings of the Association for Computational Linguistics: EMNLP 2024. 2024. doi:10.18653/v1/2024.findings-emnlp.904
-
[10]
M3DOCRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding
M3DOCRAG: Multi-modal Retrieval is What You Need for Multi-page Multi-document Understanding , author=. arXiv preprint arXiv:2411.04952 , year=
-
[11]
2025 , eprint=
MDocAgent: A Multi-Modal Multi-Agent Framework for Document Understanding , author=. 2025 , eprint=
2025
-
[12]
D oc A gent: An Agentic Framework for Multi-Modal Long-Context Document Understanding
Sun, Li and He, Liu and Jia, Shuyue and He, Yangfan and You, Chenyu. D oc A gent: An Agentic Framework for Multi-Modal Long-Context Document Understanding. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing. 2025. doi:10.18653/v1/2025.emnlp-main.893
-
[13]
S ci DQA : A Deep Reading Comprehension Dataset over Scientific Papers
Singh, Shruti and Sarkar, Nandan and Cohan, Arman. S ci DQA : A Deep Reading Comprehension Dataset over Scientific Papers. Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. 2024. doi:10.18653/v1/2024.emnlp-main.1163
-
[14]
SQuAI: Scientific Question-Answering with Multi-Agent Retrieval-Augmented Generation , year =
Besrour, Ines and He, Jingbo and Schreieder, Tobias and F\". SQuAI: Scientific Question-Answering with Multi-Agent Retrieval-Augmented Generation , year =
-
[15]
arXiv preprint arXiv:2406.11633 , year=
DocGenome: An Open Large-scale Scientific Document Benchmark for Training and Testing Multi-modal Large Language Models , author=. arXiv preprint arXiv:2406.11633 , year=
-
[16]
Scientific Reports , year=
The SciQA Scientific Question Answering Benchmark for Scholarly Knowledge , author=. Scientific Reports , year=
-
[17]
2025 , eprint=
CG-RAG: Research Question Answering by Citation Graph Retrieval-Augmented LLMs , author=. 2025 , eprint=
2025
-
[18]
Zhang, Xuanliang and Wang, Dingzirui and Wang, Baoxin and Dou, Longxu and Lu, Xinyuan and Xu, Keyan and Wu, Dayong and Zhu, Qingfu. SCITAT : A Question Answering Benchmark for Scientific Tables and Text Covering Diverse Reasoning Types. Findings of the Association for Computational Linguistics: ACL 2025. 2025. doi:10.18653/v1/2025.findings-acl.199
-
[19]
Proceedings of the 40th International Conference on Machine Learning , year=
QASA: Advanced Question Answering on Scientific Articles , author=. Proceedings of the 40th International Conference on Machine Learning , year=
-
[20]
2025 , eprint=
DocHop-QA: Towards Multi-Hop Reasoning over Multimodal Document Collections , author=. 2025 , eprint=
2025
-
[21]
P eer QA : A Scientific Question Answering Dataset from Peer Reviews
Baumg. P eer QA : A Scientific Question Answering Dataset from Peer Reviews. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers). 2025. doi:10.18653/v1/2025.naacl-long.22
-
[22]
Ye, Junjie and Du, Zhengyin and Yao, Xuesong and Lin, Weijian and Xu, Yufei and Chen, Zehui and Wang, Zaiyuan and Zhu, Sining and Xi, Zhiheng and Yuan, Siyu and Gui, Tao and Zhang, Qi and Huang, Xuanjing and Chen, Jiecao. T ool H op: A Query-Driven Benchmark for Evaluating Large Language Models in Multi-Hop Tool Use. Proceedings of the 63rd Annual Meeting...
-
[23]
Fact or fiction: Verifying scientific claims
Wadden, David and Lin, Shanchuan and Lo, Kyle and Wang, Lucy Lu and van Zuylen, Madeleine and Cohan, Arman and Hajishirzi, Hannaneh. Fact or Fiction: Verifying Scientific Claims. Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP). 2020. doi:10.18653/v1/2020.emnlp-main.609
-
[24]
EMNLP , year=
MultiDoc2Dial: Modeling Dialogues Grounded in Multiple Documents , author=. EMNLP , year=
-
[25]
and Gardner, Matt , booktitle =
Dasigi, Pradeep and Lo, Kyle and Beltagy, Iz and Cohan, Arman and Smith, Noah A. and Gardner, Matt. A Dataset of Information-Seeking Questions and Answers Anchored in Research Papers. Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies. 2021. doi:10.18653/v1/2021.na...
-
[26]
AutoGen: Enabling Next-Gen
Qingyun Wu and Gagan Bansal and Jieyu Zhang and Yiran Wu and Beibin Li and Erkang Zhu and Li Jiang and Xiaoyun Zhang and Shaokun Zhang and Jiale Liu and Ahmed Hassan Awadallah and Ryen W White and Doug Burger and Chi Wang , booktitle=. AutoGen: Enabling Next-Gen. 2024 , url=
2024
-
[27]
, author =
`smolagents`: a smol library to build great agentic systems. , author =
-
[28]
The eleventh international conference on learning representations , year=
React: Synergizing reasoning and acting in language models , author=. The eleventh international conference on learning representations , year=
-
[29]
2025 , eprint=
Second-Order Fine-Tuning without Pain for LLMs:A Hessian Informed Zeroth-Order Optimizer , author=. 2025 , eprint=
2025
-
[30]
2024 , eprint=
Sparse-VQ Transformer: An FFN-Free Framework with Vector Quantization for Enhanced Time Series Forecasting , author=. 2024 , eprint=
2024
-
[31]
2025 , eprint=
FZOO: Fast Zeroth-Order Optimizer for Fine-Tuning Large Language Models towards Adam-Scale Speed , author=. 2025 , eprint=
2025
-
[32]
2026 , eprint=
MC-Search: Evaluating and Enhancing Multimodal Agentic Search with Structured Long Reasoning Chains , author=. 2026 , eprint=
2026
-
[33]
2025 , eprint=
RIVAL: Reinforcement Learning with Iterative and Adversarial Optimization for Machine Translation , author=. 2025 , eprint=
2025
-
[34]
2026 , eprint=
Agentic Reasoning for Large Language Models , author=. 2026 , eprint=
2026
-
[35]
Proceedings of the AAAI Conference on Artificial Intelligence , author=
SABER: Switchable and Balanced Training for Efficient LLM Reasoning , volume=. Proceedings of the AAAI Conference on Artificial Intelligence , author=. 2026 , month=. doi:10.1609/aaai.v40i41.40799 , abstractNote=
-
[36]
2026 , eprint=
Mem-Gallery: Benchmarking Multimodal Long-Term Conversational Memory for MLLM Agents , author=. 2026 , eprint=
2026
-
[37]
2025 , eprint=
RiskPO: Risk-based Policy Optimization via Verifiable Reward for LLM Post-Training , author=. 2025 , eprint=
2025
-
[38]
2026 , eprint=
Look Inward to Explore Outward: Learning Temperature Policy from LLM Internal States via Hierarchical RL , author=. 2026 , eprint=
2026
-
[39]
2025 , eprint=
Agentic Reasoning and Tool Integration for LLMs via Reinforcement Learning , author=. 2025 , eprint=
2025
-
[40]
Sirui Hong and Mingchen Zhuge and Jonathan Chen and Xiawu Zheng and Yuheng Cheng and Jinlin Wang and Ceyao Zhang and Zili Wang and Steven Ka Shing Yau and Zijuan Lin and Liyang Zhou and Chenyu Ran and Lingfeng Xiao and Chenglin Wu and J. Meta. The Twelfth International Conference on Learning Representations , year=
-
[41]
Pfitzmann, Birgit and Auer, Christoph and Dolfi, Michele and Nassar, Ahmed S. and Staar, Peter , year=. DocLayNet: A Large Human-Annotated Dataset for Document-Layout Segmentation , url=. doi:10.1145/3534678.3539043 , booktitle=
-
[42]
Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities , author=. arXiv preprint arXiv:2507.06261 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[43]
On the Use of ArXiv as a Dataset
On the use of arxiv as a dataset , author=. arXiv preprint arXiv:1905.00075 , year=
work page Pith review arXiv 1905
-
[44]
arXiv preprint arXiv:2505.07920 , year=
Re2: A Consistency-ensured Dataset for Full-stage Peer Review and Multi-turn Rebuttal Discussions , author=. arXiv preprint arXiv:2505.07920 , year=
-
[45]
arXiv preprint arXiv:2403.00231 , year=
Multimodal arxiv: A dataset for improving scientific comprehension of large vision-language models , author=. arXiv preprint arXiv:2403.00231 , year=
-
[46]
Claude-3 Model Card , volume=
The claude 3 model family: Opus, sonnet, haiku , author=. Claude-3 Model Card , volume=
-
[47]
Gpt-4o system card , author=. arXiv preprint arXiv:2410.21276 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[48]
2024 , eprint=
Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone , author=. 2024 , eprint=
2024
-
[49]
2025 , eprint=
Qwen3 Technical Report , author=. 2025 , eprint=
2025
-
[50]
2025 , eprint=
Gemma 3 Technical Report , author=. 2025 , eprint=
2025
-
[51]
G -Eval: NLG Evaluation using Gpt-4 with Better Human Alignment
Liu, Yang and Iter, Dan and Xu, Yichong and Wang, Shuohang and Xu, Ruochen and Zhu, Chenguang. G -Eval: NLG Evaluation using Gpt-4 with Better Human Alignment. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing. 2023. doi:10.18653/v1/2023.emnlp-main.153
-
[52]
Large Language Models are not Fair Evaluators
Wang, Peiyi and Li, Lei and Chen, Liang and Cai, Zefan and Zhu, Dawei and Lin, Binghuai and Cao, Yunbo and Kong, Lingpeng and Liu, Qi and Liu, Tianyu and Sui, Zhifang. Large Language Models are not Fair Evaluators. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.ac...
-
[53]
and Zhang, Hao and Gonzalez, Joseph E
Zheng, Lianmin and Chiang, Wei-Lin and Sheng, Ying and Zhuang, Siyuan and Wu, Zhanghao and Zhuang, Yonghao and Lin, Zi and Li, Zhuohan and Li, Dacheng and Xing, Eric P. and Zhang, Hao and Gonzalez, Joseph E. and Stoica, Ion , title =. Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =. 2023 , publisher =
2023
-
[54]
PDFFigures 2.0: Mining figures from research papers , year=
Clark, Christopher and Divvala, Santosh , booktitle=. PDFFigures 2.0: Mining figures from research papers , year=
-
[55]
Toolformer: language models can teach themselves to use tools , year =
Schick, Timo and Dwivedi-Yu, Jane and Dess\'. Toolformer: language models can teach themselves to use tools , year =. Proceedings of the 37th International Conference on Neural Information Processing Systems , articleno =
-
[56]
Proceedings of the 40th International Conference on Machine Learning , articleno =
Gao, Luyu and Madaan, Aman and Zhou, Shuyan and Alon, Uri and Liu, Pengfei and Yang, Yiming and Callan, Jamie and Neubig, Graham , title =. Proceedings of the 40th International Conference on Machine Learning , articleno =. 2023 , publisher =
2023
-
[57]
Li, Lei and Wang, Yuqi and Xu, Runxin and Wang, Peiyi and Feng, Xiachong and Kong, Lingpeng and Liu, Qi. Multimodal A r X iv: A Dataset for Improving Scientific Comprehension of Large Vision-Language Models. Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2024. doi:10.18653/v1/2024.acl-long.775
-
[58]
Peerqa: A scientific question answering dataset from peer reviews, 2025
PeerQA: A Scientific Question Answering Dataset from Peer Reviews , author=. arXiv preprint arXiv:2502.13668 , year=
-
[59]
Advances in Neural Information Processing Systems , volume=
cPAPERS: A Dataset of Situated and Multimodal Interactive Conversations in Scientific Papers , author=. Advances in Neural Information Processing Systems , volume=
-
[60]
2024 , url=
Shraman Pramanick and Rama Chellappa and Subhashini Venugopalan , booktitle=. 2024 , url=
2024
-
[61]
S ci VQA 2025: Overview of the First Scientific Visual Question Answering Shared Task
Borisova, Ekaterina and Rauscher, Nikolas and Rehm, Georg. S ci VQA 2025: Overview of the First Scientific Visual Question Answering Shared Task. Proceedings of the Fifth Workshop on Scholarly Document Processing (SDP 2025). 2025. doi:10.18653/v1/2025.sdp-1.18
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.