Recognition: 1 theorem link
· Lean TheoremPolyReal: A Benchmark for Real-World Polymer Science Workflows
Pith reviewed 2026-05-13 19:51 UTC · model grok-4.3
The pith
PolyReal benchmark shows leading MLLMs handle polymer knowledge but drop sharply on lab safety and raw data tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We introduce PolyReal, a novel multimodal benchmark grounded in real-world scientific practices to evaluate MLLMs on the full lifecycle of polymer experimentation. It covers five critical capabilities: foundational knowledge application, lab safety analysis, experiment mechanism reasoning, raw data extraction, and performance & application exploration. Our evaluation of leading MLLMs on PolyReal reveals a capability imbalance where models perform well on knowledge-intensive reasoning but drop sharply on practice-based tasks, exposing a severe gap between abstract scientific knowledge and its practical, context-dependent application.
What carries the argument
The PolyReal benchmark, which organizes evaluation around five capabilities that together span the full practice-grounded lifecycle of polymer experimentation from knowledge recall through safety, reasoning, data handling, and application.
If this is right
- MLLMs will require additional training focused on multimodal lab data to perform reliably on practice-based scientific tasks.
- Benchmarks that ignore real-world workflows will continue to overestimate model readiness for experimental domains.
- Polymer science can serve as a repeatable testbed for measuring how well AI systems translate theory into actionable lab steps.
- Closing the observed capability gap would allow MLLMs to support safer and more efficient polymer experiment design.
Where Pith is reading between the lines
- Models trained mostly on text or simulated data may need targeted exposure to noisy, real lab images and sensor outputs to improve on extraction and safety tasks.
- High performance on PolyReal could become a practical signal that an MLLM is ready for deployment in controlled experimental environments.
- Similar lifecycle-based benchmarks could be built for other experimental fields such as materials synthesis or biological assay work.
Load-bearing premise
That the five listed capabilities and tasks fully represent the practice-grounded lifecycle of polymer experimentation without missing critical real-world elements.
What would settle it
A new MLLM that scores highly and evenly across all five PolyReal capabilities, or an audit that identifies a major polymer workflow step absent from the benchmark's task set.
Figures
read the original abstract
Multimodal Large Language Models (MLLMs) excel in general domains but struggle with complex, real-world science. We posit that polymer science, an interdisciplinary field spanning chemistry, physics, biology, and engineering, is an ideal high-stakes testbed due to its diverse multimodal data. Yet, existing benchmarks related to polymer science largely overlook real-world workflows, limiting their practical utility and failing to systematically evaluate MLLMs across the full, practice-grounded lifecycle of experimentation. We introduce PolyReal, a novel multimodal benchmark grounded in real-world scientific practices to evaluate MLLMs on the full lifecycle of polymer experimentation. It covers five critical capabilities: (1) foundational knowledge application; (2) lab safety analysis; (3) experiment mechanism reasoning; (4) raw data extraction; and (5) performance & application exploration. Our evaluation of leading MLLMs on PolyReal reveals a capability imbalance. While models perform well on knowledge-intensive reasoning (e.g., Experiment Mechanism Reasoning), they drop sharply on practice-based tasks (e.g., Lab Safety Analysis and Raw Data Extraction). This exposes a severe gap between abstract scientific knowledge and its practical, context-dependent application, showing that these real-world tasks remain challenging for MLLMs. Thus, PolyReal helps address this evaluation gap and provides a practical benchmark for assessing AI systems in real-world scientific workflows.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PolyReal, a multimodal benchmark grounded in real-world polymer science practices to evaluate MLLMs across the full experimentation lifecycle. It defines five capabilities (foundational knowledge application, lab safety analysis, experiment mechanism reasoning, raw data extraction, and performance & application exploration), evaluates leading MLLMs, and reports a capability imbalance with stronger performance on knowledge-intensive tasks than on practice-based ones.
Significance. If the benchmark is rigorously constructed and validated, PolyReal would provide a useful new evaluation tool for assessing AI systems in specialized, high-stakes scientific domains. It directly targets the gap between abstract scientific knowledge and context-dependent practical application, with concrete tasks such as safety analysis and raw data extraction that align with real workflows. The creation of a domain-specific benchmark with multimodal elements is a constructive contribution.
major comments (2)
- [Abstract] Abstract: The description of evaluation results claims a capability imbalance but supplies no information on benchmark construction, task validation procedures, dataset size, number of examples per task, or statistical significance testing; without these, the reported performance drops cannot be assessed for reliability or generalizability.
- [Benchmark definition section] Benchmark definition section: The five capabilities are asserted to cover the practice-grounded lifecycle, yet no explicit validation (e.g., expert review, coverage analysis, or pilot testing) is described to confirm that the selected tasks are representative and free of selection bias toward easier or harder instances.
minor comments (1)
- [Abstract] Abstract: The phrasing 'severe gap' is interpretive; replace with a quantitative description of the observed performance differences once numbers are provided.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback and for recognizing the value of PolyReal as a benchmark targeting the gap between abstract knowledge and practical application in polymer science. We address each major comment point by point below, with clear indications of planned revisions.
read point-by-point responses
-
Referee: [Abstract] Abstract: The description of evaluation results claims a capability imbalance but supplies no information on benchmark construction, task validation procedures, dataset size, number of examples per task, or statistical significance testing; without these, the reported performance drops cannot be assessed for reliability or generalizability.
Authors: We agree that the abstract, constrained by length, omits these specifics. The manuscript details benchmark construction, task definitions, and dataset sizes (200 examples per capability) in Section 3, with statistical significance assessed via bootstrap methods in Section 5. To improve clarity, we will revise the abstract to include a concise reference to benchmark scale and cross-reference the detailed procedures in the main text. revision: yes
-
Referee: [Benchmark definition section] Benchmark definition section: The five capabilities are asserted to cover the practice-grounded lifecycle, yet no explicit validation (e.g., expert review, coverage analysis, or pilot testing) is described to confirm that the selected tasks are representative and free of selection bias toward easier or harder instances.
Authors: The referee is correct that the current version does not describe explicit validation steps such as formal expert review or pilot testing. The capabilities were derived from standard polymer experimentation workflows and literature to span the lifecycle. In revision, we will expand the Benchmark definition section with a coverage analysis against workflow stages, bias mitigation steps, and results from a pilot study with domain experts to confirm representativeness. revision: yes
Circularity Check
No significant circularity
full rationale
The paper defines a new benchmark PolyReal by explicitly scoping five capabilities to the polymer experimentation lifecycle and reports empirical evaluation results on leading MLLMs. No derivations, equations, fitted parameters, or predictions appear; the capability imbalance claim follows directly from task performance measurements on the introduced benchmark rather than reducing to any self-referential construction or self-citation chain. The work is self-contained as a benchmark definition and evaluation tool with no load-bearing internal loops.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Polymer science is an interdisciplinary field spanning chemistry, physics, biology, and engineering with diverse multimodal data.
invented entities (1)
-
PolyReal benchmark
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Macbench: a multimodal chemistry and materials science benchmark
Nawaf Alampara, Indrajeet Mandal, Pranav Khetarpal, Har- gun Singh Grover, Mara Schilling-Wilhelmi, NM Anoop Kr- ishnan, and Kevin Maik Jablonka. Macbench: a multimodal chemistry and materials science benchmark. InProceedings of the 38th Conference on Neural Information Processing Systems (NeurIPS 2024), 2024. 3
work page 2024
- [2]
-
[3]
Vqa: Visual question answering
Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C Lawrence Zitnick, and Devi Parikh. Vqa: Visual question answering. InProceedings of the IEEE international conference on computer vision, pages 2425– 2433, 2015. 2
work page 2015
-
[4]
Have llms advanced enough? a challenging problem solving benchmark for large language models
Daman Arora, Himanshu Singh, et al. Have llms advanced enough? a challenging problem solving benchmark for large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7527–7543, 2023. 3
work page 2023
-
[5]
S ¨oren Auer, Dante AC Barone, Cassiano Bartz, Eduardo G Cortes, Mohamad Yaser Jaradeh, Oliver Karras, Manolis Koubarakis, Dmitry Mouromtsev, Dmitrii Pliukhin, Daniil Radyush, et al. The sciqa scientific question answering benchmark for scholarly knowledge.Scientific Reports, 13 (1):7240, 2023. 3
work page 2023
-
[6]
Intern-s1: A scientific multimodal foun- dation model,
Lei Bai, Zhongrui Cai, Yuhang Cao, Maosong Cao, Wei- han Cao, Chiyu Chen, Haojiong Chen, Kai Chen, Pengcheng Chen, Ying Chen, et al. Intern-s1: A scientific multimodal foundation model.arXiv preprint arXiv:2508.15763, 2025. 4
-
[7]
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, et al. Qwen2. 5-vl technical report.arXiv preprint arXiv:2502.13923, 2025. 4
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Lin Chen, Jinsong Li, Xiaoyi Dong, Pan Zhang, Yuhang Zang, Zehui Chen, Haodong Duan, Jiaqi Wang, Yu Qiao, Dahua Lin, et al. Are we on the right way for evaluating large vision-language models?Advances in Neural Informa- tion Processing Systems, 37:27056–27087, 2024. 2
work page 2024
-
[9]
Jerry Junyang Cheung, Shiyao Shen, Yuchen Zhuang, Ying- hao Li, Rampi Ramprasad, and Chao Zhang. Msqa: Bench- marking llms on graduate-level materials science reasoning and knowledge.arXiv preprint arXiv:2505.23982, 2025. 3
-
[11]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, Ice Pasupat, Noveen Sachdeva, Inderjit Dhillon, Marcel Blis- tein, Ori Ram, Dan Zhang, Evan Rosen, et al. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.arXiv preprint arXiv:2507.06261, 2025. 1
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing
Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Ba- tra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answer- ing. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 6904–6913, 2017. 2
work page 2017
-
[13]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025. 4
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[14]
Taicheng Guo, Bozhao Nan, Zhenwen Liang, Zhichun Guo, Nitesh Chawla, Olaf Wiest, Xiangliang Zhang, et al. What can large language models do in chemistry? a comprehensive benchmark on eight tasks.Advances in Neural Information Processing Systems, 36:59662–59688, 2023. 3
work page 2023
-
[15]
Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal sci- entific problems. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Pap...
work page 2024
-
[16]
Measur- ing massive multitask language understanding, 2021.URL https://arxiv
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measur- ing massive multitask language understanding, 2021.URL https://arxiv. org/abs, page 20, 2009. 3
work page 2021
-
[17]
Measuring Massive Multitask Language Understanding
Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Mea- suring massive multitask language understanding.arXiv preprint arXiv:2009.03300, 2020. 1
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[18]
arXiv preprint arXiv:2502.04326 (2025)
Jack Hong, Shilin Yan, Jiayin Cai, Xiaolong Jiang, Yao Hu, and Weidi Xie. Worldsense: Evaluating real-world omni- modal understanding for multimodal llms.arXiv preprint arXiv:2502.04326, 2025. 3
-
[19]
Zhen Huang, Zengzhi Wang, Shijie Xia, Xuefeng Li, Haoyang Zou, Ruijie Xu, Run-Ze Fan, Lyumanshan Ye, Ethan Chern, Yixin Ye, et al. Olympicarena: Benchmark- ing multi-discipline cognitive reasoning for superintelligent ai.Advances in Neural Information Processing Systems, 37: 19209–19253, 2024. 3
work page 2024
-
[20]
Gqa: A new dataset for real-world visual reasoning and compositional question answering
Drew A Hudson and Christopher D Manning. Gqa: A new dataset for real-world visual reasoning and compositional question answering. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 6700–6709, 2019. 2
work page 2019
-
[21]
Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison
Jeremy Irvin, Pranav Rajpurkar, Michael Ko, Yifan Yu, Sil- viana Ciurea-Ilcus, Chris Chute, Henrik Marklund, Behzad Haghgoo, Robyn Ball, Katie Shpanskaya, et al. Chexpert: A large chest radiograph dataset with uncertainty labels and expert comparison. InProceedings of the AAAI conference on artificial intelligence, pages 590–597, 2019. 3
work page 2019
-
[22]
Mimic-iii, a freely accessible critical care database
Alistair EW Johnson, Tom J Pollard, Lu Shen, Li-wei H Lehman, Mengling Feng, Mohammad Ghassemi, Benjamin Moody, Peter Szolovits, Leo Anthony Celi, and Roger G Mark. Mimic-iii, a freely accessible critical care database. Scientific data, 3(1):1–9, 2016. 3
work page 2016
-
[23]
TriviaQA: A Large Scale Distantly Supervised Challenge Dataset for Reading Comprehension
Mandar Joshi, Eunsol Choi, Daniel S Weld, and Luke Zettle- moyer. Triviaqa: A large scale distantly supervised chal- lenge dataset for reading comprehension.arXiv preprint arXiv:1705.03551, 2017. 1
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[24]
Aniruddha Kembhavi, Minjoon Seo, Dustin Schwenk, Jonghyun Choi, Ali Farhadi, and Hannaneh Hajishirzi. Are you smarter than a sixth grader? textbook question answer- ing for multimodal machine comprehension. InProceed- ings of the IEEE Conference on Computer Vision and Pattern recognition, pages 4999–5007, 2017. 3
work page 2017
-
[25]
Zhengzhao Lai, Youbin Zheng, Zhenyang Cai, Haonan Lyu, Jinpu Yang, Hongqing Liang, Yan Hu, and Benyou Wang. Can multimodal llms see materials clearly? a multimodal benchmark on materials characterization.arXiv preprint arXiv:2509.09307, 2025. 3
-
[26]
Jon M Laurent, Joseph D Janizek, Michael Ruzo, Michaela M Hinks, Michael J Hammerling, Siddharth Narayanan, Manvitha Ponnapati, Andrew D White, and Samuel G Rodriques. Lab-bench: Measuring capabilities of language models for biology research.arXiv preprint arXiv:2407.10362, 2024. 1
-
[27]
L ´aszl´o Lendvai and Alessandro Pegoretti. The rise of in- terdisciplinarity: A new era in polymer research?Express Polymer Letters, 18(10):962–963, 2024. 1
work page 2024
-
[28]
Jiatong Li, Weida Wang, Qinggang Zhang, Junxian Li, Di Zhang, Changmeng Zheng, Shufei Zhang, Xiaoyong Wei, and Qing Li. Mol-r1: Towards explicit long-cot reasoning in molecule discovery.arXiv preprint arXiv:2508.08401, 2025. 7
-
[29]
Yunxin Li, Longyue Wang, Baotian Hu, Xinyu Chen, Wanqi Zhong, Chenyang Lyu, Wei Wang, and Min Zhang. A com- prehensive evaluation of gpt-4v on knowledge-intensive vi- sual question answering.arXiv preprint arXiv:2311.07536,
-
[30]
Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 2
work page 2023
-
[31]
Wanhao Liu, Zonglin Yang, Jue Wang, Lidong Bing, Di Zhang, Dongzhan Zhou, Yuqiang Li, Houqiang Li, Erik Cambria, and Wanli Ouyang. Moose-chem3: Toward experiment-guided hypothesis ranking via simulated exper- imental feedback.arXiv preprint arXiv:2505.17873, 2025. 1
-
[32]
Xu Liu, Yihan Zhang, Yifan Xie, Ledu Wang, Liyu Gan, Jialei Li, Jiahe Li, Hongli Zhang, Linjiang Chen, Weiwei Shang, et al. Design of circularly polarized phosphorescence materials guided by transfer learning.Nature Communica- tions, 16(1):4970, 2025. 1
work page 2025
-
[33]
Yuan Liu, Haodong Duan, Yuanhan Zhang, Bo Li, Songyang Zhang, Wangbo Zhao, Yike Yuan, Jiaqi Wang, Conghui He, Ziwei Liu, et al. Mmbench: Is your multi-modal model an all-around player? InEuropean conference on computer vi- sion, pages 216–233. Springer, 2024. 2
work page 2024
-
[34]
Santiago Miret and NM Anoop Krishnan. Enabling large language models for real-world materials discovery.Nature Machine Intelligence, pages 1–8, 2025. 1
work page 2025
-
[35]
Are large language models superhuman chemists?arXiv preprint arXiv:2404.01475,
Adrian Mirza, Nawaf Alampara, Sreekanth Kunchapu, Marti˜no R´ıos-Garc´ıa, Benedict Emoekabu, Aswanth Krish- nan, Tanya Gupta, Mara Schilling-Wilhelmi, Macjonathan Okereke, Anagha Aneesh, et al. Are large language models superhuman chemists?arXiv preprint arXiv:2404.01475,
- [36]
- [37]
- [38]
-
[39]
Gpt-5 system card.https://openai.com/ index/gpt-5-system-card/, 2025
OpenAI. Gpt-5 system card.https://openai.com/ index/gpt-5-system-card/, 2025. 1
work page 2025
-
[40]
Phybench: Holistic evaluation of physical perception and reasoning in large language models
Shi Qiu, Shaoyang Guo, Zhuo-Yang Song, Yunbo Sun, Zeyu Cai, Jiashen Wei, Tianyu Luo, Yixuan Yin, Haoxu Zhang, Yi Hu, et al. Phybench: Holistic evaluation of physical percep- tion and reasoning in large language models.arXiv preprint arXiv:2504.16074, 2025. 3
- [41]
-
[42]
Gpqa: A graduate-level google- proof q&a benchmark
David Rein, Betty Li Hou, Asa Cooper Stickland, Jackson Petty, Richard Yuanzhe Pang, Julien Dirani, Julian Michael, and Samuel R Bowman. Gpqa: A graduate-level google- proof q&a benchmark. InFirst Conference on Language Modeling, 2024. 3
work page 2024
-
[43]
Phyx: Does your model have the” wits” for physical reasoning?arXiv preprint arXiv:2505.15929, 2025
Hui Shen, Taiqiang Wu, Qi Han, Yunta Hsieh, Jizhou Wang, Yuyue Zhang, Yuxin Cheng, Zijian Hao, Yuansheng Ni, Xin Wang, et al. Phyx: Does your model have the” wits” for physical reasoning?arXiv preprint arXiv:2505.15929, 2025. 3
-
[44]
Scieval: A multi-level large language model evaluation benchmark for scientific re- search
Liangtai Sun, Yang Han, Zihan Zhao, Da Ma, Zhennan Shen, Baocai Chen, Lu Chen, and Kai Yu. Scieval: A multi-level large language model evaluation benchmark for scientific re- search. InProceedings of the AAAI Conference on Artificial Intelligence, pages 19053–19061, 2024. 3
work page 2024
-
[45]
The virtual lab of ai agents designs new sars-cov-2 nanobodies.Nature, pages 1–3, 2025
Kyle Swanson, Wesley Wu, Nash L Bulaong, John E Pak, and James Zou. The virtual lab of ai agents designs new sars-cov-2 nanobodies.Nature, pages 1–3, 2025. 1
work page 2025
-
[46]
Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. Superglue: A stickier benchmark for general- purpose language understanding systems.Advances in neu- ral information processing systems, 32, 2019. 1
work page 2019
-
[47]
Lintao Wang, Encheng Su, Jiaqi Liu, Pengze Li, Peng Xia, Jiabei Xiao, Wenlong Zhang, Xinnan Dai, Xi Chen, Yuan Meng, et al. Physunibench: An undergraduate-level physics reasoning benchmark for multimodal models.arXiv preprint arXiv:2506.17667, 2025. 3
-
[48]
Chem-r: Learning to reason as a chemist
Weida Wang, Benteng Chen, Di Zhang, Wanhao Liu, Shuchen Pu, Ben Gao, Jin Zeng, Lei Bai, Wanli Ouyang, Xi- aoyong Wei, et al. Chem-r: Learning to reason as a chemist. arXiv preprint arXiv:2510.16880, 2025. 1, 7
-
[49]
Weida Wang, Dongchen Huang, Jiatong Li, Tengchao Yang, Ziyang Zheng, Di Zhang, Dong Han, Benteng Chen, Binzhao Luo, Zhiyu Liu, et al. Cmphysbench: A benchmark for evaluating large language models in condensed matter physics.arXiv preprint arXiv:2508.18124, 2025. 3
-
[50]
DeepSeek-VL2: Mixture-of-Experts Vision-Language Models for Advanced Multimodal Understanding
Zhiyu Wu, Xiaokang Chen, Zizheng Pan, Xingchao Liu, Wen Liu, Damai Dai, Huazuo Gao, Yiyang Ma, Chengyue Wu, Bingxuan Wang, et al. Deepseek-vl2: Mixture-of- experts vision-language models for advanced multimodal understanding.arXiv preprint arXiv:2412.10302, 2024. 4
work page internal anchor Pith review Pith/arXiv arXiv 2024
- [51]
-
[52]
Jiaqing Xie, Weida Wang, Ben Gao, Zhuo Yang, Haiyuan Wan, Shufei Zhang, Tianfan Fu, and Yuqiang Li. Qcbench: Evaluating large language models on domain-specific quan- titative chemistry.Journal of Chemical Information and Modeling, 2025. 3
work page 2025
- [53]
-
[54]
Fangchen Yu, Haiyuan Wan, Qianjia Cheng, Yuchen Zhang, Jiacheng Chen, Fujun Han, Yulun Wu, Junchi Yao, Ruilizhen Hu, Ning Ding, et al. Hipho: How far are (m) llms from hu- mans in the latest high school physics olympiad benchmark? arXiv preprint arXiv:2509.07894, 2025. 3
-
[55]
Junkai Zhang, Jingru Gan, Xiaoxuan Wang, Zian Jia, Changquan Gu, Jianpeng Chen, Yanqiao Zhu, Mingyu Derek Ma, Dawei Zhou, Ling Li, et al. Matscibench: Benchmark- ing the reasoning ability of large language models in materi- als science.arXiv preprint arXiv:2510.12171, 2025. 3
-
[56]
Large language models for reticular chemistry.Nature Reviews Materials, 10(5):369–381, 2025
Zhiling Zheng, Nakul Rampal, Theo Jaffrelot Inizan, Chris- tian Borgs, Jennifer T Chayes, and Omar M Yaghi. Large language models for reticular chemistry.Nature Reviews Materials, 10(5):369–381, 2025. 1
work page 2025
-
[57]
Agieval: A human-centric benchmark for evaluat- ing foundation models
Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluat- ing foundation models. InFindings of the Association for Computational Linguistics: NAACL 2024, pages 2299–2314,
work page 2024
-
[58]
Yuhao Zhou, Yiheng Wang, Xuming He, Ruoyao Xiao, Zhi- wei Li, Qiantai Feng, Zijie Guo, Yuejin Yang, Hao Wu, Wenxuan Huang, et al. Scientists’ first exam: Probing cog- nitive abilities of mllm via perception, understanding, and reasoning.arXiv preprint arXiv:2506.10521, 2025. 1, 3 PolyReal: A Benchmark for Real-World Polymer Science Workflows Supplementary...
-
[59]
Difficulty-Graded Evaluation Results To provide a more granular understanding of MLLM capa- bilities, we stratified the PolyReal dataset into three diffi- culty levels: Easy (280 samples), Medium (165 samples), and Hard (100 samples). This stratification allows us to distinguish between superficial perception capabilities and deep, expert-level scientific...
-
[60]
Implementation and Evaluation Details This section outlines the comprehensive protocols ensur- ing the reproducibility and rigor of thePolyRealbench- mark. We detail the standardized inference setup and the automated evaluation pipeline, emphasizing how specific prompting strategies were employed to elicit deep reason- ing and quantify scientific accuracy...
-
[61]
Completeness (met - Binary Score): •Does the ”Model’s Answer” clearly and unambiguously cover the core concept of this ”Key Scoring Point”? This is a strict binary (0 or 1) check. •1 (Met):The point is clearly and directly addressed, and its expla- nation has no critical information missing compared to the relevant explanation of this point in the ”Ground...
-
[62]
Professional Quality (quality score - Float 0.0 to 1.0): •If met is 0, this scoreMUSTbe0.0. •If met is 1, you must then gradehow wellthe point was covered: –1.0 (Perfect):The explanation is impeccable in fact, depth, and accuracy. The terminology is professional, the logic is rigorous. –0.5 (Average):The point is covered, but the explanation is su- perfic...
-
[63]
Additional Qualitative Analysis A recurring failure mode observed inPolyRealis ”Scientific Hallucination,” where models generate plausible-sounding but factually non-existent evidence. Unlike general-domain hallucinations, these errors in polymer science typically stem from a conflict between the model’s internal chemical priors and the specific visual da...
-
[64]
Small-Model Results To complement the main results, we additionally evaluate smaller models in the 2B–13B range. As shown in Table 7, the results exhibit a clear scaling trend: performance im- proves substantially with model size. Moreover, at com- parable scales, reasoning-oriented models generally outper- form their standard counterparts, suggesting tha...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.