LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations?
Pith reviewed 2026-06-29 17:29 UTC · model grok-4.3
The pith
Large multimodal models lose substantial ground on high school exams when both final answers and reasoning process are scored.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that advanced LMMs suffer substantial performance degradation under exam-realistic constraints, with GPT-5's score dropping from 79 to 53 out of 100 when process rigor and efficiency are jointly evaluated through a Mock Exam scheme on a dynamic set of 2K+ verified questions from the latest examination papers.
What carries the argument
The Mock Exam evaluation scheme that requires models to finish complete end-to-end exams autonomously while maintaining both accuracy and efficient reasoning paths, supported by an automated pipeline that continuously ingests fresh exam papers.
If this is right
- Models remain sensitive to complex visual layouts that appear in authentic exam papers.
- Static benchmarks overestimate readiness for real educational use.
- True educational readiness requires both correct answers and efficient reasoning processes.
- The benchmark can expand over time to track whether future models close the gap.
- Vulnerabilities in current models limit their deployment as autonomous tutors.
Where Pith is reading between the lines
- Training regimens that reward process efficiency separately from accuracy might reduce the observed degradation.
- Similar dynamic ingestion pipelines could be applied to other domains that suffer from data contamination.
- The gap between isolated-question performance and full-exam performance may appear in non-educational reasoning tasks as well.
- Periodic re-testing on newly ingested papers would provide ongoing evidence of progress or stagnation.
Load-bearing premise
The automated pipeline successfully ingests, parses, and verifies the latest examination papers without introducing selection bias or verification errors that would affect the reported performance gaps.
What would settle it
An advanced model that achieves nearly the same score under the full Mock Exam scheme with joint rigor and efficiency scoring as it does on isolated static questions would falsify the degradation claim.
read the original abstract
Advanced Large Multimodal Models (LMMs) have demonstrated impressive performance in K-12 reasoning tasks, exhibiting great promise as intelligent tutors. Realizing this potential requires models to navigate real-world examinations effectively, yet most existing benchmarks fail to capture the complexity of authentic testing environments. Specifically, most datasets are static, prone to data contamination, and are often confined to restricted modalities, disciplines, and evaluation criteria. To address these issues, we introduce LiveK12Bench, a dynamic, holistic, multi-disciplinary benchmark designed to evaluate the reasoning abilities of LMMs in realistic examination scenarios. LiveK12Bench comprises 2K+ verified questions spanning Mathematics, Physics, Chemistry, and Biology, sourced from the latest real-world exam papers and designed to grow over time. Our framework features several core innovations: 1) featuring an automated pipeline that continuously ingests and parses the latest examination papers to mitigate data leakage; and 2) proposing a novel `Mock Exam' evaluation scheme, which assesses the ability to complete end-to-end exams autonomously with accurate and efficient reasoning paths. Extensive experiments on 12 LMMs reveal that advanced models suffer substantial performance degradation under exam-realistic constraints: GPT-5's score drops from 79 to 53 (out of 100) when process rigor and efficiency are jointly evaluated. Our findings expose critical vulnerabilities, such as sensitivity to complex visual layouts, highlighting the gap between idealized reasoning capabilities and true educational readiness. Both code and dataset are publicly available.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces LiveK12Bench, a dynamic, multi-disciplinary benchmark of 2K+ verified questions drawn from recent real high-school examinations in Mathematics, Physics, Chemistry, and Biology. It features an automated ingestion pipeline intended to prevent data leakage and a Mock Exam protocol that jointly scores answer accuracy, process rigor, and efficiency. Experiments on 12 LMMs report substantial degradation under these constraints, with GPT-5 falling from 79 to 53 out of 100; the work also highlights sensitivity to complex visual layouts and releases code and data publicly.
Significance. If the benchmark construction and scoring are reliable, the results would usefully document a gap between static benchmark performance and end-to-end exam competence, providing a reproducible, growing testbed that could steer LMM development toward educationally realistic capabilities. Public release of code and dataset strengthens the contribution by enabling direct replication and extension.
major comments (3)
- [Abstract and pipeline description] The automated pipeline that ingests, parses, and verifies examination papers is described only at a high level in the abstract and introduction; no error-rate statistics, human-audit protocol, or quantitative assessment of OCR/layout extraction failures are supplied. Because the headline GPT-5 degradation (79→53) and the claim of faithful transcription of complex diagrams rest on this pipeline, the absence of such verification metrics is load-bearing for the central empirical claim.
- [Mock Exam evaluation scheme] The Mock Exam scheme jointly penalizes process rigor and efficiency, yet the manuscript provides no explicit definition or weighting formula for these two components, nor any ablation showing how each contributes to the reported score drop. Without this, it is unclear whether the 26-point decline is driven by the intended factors or by unstated scoring choices.
- [Benchmark construction] The paper states that questions are 'verified' and that the benchmark 'grows over time,' but supplies no protocol for ongoing verification, no inter-annotator agreement figures, and no handling of selection bias when papers contain ambiguous or multi-part items. These omissions directly affect the reproducibility and longitudinal validity of the reported performance gaps.
minor comments (3)
- [Results tables] Figure captions and table headers should explicitly state the exact scoring rubric (accuracy + rigor + efficiency) used for the Mock Exam column so readers can interpret the 79-to-53 drop without ambiguity.
- [Abstract] The abstract claims 'both code and dataset are publicly available,' yet the manuscript does not include a direct URL or repository DOI; this should be added for immediate accessibility.
- [Throughout] Minor typographical inconsistencies appear in the listing of the four disciplines (Mathematics vs. Math); standardize terminology throughout.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments. The feedback highlights important areas where additional detail will strengthen the manuscript's clarity and reproducibility. We address each major comment below and commit to revisions that directly incorporate the requested information.
read point-by-point responses
-
Referee: [Abstract and pipeline description] The automated pipeline that ingests, parses, and verifies examination papers is described only at a high level in the abstract and introduction; no error-rate statistics, human-audit protocol, or quantitative assessment of OCR/layout extraction failures are supplied. Because the headline GPT-5 degradation (79→53) and the claim of faithful transcription of complex diagrams rest on this pipeline, the absence of such verification metrics is load-bearing for the central empirical claim.
Authors: We agree that the current description of the automated pipeline is insufficiently detailed for a central component of the benchmark. In the revised manuscript we will add a dedicated subsection (Section 3.2) that fully specifies the ingestion, parsing, and verification steps. This will include quantitative error-rate statistics from a human-audit protocol conducted on a stratified sample of 300 examination papers, reporting OCR accuracy, layout extraction failure rates, and diagram transcription fidelity separately for each discipline. The audit protocol (two independent annotators plus adjudication) will be described explicitly. revision: yes
-
Referee: [Mock Exam evaluation scheme] The Mock Exam scheme jointly penalizes process rigor and efficiency, yet the manuscript provides no explicit definition or weighting formula for these two components, nor any ablation showing how each contributes to the reported score drop. Without this, it is unclear whether the 26-point decline is driven by the intended factors or by unstated scoring choices.
Authors: We acknowledge that the definitions and weighting of process rigor and efficiency were not stated with sufficient precision. The revised manuscript will include an explicit formal definition of each component, the precise weighting formula used to compute the final Mock Exam score, and a new ablation table that isolates the contribution of accuracy, rigor, and efficiency to the observed performance drops across models. This will allow readers to verify that the 26-point decline for GPT-5 is attributable to the intended factors. revision: yes
-
Referee: [Benchmark construction] The paper states that questions are 'verified' and that the benchmark 'grows over time,' but supplies no protocol for ongoing verification, no inter-annotator agreement figures, and no handling of selection bias when papers contain ambiguous or multi-part items. These omissions directly affect the reproducibility and longitudinal validity of the reported performance gaps.
Authors: We agree that reproducibility and longitudinal validity require explicit protocols. The revision will add a new subsection (Section 4.3) describing the ongoing verification protocol, including inter-annotator agreement statistics (Cohen's kappa) computed on a held-out verification set, and the procedures used to handle ambiguous or multi-part items (e.g., exclusion criteria and documentation of selection decisions). These additions will directly address concerns about selection bias and future growth of the benchmark. revision: yes
Circularity Check
No circularity: purely empirical benchmark with no derivations or self-referential reductions.
full rationale
The paper introduces LiveK12Bench as a dynamic dataset and Mock Exam evaluation protocol. All reported results (e.g., GPT-5 score drop from 79 to 53) are direct empirical measurements on collected questions; no equations, fitted parameters, uniqueness theorems, or predictions are claimed that reduce to the paper's own inputs by construction. The automated pipeline is described as an engineering contribution but is not used to derive any quantitative result that loops back to itself. This is a standard empirical benchmark paper whose central claims rest on external model evaluations rather than internal definitional closure.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Measuring mathematical problem solving with the math dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, et al. Measuring mathematical problem solving with the math dataset. InNeurIPS, 2021
2021
-
[2]
American invitational mathematics examination (aime) 2025, 2025
Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2025, 2025
2025
-
[3]
Aime-preview: A rigorous and immediate evaluation framework for advanced mathematical reasoning.https://github.com/GAIR-NLP /AIME-Preview, 2025
Yixin Ye, Yang Xiao, Tiantian Mi, and Pengfei Liu. Aime-preview: A rigorous and immediate evaluation framework for advanced mathematical reasoning.https://github.com/GAIR-NLP /AIME-Preview, 2025. GitHub repository
2025
-
[4]
M3exam: A multilingual, multimodal, multilevel benchmark for examining large language models
Wenxuan Zhang, Mahani Aljunied, Chang Gao, et al. M3exam: A multilingual, multimodal, multilevel benchmark for examining large language models. InNeurIPS, 2023
2023
-
[5]
Mmscibench: Benchmarking language models on chinese multimodal scientific problems
Xinwu Ye, Chengfan Li, Siming Chen, Wei Wei, and Robert Tang. Mmscibench: Benchmarking language models on chinese multimodal scientific problems. InFindings of the Association for Computational Linguistics: ACL 2025, pages 14621–14663, 2025
2025
-
[6]
Exams-v: A multi-discipline multilingual multimodal exam benchmark for evaluating vision language models
Rocktim Das, Simeon Hristov, Haonan Li, Dimitar Dimitrov, Ivan Koychev, and Preslav Nakov. Exams-v: A multi-discipline multilingual multimodal exam benchmark for evaluating vision language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7768–7791, 2024
2024
-
[7]
Oscar Sainz, Jon Ander Campos, Iker García-Ferrero, et al. Nlp evaluation in trouble: On the need to measure llm data contamination for each benchmark.arXiv preprint arXiv:2310.18018, 2023
-
[8]
MathVista: Evaluating mathematical reasoning of foundation models in visual contexts
Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. MathVista: Evaluating mathematical reasoning of foundation models in visual contexts. InThe Twelfth International Conference on Learning Representations, 2024
2024
-
[9]
Runqi Qiao, Qiuna Tan, Peiqing Yang, Yanzi Wang, Xiaowan Wang, Enhui Wan, Sitong Zhou, Guanting Dong, Yuchen Zeng, Yida Xu, et al. We-Math 2.0: A versatile mathbook system for incentivizing visual mathematical reasoning.arXiv preprint arXiv:2508.10433, 2025
-
[10]
MinerU: An Open-Source Solution for Precise Document Content Extraction
Bin Wang, Chao Xu, Xiaomeng Zhao, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Rui Xu, Kaiwen Liu, Yuan Qu, Fukai Shang, et al. Mineru: An open-source solution for precise document content extraction.arXiv preprint arXiv:2409.18839, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models
Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang. Scibench: Evaluating college-level scientific problem-solving abilities of large language models.arXiv preprint arXiv:2307.10635, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[13]
Gaokao-mm: A chinese human-level benchmark for multimodal models evaluation
Yi Zong and Xipeng Qiu. Gaokao-mm: A chinese human-level benchmark for multimodal models evaluation. InFindings of the Association for Computational Linguistics: ACL 2024, pages 8817–8825, 2024
2024
-
[14]
K12vista: Exploring the boundaries of mllms in k-12 education.arXiv preprint arXiv:2506.01676, 2025
Chong Li, Chenglin Zhu, Tao Zhang, Mingan Lin, Zenan Zhou, and Jian Xie. K12vista: Exploring the boundaries of mllms in k-12 education.arXiv preprint arXiv:2506.01676, 2025
-
[15]
Pengfei Zhou, Fanrui Zhang, Xiaopeng Peng, Zhaopan Xu, Jiaxin Ai, Yansheng Qiu, Chuanhao Li, Zhen Li, Ming Li, Yukang Feng, et al. Mdk12-bench: a multi-discipline benchmark for evaluating reasoning in multimodal large language models.arXiv preprint arXiv:2504.05782, 2025
-
[16]
Introducing gpt-5, August 2025
OpenAI. Introducing gpt-5, August 2025
2025
-
[17]
Gpt-5-mini, August 2025
OpenAI. Gpt-5-mini, August 2025
2025
-
[18]
Gemini 3 Pro, November 2025
Google DeepMind. Gemini 3 Pro, November 2025
2025
-
[19]
Gemini 3 Flash, November 2025
Google DeepMind. Gemini 3 Flash, November 2025
2025
-
[20]
Claude 4.6 Opus, February 2026
Anthropic. Claude 4.6 Opus, February 2026
2026
-
[21]
Claude 4.6 Sonnet, February 2026
Anthropic. Claude 4.6 Sonnet, February 2026
2026
-
[22]
Josh Achiam, Steven Adler, Sandhini Agarwal, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[23]
GLM-5: from Vibe Coding to Agentic Engineering
Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chengxing Xie, Cunxiang Wang, et al. Glm-5: from vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[24]
Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[25]
DeepSeek-AI. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[26]
Serwan Jassim, Mario Holubar, Annika Richter, Cornelius Wolff, Xenia Ohmer, and Elia Bruni. Grasp: A novel benchmark for evaluating language grounding and situated physics understand- ing in multimodal language models.arXiv preprint arXiv:2311.09048, 2023
-
[27]
Mmscibench: Benchmarking language models on chinese multimodal scientific problems
Chengfan Li, Xinwu Ye, Siming Chen, Wei Wei, and Xiangru Tang. Mmscibench: Benchmarking language models on chinese multimodal scientific problems. InFindings of ACL, 2025
2025
-
[28]
Msvec: A multidomain testing dataset for scientific claim verification
Michael Evans, Dominik Soós, Ethan Landers, and Jian Wu. Msvec: A multidomain testing dataset for scientific claim verification. InProceedings of the Twenty-fourth International Sympo- sium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing, pages 504–509, 2023
2023
-
[29]
Sciol and mulms-img: Introducing a large-scale multimodal scientific dataset and models for image-text tasks in the scientific domain
Tim Tarsi, Heike Adel, Jan Hendrik Metzen, Dan Zhang, Matteo Finco, and Annemarie Friedrich. Sciol and mulms-img: Introducing a large-scale multimodal scientific dataset and models for image-text tasks in the scientific domain. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 4560–4571, 2024. 14 LiveK12Bench: Have ...
2024
-
[30]
Evaluating the Performance of Large Language Models on GAOKAO Benchmark
Xiaotian Zhang, Chunyang Li, Yi Zong, et al. Evaluating the performance of large language models on gaokao benchmark.arXiv preprint arXiv:2305.12474, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[31]
Chaoqun He, Renjie Luo, Yuzhuo Bai, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems.arXiv preprint arXiv:2402.14008, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[32]
Chatgpt for good? on opportunities and challenges of large language models for education.Learning and individual differences, 103:102274, 2023
Enkelejda Kasneci, Kathrin Seßler, Stefan Küchemann, Maria Bannert, Daryna Dementieva, Frank Fischer, Urs Gasser, Georg Groh, Stephan Günnemann, Eyke Hüllermeier, et al. Chatgpt for good? on opportunities and challenges of large language models for education.Learning and individual differences, 103:102274, 2023
2023
-
[33]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[34]
Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts
Pan Lu, Hritik Bansal, Tony Xia, et al. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. InInternational Conference on Learning Representations (ICLR), 2024
2024
-
[35]
Measuring multimodal mathematical reasoning with MATH-Vision dataset
Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with MATH-Vision dataset. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024
2024
-
[36]
MathVerse: Does your multi-modal llm truly see the diagrams in visual math problems? InEuropean Conference on Computer Vision, pages 169–186, 2024
Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Yu Qiao, et al. MathVerse: Does your multi-modal llm truly see the diagrams in visual math problems? InEuropean Conference on Computer Vision, pages 169–186, 2024
2024
-
[37]
Runqi Qiao, Qiuna Tan, Guanting Dong, MinhuiWu MinhuiWu, Chong Sun, Xiaoshuai Song, Jiapeng Wang, Zhuoma GongQue, Shanglin Lei, Yifan Zhang, et al. We-Math: Does your large multimodal model achieve human-like mathematical reasoning? InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 200...
2025
- [38]
-
[39]
et al. Luo. O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning, 2025. URLhttps://arxiv.org/abs/2501.12570. 15 LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations? A. Supplementary Results Table 5,6,7 supplement more complete results of LMMs on three challenging subsets. Overall, the relative perfo...
-
[40]
Some questions may contain multiple sub-questions, such as fill-in-the-blank questions with multiple blanks or open-ended questions with multiple parts
Question types include fill-in-the-blank and open-ended questions. Some questions may contain multiple sub-questions, such as fill-in-the-blank questions with multiple blanks or open-ended questions with multiple parts. You must judge each sub-question independently
-
[41]
As long as the semantic meaning is equivalent, the answer is considered correct
Answers may be expressed in different forms—for example, as a mathematical expression or a textual description. As long as the semantic meaning is equivalent, the answer is considered correct. Equivalent formulas expressed in different notations are also accepted. If equivalence cannot be determined, mark the student’s answer as incorrect
-
[42]
Simply compare the student’s answer with the standard answer based on the question format to determine correctness
You do not need to re-derive the answer, as the standard answer is already provided. Simply compare the student’s answer with the standard answer based on the question format to determine correctness
-
[43]
For questions without definitive results (e.g., proofs), evaluate whether the solution approach is correct (a sound reasoning approach suffices)
For questions with definitive results, compare the student’s final answer (typically enclosed in\boxed{}) with the standard answer; the solution process need not be evaluated. For questions without definitive results (e.g., proofs), evaluate whether the solution approach is correct (a sound reasoning approach suffices). Based on the above criteria, output...
-
[44]
Do not modify any document content (including text content and LaTeX formula content); only extract and organize the corresponding information
-
[45]
Ensure all questions in the document are parsed without omission
-
[46]
18 LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations?
Output only a result that can be directly parsed as valid JSON without errors. 18 LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations?
-
[47]
multiple-choice
If there is no more content to extract, output only[DONE]. Target Schema(target_items): question_types= ["multiple-choice","fill-in-the-blank","open-ended"] •type : A string value. Select one from question_types based on the section header and question content. Multiple-choice questions typically contain options A/B/C/D; fill-in-the-blank questions featur...
2025
-
[48]
In the normalized case𝑙= 1, this gives𝑟= 1 4. Since the distance from the center to the chord subtending angle2equals(outer radius) ·cos 1, we obtain𝑂𝐹=( 2𝑟)cos 1 = 1 2 cos 1, and hence the distance from the point𝐸 on the outer circle along𝑂𝐸to the chord𝐵𝐶is𝐸𝐹=(2𝑟) −𝑂𝐹= 1 2 − 1 2 cos 1= 1 2 (1−cos 1). (2) From (1) we have𝑟= 2𝑙 3𝛼+2. The area of the annula...
-
[49]
Therefore, 𝑆annulus attains its maximum value𝑙2 4 precisely when 𝛼= 2
-
[50]
Table 13| Overall Exam Scores on the 2025-06 split of LiveK12Bench.Subscripts report the score increase relative to the 2026-03 split (Table 3)
2 3, 12 4 . Table 13| Overall Exam Scores on the 2025-06 split of LiveK12Bench.Subscripts report the score increase relative to the 2026-03 split (Table 3). The consistently positive gapΔsupports the hypothesis that earlier exams are more susceptible to contamination, and that ingesting newer papers reduces this risk. Model Math Physics Chemistry Biology ...
2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.