LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations?

Dian Li; Gang Liu; Mingze Yin; Xiaohan Wang; Yilin Zhao

arxiv: 2605.26781 · v1 · pith:FX43HMIZnew · submitted 2026-05-26 · 💻 cs.AI · cs.MM

LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations?

Xiaohan Wang , Mingze Yin , Yilin Zhao , Gang Liu , Dian Li This is my paper

Pith reviewed 2026-06-29 17:29 UTC · model grok-4.3

classification 💻 cs.AI cs.MM

keywords LiveK12Benchlarge multimodal modelsK-12 examinationsmock exam evaluationreasoning processdata contaminationmultimodal reasoningeducational AI

0 comments

The pith

Large multimodal models lose substantial ground on high school exams when both final answers and reasoning process are scored.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces LiveK12Bench, a growing collection of over 2000 verified questions drawn from recent real exam papers in mathematics, physics, chemistry, and biology. It tests whether models can complete entire exams autonomously while producing accurate and efficient reasoning paths rather than isolated answers. Experiments show clear drops in performance once these constraints are applied, including a fall from 79 to 53 for one leading model. A sympathetic reader would care because the result questions whether current models can function as reliable tutors inside actual testing environments.

Core claim

The paper claims that advanced LMMs suffer substantial performance degradation under exam-realistic constraints, with GPT-5's score dropping from 79 to 53 out of 100 when process rigor and efficiency are jointly evaluated through a Mock Exam scheme on a dynamic set of 2K+ verified questions from the latest examination papers.

What carries the argument

The Mock Exam evaluation scheme that requires models to finish complete end-to-end exams autonomously while maintaining both accuracy and efficient reasoning paths, supported by an automated pipeline that continuously ingests fresh exam papers.

If this is right

Models remain sensitive to complex visual layouts that appear in authentic exam papers.
Static benchmarks overestimate readiness for real educational use.
True educational readiness requires both correct answers and efficient reasoning processes.
The benchmark can expand over time to track whether future models close the gap.
Vulnerabilities in current models limit their deployment as autonomous tutors.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training regimens that reward process efficiency separately from accuracy might reduce the observed degradation.
Similar dynamic ingestion pipelines could be applied to other domains that suffer from data contamination.
The gap between isolated-question performance and full-exam performance may appear in non-educational reasoning tasks as well.
Periodic re-testing on newly ingested papers would provide ongoing evidence of progress or stagnation.

Load-bearing premise

The automated pipeline successfully ingests, parses, and verifies the latest examination papers without introducing selection bias or verification errors that would affect the reported performance gaps.

What would settle it

An advanced model that achieves nearly the same score under the full Mock Exam scheme with joint rigor and efficiency scoring as it does on isolated static questions would falsify the degradation claim.

read the original abstract

Advanced Large Multimodal Models (LMMs) have demonstrated impressive performance in K-12 reasoning tasks, exhibiting great promise as intelligent tutors. Realizing this potential requires models to navigate real-world examinations effectively, yet most existing benchmarks fail to capture the complexity of authentic testing environments. Specifically, most datasets are static, prone to data contamination, and are often confined to restricted modalities, disciplines, and evaluation criteria. To address these issues, we introduce LiveK12Bench, a dynamic, holistic, multi-disciplinary benchmark designed to evaluate the reasoning abilities of LMMs in realistic examination scenarios. LiveK12Bench comprises 2K+ verified questions spanning Mathematics, Physics, Chemistry, and Biology, sourced from the latest real-world exam papers and designed to grow over time. Our framework features several core innovations: 1) featuring an automated pipeline that continuously ingests and parses the latest examination papers to mitigate data leakage; and 2) proposing a novel `Mock Exam' evaluation scheme, which assesses the ability to complete end-to-end exams autonomously with accurate and efficient reasoning paths. Extensive experiments on 12 LMMs reveal that advanced models suffer substantial performance degradation under exam-realistic constraints: GPT-5's score drops from 79 to 53 (out of 100) when process rigor and efficiency are jointly evaluated. Our findings expose critical vulnerabilities, such as sensitivity to complex visual layouts, highlighting the gap between idealized reasoning capabilities and true educational readiness. Both code and dataset are publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LiveK12Bench brings a dynamic real-exam pipeline and mock scheme that flags LMM gaps, but the lack of pipeline error stats undercuts the GPT-5 drop claim.

read the letter

The core takeaway is that this paper builds LiveK12Bench from fresh high school papers in math, physics, chemistry, and biology, using an automated ingestion pipeline plus a mock exam format that scores not only final answers but also reasoning process and efficiency. They report GPT-5 falling from 79 to 53 under those rules.

The dynamic pipeline and the end-to-end mock exam approach are the actual novelties relative to static K12 sets. Pulling latest papers reduces contamination risk, the four-subject coverage is practical, and releasing code plus data helps others check or extend it. Those choices line up with real needs in education-focused multimodal work.

The soft spot sits in the pipeline validation. The abstract calls the questions verified, yet supplies no error rates, audit counts, or handling details for diagrams and layouts—the same elements the paper says trip up the models. Without those numbers, parsing mistakes could widen the measured gap between idealized and exam-realistic scores. The degradation result therefore rests on an unquantified assumption.

The work is aimed at groups testing LMMs for tutoring or assessment tasks. Readers who care about benchmark realism will find the setup useful even if the current numbers need tighter support.

Send it to peer review. The idea and the public artifacts are worth referee time; the methods section just needs the missing verification data to stand up.

Referee Report

3 major / 3 minor

Summary. The paper introduces LiveK12Bench, a dynamic, multi-disciplinary benchmark of 2K+ verified questions drawn from recent real high-school examinations in Mathematics, Physics, Chemistry, and Biology. It features an automated ingestion pipeline intended to prevent data leakage and a Mock Exam protocol that jointly scores answer accuracy, process rigor, and efficiency. Experiments on 12 LMMs report substantial degradation under these constraints, with GPT-5 falling from 79 to 53 out of 100; the work also highlights sensitivity to complex visual layouts and releases code and data publicly.

Significance. If the benchmark construction and scoring are reliable, the results would usefully document a gap between static benchmark performance and end-to-end exam competence, providing a reproducible, growing testbed that could steer LMM development toward educationally realistic capabilities. Public release of code and dataset strengthens the contribution by enabling direct replication and extension.

major comments (3)

[Abstract and pipeline description] The automated pipeline that ingests, parses, and verifies examination papers is described only at a high level in the abstract and introduction; no error-rate statistics, human-audit protocol, or quantitative assessment of OCR/layout extraction failures are supplied. Because the headline GPT-5 degradation (79→53) and the claim of faithful transcription of complex diagrams rest on this pipeline, the absence of such verification metrics is load-bearing for the central empirical claim.
[Mock Exam evaluation scheme] The Mock Exam scheme jointly penalizes process rigor and efficiency, yet the manuscript provides no explicit definition or weighting formula for these two components, nor any ablation showing how each contributes to the reported score drop. Without this, it is unclear whether the 26-point decline is driven by the intended factors or by unstated scoring choices.
[Benchmark construction] The paper states that questions are 'verified' and that the benchmark 'grows over time,' but supplies no protocol for ongoing verification, no inter-annotator agreement figures, and no handling of selection bias when papers contain ambiguous or multi-part items. These omissions directly affect the reproducibility and longitudinal validity of the reported performance gaps.

minor comments (3)

[Results tables] Figure captions and table headers should explicitly state the exact scoring rubric (accuracy + rigor + efficiency) used for the Mock Exam column so readers can interpret the 79-to-53 drop without ambiguity.
[Abstract] The abstract claims 'both code and dataset are publicly available,' yet the manuscript does not include a direct URL or repository DOI; this should be added for immediate accessibility.
[Throughout] Minor typographical inconsistencies appear in the listing of the four disciplines (Mathematics vs. Math); standardize terminology throughout.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. The feedback highlights important areas where additional detail will strengthen the manuscript's clarity and reproducibility. We address each major comment below and commit to revisions that directly incorporate the requested information.

read point-by-point responses

Referee: [Abstract and pipeline description] The automated pipeline that ingests, parses, and verifies examination papers is described only at a high level in the abstract and introduction; no error-rate statistics, human-audit protocol, or quantitative assessment of OCR/layout extraction failures are supplied. Because the headline GPT-5 degradation (79→53) and the claim of faithful transcription of complex diagrams rest on this pipeline, the absence of such verification metrics is load-bearing for the central empirical claim.

Authors: We agree that the current description of the automated pipeline is insufficiently detailed for a central component of the benchmark. In the revised manuscript we will add a dedicated subsection (Section 3.2) that fully specifies the ingestion, parsing, and verification steps. This will include quantitative error-rate statistics from a human-audit protocol conducted on a stratified sample of 300 examination papers, reporting OCR accuracy, layout extraction failure rates, and diagram transcription fidelity separately for each discipline. The audit protocol (two independent annotators plus adjudication) will be described explicitly. revision: yes
Referee: [Mock Exam evaluation scheme] The Mock Exam scheme jointly penalizes process rigor and efficiency, yet the manuscript provides no explicit definition or weighting formula for these two components, nor any ablation showing how each contributes to the reported score drop. Without this, it is unclear whether the 26-point decline is driven by the intended factors or by unstated scoring choices.

Authors: We acknowledge that the definitions and weighting of process rigor and efficiency were not stated with sufficient precision. The revised manuscript will include an explicit formal definition of each component, the precise weighting formula used to compute the final Mock Exam score, and a new ablation table that isolates the contribution of accuracy, rigor, and efficiency to the observed performance drops across models. This will allow readers to verify that the 26-point decline for GPT-5 is attributable to the intended factors. revision: yes
Referee: [Benchmark construction] The paper states that questions are 'verified' and that the benchmark 'grows over time,' but supplies no protocol for ongoing verification, no inter-annotator agreement figures, and no handling of selection bias when papers contain ambiguous or multi-part items. These omissions directly affect the reproducibility and longitudinal validity of the reported performance gaps.

Authors: We agree that reproducibility and longitudinal validity require explicit protocols. The revision will add a new subsection (Section 4.3) describing the ongoing verification protocol, including inter-annotator agreement statistics (Cohen's kappa) computed on a held-out verification set, and the procedures used to handle ambiguous or multi-part items (e.g., exclusion criteria and documentation of selection decisions). These additions will directly address concerns about selection bias and future growth of the benchmark. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark with no derivations or self-referential reductions.

full rationale

The paper introduces LiveK12Bench as a dynamic dataset and Mock Exam evaluation protocol. All reported results (e.g., GPT-5 score drop from 79 to 53) are direct empirical measurements on collected questions; no equations, fitted parameters, uniqueness theorems, or predictions are claimed that reduce to the paper's own inputs by construction. The automated pipeline is described as an engineering contribution but is not used to derive any quantitative result that loops back to itself. This is a standard empirical benchmark paper whose central claims rest on external model evaluations rather than internal definitional closure.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Benchmark construction relies on standard assumptions in AI evaluation such as question validity and model access; no free parameters, axioms beyond standard math, or invented entities are introduced in the abstract.

pith-pipeline@v0.9.1-grok · 5837 in / 1002 out tokens · 90037 ms · 2026-06-29T17:29:52.942945+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

50 extracted references · 17 canonical work pages · 10 internal anchors

[1]

Measuring mathematical problem solving with the math dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, et al. Measuring mathematical problem solving with the math dataset. InNeurIPS, 2021

2021
[2]

American invitational mathematics examination (aime) 2025, 2025

Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2025, 2025

2025
[3]

Aime-preview: A rigorous and immediate evaluation framework for advanced mathematical reasoning.https://github.com/GAIR-NLP /AIME-Preview, 2025

Yixin Ye, Yang Xiao, Tiantian Mi, and Pengfei Liu. Aime-preview: A rigorous and immediate evaluation framework for advanced mathematical reasoning.https://github.com/GAIR-NLP /AIME-Preview, 2025. GitHub repository

2025
[4]

M3exam: A multilingual, multimodal, multilevel benchmark for examining large language models

Wenxuan Zhang, Mahani Aljunied, Chang Gao, et al. M3exam: A multilingual, multimodal, multilevel benchmark for examining large language models. InNeurIPS, 2023

2023
[5]

Mmscibench: Benchmarking language models on chinese multimodal scientific problems

Xinwu Ye, Chengfan Li, Siming Chen, Wei Wei, and Robert Tang. Mmscibench: Benchmarking language models on chinese multimodal scientific problems. InFindings of the Association for Computational Linguistics: ACL 2025, pages 14621–14663, 2025

2025
[6]

Exams-v: A multi-discipline multilingual multimodal exam benchmark for evaluating vision language models

Rocktim Das, Simeon Hristov, Haonan Li, Dimitar Dimitrov, Ivan Koychev, and Preslav Nakov. Exams-v: A multi-discipline multilingual multimodal exam benchmark for evaluating vision language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7768–7791, 2024

2024
[7]

Nlp evaluation in trouble: On the need to measure llm data contamination for each benchmark.arXiv preprint arXiv:2310.18018, 2023

Oscar Sainz, Jon Ander Campos, Iker García-Ferrero, et al. Nlp evaluation in trouble: On the need to measure llm data contamination for each benchmark.arXiv preprint arXiv:2310.18018, 2023

work page arXiv 2023
[8]

MathVista: Evaluating mathematical reasoning of foundation models in visual contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. MathVista: Evaluating mathematical reasoning of foundation models in visual contexts. InThe Twelfth International Conference on Learning Representations, 2024

2024
[9]

We-math 2.0: A versatile mathbook system for incentivizing visual mathematical reasoning.arXiv preprint arXiv:2508.10433, 2025

Runqi Qiao, Qiuna Tan, Peiqing Yang, Yanzi Wang, Xiaowan Wang, Enhui Wan, Sitong Zhou, Guanting Dong, Yuchen Zeng, Yida Xu, et al. We-Math 2.0: A versatile mathbook system for incentivizing visual mathematical reasoning.arXiv preprint arXiv:2508.10433, 2025

work page arXiv 2025
[10]

MinerU: An Open-Source Solution for Precise Document Content Extraction

Bin Wang, Chao Xu, Xiaomeng Zhao, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Rui Xu, Kaiwen Liu, Yuan Qu, Fukai Shang, et al. Mineru: An open-source solution for precise document content extraction.arXiv preprint arXiv:2409.18839, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[11]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models

Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang. Scibench: Evaluating college-level scientific problem-solving abilities of large language models.arXiv preprint arXiv:2307.10635, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

Gaokao-mm: A chinese human-level benchmark for multimodal models evaluation

Yi Zong and Xipeng Qiu. Gaokao-mm: A chinese human-level benchmark for multimodal models evaluation. InFindings of the Association for Computational Linguistics: ACL 2024, pages 8817–8825, 2024

2024
[14]

K12vista: Exploring the boundaries of mllms in k-12 education.arXiv preprint arXiv:2506.01676, 2025

Chong Li, Chenglin Zhu, Tao Zhang, Mingan Lin, Zenan Zhou, and Jian Xie. K12vista: Exploring the boundaries of mllms in k-12 education.arXiv preprint arXiv:2506.01676, 2025

work page arXiv 2025
[15]

Mdk12-bench: a multi-discipline benchmark for evaluating reasoning in multimodal large language models.arXiv preprint arXiv:2504.05782, 2025

Pengfei Zhou, Fanrui Zhang, Xiaopeng Peng, Zhaopan Xu, Jiaxin Ai, Yansheng Qiu, Chuanhao Li, Zhen Li, Ming Li, Yukang Feng, et al. Mdk12-bench: a multi-discipline benchmark for evaluating reasoning in multimodal large language models.arXiv preprint arXiv:2504.05782, 2025

work page arXiv 2025
[16]

Introducing gpt-5, August 2025

OpenAI. Introducing gpt-5, August 2025

2025
[17]

Gpt-5-mini, August 2025

OpenAI. Gpt-5-mini, August 2025

2025
[18]

Gemini 3 Pro, November 2025

Google DeepMind. Gemini 3 Pro, November 2025

2025
[19]

Gemini 3 Flash, November 2025

Google DeepMind. Gemini 3 Flash, November 2025

2025
[20]

Claude 4.6 Opus, February 2026

Anthropic. Claude 4.6 Opus, February 2026

2026
[21]

Claude 4.6 Sonnet, February 2026

Anthropic. Claude 4.6 Sonnet, February 2026

2026
[22]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[23]

GLM-5: from Vibe Coding to Agentic Engineering

Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chengxing Xie, Cunxiang Wang, et al. Glm-5: from vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[24]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[25]

DeepSeek-V3 Technical Report

DeepSeek-AI. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Grasp: A novel benchmark for evaluating language grounding and situated physics understand- ing in multimodal language models.arXiv preprint arXiv:2311.09048, 2023

Serwan Jassim, Mario Holubar, Annika Richter, Cornelius Wolff, Xenia Ohmer, and Elia Bruni. Grasp: A novel benchmark for evaluating language grounding and situated physics understand- ing in multimodal language models.arXiv preprint arXiv:2311.09048, 2023

work page arXiv 2023
[27]

Mmscibench: Benchmarking language models on chinese multimodal scientific problems

Chengfan Li, Xinwu Ye, Siming Chen, Wei Wei, and Xiangru Tang. Mmscibench: Benchmarking language models on chinese multimodal scientific problems. InFindings of ACL, 2025

2025
[28]

Msvec: A multidomain testing dataset for scientific claim verification

Michael Evans, Dominik Soós, Ethan Landers, and Jian Wu. Msvec: A multidomain testing dataset for scientific claim verification. InProceedings of the Twenty-fourth International Sympo- sium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing, pages 504–509, 2023

2023
[29]

Sciol and mulms-img: Introducing a large-scale multimodal scientific dataset and models for image-text tasks in the scientific domain

Tim Tarsi, Heike Adel, Jan Hendrik Metzen, Dan Zhang, Matteo Finco, and Annemarie Friedrich. Sciol and mulms-img: Introducing a large-scale multimodal scientific dataset and models for image-text tasks in the scientific domain. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 4560–4571, 2024. 14 LiveK12Bench: Have ...

2024
[30]

Evaluating the Performance of Large Language Models on GAOKAO Benchmark

Xiaotian Zhang, Chunyang Li, Yi Zong, et al. Evaluating the performance of large language models on gaokao benchmark.arXiv preprint arXiv:2305.12474, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[31]

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems.arXiv preprint arXiv:2402.14008, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Chatgpt for good? on opportunities and challenges of large language models for education.Learning and individual differences, 103:102274, 2023

Enkelejda Kasneci, Kathrin Seßler, Stefan Küchemann, Maria Bannert, Daryna Dementieva, Frank Fischer, Urs Gasser, Georg Groh, Stephan Günnemann, Eyke Hüllermeier, et al. Chatgpt for good? on opportunities and challenges of large language models for education.Learning and individual differences, 103:102274, 2023

2023
[33]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[34]

Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

Pan Lu, Hritik Bansal, Tony Xia, et al. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. InInternational Conference on Learning Representations (ICLR), 2024

2024
[35]

Measuring multimodal mathematical reasoning with MATH-Vision dataset

Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with MATH-Vision dataset. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024

2024
[36]

MathVerse: Does your multi-modal llm truly see the diagrams in visual math problems? InEuropean Conference on Computer Vision, pages 169–186, 2024

Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Yu Qiao, et al. MathVerse: Does your multi-modal llm truly see the diagrams in visual math problems? InEuropean Conference on Computer Vision, pages 169–186, 2024

2024
[37]

Runqi Qiao, Qiuna Tan, Guanting Dong, MinhuiWu MinhuiWu, Chong Sun, Xiaoshuai Song, Jiapeng Wang, Zhuoma GongQue, Shanglin Lei, Yifan Zhang, et al. We-Math: Does your large multimodal model achieve human-like mathematical reasoning? InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 200...

2025
[38]

et al. Liu. A survey on efficient reasoning for large language models, 2025. URLhttps: //arxiv.org/abs/2504.10903

work page arXiv 2025
[39]

et al. Luo. O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning, 2025. URLhttps://arxiv.org/abs/2501.12570. 15 LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations? A. Supplementary Results Table 5,6,7 supplement more complete results of LMMs on three challenging subsets. Overall, the relative perfo...

work page arXiv 2025
[40]

Some questions may contain multiple sub-questions, such as fill-in-the-blank questions with multiple blanks or open-ended questions with multiple parts

Question types include fill-in-the-blank and open-ended questions. Some questions may contain multiple sub-questions, such as fill-in-the-blank questions with multiple blanks or open-ended questions with multiple parts. You must judge each sub-question independently
[41]

As long as the semantic meaning is equivalent, the answer is considered correct

Answers may be expressed in different forms—for example, as a mathematical expression or a textual description. As long as the semantic meaning is equivalent, the answer is considered correct. Equivalent formulas expressed in different notations are also accepted. If equivalence cannot be determined, mark the student’s answer as incorrect
[42]

Simply compare the student’s answer with the standard answer based on the question format to determine correctness

You do not need to re-derive the answer, as the standard answer is already provided. Simply compare the student’s answer with the standard answer based on the question format to determine correctness
[43]

For questions without definitive results (e.g., proofs), evaluate whether the solution approach is correct (a sound reasoning approach suffices)

For questions with definitive results, compare the student’s final answer (typically enclosed in\boxed{}) with the standard answer; the solution process need not be evaluated. For questions without definitive results (e.g., proofs), evaluate whether the solution approach is correct (a sound reasoning approach suffices). Based on the above criteria, output...
[44]

Do not modify any document content (including text content and LaTeX formula content); only extract and organize the corresponding information
[45]

Ensure all questions in the document are parsed without omission
[46]

18 LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations?

Output only a result that can be directly parsed as valid JSON without errors. 18 LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations?
[47]

multiple-choice

If there is no more content to extract, output only[DONE]. Target Schema(target_items): question_types= ["multiple-choice","fill-in-the-blank","open-ended"] •type : A string value. Select one from question_types based on the section header and question content. Multiple-choice questions typically contain options A/B/C/D; fill-in-the-blank questions featur...

2025
[48]

In the normalized case𝑙= 1, this gives𝑟= 1 4. Since the distance from the center to the chord subtending angle2equals(outer radius) ·cos 1, we obtain𝑂𝐹=( 2𝑟)cos 1 = 1 2 cos 1, and hence the distance from the point𝐸 on the outer circle along𝑂𝐸to the chord𝐵𝐶is𝐸𝐹=(2𝑟) −𝑂𝐹= 1 2 − 1 2 cos 1= 1 2 (1−cos 1). (2) From (1) we have𝑟= 2𝑙 3𝛼+2. The area of the annula...
[49]

Therefore, 𝑆annulus attains its maximum value𝑙2 4 precisely when 𝛼= 2
[50]

Table 13| Overall Exam Scores on the 2025-06 split of LiveK12Bench.Subscripts report the score increase relative to the 2026-03 split (Table 3)

2 3, 12 4 . Table 13| Overall Exam Scores on the 2025-06 split of LiveK12Bench.Subscripts report the score increase relative to the 2026-03 split (Table 3). The consistently positive gapΔsupports the hypothesis that earlier exams are more susceptible to contamination, and that ingesting newer papers reduces this risk. Model Math Physics Chemistry Biology ...

2025

[1] [1]

Measuring mathematical problem solving with the math dataset

Dan Hendrycks, Collin Burns, Saurav Kadavath, et al. Measuring mathematical problem solving with the math dataset. InNeurIPS, 2021

2021

[2] [2]

American invitational mathematics examination (aime) 2025, 2025

Yifan Zhang and Team Math-AI. American invitational mathematics examination (aime) 2025, 2025

2025

[3] [3]

Aime-preview: A rigorous and immediate evaluation framework for advanced mathematical reasoning.https://github.com/GAIR-NLP /AIME-Preview, 2025

Yixin Ye, Yang Xiao, Tiantian Mi, and Pengfei Liu. Aime-preview: A rigorous and immediate evaluation framework for advanced mathematical reasoning.https://github.com/GAIR-NLP /AIME-Preview, 2025. GitHub repository

2025

[4] [4]

M3exam: A multilingual, multimodal, multilevel benchmark for examining large language models

Wenxuan Zhang, Mahani Aljunied, Chang Gao, et al. M3exam: A multilingual, multimodal, multilevel benchmark for examining large language models. InNeurIPS, 2023

2023

[5] [5]

Mmscibench: Benchmarking language models on chinese multimodal scientific problems

Xinwu Ye, Chengfan Li, Siming Chen, Wei Wei, and Robert Tang. Mmscibench: Benchmarking language models on chinese multimodal scientific problems. InFindings of the Association for Computational Linguistics: ACL 2025, pages 14621–14663, 2025

2025

[6] [6]

Exams-v: A multi-discipline multilingual multimodal exam benchmark for evaluating vision language models

Rocktim Das, Simeon Hristov, Haonan Li, Dimitar Dimitrov, Ivan Koychev, and Preslav Nakov. Exams-v: A multi-discipline multilingual multimodal exam benchmark for evaluating vision language models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 7768–7791, 2024

2024

[7] [7]

Nlp evaluation in trouble: On the need to measure llm data contamination for each benchmark.arXiv preprint arXiv:2310.18018, 2023

Oscar Sainz, Jon Ander Campos, Iker García-Ferrero, et al. Nlp evaluation in trouble: On the need to measure llm data contamination for each benchmark.arXiv preprint arXiv:2310.18018, 2023

work page arXiv 2023

[8] [8]

MathVista: Evaluating mathematical reasoning of foundation models in visual contexts

Pan Lu, Hritik Bansal, Tony Xia, Jiacheng Liu, Chunyuan Li, Hannaneh Hajishirzi, Hao Cheng, Kai-Wei Chang, Michel Galley, and Jianfeng Gao. MathVista: Evaluating mathematical reasoning of foundation models in visual contexts. InThe Twelfth International Conference on Learning Representations, 2024

2024

[9] [9]

We-math 2.0: A versatile mathbook system for incentivizing visual mathematical reasoning.arXiv preprint arXiv:2508.10433, 2025

Runqi Qiao, Qiuna Tan, Peiqing Yang, Yanzi Wang, Xiaowan Wang, Enhui Wan, Sitong Zhou, Guanting Dong, Yuchen Zeng, Yida Xu, et al. We-Math 2.0: A versatile mathbook system for incentivizing visual mathematical reasoning.arXiv preprint arXiv:2508.10433, 2025

work page arXiv 2025

[10] [10]

MinerU: An Open-Source Solution for Precise Document Content Extraction

Bin Wang, Chao Xu, Xiaomeng Zhao, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Rui Xu, Kaiwen Liu, Yuan Qu, Fukai Shang, et al. Mineru: An open-source solution for precise document content extraction.arXiv preprint arXiv:2409.18839, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[11] [11]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, Wenbin Ge, Zhifang Guo, Qidong Huang, Jie Huang, Fei Huang, Binyuan Hui, Shutong Jiang, Zhaohai Li, Mingsheng Li, Mei Li, Kaixin Li, Zicheng Lin, Junyang Lin, Xuejing Liu, Jiawei Liu, Chenglong Liu, Yang Liu, Dayiheng Liu, Shixuan ...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[12] [12]

SciBench: Evaluating College-Level Scientific Problem-Solving Abilities of Large Language Models

Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang. Scibench: Evaluating college-level scientific problem-solving abilities of large language models.arXiv preprint arXiv:2307.10635, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [13]

Gaokao-mm: A chinese human-level benchmark for multimodal models evaluation

Yi Zong and Xipeng Qiu. Gaokao-mm: A chinese human-level benchmark for multimodal models evaluation. InFindings of the Association for Computational Linguistics: ACL 2024, pages 8817–8825, 2024

2024

[14] [14]

K12vista: Exploring the boundaries of mllms in k-12 education.arXiv preprint arXiv:2506.01676, 2025

Chong Li, Chenglin Zhu, Tao Zhang, Mingan Lin, Zenan Zhou, and Jian Xie. K12vista: Exploring the boundaries of mllms in k-12 education.arXiv preprint arXiv:2506.01676, 2025

work page arXiv 2025

[15] [15]

Mdk12-bench: a multi-discipline benchmark for evaluating reasoning in multimodal large language models.arXiv preprint arXiv:2504.05782, 2025

Pengfei Zhou, Fanrui Zhang, Xiaopeng Peng, Zhaopan Xu, Jiaxin Ai, Yansheng Qiu, Chuanhao Li, Zhen Li, Ming Li, Yukang Feng, et al. Mdk12-bench: a multi-discipline benchmark for evaluating reasoning in multimodal large language models.arXiv preprint arXiv:2504.05782, 2025

work page arXiv 2025

[16] [16]

Introducing gpt-5, August 2025

OpenAI. Introducing gpt-5, August 2025

2025

[17] [17]

Gpt-5-mini, August 2025

OpenAI. Gpt-5-mini, August 2025

2025

[18] [18]

Gemini 3 Pro, November 2025

Google DeepMind. Gemini 3 Pro, November 2025

2025

[19] [19]

Gemini 3 Flash, November 2025

Google DeepMind. Gemini 3 Flash, November 2025

2025

[20] [20]

Claude 4.6 Opus, February 2026

Anthropic. Claude 4.6 Opus, February 2026

2026

[21] [21]

Claude 4.6 Sonnet, February 2026

Anthropic. Claude 4.6 Sonnet, February 2026

2026

[22] [22]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[23] [23]

GLM-5: from Vibe Coding to Agentic Engineering

Aohan Zeng, Xin Lv, Zhenyu Hou, Zhengxiao Du, Qinkai Zheng, Bin Chen, Da Yin, Chendi Ge, Chengxing Xie, Cunxiang Wang, et al. Glm-5: from vibe coding to agentic engineering.arXiv preprint arXiv:2602.15763, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[24] [24]

Kimi Team, Tongtong Bai, Yifan Bai, Yiping Bao, SH Cai, Yuan Cao, Y Charles, HS Che, Cheng Chen, Guanduo Chen, et al. Kimi k2. 5: Visual agentic intelligence.arXiv preprint arXiv:2602.02276, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[25] [25]

DeepSeek-V3 Technical Report

DeepSeek-AI. Deepseek-v3 technical report.arXiv preprint arXiv:2412.19437, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[26] [26]

Grasp: A novel benchmark for evaluating language grounding and situated physics understand- ing in multimodal language models.arXiv preprint arXiv:2311.09048, 2023

Serwan Jassim, Mario Holubar, Annika Richter, Cornelius Wolff, Xenia Ohmer, and Elia Bruni. Grasp: A novel benchmark for evaluating language grounding and situated physics understand- ing in multimodal language models.arXiv preprint arXiv:2311.09048, 2023

work page arXiv 2023

[27] [27]

Mmscibench: Benchmarking language models on chinese multimodal scientific problems

Chengfan Li, Xinwu Ye, Siming Chen, Wei Wei, and Xiangru Tang. Mmscibench: Benchmarking language models on chinese multimodal scientific problems. InFindings of ACL, 2025

2025

[28] [28]

Msvec: A multidomain testing dataset for scientific claim verification

Michael Evans, Dominik Soós, Ethan Landers, and Jian Wu. Msvec: A multidomain testing dataset for scientific claim verification. InProceedings of the Twenty-fourth International Sympo- sium on Theory, Algorithmic Foundations, and Protocol Design for Mobile Networks and Mobile Computing, pages 504–509, 2023

2023

[29] [29]

Sciol and mulms-img: Introducing a large-scale multimodal scientific dataset and models for image-text tasks in the scientific domain

Tim Tarsi, Heike Adel, Jan Hendrik Metzen, Dan Zhang, Matteo Finco, and Annemarie Friedrich. Sciol and mulms-img: Introducing a large-scale multimodal scientific dataset and models for image-text tasks in the scientific domain. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 4560–4571, 2024. 14 LiveK12Bench: Have ...

2024

[30] [30]

Evaluating the Performance of Large Language Models on GAOKAO Benchmark

Xiaotian Zhang, Chunyang Li, Yi Zong, et al. Evaluating the performance of large language models on gaokao benchmark.arXiv preprint arXiv:2305.12474, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[31] [31]

OlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems

Chaoqun He, Renjie Luo, Yuzhuo Bai, et al. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems.arXiv preprint arXiv:2402.14008, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[32] [32]

Chatgpt for good? on opportunities and challenges of large language models for education.Learning and individual differences, 103:102274, 2023

Enkelejda Kasneci, Kathrin Seßler, Stefan Küchemann, Maria Bannert, Daryna Dementieva, Frank Fischer, Urs Gasser, Georg Groh, Stephan Günnemann, Eyke Hüllermeier, et al. Chatgpt for good? on opportunities and challenges of large language models for education.Learning and individual differences, 103:102274, 2023

2023

[33] [33]

Training Verifiers to Solve Math Word Problems

Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, et al. Training verifiers to solve math word problems.arXiv preprint arXiv:2110.14168, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[34] [34]

Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts

Pan Lu, Hritik Bansal, Tony Xia, et al. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts. InInternational Conference on Learning Representations (ICLR), 2024

2024

[35] [35]

Measuring multimodal mathematical reasoning with MATH-Vision dataset

Ke Wang, Junting Pan, Weikang Shi, Zimu Lu, Houxing Ren, Aojun Zhou, Mingjie Zhan, and Hongsheng Li. Measuring multimodal mathematical reasoning with MATH-Vision dataset. In The Thirty-eight Conference on Neural Information Processing Systems Datasets and Benchmarks Track, 2024

2024

[36] [36]

MathVerse: Does your multi-modal llm truly see the diagrams in visual math problems? InEuropean Conference on Computer Vision, pages 169–186, 2024

Renrui Zhang, Dongzhi Jiang, Yichi Zhang, Haokun Lin, Ziyu Guo, Pengshuo Qiu, Aojun Zhou, Pan Lu, Kai-Wei Chang, Yu Qiao, et al. MathVerse: Does your multi-modal llm truly see the diagrams in visual math problems? InEuropean Conference on Computer Vision, pages 169–186, 2024

2024

[37] [37]

Runqi Qiao, Qiuna Tan, Guanting Dong, MinhuiWu MinhuiWu, Chong Sun, Xiaoshuai Song, Jiapeng Wang, Zhuoma GongQue, Shanglin Lei, Yifan Zhang, et al. We-Math: Does your large multimodal model achieve human-like mathematical reasoning? InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 200...

2025

[38] [38]

et al. Liu. A survey on efficient reasoning for large language models, 2025. URLhttps: //arxiv.org/abs/2504.10903

work page arXiv 2025

[39] [39]

et al. Luo. O1-pruner: Length-harmonizing fine-tuning for o1-like reasoning pruning, 2025. URLhttps://arxiv.org/abs/2501.12570. 15 LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations? A. Supplementary Results Table 5,6,7 supplement more complete results of LMMs on three challenging subsets. Overall, the relative perfo...

work page arXiv 2025

[40] [40]

Some questions may contain multiple sub-questions, such as fill-in-the-blank questions with multiple blanks or open-ended questions with multiple parts

Question types include fill-in-the-blank and open-ended questions. Some questions may contain multiple sub-questions, such as fill-in-the-blank questions with multiple blanks or open-ended questions with multiple parts. You must judge each sub-question independently

[41] [41]

As long as the semantic meaning is equivalent, the answer is considered correct

Answers may be expressed in different forms—for example, as a mathematical expression or a textual description. As long as the semantic meaning is equivalent, the answer is considered correct. Equivalent formulas expressed in different notations are also accepted. If equivalence cannot be determined, mark the student’s answer as incorrect

[42] [42]

Simply compare the student’s answer with the standard answer based on the question format to determine correctness

You do not need to re-derive the answer, as the standard answer is already provided. Simply compare the student’s answer with the standard answer based on the question format to determine correctness

[43] [43]

For questions without definitive results (e.g., proofs), evaluate whether the solution approach is correct (a sound reasoning approach suffices)

For questions with definitive results, compare the student’s final answer (typically enclosed in\boxed{}) with the standard answer; the solution process need not be evaluated. For questions without definitive results (e.g., proofs), evaluate whether the solution approach is correct (a sound reasoning approach suffices). Based on the above criteria, output...

[44] [44]

Do not modify any document content (including text content and LaTeX formula content); only extract and organize the corresponding information

[45] [45]

Ensure all questions in the document are parsed without omission

[46] [46]

18 LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations?

Output only a result that can be directly parsed as valid JSON without errors. 18 LiveK12Bench: Have Large Multimodal Models Truly Conquered High School-level Examinations?

[47] [47]

multiple-choice

If there is no more content to extract, output only[DONE]. Target Schema(target_items): question_types= ["multiple-choice","fill-in-the-blank","open-ended"] •type : A string value. Select one from question_types based on the section header and question content. Multiple-choice questions typically contain options A/B/C/D; fill-in-the-blank questions featur...

2025

[48] [48]

In the normalized case𝑙= 1, this gives𝑟= 1 4. Since the distance from the center to the chord subtending angle2equals(outer radius) ·cos 1, we obtain𝑂𝐹=( 2𝑟)cos 1 = 1 2 cos 1, and hence the distance from the point𝐸 on the outer circle along𝑂𝐸to the chord𝐵𝐶is𝐸𝐹=(2𝑟) −𝑂𝐹= 1 2 − 1 2 cos 1= 1 2 (1−cos 1). (2) From (1) we have𝑟= 2𝑙 3𝛼+2. The area of the annula...

[49] [49]

Therefore, 𝑆annulus attains its maximum value𝑙2 4 precisely when 𝛼= 2

[50] [50]

Table 13| Overall Exam Scores on the 2025-06 split of LiveK12Bench.Subscripts report the score increase relative to the 2026-03 split (Table 3)

2 3, 12 4 . Table 13| Overall Exam Scores on the 2025-06 split of LiveK12Bench.Subscripts report the score increase relative to the 2026-03 split (Table 3). The consistently positive gapΔsupports the hypothesis that earlier exams are more susceptible to contamination, and that ingesting newer papers reduces this risk. Model Math Physics Chemistry Biology ...

2025