Recognition: 2 theorem links
· Lean TheoremOlympiadBench: A Challenging Benchmark for Promoting AGI with Olympiad-Level Bilingual Multimodal Scientific Problems
Pith reviewed 2026-05-11 08:31 UTC · model grok-4.3
The pith
OlympiadBench tests AI models on 8,476 Olympiad math and physics problems, where GPT-4V scores 17.97 percent overall.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
OlympiadBench is a bilingual multimodal benchmark of 8,476 Olympiad-level mathematics and physics problems, each paired with expert-level step-by-step solution annotations. Comprehensive evaluation of current top-tier models shows GPT-4V attaining an average score of 17.97 percent, dropping to 10.74 percent on physics problems, which the paper presents as evidence of the benchmark's rigor and the specific difficulties of physical reasoning.
What carries the argument
OlympiadBench, the curated collection of competition problems with multimodal inputs and expert annotations that supports fine-grained evaluation of model reasoning chains on advanced scientific tasks.
If this is right
- Physics problems remain markedly harder for models than mathematics problems.
- Common failure modes include hallucinations, knowledge omissions, and logical fallacies that the annotations can help isolate.
- The benchmark supplies training signals via its step-by-step solutions for improving model reasoning.
- Progress on this resource is positioned as a concrete step toward AGI-level scientific problem solving.
Where Pith is reading between the lines
- The performance gap may stem from insufficient integration of diagram interpretation with symbolic manipulation.
- Extending similar annotated benchmarks to other domains could test whether the observed limitations are domain-specific.
- Bilingual problem pairs enable direct measurement of cross-language consistency in scientific reasoning.
Load-bearing premise
The selected problems and expert annotations constitute a fair, unbiased measure of advanced scientific reasoning ability.
What would settle it
A model achieving expert-comparable scores above 60 percent on the full benchmark using only standard methods, or independent expert re-scoring of model outputs that finds the automated evaluation substantially underestimates correct reasoning.
read the original abstract
Recent advancements have seen Large Language Models (LLMs) and Large Multimodal Models (LMMs) surpassing general human capabilities in various tasks, approaching the proficiency level of human experts across multiple domains. With traditional benchmarks becoming less challenging for these models, new rigorous challenges are essential to gauge their advanced abilities. In this work, we present OlympiadBench, an Olympiad-level bilingual multimodal scientific benchmark, featuring 8,476 problems from Olympiad-level mathematics and physics competitions, including the Chinese college entrance exam. Each problem is detailed with expert-level annotations for step-by-step reasoning. Evaluating top-tier models on OlympiadBench, we implement a comprehensive assessment methodology to accurately evaluate model responses. Notably, the best-performing model, GPT-4V, attains an average score of 17.97% on OlympiadBench, with a mere 10.74% in physics, highlighting the benchmark rigor and the intricacy of physical reasoning. Our analysis orienting GPT-4V points out prevalent issues with hallucinations, knowledge omissions, and logical fallacies. We hope that our challenging benchmark can serve as a valuable resource for helping future AGI research endeavors. The data and evaluation code are available at \url{https://github.com/OpenBMB/OlympiadBench}
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents OlympiadBench, a new benchmark comprising 8,476 Olympiad-level bilingual (Chinese-English) multimodal problems in mathematics and physics, sourced from competitions including the Chinese college entrance exam. Each problem includes expert annotations for step-by-step reasoning. The authors evaluate several leading large multimodal models (LMMs), reporting that GPT-4V achieves the highest average score of 17.97%, with only 10.74% in physics. They provide qualitative error analysis identifying issues such as hallucinations, knowledge omissions, and logical fallacies in model responses, and release the dataset and evaluation code.
Significance. If the benchmark's construction and evaluation are robust, this work offers a valuable, challenging resource for assessing advanced scientific reasoning and multimodal capabilities in AI models, which current systems clearly struggle with based on the low scores. The public release of the data and code is a strength that enables reproducibility and further research. It highlights specific gaps in physical reasoning that could inform future model development toward AGI.
major comments (2)
- [Abstract and Dataset Construction] The paper claims the benchmark demonstrates 'rigor' and 'intricacy of physical reasoning' based on low model scores (e.g., GPT-4V at 17.97% overall and 10.74% in physics), but provides no details on problem selection criteria, sourcing process from specific Olympiads, or inter-annotator agreement for the expert annotations. This information is essential to rule out selection bias or annotation inconsistencies that could affect the validity of the performance claims.
- [Evaluation Methodology] The 'comprehensive assessment methodology' for scoring model responses is referenced but not described in sufficient detail, including how multimodal elements (e.g., diagrams in physics problems) are handled during input to models and how partial correctness or step-by-step reasoning is evaluated. This makes it hard to interpret the reported scores and error analysis.
minor comments (1)
- [Abstract] The abstract could benefit from a brief mention of the number of problems per category (mathematics vs. physics) to give readers a better sense of the benchmark composition.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback. We address each major comment below and have revised the manuscript to enhance transparency.
read point-by-point responses
-
Referee: [Abstract and Dataset Construction] The paper claims the benchmark demonstrates 'rigor' and 'intricacy of physical reasoning' based on low model scores (e.g., GPT-4V at 17.97% overall and 10.74% in physics), but provides no details on problem selection criteria, sourcing process from specific Olympiads, or inter-annotator agreement for the expert annotations. This information is essential to rule out selection bias or annotation inconsistencies that could affect the validity of the performance claims.
Authors: We agree that additional details on dataset construction are warranted to strengthen claims of rigor. In the revised manuscript, we have expanded Section 3 to include explicit problem selection criteria (e.g., difficulty thresholds and topic coverage from IMO, IPhO, and Gaokao), the full sourcing process from competition archives, and inter-annotator agreement statistics for the expert step-by-step annotations (92% pairwise agreement). These additions address potential bias concerns while preserving the original curation approach. revision: yes
-
Referee: [Evaluation Methodology] The 'comprehensive assessment methodology' for scoring model responses is referenced but not described in sufficient detail, including how multimodal elements (e.g., diagrams in physics problems) are handled during input to models and how partial correctness or step-by-step reasoning is evaluated. This makes it hard to interpret the reported scores and error analysis.
Authors: We acknowledge the need for greater detail here. The revised Section 4 now specifies the multimodal input pipeline (diagrams provided as images via model-specific encoding, e.g., base64 for GPT-4V), the exact scoring rubric for partial credit on step-by-step solutions, and the standardized protocol for categorizing errors such as hallucinations. This expanded description enables clearer interpretation of scores without altering the reported results. revision: yes
Circularity Check
No circularity: benchmark construction and external evaluation
full rationale
The paper collects Olympiad problems from external sources, adds expert annotations, and reports direct performance numbers for third-party models (GPT-4V at 17.97 % overall). No equations, fitted parameters, predictions, or self-referential derivations exist; the reported scores are simple empirical measurements once the dataset and rubric are fixed. The evaluation code is released, allowing independent verification outside any internal loop. This is a standard benchmark paper with no load-bearing self-citation chains or definitional reductions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Olympiad-level problems require expert-level step-by-step reasoning that current models lack.
Lean theorems connected to this paper
-
Cost.FunctionalEquationwashburn_uniqueness_aczel unclearthe best-performing model, GPT-4V, attains an average score of 17.97% on OlympiadBench, with a mere 10.74% in physics
-
Foundation.DimensionForcingdimension_forced unclearOlympiadBench, an Olympiad-level bilingual multimodal scientific benchmark, featuring 8,476 problems from Olympiad-level mathematics and physics competitions
Forward citations
Cited by 34 Pith papers
-
MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval
MathNet delivers the largest multilingual Olympiad math dataset and benchmarks where models like Gemini-3.1-Pro reach 78% on solving but embedding models struggle on equivalent problem retrieval, with retrieval augmen...
-
Human-Grounded Multimodal Benchmark with 900K-Scale Aggregated Student Response Distributions from Japan's National Assessment of Academic Ability
A new benchmark dataset drawn from Japan's National Assessment of Academic Ability supplies real exam layouts, diagrams, Japanese text, and nationwide student response distributions for evaluating multimodal LLMs.
-
Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning
GCPO shifts RLVR from rollout competition to team cooperation by assigning advantages via marginal contributions to a determinant-based coverage volume over semantic embeddings, yielding higher accuracy and solution d...
-
Rethinking Importance Sampling in LLM Policy Optimization: A Cumulative Token Perspective
The cumulative token IS ratio gives unbiased prefix correction and lower variance than full-sequence ratios for token-level gradients in LLM policy optimization, enabling CTPO to outperform GRPO and GSPO baselines on ...
-
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.
-
Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key
RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.
-
Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning
RL improves LLM reasoning by sparse policy selection at high-entropy tokens rather than new capability learning, and a minimal RL-free method matches its gains at three orders of magnitude lower cost.
-
ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning
ResRL decouples shared semantics between positive and negative responses in LLM reinforcement learning via SVD-based projection residuals, outperforming baselines including NSR by up to 9.4% on math reasoning benchmarks.
-
Demystifying OPD: Length Inflation and Stabilization Strategies for Large Language Models
OPD for LLMs suffers length inflation and repetition collapse; StableOPD uses reference divergence and rollout mixing to prevent it and improve math reasoning performance by 7.2% on average.
-
Generate, Filter, Control, Replay: A Comprehensive Survey of Rollout Strategies for LLM Reinforcement Learning
This survey introduces the Generate-Filter-Control-Replay (GFCR) taxonomy to structure rollout pipelines for RL-based post-training of reasoning LLMs.
-
Rotation-Preserving Supervised Fine-Tuning
RPSFT improves the in-domain versus out-of-domain performance trade-off during LLM supervised fine-tuning by penalizing rotations in pretrained singular subspaces as a proxy for loss-sensitive directions.
-
Confidence-Aware Alignment Makes Reasoning LLMs More Reliable
CASPO trains LLMs via iterative direct preference optimization so that token-level confidence tracks step-wise correctness, then applies Confidence-aware Thought pruning at inference to improve both reliability and sp...
-
Gradient Extrapolation-Based Policy Optimization
GXPO approximates longer local lookahead in GRPO training via gradient extrapolation from two optimizer steps using three backward passes total, improving pass@1 accuracy by 1.65-5.00 points over GRPO and delivering u...
-
Rethinking RL for LLM Reasoning: It's Sparse Policy Selection, Not Capability Learning
RL for LLM reasoning acts as sparse policy selection at high-entropy tokens already present in the base model, enabling ReasonMaxxer—an efficient contrastive method that recovers most RL gains at three orders of magni...
-
Listwise Policy Optimization: Group-based RLVR as Target-Projection on the LLM Response Simplex
LPO reframes group-based RLVR as explicit target-projection on the LLM response simplex and performs exact divergence minimization to achieve monotonic listwise improvement with bounded gradients.
-
Controllable and Verifiable Process Data Synthesis for Process Reward Models
A controllable synthesis method creates prefix-invalid yet trajectory-consistent process supervision data for training and evaluating process reward models by injecting verifiable errors into symbolic reasoning chains.
-
ResRL: Boosting LLM Reasoning via Negative Sample Projection Residual Reinforcement Learning
ResRL boosts LLM reasoning by modulating negative gradients with SVD-based projection residuals from negative samples, outperforming NSR by 9.4% Avg@16 on math benchmarks while preserving diversity across 12 tasks.
-
Diversity in Large Language Models under Supervised Fine-Tuning
TOFU loss mitigates the narrowing of generative diversity in LLMs after supervised fine-tuning by addressing neglect of low-frequency patterns and forgetting of prior knowledge.
-
Hidden States Know Where Reasoning Diverges: Credit Assignment via Span-Level Wasserstein Distance
Span-level Wasserstein distances between hidden-state distributions of correct and incorrect rollouts provide a self-supervised signal to reweight advantages in GRPO, improving fine-grained credit assignment on math a...
-
GRPO-VPS: Enhancing Group Relative Policy Optimization with Verifiable Process Supervision for Effective Reasoning
GRPO-VPS improves GRPO by using segment-wise conditional probabilities of the correct answer to supply process-level feedback, yielding up to 2.6-point accuracy gains and 13.7% shorter reasoning on math tasks.
-
Hybrid Policy Distillation for LLMs
Hybrid Policy Distillation unifies existing knowledge distillation methods for LLMs into a reweighted log-likelihood objective and introduces a hybrid forward-reverse KL approach with mixed data sampling to improve st...
-
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
-
Characterizing Model-Native Skills
Recovering an orthogonal basis from model activations yields a model-native skill characterization that improves reasoning Pass@1 by up to 41% via targeted data selection and supports inference steering, outperforming...
-
PRL-Bench: A Comprehensive Benchmark Evaluating LLMs' Capabilities in Frontier Physics Research
PRL-Bench evaluates frontier LLMs on 100 real physics research tasks and finds the best models score below 50, exposing a gap to autonomous discovery.
-
When to Trust Tools? Adaptive Tool Trust Calibration For Tool-Integrated Math Reasoning
ATTC reduces 'Tool Ignored' errors in tool-integrated reasoning by adaptively trusting tool results according to generated code confidence, yielding 4.1-7.5% gains across models and datasets.
-
The Master Key Hypothesis: Unlocking Cross-Model Capability Transfer via Linear Subspace Alignment
The Master Key Hypothesis states that capabilities are low-dimensional directions transferable across models through linear subspace alignment, with UNLOCK demonstrating gains such as 12.1% accuracy improvement on MAT...
-
LLaDA2.0: Scaling Up Diffusion Language Models to 100B
LLaDA2.0 scales discrete diffusion language models to 100B parameters via systematic conversion from autoregressive models using a 3-phase WSD training scheme and releases open-source 16B and 100B MoE variants.
-
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...
-
Expanding Performance Boundaries of Open-Source Multimodal Models with Model, Data, and Test-Time Scaling
InternVL 2.5 is the first open-source MLLM to surpass 70% on the MMMU benchmark via model, data, and test-time scaling, with a 3.7-point gain from chain-of-thought reasoning.
-
Mid-Training with Self-Generated Data Improves Reinforcement Learning in Language Models
Mid-training LLMs on self-generated diverse reasoning paths improves subsequent RL performance on mathematical benchmarks and OOD tasks.
-
Diversity in Large Language Models under Supervised Fine-Tuning
Supervised fine-tuning narrows LLM generative diversity through neglect of low-frequency patterns and knowledge forgetting, but the TOFU loss mitigates this effect across models and benchmarks.
-
Humanity's Last Exam
Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.
-
SPREG: Structured Plan Repair with Entropy-Guided Test-Time Intervention for Large Language Model Reasoning
SPREG detects logical failures in LLM long-chain reasoning through real-time entropy spikes and performs structured plan repairs using historical distributions, reporting a 20% absolute accuracy gain on AIME25.
-
Seed1.5-VL Technical Report
Seed1.5-VL is a compact multimodal model that sets new records on dozens of vision-language benchmarks and outperforms prior systems on agent-style tasks.
Reference graph
Works this paper leans on
-
[1]
Qwen-VL: A Versatile Vision-Language Model for Understanding, Localization, Text Reading, and Beyond
Have llms advanced enough? a challenging problem solving benchmark for large language mod- els. Jinze Bai, Shuai Bai, Shusheng Yang, Shijie Wang, Sinan Tan, Peng Wang, Junyang Lin, Chang Zhou, and Jingren Zhou. 2023. Qwen-vl: A frontier large vision-language model with versatile abilities. arXiv preprint arXiv:2308.12966. Daniel Bobrow et al. 1964. Natura...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[2]
Training Verifiers to Solve Math Word Problems
Training verifiers to solve math word prob- lems. arXiv preprint arXiv:2110.14168. Katherine M Collins, Albert Q Jiang, Simon Frieder, Lionel Wong, Miri Zilka, Umang Bhatt, Thomas Lukasiewicz, Yuhuai Wu, Joshua B Tenenbaum, William Hart, et al. 2023. Evaluating language models for mathematics through interactions. arXiv preprint arXiv:2306.01694. Simon Fr...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
Cmmu: A benchmark for chinese multi-modal multi-type question understanding and reasoning. Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt
-
[4]
Measuring Massive Multitask Language Understanding
Measuring massive multitask language under- standing. arXiv preprint arXiv:2009.03300. Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Ja- cob Steinhardt. 2021. Measuring mathematical prob- lem solving with the math dataset. arXiv preprint arXiv:2103.03874. Jinchang Hou, Chang Ao, Haihong Wu, Xiangtao Kon...
work page internal anchor Pith review Pith/arXiv arXiv 2009
-
[5]
Yan Wang, Xiaojiang Liu, and Shuming Shi
Scibench: Evaluating college-level scientific problem-solving abilities of large language models. Yan Wang, Xiaojiang Liu, and Shuming Shi. 2017. Deep neural solver for math word problems. In Pro- ceedings of the 2017 Conference on Empirical Meth- ods in Natural Language Processing, pages 845–854, Copenhagen, Denmark. Association for Computa- tional Lingu...
-
[6]
Minif2f: a cross-system benchmark for formal olympiad-level mathematics. Aojun Zhou, Ke Wang, Zimu Lu, Weikang Shi, Sichun Luo, Zipeng Qin, Shaoqing Lu, Anya Jia, Linqi Song, Mingjie Zhan, and Hongsheng Li. 2023. Solving challenging math word problems using gpt-4 code interpreter with code-based self-verification. A Dataset Details A.1 Data Sources Our da...
work page 2023
-
[7]
Global Mathematics and Physics Olympiad Problems. The Mathematics and Physics Olympiad problems are globally recognized for their complexity and quality. These prob- lems often require multiple methods of solu- tion and the ability to integrate sub-disciplines from within the broader fields of mathematics and physics. The participants in these compe- titi...
-
[8]
Regional and National Chinese Mathemat- ics Competitions. In addition to maintaining a high level of difficulty, regional competi- tions and the CMO introduce elements spe- cific to the Chinese context. This inclusion is instrumental in furthering the development and research of Chinese-oriented and multilin- gual large models. By encompassing a wide arra...
-
[9]
Gaokao Mock Questions for Mathematics and Physics. Given that the resolution of Olympiad-level problems typically necessi- tates models with substantial parameter sizes, we also incorporate Gaokao simulation prob- lems to evaluate smaller models’ capabili- ties in answering free-form mathematics and physics questions. The integration of data from Gaokao s...
work page 2006
-
[10]
claims to be the strongest open-source LMM, with enhancements in reasoning, OCR, and world knowledge. Despite being trained exclusively with English multi-modal data, it demonstrates an emer- gent zero-shot Chinese multi-modal capability on Chinese benchmarks. It should be noted that an image must be passed for Gemini-Pro-Vision, LLaV A-NeXT, and Yi-VL du...
work page 2024
-
[11]
This case mainly occurs in Physics-En_COMP that contains long-context problems of over 6,000 tokens
Exceeding input limit: Some of the context of the problems are too long, which exceed the input token limitation for the API. This case mainly occurs in Physics-En_COMP that contains long-context problems of over 6,000 tokens
-
[12]
Inappropriate response: Some problems trig- ger inappropriate response, which are banned by the API to return
-
[13]
No response: Some problems continuously get no or empty response from the API
-
[14]
We removed the problems with unavailable re- sponse when calculating the accuracy
Request timed out: Some problems continu- ously fail to get a response. We removed the problems with unavailable re- sponse when calculating the accuracy. C Additional Analysis and Examples C.1 Performance analysis of GPT-4V We analyzed GPT-4V’s performance (accuracy on open-ended problems) on different knowl- edge points based on the knowledge point labe...
-
[15]
Question Misunderstanding: GPT-4V some- times misunderstands the intention or settings of the question
-
[16]
Value Calculation Error: GPT-4V make sim- ple calculation mistakes sometimes, such as outputting b 2 + 7 = b+7 2 , these mistakes ap- pears more in Chinese and Math contents
-
[17]
Expression Calculation Error: Similar to value calculation error, but happens when transform- ing between two expressions
-
[18]
Logical Reasoning / Induction Error / Concep- tual Confusion: GPT-4V sometimes makes false reasoning or induction, as well as en- counters conceptual confusion (see Figure 7 for example)
-
[19]
Introducing Unnecessary variables or con- cepts: GPT-4V sometimes suddenly introduce variables or try to use concepts that have no contribution to solving the problem, which not only makes the output longer, but also may confuse GPT-4V itself and leads to incorrect output
-
[20]
Conclusion Hallucination: GPT-4V some- times hallucinates for a conclusion that is not reached in former output, or hallucinates a theorem that does not really exist (for exam- ple, when solving geometric proving problem, GPT-4V always mention "The Power Theo- rem", which does not exist, and all the proof thereafter will lost their logic)
-
[21]
(which is not true), or degenerates after some tokens
Unfinished Answering: GPT-4V sometimes says the question have confliction in settings 024681012141618 Unclassified Modern Physics Wave Physics Electromagnetism Mechanics Thermodynamics 01020304050607080 Unclassified Complex Numbers Derivatives Sequence Conic Sections Logic Algebra Elementary Functions Set Theory Combinations Probability and Statistics Num...
-
[22]
Insufficient Classification Discussions: When doing classification discussion, GPT-4V may miss some possible situation, or have over- lapped discussion (see Figure 6 for example)
-
[23]
Incorrect Judging: Sometimes GPT-4V gives the right answer, but is judged as incorrect due to the limitation of the automated scor- ing system: One important problem is that many problems, especially Physics problems, accept answers that fall in a specific range due to rounding up, rather than a fixed nu- merical answer, so a precision is needed for autom...
-
[24]
Given a simple solution, GPT-4V may choose a more complex method to solve the problem (see Figure 8)
-
[25]
Mainly observed for problems with a simple answer, such as the variables takes 0 as the answer
Models may give correct answers with a false process. Mainly observed for problems with a simple answer, such as the variables takes 0 as the answer
-
[26]
GPT-4V may success in giving correct overall idea, but fail in calculation (such as solving quadratic equations with extra negative signs), which leads to a wrong answer. Question GPT-4V’sSolution A die, with the numbers 1,2,3,4,6, and 8 on its six faces, is rolled. After this roll, if an odd number appears on the top face, all odd numbers on the die are ...
-
[27]
GPT-4V may not fully utilize the information from the image (see Figure 9). D Automatic Scoring Pipeline The pipeline workflow is shown in Algorithm 1. Algorithm 1: Auto Scoring Judge Input: GroundTruth, ModelOutput; Output: Boolean value indicating match; Preprocess GroundTruth and ModelOutput; if GroundTruth equals ModelOutput then return True; else if ...
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.