Recognition: 1 theorem link
MathArena: Evaluating LLMs on Uncontaminated Math Competitions
Pith reviewed 2026-05-15 00:05 UTC · model grok-4.3
The pith
MathArena evaluates LLMs on math competition problems released after their training data cutoffs to eliminate contamination.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
MathArena is a benchmark that uses problems from recurring math competitions released after model training cutoffs. This produces contamination-free evaluations across more than 50 models and 162 problems from seven contests. Results show contamination in AIME 2024, strong reasoning on harder contests such as CMIMC 2025, and a clear gap in proof-writing with top models scoring slightly less than 40 percent on IMO 2025.
What carries the argument
The central mechanism is real-time evaluation on newly released problems from recurring competitions, which supplies a continuous stream of fresh test items.
Where Pith is reading between the lines
- The approach could extend to other fields that release regular high-quality challenges, such as programming or physics contests.
- Models may need targeted training on formal proof structures to close the observed gap.
- Ongoing updates to the benchmark could become a standard practice for keeping AI evaluations current and fair.
Load-bearing premise
Newly released competition problems have never appeared in any training corpus or web scrape used by the evaluated models.
What would settle it
Locating any of the 2025 contest problems used in MathArena inside the training data of a top-performing model would disprove the contamination-free claim.
read the original abstract
The rapid advancement of reasoning capabilities in large language models (LLMs) has led to notable improvements on mathematical benchmarks. However, many of the most commonly used evaluation datasets (e.g., AIME 2024) are widely available online, making it difficult to disentangle genuine reasoning from potential memorization. Furthermore, these benchmarks do not evaluate proof-writing capabilities, which are crucial for many mathematical tasks. To address this, we introduce MathArena, a new benchmark based on the following key insight: recurring math competitions provide a stream of high-quality, challenging problems that can be used for real-time evaluation of LLMs. By evaluating models as soon as new problems are released, we effectively eliminate the risk of contamination. Using this framework, we find strong signs of contamination in AIME 2024. Nonetheless, evaluations on harder competitions, such as CMIMC 2025, demonstrate impressive reasoning capabilities in top-performing models. MathArena is also the first benchmark for proof-writing capabilities. On IMO 2025, top models achieve slightly less than 40%, demonstrating both notable progress and significant room for improvement. So far, we have evaluated over $50$ models across seven competitions, totaling $162$ problems. As an evolving benchmark, MathArena will continue to track the progress of LLMs on newly released competitions, ensuring rigorous and up-to-date evaluation of mathematical reasoning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces MathArena, a benchmark that evaluates LLMs on newly released problems from math competitions (AIME, CMIMC, IMO, etc.) to avoid contamination from training data. It reports strong evidence of contamination in AIME 2024, impressive reasoning on harder contests such as CMIMC 2025, and the first systematic results on proof-writing, with top models scoring slightly below 40% on IMO 2025. Over 50 models were tested on 162 problems total, with the benchmark positioned as an evolving, real-time evaluation framework.
Significance. If the no-contamination premise is substantiated, MathArena supplies a valuable, extensible resource for measuring genuine generalization in LLM mathematical reasoning, especially proof generation, which existing benchmarks largely omit. The empirical contrast between contaminated and post-cutoff contests, together with the ongoing release pipeline, could set a precedent for contamination-resistant evaluation in AI.
major comments (3)
- [Introduction and §3] The central methodological claim (Introduction and §3) that immediate post-release evaluation eliminates contamination risk is load-bearing yet rests on an unverified assumption; no archive searches, web-probe experiments, or training-data overlap checks are described for the 162 problems, leaving open the possibility that unofficial leaks or forum posts reached training corpora before model cutoffs.
- [§4.3] §4.3 (IMO 2025 evaluation): the reported top-model score of slightly less than 40% on proof-writing lacks detail on grading protocol, including rubric, whether grading was automated or human, number of graders, and inter-rater agreement; without these, the quantitative claim cannot be assessed for reliability.
- [§4.1] §4.1 (contamination detection for AIME 2024): the exact procedure, metrics, and thresholds used to identify 'strong signs of contamination' are not specified, preventing readers from determining whether analogous undetected leakage could affect the harder-contest results.
minor comments (2)
- [Table 1] A consolidated table listing all seven competitions, their release dates, and problem counts would improve readability and allow quick cross-reference with the reported scores.
- [Throughout] Model names and abbreviations are introduced inconsistently; a single nomenclature table or footnote list would reduce ambiguity across figures and tables.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment below, providing clarifications and indicating where revisions will be made to improve the manuscript.
read point-by-point responses
-
Referee: [Introduction and §3] The central methodological claim (Introduction and §3) that immediate post-release evaluation eliminates contamination risk is load-bearing yet rests on an unverified assumption; no archive searches, web-probe experiments, or training-data overlap checks are described for the 162 problems, leaving open the possibility that unofficial leaks or forum posts reached training corpora before model cutoffs.
Authors: We agree that the manuscript would benefit from greater transparency on this point. While immediate post-release evaluation inherently limits the opportunity for contamination relative to static benchmarks, we recognize that unofficial leaks remain a theoretical possibility. In the revised version we will expand the Introduction and §3 to include explicit timelines of each contest's official release dates, our evaluation dates, and any checks we performed for public availability on official sites and major forums before model cutoffs. We will also add a limitations paragraph acknowledging that absolute verification of training-data absence is infeasible and explaining why the approach still offers stronger protection than existing benchmarks. revision: partial
-
Referee: [§4.3] §4.3 (IMO 2025 evaluation): the reported top-model score of slightly less than 40% on proof-writing lacks detail on grading protocol, including rubric, whether grading was automated or human, number of graders, and inter-rater agreement; without these, the quantitative claim cannot be assessed for reliability.
Authors: We accept this criticism and will substantially expand §4.3. The revised text will state that all proofs were graded manually by expert mathematicians using a rubric adapted from official IMO scoring guidelines, with emphasis on mathematical correctness, completeness, and clarity. Grading was performed independently by two graders, with a third expert resolving any disagreements; we will report the resulting inter-rater agreement. The full rubric will be included in the appendix. revision: yes
-
Referee: [§4.1] §4.1 (contamination detection for AIME 2024): the exact procedure, metrics, and thresholds used to identify 'strong signs of contamination' are not specified, preventing readers from determining whether analogous undetected leakage could affect the harder-contest results.
Authors: We will revise §4.1 to describe the detection procedure in full. The method compared model accuracy on AIME 2024 against expected performance derived from similar problems in prior uncontaminated contests, using quantitative metrics such as accuracy deviation and qualitative inspection of solution patterns for signs of memorization. Thresholds were defined via statistical outliers relative to baseline models. The revised section will specify the exact metrics and thresholds so readers can evaluate the strength of the evidence and apply analogous reasoning to other contests. revision: yes
Circularity Check
No circularity: empirical benchmark scores on external contests with no derived predictions or self-referential reductions
full rationale
The paper presents an empirical benchmark (MathArena) that scores LLMs on newly released competition problems (AIME 2024, CMIMC 2025, IMO 2025, etc.). Central results are raw performance percentages across 162 problems and 50+ models. No equations, fitted parameters, or first-principles derivations exist that could reduce to inputs by construction. The key methodological claim (evaluating 'as soon as new problems are released' eliminates contamination) is an unverified assumption about external data, not a self-definitional or fitted prediction inside the paper. No self-citations are load-bearing for any quantitative result. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Newly released math competition problems have not appeared in any training data or web scrape used by the evaluated LLMs.
Forward citations
Cited by 23 Pith papers
-
MathConstraint: Automated Generation of Verified Combinatorial Reasoning Instances for LLMs
MathConstraint generates scalable, automatically verifiable combinatorial problems where LLMs achieve 18.5-66.9% accuracy without tools but roughly double that with solver access.
-
MathNet: a Global Multimodal Benchmark for Mathematical Reasoning and Retrieval
MathNet delivers the largest multilingual Olympiad math dataset and benchmarks where models like Gemini-3.1-Pro reach 78% on solving but embedding models struggle on equivalent problem retrieval, with retrieval augmen...
-
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
Lightning OPD enforces teacher consistency by precomputing log-probabilities over SFT rollouts, matching standard OPD performance with bounded gradient discrepancy and achieving 4x speedup on math and code reasoning tasks.
-
Formal Conjectures: An Open and Evolving Benchmark for Verified Discovery in Mathematics
Formal Conjectures is a Lean 4 benchmark containing 2615 formalized problems with 1029 open conjectures, designed to evaluate automated mathematical reasoning and proof discovery.
-
Not All Proofs Are Equal: Evaluating LLM Proof Quality Beyond Correctness
LLM proofs for hard math problems show large differences in quality metrics like conciseness and cognitive simplicity that correctness-only tests miss, along with trade-offs between quality and correctness.
-
Beyond Negative Rollouts: Positive-Only Policy Optimization with Implicit Negative Gradients
POPO uses bounded importance sampling on positive rollouts and a siamese policy network to achieve implicit negative gradients and stable optimization, matching or exceeding GRPO on math benchmarks such as 36.67% on A...
-
Math Education Digital Shadows for facilitating learning with LLMs: Math performance, anxiety and confidence in simulated students and AIs
MEDS is a dataset of 28,000 LLM personas performing high-school math tasks alongside psychometric tests and cognitive networks that capture math anxiety, self-efficacy, and confidence to support safer AI tutors.
-
Rethinking the Comparison Unit in Sequence-Level Reinforcement Learning: An Equal-Length Paired Training Framework from Loss Correction to Sample Construction
EqLen is a sample-construction framework that builds equal-length paired segments via dual-track generation and masking for stable group-relative RL in sequences, reframing the length problem as a comparison-unit issu...
-
Prefix Teach, Suffix Fade: Local Teachability Collapse in Strong-to-Weak On-Policy Distillation
Local teachability collapse in trajectory suffixes makes uniform dense supervision suboptimal in strong-to-weak OPD; truncating at BIC-style change points on teacher margin improves performance.
-
DataMaster: Data-Centric Autonomous AI Research
DataMaster deploys an AI agent to autonomously engineer data via tree search over external sources, shared candidate pools, and memory of past outcomes, yielding 32% higher medal rates on MLE-Bench Lite and a small GP...
-
DataMaster: Data-Centric Autonomous AI Research
DataMaster autonomously optimizes data via tree search and shared memory, raising medal rate 32.27% on MLE-Bench Lite and beating the base instruct model on GPQA.
-
Implicit Compression Regularization: Concise Reasoning via Internal Shorter Distributions in RL Post-Training
ICR creates a virtual shorter distribution from shortest correct on-policy responses to regularize RL post-training toward concise yet accurate reasoning, improving the accuracy-length Pareto frontier on math and know...
-
An Interpretable and Scalable Framework for Evaluating Large Language Models
A majorization-minimization framework turns IRT into scalable matrix factorization subproblems for LLM evaluation, delivering orders-of-magnitude speedups with identifiability guarantees.
-
AutoPyVerifier: Learning Compact Executable Verifiers for Large Language Model Outputs
AutoPyVerifier learns compact sets of executable Python verifiers from labeled LLM outputs via LLM synthesis and DAG search, improving objective prediction by up to 55 F1 points and downstream LLM accuracy by up to 17 points.
-
Internalizing Outcome Supervision into Process Supervision: A New Paradigm for Reinforcement Learning for Reasoning
A new RL paradigm for reasoning where models generate their own internal process supervision from outcome feedback by recycling failed trajectories.
-
Rethinking On-Policy Distillation of Large Language Models: Phenomenology, Mechanism, and Recipe
On-policy distillation works when student and teacher models share thinking patterns and the teacher adds new capabilities, with success tied to alignment on a small set of high-probability tokens.
-
Lightning OPD: Efficient Post-Training for Large Reasoning Models with Offline On-Policy Distillation
Lightning OPD is an offline on-policy distillation method that matches standard OPD performance at 4x efficiency by enforcing teacher consistency between SFT and distillation phases.
-
M2A: Synergizing Mathematical and Agentic Reasoning in Large Language Models
M2A uses null-space model merging to combine mathematical and agentic reasoning in LLMs, raising SWE-Bench Verified performance from 44.0% to 51.2% on Qwen3-8B without retraining.
-
On-Policy Distillation with Best-of-N Teacher Rollout Selection
BRTS improves on-policy distillation by selecting the highest-quality teacher trajectory from a small pool of samples based on correctness and alignment with the student, yielding gains on AIME and AMC math benchmarks.
-
On-Policy Distillation with Best-of-N Teacher Rollout Selection
BRTS improves on-policy distillation by sampling multiple teacher rollouts and selecting the best one via a correctness-first then alignment priority rule, yielding gains on AIME and AMC math benchmarks.
-
Beyond Benchmarks: MathArena as an Evaluation Platform for Mathematics with LLMs
MathArena is a maintained platform evaluating LLMs across olympiad problems, proofs, research questions, and formal proofs, with GPT-5.5 reaching 98% on 2026 USAMO and 74% on research-level tasks.
-
Beyond Distribution Sharpening: The Importance of Task Rewards
Task-reward reinforcement learning yields robust gains on math benchmarks for models like Llama-3.2-3B while distribution sharpening alone delivers only limited and unstable improvements.
-
MiMo-V2-Flash Technical Report
MiMo-V2-Flash is a 309B/15B MoE model trained on 27T tokens with hybrid attention and multi-teacher on-policy distillation that matches larger models like DeepSeek-V3.2 while enabling 2.6x faster decoding via repurpos...
Reference graph
Works this paper leans on
-
[1]
Phi-4-reasoning technical report, 2025
Marah Abdin, Sahaj Agarwal, Ahmed Awadallah, Vidhisha Balachandran, Harkirat Behl, Lingjiao Chen, Gustavo de Rosa, Suriya Gunasekar, Mojan Javaheripi, Neel Joshi, Piero Kauffmann, Yash Lara, Caio César Teodoro Mendes, Arindam Mitra, Besmira Nushi, Dimitris Papailiopoulos, Olli Saarikivi, Shital Shah, Vaishnavi Shrivastava, Vibhav Vineet, Yue Wu, Safoora Y...
-
[2]
Art of Problem Solving. 2025 aime i. Art of Problem Solving Wiki, 2025. URL https: //artofproblemsolving.com/wiki/index.php/2025_AIME_I. Accessed: 2025
work page 2025
-
[3]
Art of Problem Solving. 2025 aime ii. Art of Problem Solving Wiki, 2025. URL https: //artofproblemsolving.com/wiki/index.php/2025_AIME_II. Accessed: 2025
work page 2025
-
[4]
Brown university math olympiad 2025, 2025
BRUMO. Brown university math olympiad 2025, 2025. URL https://www.brumo.org/. Accessed: 2025
work page 2025
-
[5]
CMIMC. Cmimc 2025, 2025. URLhttps://cmimc.math.cmu.edu/. Accessed: 2025
work page 2025
-
[6]
Training Verifiers to Solve Math Word Problems
Karl Cobbe, Vineet Kosaraju, Mohammad Bavarian, Mark Chen, Heewoo Jun, Lukasz Kaiser, Matthias Plappert, Jerry Tworek, Jacob Hilton, Reiichiro Nakano, Christopher Hesse, and John Schulman. Training verifiers to solve math word problems.CoRR, abs/2110.14168, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[7]
DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning
DeepSeek-AI. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Math- construct: Challenging llm reasoning with constructive proofs
Jasper Dekoninck, Mislav Balunovic, Nikola Jovanovi´c, Ivo Petrov, and Martin Vechev. Math- construct: Challenging llm reasoning with constructive proofs. InICLR 2025 Workshop: VerifAI: AI Verification in the Wild
work page 2025
-
[10]
Openai and frontiermath.Epoch AI Blog, 2024
Epoch. Openai and frontiermath.Epoch AI Blog, 2024. URL https://epoch.ai/blog/ openai-and-frontiermath
work page 2024
-
[11]
Project Euler. Project euler, 2025. URLhttps://projecteuler.net/. Accessed: 2025
work page 2025
-
[12]
International mathematical olympiad, 2025
IMO Foundation. International mathematical olympiad, 2025. URL https://www. imo-official.org/. Accessed: 2025
work page 2025
-
[13]
Mathematical capabilities of chatgpt
Simon Frieder, Luca Pinchetti, Alexis Chevalier, Ryan-Rhys Griffiths, Tommaso Salvatori, Thomas Lukasiewicz, Philipp Petersen, and Julius Berner. Mathematical capabilities of chatgpt. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine, editors,Advances in Neural Information Processing Systems 36: Annual Conference o...
work page 2023
-
[14]
Omni-math: A universal olympiad level mathematic benchmark for large language models
Bofei Gao, Feifan Song, Zhe Yang, Zefan Cai, Yibo Miao, Qingxiu Dong, Lei Li, Chenghao Ma, Liang Chen, Runxin Xu, Zhengyang Tang, Benyou Wang, Daoguang Zan, Shanghaoran Quan, Ge Zhang, Lei Sha, Yichang Zhang, Xuancheng Ren, Tianyu Liu, and Baobao Chang. Omni-math: A universal olympiad level mathematic benchmark for large language models. CoRR, abs/2410.07...
-
[15]
Frontiermath: A benchmark for evaluating advanced mathematical reasoning in AI.arXiv, 2024
Elliot Glazer, Ege Erdil, Tamay Besiroglu, Diego Chicharro, Evan Chen, Alex Gunning, Caroline Falkman Olsson, Jean-Stanislas Denain, Anson Ho, Emily de Oliveira Santos, Olli Järviniemi, Matthew Barnett, Robert Sandler, Matej Vrzala, Jaime Sevilla, Qiuyu Ren, Elizabeth Pratt, Lionel Levine, Grant Barkley, Natalie Stewart, Bogdan Grechuk, Tetiana Grechuk, 1...
work page 2024
-
[16]
Chaoqun He, Renjie Luo, Yuzhuo Bai, Shengding Hu, Zhen Leng Thai, Junhao Shen, Jinyi Hu, Xu Han, Yujie Huang, Yuxiang Zhang, Jie Liu, Lei Qi, Zhiyuan Liu, and Maosong Sun. Olympiadbench: A challenging benchmark for promoting AGI with olympiad-level bilingual multimodal scientific problems. InACL (1), pages 3828–3850. Association for Computational Linguist...
work page 2024
-
[17]
Measuring mathematical problem solving with the MATH dataset
Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the MATH dataset. InNeurIPS Datasets and Benchmarks, 2021
work page 2021
- [18]
-
[20]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. Openai o1 system card.arXiv preprint arXiv:2412.16720, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
FIMO: A challenge formal dataset for automated theorem proving.CoRR, abs/2309.04295, 2023
Chengwu Liu, Jianhao Shen, Huajian Xin, Zhengying Liu, Ye Yuan, Haiming Wang, Wei Ju, Chuanyang Zheng, Yichun Yin, Lin Li, Ming Zhang, and Qun Liu. FIMO: A challenge formal dataset for automated theorem proving.CoRR, abs/2309.04295, 2023
- [22]
-
[23]
Sadegh Mahdavi, Muchen Li, Kaiwen Liu, Christos Thrampoulidis, Leonid Sigal, and Renjie Liao. Leveraging online olympiad-level math problems for llms training and contamination- resistant evaluation.CoRR, abs/2501.14275, 2025
-
[24]
Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models
Iman Mirzadeh, Keivan Alizadeh, Hooman Shahrokhi, Oncel Tuzel, Samy Bengio, and Mehrdad Farajtabar. Gsm-symbolic: Understanding the limitations of mathematical reasoning in large language models. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URLhttps://openreview. net/...
work page 2025
-
[25]
2025 putnam mathematical competition, 2025
Mathematical Association of America. 2025 putnam mathematical competition, 2025. URL https://maa.org/putnam/. Accessed: 2025
work page 2025
-
[26]
Art of Problem Solving. 2025 usa math olympiad, 2025. URLhttps://artofproblemsolving. com/wiki/index.php/2025_USAMO. Accessed: 2025
work page 2025
-
[27]
OpenAI. Deep research, 2025. URL https://openai.com/index/ introducing-deep-research/
work page 2025
-
[28]
Proof or bluff? evaluating llms on 2025 usa math olympiad.arXiv preprint arXiv:2503.21934, 2025
Ivo Petrov, Jasper Dekoninck, Lyuben Baltadzhiev, Maria Drencheva, Kristian Minchev, Mislav Balunovi´c, Nikola Jovanovi´c, and Martin Vechev. Proof or bluff? evaluating llms on 2025 usa math olympiad.arXiv preprint arXiv:2503.21934, 2025
-
[29]
Humanity’s last exam.arXiv, 2025
Long Phan, Alice Gatti, Ziwen Han, Nathaniel Li, Josephina Hu, Hugh Zhang, Sean Shi, Michael Choi, Anish Agrawal, Arnav Chopra, Adam Khoja, Ryan Kim, Jason Hausenloy, Oliver Zhang, et al. Humanity’s last exam.arXiv, 2025
work page 2025
-
[30]
Challenging the Boundaries of Reasoning: An Olympiad-Level Math Benchmark for Large Language Models
Haoxiang Sun, Yingqian Min, Zhipeng Chen, Wayne Xin Zhao, Zheng Liu, Zhongyuan Wang, Lei Fang, and Ji-Rong Wen. Challenging the boundaries of reasoning: An olympiad-level math benchmark for large language models.CoRR, abs/2503.21380, 2025. 12
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
Gemini Team. Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities.CoRR, abs/2507.06261, 2025. doi: 10.48550/ARXIV .2507.06261. URLhttps://doi.org/10.48550/arXiv.2507.06261
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv 2025
-
[32]
George Tsoukalas, Jasper Lee, John Jennings, Jimmy Xin, Michelle Ding, Michael Jennings, Amitayush Thakur, and Swarat Chaudhuri. Putnambench: Evaluating neural theorem-provers on the putnam mathematical competition.CoRR, abs/2407.11214, 2024
-
[33]
Livebench: A challenging, contamination-free llm benchmark.arXiv preprint arXiv:2406.19314, 2024
Colin White, Samuel Dooley, Manley Roberts, Arka Pal, Ben Feuer, Siddhartha Jain, Ravid Shwartz-Ziv, Neel Jain, Khalid Saifullah, Siddartha Naidu, et al. Livebench: A challenging, contamination-free llm benchmark.arXiv preprint arXiv:2406.19314, 2024
-
[34]
Grok 3 beta — the age of reasoning agents, February 2025
xAI Team. Grok 3 beta — the age of reasoning agents, February 2025. URL https://x.ai/ news/grok-3. News post
work page 2025
-
[35]
Huaiyuan Ying, Zijian Wu, Yihan Geng, Jiayu Wang, Dahua Lin, and Kai Chen. Lean workbook: A large-scale lean problem set formalized from natural language math problems.arXiv preprint arXiv:2406.03847, 2024
-
[36]
Yue, Lovish Madaan, Ted Moskovitz, DJ Strouse, and Aaditya K
Albert S. Yue, Lovish Madaan, Ted Moskovitz, DJ Strouse, and Aaditya K. Singh. HARP: A challenging human-annotated math reasoning benchmark.CoRR, abs/2412.08819, 2024
-
[37]
Hugh Zhang, Jeff Da, Dean Lee, Vaughn Robinson, Catherine Wu, William Song, Tiffany Zhao, Pranav Raja, Charlotte Zhuang, Dylan Slack, et al. A careful examination of large language model performance on grade school arithmetic.Advances in Neural Information Processing Systems, 37:46819–46836, 2024
work page 2024
-
[38]
Jun Zhao, Jingqi Tong, Yurong Mou, Ming Zhang, Qi Zhang, and Xuanjing Huang. Exploring the compositional deficiency of large language models in mathematical reasoning.arXiv preprint arXiv:2405.06680, 2024
-
[39]
minif2f: a cross-system benchmark for formal olympiad-level mathematics
Kunhao Zheng, Jesse Michael Han, and Stanislas Polu. minif2f: a cross-system benchmark for formal olympiad-level mathematics. InICLR. OpenReview.net, 2022
work page 2022
-
[40]
### Final answer: \\boxed{your_answer_here}
Chengke Zou, Xingang Guo, Rui Yang, Junyu Zhang, Bin Hu, and Huan Zhang. Dynamath: A dynamic visual benchmark for evaluating mathematical reasoning robustness of vision language models. InThe Thirteenth International Conference on Learning Representations, ICLR 2025, Singapore, April 24-28, 2025. OpenReview.net, 2025. URLhttps://openreview.net/forum? id=V...
-
[41]
sunny,” that together contain every point ofS n. Recall a line is “sunny
If two snakes ever touch, they merge to form a single snake occupying the arcs of both of the previous snakes, and the merged snake moves as one snake. Compute the expected number of seconds until there is only one snake left. Solution by GROK4 FAST 448 3 Explanation The problem involves 15 snakes on a circle of 45 arcs, initially placed every third arc, ...
work page 2025
-
[42]
One can coverS n withnlines having exactlyk∈ {0,1,3}sunny lines
-
[43]
No other value ofkis possible. — Step 1 - Reduction to a smaller triangular grid.LetLbe any collection ofnlines covering Sn, with s of them non–sunny and k=n−s sunny. Since non–sunny lines are parallel to one of the three directions (horizontal, vertical, or antidiagonal x+y=const ), each non–sunny line covers points in at most one of the three “grid dire...
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.