Recognition: 2 theorem links
· Lean TheoremHumanity's Last Exam
Pith reviewed 2026-05-10 18:36 UTC · model grok-4.3
The pith
A benchmark of 2500 expert-level questions shows state-of-the-art LLMs still perform poorly on hard academic problems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors assembled 2500 multi-modal questions across dozens of subjects, each carrying a known, unambiguous solution that is easily verified yet not quickly retrievable from the internet. State-of-the-art LLMs achieve low accuracy and poor calibration on this collection, in contrast to their near-ceiling performance on saturated earlier benchmarks, thereby exposing a measurable distance between present model abilities and the expert human frontier on closed-ended academic questions.
What carries the argument
The Humanity's Last Exam benchmark itself, a fixed set of 2500 expert-developed questions with verifiable answers that resist rapid retrieval.
If this is right
- The benchmark supplies a durable yardstick for measuring gains in reasoning and knowledge on genuinely difficult problems.
- Model developers gain a concrete signal that current approaches leave substantial headroom before expert-level closed-ended performance.
- Policymakers receive a clearer view of the distance between deployed systems and human-expert capability on academic tasks.
- Subsequent evaluation efforts can adopt the same global-expert, verifiable-answer design for other domains.
Where Pith is reading between the lines
- Strong performance on this set may correlate with competence on complex real-world expert workflows that mix facts and reasoning.
- The multi-modal format points to a need for joint advances in text and visual understanding at frontier difficulty.
- Repeated use of the same questions over time will let researchers quantify whether gains are genuine or partly due to data leakage.
- Similar coordinated expert efforts could produce parallel tests for fields where knowledge moves faster than static benchmarks allow.
Load-bearing premise
The questions have clear solutions that cannot be quickly found through internet searches and sit at the current edge of what human experts know.
What would settle it
An independent check that shows many of the questions can be answered correctly by standard web search or that top LLMs reach above 60 percent accuracy on the full set without additional training.
read the original abstract
Benchmarks are important tools for tracking the rapid advancements in large language model (LLM) capabilities. However, benchmarks are not keeping pace in difficulty: LLMs now achieve over 90\% accuracy on popular benchmarks like MMLU, limiting informed measurement of state-of-the-art LLM capabilities. In response, we introduce Humanity's Last Exam (HLE), a multi-modal benchmark at the frontier of human knowledge, designed to be the final closed-ended academic benchmark of its kind with broad subject coverage. HLE consists of 2,500 questions across dozens of subjects, including mathematics, humanities, and the natural sciences. HLE is developed globally by subject-matter experts and consists of multiple-choice and short-answer questions suitable for automated grading. Each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval. State-of-the-art LLMs demonstrate low accuracy and calibration on HLE, highlighting a significant gap between current LLM capabilities and the expert human frontier on closed-ended academic questions. To inform research and policymaking upon a clear understanding of model capabilities, we publicly release HLE at https://lastexam.ai.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Humanity's Last Exam (HLE), a multi-modal benchmark of 2,500 closed-ended questions (multiple-choice and short-answer) spanning mathematics, humanities, and natural sciences. Questions were developed globally by subject-matter experts and are asserted to have unambiguous, verifiable solutions that cannot be quickly answered via internet retrieval. The paper claims that existing benchmarks like MMLU are saturated (>90% LLM accuracy) and positions HLE as a frontier benchmark on which state-of-the-art LLMs exhibit low accuracy and poor calibration, revealing a substantial gap to expert human performance. The benchmark is released publicly at lastexam.ai.
Significance. If the questions are rigorously validated as non-retrievable and frontier-level, HLE would be a valuable contribution by supplying a non-saturated, broad-coverage benchmark for tracking LLM progress on expert academic tasks. The global expert curation and multi-modal design are strengths, and the public release supports reproducibility. However, the claimed significance of the LLM capability gap rests on unshown validation evidence, limiting its current impact for research and policy.
major comments (2)
- [Abstract and question development section] Abstract and the section describing question development: The assertion that 'each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval' is load-bearing for interpreting low LLM accuracy as evidence of a true capability frontier rather than training-data gaps or leakage. No concrete methodology is supplied (e.g., expert search audits, originality checks, or quantitative retrievability tests), directly addressing the central claim.
- [Results and evaluation sections] Results and evaluation sections: The abstract states that SOTA LLMs 'demonstrate low accuracy and calibration' on HLE, yet the provided information contains no quantitative results, specific model accuracies, baselines, calibration metrics, or statistical details. This absence makes it impossible to assess the magnitude or robustness of the reported gap.
minor comments (1)
- [Abstract] Abstract: Including one or two concrete accuracy figures (with model names) would make the 'low accuracy' claim more precise and informative for readers.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback on our manuscript introducing Humanity's Last Exam. We address each major comment point by point below, with clear indications of planned revisions to improve clarity and rigor.
read point-by-point responses
-
Referee: [Abstract and question development section] Abstract and the section describing question development: The assertion that 'each question has a known solution that is unambiguous and easily verifiable, but cannot be quickly answered via internet retrieval' is load-bearing for interpreting low LLM accuracy as evidence of a true capability frontier rather than training-data gaps or leakage. No concrete methodology is supplied (e.g., expert search audits, originality checks, or quantitative retrievability tests), directly addressing the central claim.
Authors: We agree that explicit validation details are essential to support the non-retrievability claim and distinguish capability gaps from data leakage. The manuscript describes global expert curation and the requirement for verifiable solutions, but we acknowledge the need for greater specificity. In the revised version, we will add a dedicated subsection under question development that outlines the concrete procedures: expert-conducted web searches for each question, checks against academic databases and prior benchmarks for originality, and any quantitative thresholds or audit logs used to confirm that solutions cannot be quickly retrieved. Examples of such checks for representative questions will be included where feasible without compromising the benchmark. revision: yes
-
Referee: [Results and evaluation sections] Results and evaluation sections: The abstract states that SOTA LLMs 'demonstrate low accuracy and calibration' on HLE, yet the provided information contains no quantitative results, specific model accuracies, baselines, calibration metrics, or statistical details. This absence makes it impossible to assess the magnitude or robustness of the reported gap.
Authors: We apologize that the quantitative results were not presented with sufficient prominence or completeness in the version under review. The manuscript does contain an evaluation section reporting model performance, but we will revise it to include explicit tables with per-model accuracies (e.g., for GPT-4o, Claude 3.5 Sonnet, and others), direct comparisons to human expert baselines, calibration metrics such as expected calibration error, and basic statistical details including confidence intervals or variance across question subsets. This will enable readers to evaluate the scale and reliability of the observed gap. revision: yes
Circularity Check
No circularity: benchmark dataset release without derivations or fits
full rationale
The paper introduces Humanity's Last Exam as a new multi-modal benchmark consisting of 2,500 expert-authored questions. It contains no mathematical derivations, model equations, parameter fittings, or predictions derived from internal computations. The central claims—that questions are unambiguous, verifiable, and not quickly retrievable via internet, and that current LLMs show low accuracy—rest on the empirical construction and release of the dataset itself rather than any self-referential reduction of outputs to inputs. No self-citation chains, ansatzes, or renamings of known results are used to justify load-bearing steps. The work is therefore self-contained as a benchmark contribution with no derivation chain to inspect for circularity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Questions have known, unambiguous, and easily verifiable solutions that cannot be quickly answered via internet retrieval.
Forward citations
Cited by 54 Pith papers
-
Soohak: A Mathematician-Curated Benchmark for Evaluating Research-level Math Capabilities of LLMs
Soohak is a new 439-problem mathematician-authored benchmark showing frontier LLMs reach only 30% on research math and fail to exceed 50% on refusing ill-posed questions.
-
neuralCAD-Edit: An Expert Benchmark for Multimodal-Instructed 3D CAD Model Editing
neuralCAD-Edit benchmark shows even the best foundation model (GPT 5.2) scores 53% lower than human CAD experts in acceptance trials for multimodal-instructed 3D model edits.
-
PolyBench: Benchmarking LLM Forecasting and Trading Capabilities on Live Prediction Market Data
Only two of seven LLMs produce positive returns on live Polymarket data, with MiMo-V2-Flash at 17.6% CWR and Gemini-3-Flash at 6.2% CWR while the other five lose money.
-
TRIAGE: Evaluating Prospective Metacognitive Control in LLMs under Resource Constraints
TRIAGE evaluates LLMs on prospective metacognitive control by requiring a single plan for task selection, sequencing, and token allocation under a calibrated budget, revealing substantial gaps in current models across...
-
AssayBench: An Assay-Level Virtual Cell Benchmark for LLMs and Agents
AssayBench is a new gene-ranking benchmark for phenotypic CRISPR screens that shows zero-shot generalist LLMs outperform both biology-specific LLMs and trainable baselines on adjusted nDCG.
-
Towards On-Policy Data Evolution for Visual-Native Multimodal Deep Search Agents
A new image-bank harness and closed-loop on-policy data evolution method raises multimodal agent performance on visual search benchmarks from 24.9% to 39.0% for an 8B model and from 30.6% to 41.5% for a 30B model.
-
MaD Physics: Evaluating information seeking under constraints in physical environments
MaD Physics is a new benchmark for evaluating AI agents on constrained information-seeking, model inference, and prediction in three physical environments with altered laws to avoid knowledge contamination.
-
LLM-Guided Monte Carlo Tree Search over Knowledge Graphs: Composing Mechanistic Explanations for Drug-Disease Pairs
TESSERA combines LLMs as local policy and evaluator with MCTS on knowledge graphs to compose mechanistic drug-disease explanations.
-
DiagnosticIQ: A Benchmark for LLM-Based Industrial Maintenance Action Recommendation from Symbolic Rules
DiagnosticIQ benchmark shows frontier LLMs perform similarly on standard rule-to-action tasks but lose substantial accuracy under distractor expansion and condition inversion, pointing to calibration as the key deploy...
-
AcademiClaw: When Students Set Challenges for AI Agents
AcademiClaw is a new benchmark of 80 student-sourced academic tasks where the best frontier AI agents achieve only a 55% pass rate.
-
Reward Hacking Benchmark: Measuring Exploits in LLM Agents with Tool Use
The Reward Hacking Benchmark shows RL post-training raises exploit rates in tool-using LLM agents from 0.6% to 13.9%, with environmental hardening cutting exploits by 87.7% relative without lowering task success.
-
Super Apriel: One Checkpoint, Many Speeds
A single 15B supernet checkpoint supports runtime switching between attention mixer placements for multiple decode speed presets while retaining 77-96% quality relative to the teacher model.
-
Stargazer: A Scalable Model-Fitting Benchmark Environment for AI Agents under Astrophysical Constraints
Stargazer benchmarks AI agents on physics-constrained model fitting for astrophysical data, revealing that agents achieve statistical fits but often fail to recover correct physical parameters.
-
Stargazer: A Scalable Model-Fitting Benchmark Environment for AI Agents under Astrophysical Constraints
Stargazer benchmark shows frontier AI agents achieve statistical fits to radial velocity data but frequently fail to recover correct physical planetary system parameters.
-
PokeGym: A Visually-Driven Long-Horizon Benchmark for Vision-Language Models
PokeGym is a new benchmark that tests VLMs on long-horizon tasks in a complex 3D game using only visual observations, identifying deadlock recovery as the primary failure mode.
-
GeoBrowse: A Geolocation Benchmark for Agentic Tool Use with Expert-Annotated Reasoning Traces
GeoBrowse is a two-level geolocation benchmark combining visual cue composition with knowledge-intensive multi-hop queries, paired with the GATE agent workflow that outperforms no-tool, search-only, and image-only baselines.
-
The limits of bio-molecular modeling with large language models : a cross-scale evaluation
LLMs perform adequately on bio-molecular classification tasks but remain weak on regression, with hybrid architectures outperforming others on long sequences and fine-tuning hurting generalization.
-
Measuring Five-Nines Reliability: Sample-Efficient LLM Evaluation in Saturated Benchmarks
Cross-entropy method sampling reduces inferences needed to estimate five-nines LLM reliability by up to 156x on parameterized GSM8K templates, revealing reliability differences hidden by saturated accuracy scores.
-
The Generalized Turing Test: A Foundation for Comparing Intelligence
The Generalized Turing Test defines relative intelligence as the inability of one agent to distinguish an imitator from the original through interaction.
-
EvoMAS: Learning Execution-Time Workflows for Multi-Agent Systems
EvoMAS trains a workflow adapter with policy gradients to dynamically instantiate stage-specific multi-agent workflows from a fixed agent pool, using explicit task-state construction and terminal success signals, and ...
-
A Semantic-Sampling Framework for Evaluating Calibration in Open-Ended Question Answering
Sem-ECE is an asymptotically unbiased calibration error estimator for open-ended QA that uses semantic sampling of answers to derive confidence from class frequencies, with two variants that diverge on hard questions.
-
Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models
MELT decouples reasoning depth from memory in looped LLMs by sharing a single gated KV cache per layer and using two-phase chunk-wise distillation from Ouro, delivering constant memory use while matching or beating st...
-
Learning Agent Routing From Early Experience
BoundaryRouter routes queries to LLM or agent using early experience memory from a seed set, cutting inference time 60.6% versus always using agents and raising performance 28.6% versus always using direct LLM inference.
-
Cripping AI: Reimagining AI Through Lived Disability Experiences
Cripping AI is a proposed framework that dismantles ableist assumptions in AI by centering disabled ways of knowing and respecting disabled labor in co-creation.
-
SciResearcher: Scaling Deep Research Agents for Frontier Scientific Reasoning
SciResearcher automates creation of diverse scientific reasoning tasks from academic evidence to train an 8B model that sets new SOTA at 19.46% on HLE-Bio/Chem-Gold and gains 13-15% on SuperGPQA-Hard-Biology and TRQA-...
-
ProEval: Proactive Failure Discovery and Efficient Performance Estimation for Generative AI Evaluation
ProEval is a proactive framework using pre-trained GPs, Bayesian quadrature, and superlevel set sampling to estimate performance and find failures in generative AI with 8-65x fewer samples than baselines.
-
Superminds Test: Actively Evaluating Collective Intelligence of Agent Society via Probing Agents
Large-scale experiments on two million agents reveal that collective intelligence does not emerge from scale alone due to sparse and shallow interactions.
-
Large Language Models Decide Early and Explain Later
LLMs settle on their answer after a minority of CoT tokens and produce an average 760 more as post-decision explanation, enabling early stopping that saves 500 tokens per query at a 2% accuracy cost.
-
ActuBench: A Multi-Agent LLM Pipeline for Generation and Evaluation of Actuarial Reasoning Tasks
ActuBench is a multi-agent LLM pipeline for generating and evaluating actuarial reasoning tasks, with evaluations of 50 models showing effective verification, competitive local open-weights models, and differing ranki...
-
Agent-World: Scaling Real-World Environment Synthesis for Evolving General Agent Intelligence
Agent-World autonomously synthesizes verifiable real-world tasks and uses continuous self-evolution to train 8B and 14B agents that outperform proprietary models on 23 benchmarks.
-
PRL-Bench: A Comprehensive Benchmark Evaluating LLMs' Capabilities in Frontier Physics Research
PRL-Bench evaluates frontier LLMs on 100 real physics research tasks and finds the best models score below 50, exposing a gap to autonomous discovery.
-
Frontier-Eng: Benchmarking Self-Evolving Agents on Real-World Engineering Tasks with Generative Optimization
Frontier-Eng is a new benchmark for generative optimization in engineering where agents iteratively improve designs under fixed interaction budgets using executable verifiers, with top models like GPT 5.4 showing limi...
-
Towards Knowledgeable Deep Research: Framework and Benchmark
The paper introduces the KDR task, HKA multi-agent framework, and KDR-Bench to enable LLM agents to integrate structured knowledge into deep research reports, with experiments showing outperformance over prior agents.
-
Select-then-Solve: Paradigm Routing as Inference-Time Optimization for LLM Agents
A learned embedding-based router selecting among six reasoning paradigms improves LLM agent accuracy from 47.6% to 53.1% on average, beating the best fixed paradigm by 2.8pp.
-
Representational Collapse in Multi-Agent LLM Committees: Measurement and Diversity-Aware Consensus
LLM agent committees exhibit representational collapse with mean cosine similarity of 0.888, and diversity-aware consensus reaches 87% accuracy on GSM8K versus 84% for self-consistency at lower cost.
-
MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention
MiniMax-M1 is a 456B parameter hybrid-attention MoE model trained with CISPO RL that achieves performance comparable or superior to DeepSeek-R1 and Qwen3-235B on reasoning and software engineering tasks while training...
-
LLMs Get Lost In Multi-Turn Conversation
LLMs drop 39% in performance during multi-turn conversations due to premature assumptions and inability to recover from early errors.
-
Instructions Shape Production of Language, not Processing
Instructions primarily shape the production stage of language models rather than the processing stage, with task-specific information and causal effects stronger in output tokens than input tokens.
-
Personalized Deep Research: A User-Centric Framework, Dataset, and Hybrid Evaluation for Knowledge Discovery
PDR is a user-context-aware framework for LLM research agents that improves report relevance over static baselines, supported by a new dataset and hybrid evaluation.
-
pAI/MSc: ML Theory Research with Humans on the Loop
pAI/MSc is a customizable multi-agent system that reduces human steering by orders of magnitude when turning a hypothesis into a literature-grounded, mathematically established, experimentally supported manuscript dra...
-
EvoMaster: A Foundational Evolving Agent Framework for Agentic Science at Scale
EvoMaster is a self-evolving agent framework that achieves state-of-the-art results on scientific benchmarks by enabling iterative hypothesis refinement and knowledge accumulation across domains.
-
Toward Human-AI Complementarity Across Diverse Tasks
Human-AI hybrids achieve only +0.4pp over AI alone on diverse tasks because confidence routing fails to identify the small set of cases where humans can correct AI errors.
-
COMPOSITE-Stem
COMPOSITE-STEM is a new benchmark of 70 expert-curated STEM tasks where frontier AI agents score at most 21% using flexible exact-match and rubric-based grading.
-
GLM-5: from Vibe Coding to Agentic Engineering
GLM-5 is a foundation model that claims state-of-the-art results on coding benchmarks and superior performance on end-to-end software engineering tasks via new asynchronous RL methods and cost-saving DSA.
-
Kimi K2.5: Visual Agentic Intelligence
Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.
-
MiMo-V2-Flash Technical Report
MiMo-V2-Flash is a 309B/15B MoE model trained on 27T tokens with hybrid attention and multi-teacher on-policy distillation that matches larger models like DeepSeek-V3.2 while enabling 2.6x faster decoding via repurpos...
-
DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models
DeepSeek-V3.2 adds sparse attention, scaled RL post-training, and large-scale agentic data synthesis to reach GPT-5-level performance and gold medals in 2025 IMO and IOI with its high-compute variant.
-
gpt-oss-120b & gpt-oss-20b Model Card
OpenAI releases two open-weight reasoning models, gpt-oss-120b and gpt-oss-20b, trained via distillation and RL with claimed strong results on math, coding, and safety benchmarks.
-
Kimi K2: Open Agentic Intelligence
Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.
-
Measuring AI Reasoning: A Guide for Researchers
Reasoning in language models should be measured by the faithfulness and validity of their multi-step search processes and intermediate traces, not final-answer accuracy.
-
Supplement Generation Training for Enhancing Agentic Task Performance
SGT trains a lightweight model to generate task-specific supplemental text that improves performance of a larger frozen LLM on agentic tasks without modifying the large model.
-
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
GLM-4.5, a 355B-parameter MoE model with hybrid reasoning, scores 70.1% on TAU-Bench, 91.0% on AIME 24, and 64.2% on SWE-bench Verified while ranking 3rd overall and 2nd on agentic benchmarks.
-
Benchmarking EngGPT2-16B-A3B against Comparable Italian and International Open-source LLMs
EngGPT2MoE-16B-A3B matches or beats other Italian models on most international benchmarks but trails top international models such as GPT-5 nano and Qwen3-8B.
-
Wan-Image: Pushing the Boundaries of Generative Visual Intelligence
Wan-Image is a unified multi-modal system that integrates LLMs and diffusion transformers to deliver professional-grade image generation features including complex typography, multi-subject consistency, and precise ed...
Reference graph
Works this paper leans on
-
[1]
C. Alberti, K. Lee, and M. Collins. A bert baseline for the natural questions, 2019. URL https: //arxiv.org/abs/1901.08634
-
[2]
AgentHarm: A Benchmark for Measuring Harmfulness of LLM Agents
M. Andriushchenko, A. Souly, M. Dziemian, D. Duenas, M. Lin, J. Wang, D. Hendrycks, A. Zou, Z. Kolter, M. Fredrikson, E. Winsor, J. Wynne, Y . Gal, and X. Davies. Agentharm: A benchmark for measuring harmfulness of llm agents, 2024. URLhttps://arxiv.org/abs/2410.09024
work page internal anchor Pith review arXiv 2024
-
[3]
The claude 3 model family: Opus, sonnet, haiku, 2024
Anthropic. The claude 3 model family: Opus, sonnet, haiku, 2024. URL https://api. semanticscholar.org/CorpusID:268232499
2024
-
[4]
Model card addendum: Claude 3.5 haiku and upgraded claude 3.5 son- net, 2024
Anthropic. Model card addendum: Claude 3.5 haiku and upgraded claude 3.5 son- net, 2024. URL https://assets.anthropic.com/m/1cd9d098ac3e6467/original/ Claude-3-Model-Card-October-Addendum.pdf
2024
-
[5]
Responsible scaling policy updates, 2024
Anthropic. Responsible scaling policy updates, 2024. URL https://www.anthropic.com/ rsp-updates
2024
-
[6]
R. K. Arora, J. Wei, R. S. Hicks, P. Bowman, J. Quiñonero-Candela, F. Tsimpourlas, M. Sharman, M. Shah, A. Vallone, A. Beutel, J. Heidecke, and K. Singhal. Healthbench: Evaluating large language models towards improved human health, 2025. URLhttps://arxiv.org/abs/2505.08775
work page internal anchor Pith review arXiv 2025
-
[7]
Austin, A
J. Austin, A. Odena, M. Nye, M. Bosma, H. Michalewski, D. Dohan, E. Jiang, C. Cai, M. Terry, Q. Le, and C. Sutton. Program synthesis with large language models, 2021. URL https://arxiv.org/abs/2108. 07732
2021
-
[8]
Y . Bai, A. Jones, K. Ndousse, A. Askell, A. Chen, N. DasSarma, D. Drain, S. Fort, D. Ganguli, T. Henighan, N. Joseph, S. Kadavath, J. Kernion, T. Conerly, S. El-Showk, N. Elhage, Z. Hatfield-Dodds, D. Hernandez, T. Hume, S. Johnston, S. Kravec, L. Lovitt, N. Nanda, C. Olsson, D. Amodei, T. Brown, J. Clark, S. Mc- Candlish, C. Olah, B. Mann, and J. Kaplan...
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[9]
MS MARCO: A Human Generated MAchine Reading COmprehension Dataset
P. Bajaj, D. Campos, N. Craswell, L. Deng, J. Gao, X. Liu, R. Majumder, A. McNamara, B. Mitra, T. Nguyen, M. Rosenberg, X. Song, A. Stoica, S. Tiwary, and T. Wang. Ms marco: A human generated machine reading comprehension dataset, 2018. URLhttps://arxiv.org/abs/1611.09268
work page internal anchor Pith review arXiv 2018
-
[10]
Purple llama CyberSecEval : A secure coding benchmark for language models
M. Bhatt, S. Chennabasappa, C. Nikolaidis, S. Wan, I. Evtimov, D. Gabi, D. Song, F. Ahmad, C. Ascher- mann, L. Fontana, S. Frolov, R. P. Giri, D. Kapil, Y . Kozyrakis, D. LeBlanc, J. Milazzo, A. Straumann, G. Synnaeve, V . V ontimitta, S. Whitman, and J. Saxe. Purple llama cyberseceval: A secure coding benchmark for language models, 2023. URLhttps://arxiv...
- [11]
-
[12]
M. Chen, J. Tworek, H. Jun, Q. Yuan, H. P. de Oliveira Pinto, J. Kaplan, H. Edwards, Y . Burda, N. Joseph, G. Brockman, A. Ray, R. Puri, G. Krueger, M. Petrov, H. Khlaaf, G. Sastry, P. Mishkin, B. Chan, S. Gray, N. Ryder, M. Pavlov, A. Power, L. Kaiser, M. Bavarian, C. Winter, P. Tillet, F. P. Such, D. Cummings, M. Plappert, F. Chantzis, E. Barnes, A. Her...
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[13]
Arc prize 2024: Technical report
F. Chollet, M. Knoop, G. Kamradt, and B. Landers. Arc prize 2024: Technical report, 2024. URL https://arxiv.org/abs/2412.04604
-
[14]
Training Verifiers to Solve Math Word Problems
K. Cobbe, V . Kosaraju, M. Bavarian, M. Chen, H. Jun, L. Kaiser, M. Plappert, J. Tworek, J. Hilton, R. Nakano, C. Hesse, and J. Schulman. Training verifiers to solve math word problems, 2021. URL https://arxiv.org/abs/2110.14168
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[15]
Deepseek-v3 technical report, 2024
DeepSeek-AI. Deepseek-v3 technical report, 2024. URL https://github.com/deepseek-ai/ DeepSeek-V3/blob/main/DeepSeek_V3.pdf
2024
-
[16]
D. Dua, Y . Wang, P. Dasigi, G. Stanovsky, S. Singh, and M. Gardner. Drop: A reading comprehension benchmark requiring discrete reasoning over paragraphs, 2019. URL https://arxiv.org/abs/1903. 00161. 10
2019
-
[17]
A. Dubey et al. The llama 3 herd of models, 2024. URLhttps://arxiv.org/abs/2407.21783
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
B. Gao, F. Song, Z. Yang, Z. Cai, Y . Miao, Q. Dong, L. Li, C. Ma, L. Chen, R. Xu, Z. Tang, B. Wang, D. Zan, S. Quan, G. Zhang, L. Sha, Y . Zhang, X. Ren, T. Liu, and B. Chang. Omni-math: A universal olympiad level mathematic benchmark for large language models, 2024. URL https://arxiv.org/abs/2410.07985
-
[19]
E. Glazer, E. Erdil, T. Besiroglu, D. Chicharro, E. Chen, A. Gunning, C. F. Olsson, J.-S. Denain, A. Ho, E. de Oliveira Santos, O. Järviniemi, M. Barnett, R. Sandler, J. Sevilla, Q. Ren, E. Pratt, L. Levine, G. Barkley, N. Stewart, B. Grechuk, T. Grechuk, and S. V . Enugandla. Frontiermath: A benchmark for evaluating advanced mathematical reasoning in ai,...
-
[20]
C. He, R. Luo, Y . Bai, S. Hu, Z. L. Thai, J. Shen, J. Hu, X. Han, Y . Huang, Y . Zhang, J. Liu, L. Qi, Z. Liu, and M. Sun. Olympiadbench: A challenging benchmark for promoting agi with olympiad-level bilingual multimodal scientific problems, 2024. URLhttps://arxiv.org/abs/2402.14008
work page internal anchor Pith review arXiv 2024
-
[21]
Measuring Coding Challenge Competence With APPS
D. Hendrycks, S. Basart, S. Kadavath, M. Mazeika, A. Arora, E. Guo, C. Burns, S. Puranik, H. He, D. Song, and J. Steinhardt. Measuring coding challenge competence with apps, 2021. URL https: //arxiv.org/abs/2105.09938
work page internal anchor Pith review arXiv 2021
-
[22]
Measuring Massive Multitask Language Understanding
D. Hendrycks, C. Burns, S. Basart, A. Zou, M. Mazeika, D. Song, and J. Steinhardt. Measuring massive multitask language understanding, 2021. URLhttps://arxiv.org/abs/2009.03300
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[23]
Hendrycks, C
D. Hendrycks, C. Burns, S. Kadavath, A. Arora, S. Basart, E. Tang, D. Song, and J. Steinhardt. Measuring mathematical problem solving with the math dataset, 2021. URL https://arxiv.org/abs/2103. 03874
2021
-
[24]
D. Hendrycks, A. Zou, M. Mazeika, L. Tang, B. Li, D. Song, and J. Steinhardt. Pixmix: Dreamlike pictures comprehensively improve safety measures, 2022. URLhttps://arxiv.org/abs/2112.05135
-
[25]
Hosseini, A
A. Hosseini, A. Sordoni, D. Toyama, A. Courville, and R. Agarwal. Not all llm reasoners are created equal,
- [26]
-
[27]
Jacovi, A
A. Jacovi, A. Wang, C. Alberti, C. Tao, J. Lipovetz, K. Olszewska, L. Haas, M. Liu, N. Keating, A. Bloniarz, C. Saroufim, C. Fry, D. Marcus, D. Kukliansky, G. S. Tomar, J. Swirhun, J. Xing, L. W. andMadhu Gurumurthy, M. Aaron, M. Ambar, R. Fellinger, R. Wang, R. Sims, Z. Zhang, S. Goldshtein, and D. Das. Facts leaderboard. https://kaggle.com/facts-leaderb...
2024
-
[28]
C. E. Jimenez, J. Yang, A. Wettig, S. Yao, K. Pei, O. Press, and K. Narasimhan. Swe-bench: Can language models resolve real-world github issues?, 2024. URLhttps://arxiv.org/abs/2310.06770
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[29]
arXiv:2104.14337 (2021), https://arxiv.org/abs/2104.14337
D. Kiela, M. Bartolo, Y . Nie, D. Kaushik, A. Geiger, Z. Wu, B. Vidgen, G. Prasad, A. Singh, P. Ringshia, Z. Ma, T. Thrush, S. Riedel, Z. Waseem, P. Stenetorp, R. Jia, M. Bansal, C. Potts, and A. Williams. Dynabench: Rethinking benchmarking in nlp, 2021. URLhttps://arxiv.org/abs/2104.14337
-
[30]
Refusal-trained llms are easily jailbroken as browser agents.arXiv preprint arXiv:2410.13886,
P. Kumar, E. Lau, S. Vijayakumar, T. Trinh, S. R. Team, E. Chang, V . Robinson, S. Hendryx, S. Zhou, M. Fredrikson, S. Yue, and Z. Wang. Refusal-trained llms are easily jailbroken as browser agents, 2024. URLhttps://arxiv.org/abs/2410.13886
-
[31]
J. M. Laurent, J. D. Janizek, M. Ruzo, M. M. Hinks, M. J. Hammerling, S. Narayanan, M. Ponnapati, A. D. White, and S. G. Rodriques. Lab-bench: Measuring capabilities of language models for biology research,
- [32]
-
[33]
N. Li, A. Pan, A. Gopal, S. Yue, D. Berrios, A. Gatti, J. D. Li, A.-K. Dombrowski, S. Goel, L. Phan, G. Mukobi, N. Helm-Burger, R. Lababidi, L. Justen, A. B. Liu, M. Chen, I. Barrass, O. Zhang, X. Zhu, R. Tamirisa, B. Bharathi, A. Khoja, Z. Zhao, A. Herbert-V oss, C. B. Breuer, S. Marks, O. Patel, A. Zou, M. Mazeika, Z. Wang, P. Oswal, W. Lin, A. A. Hunt,...
-
[34]
P. Lu, H. Bansal, T. Xia, J. Liu, C. Li, H. Hajishirzi, H. Cheng, K.-W. Chang, M. Galley, and J. Gao. Mathvista: Evaluating mathematical reasoning of foundation models in visual contexts, 2024. URL https://arxiv.org/abs/2310.02255
work page internal anchor Pith review arXiv 2024
- [35]
- [36]
-
[37]
Openai o1 system card, 2024
OpenAI. Openai o1 system card, 2024. URLhttps://cdn.openai.com/o1-system-card-20240917. pdf
2024
-
[38]
Openai and los alamos national laboratory announce bio- science research partnership, 2024
OpenAI. Openai and los alamos national laboratory announce bio- science research partnership, 2024. URL https://openai.com/index/ openai-and-los-alamos-national-laboratory-work-together/
2024
-
[39]
Introducing swe-bench verified, 2024
OpenAI. Introducing swe-bench verified, 2024. URL https://openai.com/index/ introducing-swe-bench-verified/
2024
-
[40]
OpenAI et al. Gpt-4 technical report, 2024. URLhttps://arxiv.org/abs/2303.08774
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[41]
S. Ott, A. Barbosa-Silva, K. Blagec, J. Brauner, and M. Samwald. Mapping global dynamics of benchmark creation and saturation in artificial intelligence.Nature Communications, 13(1):6793, 2022
2022
- [42]
-
[43]
arXiv preprint arXiv:2212.09251 , year=
E. Perez, S. Ringer, K. Lukoši ¯ut˙e, K. Nguyen, E. Chen, S. Heiner, C. Pettit, C. Olsson, S. Kundu, S. Kadavath, A. Jones, A. Chen, B. Mann, B. Israel, B. Seethor, C. McKinnon, C. Olah, D. Yan, D. Amodei, D. Amodei, D. Drain, D. Li, E. Tran-Johnson, G. Khundadze, J. Kernion, J. Landis, J. Kerr, J. Mueller, J. Hyun, J. Landau, K. Ndousse, L. Goldberg, L. ...
-
[44]
Phuong, M
M. Phuong, M. Aitchison, E. Catt, S. Cogan, A. Kaskasoli, V . Krakovna, D. Lindner, M. Rahtz, Y . Assael, S. Hodkinson, H. Howard, T. Lieberum, R. Kumar, M. A. Raad, A. Webson, L. Ho, S. Lin, S. Farquhar, M. Hutter, G. Deletang, A. Ruoss, S. El-Sayed, S. Brown, A. Dragan, R. Shah, A. Dafoe, and T. Shevlane. Evaluating frontier models for dangerous capabil...
2024
-
[45]
SQuAD: 100,000+ Questions for Machine Comprehension of Text
P. Rajpurkar, J. Zhang, K. Lopyrev, and P. Liang. Squad: 100,000+ questions for machine comprehension of text, 2016. URLhttps://arxiv.org/abs/1606.05250
work page internal anchor Pith review arXiv 2016
-
[46]
Know What You Don't Know: Unanswerable Questions for SQuAD
P. Rajpurkar, R. Jia, and P. Liang. Know what you don’t know: Unanswerable questions for squad, 2018. URLhttps://arxiv.org/abs/1806.03822
work page Pith review arXiv 2018
-
[47]
D. Rein, B. L. Hou, A. C. Stickland, J. Petty, R. Y . Pang, J. Dirani, J. Michael, and S. R. Bowman. Gpqa: A graduate-level google-proof q&a benchmark, 2023. URLhttps://arxiv.org/abs/2311.12022
work page internal anchor Pith review arXiv 2023
-
[48]
Singhal, S
K. Singhal, S. Azizi, T. Tu, S. S. Mahdavi, J. Wei, H. W. Chung, N. Scales, A. Tanwani, H. Cole-Lewis, S. Pfohl, et al. Large language models encode clinical knowledge.Nature, 620(7972):172–180, 2023
2023
-
[49]
Skarlinski, J
M. Skarlinski, J. Laurent, A. Bou, and A. White. About 30% ofHumanity’s Last Exam chemistry/biology answers are likely wrong, July 2025. URL https://www.futurehouse.org/ research-announcements/hle-exam
2025
-
[50]
V . K. Srinivasan, Z. Dong, B. Zhu, B. Yu, H. Mao, D. Mosk-Aoyama, K. Keutzer, J. Jiao, and J. Zhang. Nexusraven: A commercially-permissive language model for function calling. InNeurIPS 2023 F oun- dation Models for Decision Making Workshop, 2023. URL https://openreview.net/forum?id= 5lcPe6DqfI
2023
-
[51]
Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models
A. Srivastava, A. Rastogi, A. Rao, A. A. M. Shoeb, A. Abid, A. Fisch, A. R. Brown, A. Santoro, A. Gupta, A. Garriga-Alonso, A. Kluska, A. Lewkowycz, A. Agarwal, A. Power, A. Ray, A. Warstadt, A. W. Kocurek, A. Safaya, A. Tazarv, A. Xiang, A. Parrish, A. Nie, A. Hussain, A. Askell, A. Dsouza, A. Slone, A. Rahane, A. S. Iyer, A. Andreassen, A. Madotto, A. S...
work page internal anchor Pith review arXiv 2023
- [52]
-
[53]
Team et al
G. Team et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,
-
[54]
URLhttps://arxiv.org/abs/2403.05530
work page internal anchor Pith review Pith/arXiv arXiv
-
[55]
G. Tsoukalas, J. Lee, J. Jennings, J. Xin, M. Ding, M. Jennings, A. Thakur, and S. Chaudhuri. Putnambench: Evaluating neural theorem-provers on the putnam mathematical competition, 2024. URLhttps://arxiv. org/abs/2407.11214
-
[56]
A. Wang, A. Singh, J. Michael, F. Hill, O. Levy, and S. R. Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding, 2019. URL https://arxiv.org/abs/1804. 07461
2019
- [57]
-
[58]
Y . Wang, X. Ma, G. Zhang, Y . Ni, A. Chandra, S. Guo, W. Ren, A. Arulraj, X. He, Z. Jiang, T. Li, M. Ku, K. Wang, A. Zhuang, R. Fan, X. Yue, and W. Chen. Mmlu-pro: A more robust and challenging multi-task language understanding benchmark (published at neurips 2024 track datasets and benchmarks), 2024. URL https://arxiv.org/abs/2406.01574
work page internal anchor Pith review arXiv 2024
- [59]
-
[60]
H. Wijk, T. Lin, J. Becker, S. Jawhar, N. Parikh, T. Broadley, L. Chan, M. Chen, J. Clymer, J. Dhyani, E. Ericheva, K. Garcia, B. Goodrich, N. Jurkovic, M. Kinniment, A. Lajko, S. Nix, L. Sato, W. Saunders, M. Taran, B. West, and E. Barnes. Re-bench: Evaluating frontier ai r&d capabilities of language model agents against human experts, 2024. URLhttps://a...
-
[61]
Grok-2 beta release, 2024
xAI. Grok-2 beta release, 2024. URLhttps://x.ai/blog/grok-2
2024
-
[62]
F. Yan, H. Mao, C. C.-J. Ji, T. Zhang, S. G. Patil, I. Stoica, and J. E. Gonzalez. Berkeley function call- ing leaderboard. https://gorilla.cs.berkeley.edu/blogs/8_berkeley_function_calling_ leaderboard.html, 2024
2024
-
[63]
Z. Yang, P. Qi, S. Zhang, Y . Bengio, W. W. Cohen, R. Salakhutdinov, and C. D. Manning. Hotpotqa: A dataset for diverse, explainable multi-hop question answering, 2018. URL https://arxiv.org/abs/ 1809.09600
work page internal anchor Pith review arXiv 2018
-
[64]
S. Yao, N. Shinn, P. Razavi, and K. Narasimhan. τ-bench: A benchmark for tool-agent-user interaction in real-world domains, 2024. URLhttps://arxiv.org/abs/2406.12045
work page internal anchor Pith review arXiv 2024
-
[65]
A. K. Zhang, N. Perry, R. Dulepet, J. Ji, J. W. Lin, E. Jones, C. Menders, G. Hussein, S. Liu, D. Jasper, P. Peetathawatchai, A. Glenn, V . Sivashankar, D. Zamoshchin, L. Glikbarg, D. Askaryar, M. Yang, T. Zhang, R. Alluri, N. Tran, R. Sangpisit, P. Yiorkadjis, K. Osele, G. Raghupathi, D. Boneh, D. E. Ho, and P. Liang. Cybench: A framework for evaluating ...
-
[66]
Agieval: A human-centric benchmark for evaluating foundation models
W. Zhong, R. Cui, Y . Guo, Y . Liang, S. Lu, Y . Wang, A. Saied, W. Chen, and N. Duan. Agieval: A human-centric benchmark for evaluating foundation models, 2023. URL https://arxiv.org/abs/ 2304.06364. 13 A Authors We offered optional co-authorship to all question submitters with an accepted question in HUMANITY’SLAST EXAM(including both public and private...
-
[67]
Independent Researcher
-
[68]
University of California, Berkeley
-
[69]
Massachusetts Institute of Technology
-
[70]
University of Cambridge
-
[71]
University of Oxford
-
[72]
Princeton University
-
[73]
Carnegie Mellon University
-
[74]
University of Chicago
-
[75]
University of Michigan
-
[76]
École Polytechnique Fédérale de Lausanne
-
[77]
University of Toronto
-
[78]
University of Illinois Urbana-Champaign
-
[79]
Washington University
-
[80]
University of Wisconsin-Madison
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.