pith. machine review for the scientific record. sign in

arxiv: 2406.01574 · v6 · submitted 2024-06-03 · 💻 cs.CL

Recognition: 3 theorem links

· Lean Theorem

MMLU-Pro: A More Robust and Challenging Multi-Task Language Understanding Benchmark

Aaran Arulraj, Abhranil Chandra, Alex Zhuang, Ge Zhang, Kai Wang, Max Ku, Rongqi Fan, Shiguang Guo, Tianle Li, Weiming Ren, Wenhu Chen, Xiang Yue, Xuan He, Xueguang Ma, Yuansheng Ni, Yubo Wang, Ziyan Jiang

Pith reviewed 2026-05-11 15:45 UTC · model grok-4.3

classification 💻 cs.CL
keywords MMLUbenchmarklarge language modelsreasoningmulti-task understandingprompt sensitivityevaluation
0
0 comments X

The pith

MMLU-Pro creates a tougher benchmark that drops model accuracy by 16 to 33 percent and cuts prompt sensitivity to 2 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces MMLU-Pro to address the plateauing performance on the original MMLU benchmark for language models. By adding more reasoning-intensive questions, expanding answer choices from four to ten, and removing easy or noisy items, the new benchmark shows substantially lower accuracies and more consistent results across prompts. A reader would care because it allows clearer comparisons between models as their overall performance improves over time. The changes also make chain-of-thought prompting more useful than it was before.

Core claim

The central discovery is that MMLU-Pro, by integrating challenging reasoning questions and ten-option choices while removing trivial ones from MMLU, causes accuracy to fall 16% to 33% across models and reduces score variation from prompt changes to only 2%. This setup makes the benchmark more stable and better at revealing differences in capabilities, with chain-of-thought methods now outperforming direct answers unlike in the original MMLU.

What carries the argument

MMLU-Pro itself, the revised dataset built by curating harder questions and increasing the number of answer choices to ten.

If this is right

  • Models must demonstrate stronger reasoning to achieve high performance.
  • Chain-of-thought prompting provides a clearer advantage on this benchmark.
  • Evaluations become less dependent on the specific wording of instructions.
  • Progress in the field can be tracked more accurately without saturation effects.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The original benchmark may have overstated model abilities by including too many straightforward questions.
  • The contrast in chain-of-thought effectiveness suggests MMLU-Pro better isolates reasoning demands from recall.
  • Similar filtering and expansion could be applied to other multi-task tests to improve their stability.

Load-bearing premise

That the added questions genuinely test reasoning skills and the removed ones were not essential to the benchmark's coverage.

What would settle it

If a new model scores as high on MMLU-Pro as on MMLU while using only direct answering without reasoning steps, or if different prompts still cause large score swings.

read the original abstract

In the age of large-scale language models, benchmarks like the Massive Multitask Language Understanding (MMLU) have been pivotal in pushing the boundaries of what AI can achieve in language comprehension and reasoning across diverse domains. However, as models continue to improve, their performance on these benchmarks has begun to plateau, making it increasingly difficult to discern differences in model capabilities. This paper introduces MMLU-Pro, an enhanced dataset designed to extend the mostly knowledge-driven MMLU benchmark by integrating more challenging, reasoning-focused questions and expanding the choice set from four to ten options. Additionally, MMLU-Pro eliminates the trivial and noisy questions in MMLU. Our experimental results show that MMLU-Pro not only raises the challenge, causing a significant drop in accuracy by 16% to 33% compared to MMLU but also demonstrates greater stability under varying prompts. With 24 different prompt styles tested, the sensitivity of model scores to prompt variations decreased from 4-5% in MMLU to just 2% in MMLU-Pro. Additionally, we found that models utilizing Chain of Thought (CoT) reasoning achieved better performance on MMLU-Pro compared to direct answering, which is in stark contrast to the findings on the original MMLU, indicating that MMLU-Pro includes more complex reasoning questions. Our assessments confirm that MMLU-Pro is a more discriminative benchmark to better track progress in the field.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces MMLU-Pro, an enhanced version of the MMLU benchmark that adds more challenging reasoning-focused questions, expands answer options from 4 to 10, and removes trivial or noisy items. Experiments across multiple LLMs report accuracy drops of 16-33%, reduced prompt sensitivity (from 4-5% to 2% across 24 styles), and greater gains from Chain-of-Thought prompting relative to direct answering, claiming MMLU-Pro is a more discriminative benchmark for tracking model progress.

Significance. If the curation and empirical claims hold, MMLU-Pro would provide a timely, more robust benchmark that addresses performance plateaus on MMLU by better distinguishing capabilities through harder reasoning items and improved stability. The direct empirical testing on prompt variations and CoT effects across models is a strength, offering falsifiable comparisons that could guide future benchmark design.

major comments (3)
  1. [§2] §2 (Dataset Construction): The criteria for identifying and removing 'trivial and noisy' questions from MMLU and for sourcing/validating the added 'more challenging, reasoning-focused' questions are not specified with objective, reproducible metrics such as difficulty calibration, inter-annotator agreement, or controls for knowledge distribution shifts. This directly undermines the central claim that observed accuracy drops and reduced prompt sensitivity reflect intrinsic improvements rather than curator-induced selection bias.
  2. [§4] §4 (Experiments and Results): The claim of prompt sensitivity reduced to 2% (versus 4-5% on MMLU) across 24 styles lacks reported statistical significance, confidence intervals, or variance estimates, making it impossible to assess whether the stability improvement is robust or could be due to sampling variability in the tested models and prompts.
  3. [§3] §3 (Evaluation Setup): No analysis is provided on whether the domain/topic distribution of MMLU-Pro matches the original MMLU after removals and additions, or on how the expansion to 10 options interacts with the reasoning-focused claim; without this, the 16-33% accuracy drop cannot be confidently attributed to increased reasoning demands rather than distributional shifts.
minor comments (2)
  1. [Abstract] The abstract and introduction use 'reasoning-focused' without a formal definition or concrete examples of what distinguishes these questions from knowledge-driven ones in MMLU.
  2. [Figures/Tables] Figure captions and table headers could more explicitly note the number of models and prompt styles used in each comparison to improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address each major comment point-by-point below, providing the strongest honest defense of our work while indicating where we will revise the paper to improve clarity, transparency, and rigor.

read point-by-point responses
  1. Referee: [§2] §2 (Dataset Construction): The criteria for identifying and removing 'trivial and noisy' questions from MMLU and for sourcing/validating the added 'more challenging, reasoning-focused' questions are not specified with objective, reproducible metrics such as difficulty calibration, inter-annotator agreement, or controls for knowledge distribution shifts. This directly undermines the central claim that observed accuracy drops and reduced prompt sensitivity reflect intrinsic improvements rather than curator-induced selection bias.

    Authors: We agree that the manuscript would benefit from greater explicitness regarding curation criteria to enhance reproducibility and address potential bias concerns. Trivial questions were identified via pilot evaluations on high-performing models showing near-ceiling performance without reasoning, while noisy items were removed through expert manual review for ambiguities or factual issues. Added questions were sourced from advanced materials requiring multi-step reasoning. In revision, we will expand §2 with a dedicated subsection providing concrete examples of removed and added questions, along with the heuristics and thresholds applied. Although formal inter-annotator agreement was not computed in the initial process (relying on domain-expert consensus), we will report agreement on a re-reviewed sample. This added transparency will allow better assessment of whether improvements stem from intrinsic properties rather than selection. revision: partial

  2. Referee: [§4] §4 (Experiments and Results): The claim of prompt sensitivity reduced to 2% (versus 4-5% on MMLU) across 24 styles lacks reported statistical significance, confidence intervals, or variance estimates, making it impossible to assess whether the stability improvement is robust or could be due to sampling variability in the tested models and prompts.

    Authors: We concur that statistical rigor would strengthen this empirical claim. The reported 2% figure represents the observed range of variation across the 24 prompt styles, but we did not include variance or significance tests in the initial submission. In the revised version, we will add standard deviation and variance estimates for accuracies under different prompts, 95% confidence intervals, and statistical tests (e.g., paired comparisons) to evaluate whether the reduction in sensitivity is significant. These will be integrated into §4 to permit readers to assess robustness against sampling variability. revision: yes

  3. Referee: [§3] §3 (Evaluation Setup): No analysis is provided on whether the domain/topic distribution of MMLU-Pro matches the original MMLU after removals and additions, or on how the expansion to 10 options interacts with the reasoning-focused claim; without this, the 16-33% accuracy drop cannot be confidently attributed to increased reasoning demands rather than distributional shifts.

    Authors: We recognize the value of distributional analysis to support attribution of the accuracy drop. We will revise §3 to include a direct comparison of subject and topic proportions between MMLU and MMLU-Pro, confirming that the overall distribution remains largely consistent with only targeted adjustments. On the 10-option expansion, we will add discussion clarifying that it primarily enhances option discrimination and lowers random guessing (from 25% to 10%), which works in tandem with the reasoning emphasis; the larger CoT gains on MMLU-Pro provide supporting evidence for increased reasoning demands. While a full factorial ablation isolating every factor is beyond the current scope, the added analysis will help readers evaluate the contribution of distributional shifts versus reasoning complexity. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical benchmark evaluation is self-contained

full rationale

The paper introduces MMLU-Pro through dataset curation (adding reasoning-focused questions, removing trivial ones, expanding options to 10) and reports direct empirical results from testing multiple models under varied prompts and CoT conditions. No mathematical derivations, equations, predictions, or first-principles claims exist that could reduce to inputs by construction. All metrics (16-33% accuracy drop, 2% prompt sensitivity) are measured outcomes, not fitted or self-defined quantities. No self-citations are load-bearing for the central claims, and the work is externally falsifiable via replication on the released dataset.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the premise that expanding options and filtering for reasoning questions produces a superior benchmark; no free parameters are fitted, and no new entities are postulated.

axioms (1)
  • domain assumption Harder reasoning questions plus more answer options yield a more discriminative and stable benchmark than the original MMLU.
    Invoked throughout the abstract when interpreting accuracy drops and prompt stability as evidence of improvement.

pith-pipeline@v0.9.0 · 5619 in / 1289 out tokens · 103178 ms · 2026-05-11T15:45:21.647581+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 39 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. MMMU-Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

    cs.CL 2024-09 accept novelty 8.0

    MMMU-Pro is a stricter multimodal benchmark that removes text-only solvable questions, augments options, and requires reading text from images, yielding substantially lower model scores of 16.8-26.9%.

  2. Human-Grounded Multimodal Benchmark with 900K-Scale Aggregated Student Response Distributions from Japan's National Assessment of Academic Ability

    cs.CL 2026-05 unverdicted novelty 7.0

    A new benchmark dataset drawn from Japan's National Assessment of Academic Ability supplies real exam layouts, diagrams, Japanese text, and nationwide student response distributions for evaluating multimodal LLMs.

  3. LLM-Guided Monte Carlo Tree Search over Knowledge Graphs: Composing Mechanistic Explanations for Drug-Disease Pairs

    cs.AI 2026-05 unverdicted novelty 7.0

    TESSERA combines LLMs as local policy and evaluator with MCTS on knowledge graphs to compose mechanistic drug-disease explanations.

  4. Star Elastic: Many-in-One Reasoning LLMs with Efficient Budget Control

    cs.LG 2026-05 unverdicted novelty 7.0

    Star Elastic trains N nested submodels in a single post-training job on a parent reasoning LLM, supporting elastic budget control that matches or exceeds independent baselines while cutting training compute by up to 360x.

  5. Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level

    cs.LG 2026-05 unverdicted novelty 7.0

    AOPD modifies on-policy distillation by using localized divergence minimization for non-positive advantages instead of negative reinforcement, yielding average gains of 4.09/8.34 over standard OPD on math reasoning be...

  6. CXR-ContraBench: Benchmarking Negated-Option Attraction in Medical VLMs

    cs.CV 2026-05 conditional novelty 7.0

    Medical VLMs frequently select negated options that contradict visible chest X-ray findings, achieving only ~30% accuracy on direct presence probes, but a post-hoc consistency verifier raises accuracy above 95%.

  7. MRI-Eval: A Tiered Benchmark for Evaluating LLM Performance on MRI Physics and GE Scanner Operations Knowledge

    eess.IV 2026-05 unverdicted novelty 7.0

    MRI-Eval benchmark shows frontier LLMs scoring 93-97% on MRI MCQs but falling to 37-61% on stem-only questions, with GE scanner operations as the weakest category for all models.

  8. Instruction Complexity Induces Positional Collapse in Adversarial LLM Evaluation

    cs.CL 2026-04 unverdicted novelty 7.0

    Complex adversarial instructions induce positional collapse in LLMs, with extreme cases showing 99.9% concentration on a single response position and zero content sensitivity.

  9. Analysis and Explainability of LLMs Via Evolutionary Methods

    cs.NE 2026-04 unverdicted novelty 7.0

    Evolutionary trees from LLM weights recover ground-truth training topologies and identify key datasets and layers through phenotypic analysis.

  10. Efficient Test-Time Inference via Deterministic Exploration of Truncated Decoding Trees

    cs.LG 2026-04 unverdicted novelty 7.0

    Distinct Leaf Enumeration (DLE) replaces stochastic self-consistency sampling with deterministic traversal of a truncated decoding tree to enumerate distinct leaves, increasing coverage and reducing redundant computat...

  11. Super Apriel: One Checkpoint, Many Speeds

    cs.LG 2026-04 unverdicted novelty 7.0

    A single 15B supernet checkpoint supports runtime switching between attention mixer placements for multiple decode speed presets while retaining 77-96% quality relative to the teacher model.

  12. Remask, Don't Replace: Token-to-Mask Refinement in Diffusion Large Language Models

    cs.CL 2026-04 unverdicted novelty 7.0

    Token-to-Mask remasking improves self-correction in diffusion LLMs by resetting erroneous commitments to masks rather than overwriting them, yielding +13.33 points on AIME 2025 and +8.56 on CMATH.

  13. Controllable and Verifiable Tool-Use Data Synthesis for Agentic Reinforcement Learning

    cs.AI 2026-04 unverdicted novelty 7.0

    COVERT generates verifiable synthetic tool-use environments for RL by validated trajectory synthesis and oracle-preserving augmentations, improving tool-use accuracy on BFCL v3 and ACEBench while remaining complementa...

  14. MARS: Enabling Autoregressive Models Multi-Token Generation

    cs.CL 2026-04 unverdicted novelty 7.0

    MARS fine-tunes autoregressive models to predict multiple tokens per step via continued training on instruction data, achieving 1.5-1.7x throughput while matching baseline accuracy and supporting real-time speed adjustment.

  15. Video-MMMU: Evaluating Knowledge Acquisition from Multi-Discipline Professional Videos

    cs.CV 2025-01 unverdicted novelty 7.0

    Video-MMMU benchmark shows large multimodal models exhibit steep performance drops on higher cognitive tasks when learning from professional videos and lag significantly behind humans in knowledge acquisition.

  16. SlimQwen: Exploring the Pruning and Distillation in Large MoE Model Pre-training

    cs.LG 2026-05 unverdicted novelty 6.0

    Pruning pretrained MoE models outperforms training from scratch, different compression methods converge after continued pretraining, and combining KD with language modeling loss plus progressive schedules yields a com...

  17. Rotation-Preserving Supervised Fine-Tuning

    cs.LG 2026-05 unverdicted novelty 6.0

    RPSFT improves the in-domain versus out-of-domain performance trade-off during LLM supervised fine-tuning by penalizing rotations in pretrained singular subspaces as a proxy for loss-sensitive directions.

  18. An Interpretable and Scalable Framework for Evaluating Large Language Models

    stat.ML 2026-05 unverdicted novelty 6.0

    A majorization-minimization framework turns IRT into scalable matrix factorization subproblems for LLM evaluation, delivering orders-of-magnitude speedups with identifiability guarantees.

  19. Option-Order Randomisation Reveals a Distributional Position Attractor in Prompted Sandbagging

    cs.CL 2026-04 unverdicted novelty 6.0

    Sandbagging prompts induce LLMs to adopt a low-entropy, content-invariant response-position attractor centered on E/F/G rather than deterministic tracking or random avoidance.

  20. Large Language Models Decide Early and Explain Later

    cs.CL 2026-04 unverdicted novelty 6.0

    LLMs settle on their answer after a minority of CoT tokens and produce an average 760 more as post-decision explanation, enabling early stopping that saves 500 tokens per query at a 2% accuracy cost.

  21. Decoupled DiLoCo for Resilient Distributed Pre-training

    cs.CL 2026-04 unverdicted novelty 6.0

    Decoupled DiLoCo enables asynchronous distributed pre-training with zero global downtime under simulated failures while preserving competitive performance on text and vision tasks.

  22. COMPASS: COntinual Multilingual PEFT with Adaptive Semantic Sampling

    cs.LG 2026-04 unverdicted novelty 6.0

    COMPASS uses semantic clustering on multilingual embeddings to select auxiliary data for PEFT adapters, outperforming linguistic-similarity baselines on multilingual benchmarks while supporting continual adaptation.

  23. Towards Disentangled Preference Optimization Dynamics: Suppress the Loser, Preserve the Winner

    cs.LG 2026-04 unverdicted novelty 6.0

    A unified incentive-score decomposition of preference optimization reveals the disentanglement band condition and reward calibration method that enables suppressing losers while preserving winners in LLM training.

  24. Rethinking Generalization in Reasoning SFT: A Conditional Analysis on Optimization, Data, and Model Capability

    cs.AI 2026-04 unverdicted novelty 6.0

    Reasoning SFT generalizes cross-domain conditionally on sufficient optimization, high-quality long-CoT data, and strong base models, while degrading safety.

  25. Can LLMs Learn to Reason Robustly under Noisy Supervision?

    cs.LG 2026-04 conditional novelty 6.0

    Online Label Refinement lets LLMs learn robust reasoning from noisy supervision by correcting labels when majority answers show rising rollout success and stable history, delivering 3-4% gains on math and reasoning be...

  26. Structured Multi-Criteria Evaluation of Large Language Models with Fuzzy Analytic Hierarchy Process and DualJudge

    cs.AI 2026-04 unverdicted novelty 6.0

    Fuzzy AHP and DualJudge deliver more stable and calibrated LLM evaluations than direct scoring by breaking assessments into explicit criteria and adaptively fusing intuitive and deliberative judgments.

  27. Rubrics to Tokens: Bridging Response-level Rubrics and Token-level Rewards in Instruction Following Tasks

    cs.CL 2026-04 unverdicted novelty 6.0

    RTT bridges response-level rubrics to token-level rewards via a relevance discriminator and intra-sample group normalization, yielding higher instruction and rubric accuracy than baselines.

  28. InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    cs.CV 2025-08 unverdicted novelty 6.0

    InternVL3.5 advances open-source multimodal models with Cascade RL for +16% reasoning gains and ViR for 4x inference speedup, with the 241B model reaching SOTA among open-source MLLMs on multimodal, reasoning, and age...

  29. Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level

    cs.LG 2026-05 unverdicted novelty 5.0

    Asymmetric On-Policy Distillation improves on-policy distillation by using divergence minimization instead of negative reinforcement in low-advantage regions, yielding 4-8 point gains on math reasoning benchmarks whil...

  30. Asymmetric On-Policy Distillation: Bridging Exploitation and Imitation at the Token Level

    cs.LG 2026-05 unverdicted novelty 5.0

    Asymmetric On-Policy Distillation replaces ineffective negative reinforcement with localized divergence minimization in low-advantage regions, yielding 4.09-8.34 point gains over standard OPD on math reasoning benchmarks.

  31. Cooperative Profiles Predict Multi-Agent LLM Team Performance in AI for Science Workflows

    cs.CL 2026-04 unverdicted novelty 5.0

    Cooperative profiles from behavioral economics games predict LLM team performance in AI-for-science workflows.

  32. Kimi K2.5: Visual Agentic Intelligence

    cs.CL 2026-02 unverdicted novelty 5.0

    Kimi K2.5 combines joint text-vision training with an Agent Swarm parallel orchestration framework to reach claimed state-of-the-art results on coding, vision, reasoning, and agent tasks while cutting latency up to 4.5 times.

  33. DeepSeek-V3.2: Pushing the Frontier of Open Large Language Models

    cs.CL 2025-12 unverdicted novelty 5.0

    DeepSeek-V3.2 adds sparse attention, scaled RL post-training, and large-scale agentic data synthesis to reach GPT-5-level performance and gold medals in 2025 IMO and IOI with its high-compute variant.

  34. Kimi K2: Open Agentic Intelligence

    cs.LG 2025-07 unverdicted novelty 5.0

    Kimi K2 is a 1-trillion-parameter MoE model that leads open-source non-thinking models on agentic benchmarks including 65.8 on SWE-Bench Verified and 66.1 on Tau2-Bench.

  35. Qwen3 Technical Report

    cs.CL 2025-05 unverdicted novelty 5.0

    Pith review generated a malformed one-line summary.

  36. SmolLM2: When Smol Goes Big -- Data-Centric Training of a Small Language Model

    cs.CL 2025-02 unverdicted novelty 5.0

    SmolLM2 is a 1.7B-parameter language model that outperforms Qwen2.5-1.5B and Llama3.2-1B after overtraining on 11 trillion tokens using custom FineMath, Stack-Edu, and SmolTalk datasets in a multi-stage pipeline.

  37. Humanity's Last Exam

    cs.LG 2025-01 unverdicted novelty 5.0

    Humanity's Last Exam is a new 2,500-question benchmark at the frontier of human knowledge where state-of-the-art LLMs show low accuracy.

  38. On the Privacy of LLMs: An Ablation Study

    cs.CR 2026-05 unverdicted novelty 4.0

    Privacy attacks on LLMs show strong signals for membership inference and backdoors but weaker performance for attribute inference and data extraction, with risks highly dependent on system configuration.

  39. EXAONE 4.5 Technical Report

    cs.CL 2026-04 unverdicted novelty 2.0

    EXAONE 4.5 is a new open-weight multimodal model that matches general benchmarks and outperforms similar-scale models on document understanding and Korean contextual reasoning.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · cited by 37 Pith papers · 20 internal anchors

  1. [1]

    Phi-3 Technical Report: A Highly Capable Language Model Locally on Your Phone

    Marah Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat Behl, et al. Phi-3 technical re- port: A highly capable language model locally on your phone. arXiv preprint arXiv:2404.14219, 2024

  2. [2]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report. arXiv preprint arXiv:2303.08774, 2023

  3. [3]

    When benchmarks are targets: Revealing the sensitivity of large language model leaderboards

    Norah Alzahrani, Hisham Abdullah Alyahya, Yazeed Alnumay, Sultan Alrashed, Shaykhah Alsubaie, Yusef Almushaykeh, Faisal Mirza, Nouf Alotaibi, Nora Altwairesh, Areeb Alow- isheq, et al. When benchmarks are targets: Revealing the sensitivity of large language model leaderboards. arXiv preprint arXiv:2402.01781, 2024

  4. [4]

    Llemma: An Open Language Model For Mathematics

    Zhangir Azerbayev, Hailey Schoelkopf, Keiran Paster, Marco Dos Santos, Stephen McAleer, Albert Q Jiang, Jia Deng, Stella Biderman, and Sean Welleck. Llemma: An open language model for mathematics. arXiv preprint arXiv:2310.10631, 2023. 9

  5. [5]

    Qwen Technical Report

    Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Yang Fan, Wenbin Ge, Yu Han, Fei Huang, Binyuan Hui, Luo Ji, Mei Li, Junyang Lin, Runji Lin, Dayiheng Liu, Gao Liu, Chengqiang Lu, Keming Lu, Jianxin Ma, Rui Men, Xingzhang Ren, Xuancheng Ren, Chuanqi Tan, Sinan Tan, Jianhong Tu, Peng Wang, Shijie Wang, Wei Wang, Shengguang Wu, Benfeng X...

  6. [6]

    Constitutional AI: Harmlessness from AI Feedback

    Yuntao Bai, Saurav Kadavath, Sandipan Kundu, Amanda Askell, Jackson Kernion, Andy Jones, Anna Chen, Anna Goldie, Azalia Mirhoseini, Cameron McKinnon, et al. Constitutional ai: Harmlessness from ai feedback. arXiv preprint arXiv:2212.08073, 2022

  7. [7]

    Language models are few-shot learners

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners. Advances in neural information processing systems , 33:1877–1901, 2020

  8. [8]

    C4ai-command-r-v01

    c4ai-command-r-v01. C4ai-command-r-v01. https://huggingface.co/cohereforai/c4ai-command- r-v01. URL https://huggingface.co/CohereForAI/c4ai-command-r-v01

  9. [9]

    A survey on evaluation of large language models

    Yupeng Chang, Xu Wang, Jindong Wang, Yuan Wu, Linyi Yang, Kaijie Zhu, Hao Chen, Xiaoyuan Yi, Cunxiang Wang, Yidong Wang, et al. A survey on evaluation of large language models. ACM Transactions on Intelligent Systems and Technology, 15(3):1–45, 2024

  10. [10]

    Theoremqa: A theorem-driven question answering dataset

    Wenhu Chen, Ming Yin, Max Ku, Pan Lu, Yixin Wan, Xueguang Ma, Jianyu Xu, Xinyi Wang, and Tony Xia. Theoremqa: A theorem-driven question answering dataset. In The 2023 Conference on Empirical Methods in Natural Language Processing , 2023

  11. [11]

    Chatbot Arena: An Open Platform for Evaluating LLMs by Human Preference

    Wei-Lin Chiang, Lianmin Zheng, Ying Sheng, Anastasios Nikolas Angelopoulos, Tianle Li, Dacheng Li, Hao Zhang, Banghua Zhu, Michael Jordan, Joseph E Gonzalez, et al. Chatbot arena: An open platform for evaluating llms by human preference. arXiv preprint arXiv:2403.04132, 2024

  12. [12]

    Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge

    Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabharwal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try arc, the ai2 reasoning challenge. arXiv preprint arXiv:1803.05457, 2018

  13. [13]

    Introducing the next generation of Claude https://www.anthropic.com/news/claude-3- family

    claude. Introducing the next generation of Claude https://www.anthropic.com/news/claude-3- family. URL https://www.anthropic.com/news/claude-3-family

  14. [14]

    Opencompass: A universal evaluation platform for foundation models

    OpenCompass Contributors. Opencompass: A universal evaluation platform for foundation models. https://github.com/open-compass/opencompass, 2023

  15. [15]

    Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model, 2024

    DeepSeek-AI. Deepseek-v2: A strong, economical, and efficient mixture-of-experts language model, 2024

  16. [16]

    Chain-of-thought hub: A continuous effort to measure large language models’ reasoning performance

    Yao Fu, Litu Ou, Mingyu Chen, Yuhao Wan, Hao Peng, and Tushar Khot. Chain-of-thought hub: A continuous effort to measure large language models’ reasoning performance. arXiv preprint arXiv:2305.17306, 2023

  17. [17]

    Hello gpt4-o

    gpt-4o. Hello gpt4-o. https://openai.com/index/hello-gpt-4o/. URL https://openai.com/ index/hello-gpt-4o/

  18. [18]

    Measuring Massive Multitask Language Understanding

    Dan Hendrycks, Collin Burns, Steven Basart, Andy Zou, Mantas Mazeika, Dawn Song, and Jacob Steinhardt. Measuring massive multitask language understanding. arXiv preprint arXiv:2009.03300, 2020

  19. [19]

    Measuring Mathematical Problem Solving With the MATH Dataset

    Dan Hendrycks, Collin Burns, Saurav Kadavath, Akul Arora, Steven Basart, Eric Tang, Dawn Song, and Jacob Steinhardt. Measuring mathematical problem solving with the math dataset. arXiv preprint arXiv:2103.03874, 2021

  20. [20]

    Mistral 7B

    Albert Q Jiang, Alexandre Sablayrolles, Arthur Mensch, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Florian Bressand, Gianna Lengyel, Guillaume Lample, Lucile Saulnier, et al. Mistral 7b. arXiv preprint arXiv:2310.06825, 2023. 10

  21. [21]

    Mixtral of Experts

    Albert Q Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, et al. Mixtral of experts. arXiv preprint arXiv:2401.04088, 2024

  22. [22]

    Holistic Evaluation of Language Models

    Percy Liang, Rishi Bommasani, Tony Lee, Dimitris Tsipras, Dilara Soylu, Michihiro Yasunaga, Yian Zhang, Deepak Narayanan, Yuhuai Wu, Ananya Kumar, et al. Holistic evaluation of language models. arXiv preprint arXiv:2211.09110, 2022

  23. [23]

    Lingyiwanwu, yi-large

    lingyiwanwu. Lingyiwanwu, yi-large. https://www.lingyiwanwu.com/en. URL https://www. lingyiwanwu.com/en

  24. [24]

    Build the future of ai with meta llama 3 - https://llama.meta.com/llama3/

    Meta Llama 3. Build the future of ai with meta llama 3 - https://llama.meta.com/llama3/. URL https://llama.meta.com/llama3/

  25. [25]

    Nikita Nangia, Clara Vania, Rasika Bhalerao, and Samuel R

    Swaroop Mishra, Daniel Khashabi, Chitta Baral, and Hannaneh Hajishirzi. Cross-task gener- alization via natural language crowdsourcing instructions. arXiv preprint arXiv:2104.08773, 2021

  26. [26]

    Levels of AGI: Operationalizing progress on the path to AGI.arXiv preprint arXiv:2311.02462,

    Meredith Ringel Morris, Jascha Sohl-dickstein, Noah Fiedel, Tris Warkentin, Allan Dafoe, Aleksandra Faust, Clement Farabet, and Shane Legg. Levels of agi: Operationalizing progress on the path to agi. arXiv preprint arXiv:2311.02462, 2023

  27. [27]

    Open LLM Leaderboard - a Hugging Face Space by open-llm-leaderboard

    open llm leaderboard. Open LLM Leaderboard - a Hugging Face Space by open-llm-leaderboard. https://huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard. URL https:// huggingface.co/spaces/open-llm-leaderboard/open_llm_leaderboard

  28. [28]

    Training language models to follow instructions with human feedback

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback. Advances in neural information processing systems , 35:27730–27744, 2022

  29. [29]

    Smaug: Fixing failure modes of preference optimisation with dpo-positive.arXiv preprint arXiv:2402.13228,

    Arka Pal, Deep Karkhanis, Samuel Dooley, Manley Roberts, Siddartha Naidu, and Colin White. Smaug: Fixing failure modes of preference optimisation with dpo-positive. arXiv preprint arXiv:2402.13228, 2024

  30. [30]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Machel Reid, Nikolay Savinov, Denis Teplyashin, Dmitry Lepikhin, Timothy Lillicrap, Jean- baptiste Alayrac, Radu Soricut, Angeliki Lazaridou, Orhan Firat, Julian Schrittwieser, et al. Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context. arXiv preprint arXiv:2403.05530, 2024

  31. [31]

    Multitask Prompted Training Enables Zero-Shot Task Generalization

    Victor Sanh, Albert Webson, Colin Raffel, Stephen H Bach, Lintang Sutawika, Zaid Alyafeai, Antoine Chaffin, Arnaud Stiegler, Teven Le Scao, Arun Raja, et al. Multitask prompted training enables zero-shot task generalization. arXiv preprint arXiv:2110.08207, 2021

  32. [32]

    Beyond the Imitation Game: Quantifying and extrapolating the capabilities of language models

    Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, Abu Awal Md Shoeb, Abubakar Abid, Adam Fisch, Adam R Brown, Adam Santoro, Aditya Gupta, Adrià Garriga-Alonso, et al. Beyond the imitation game: Quantifying and extrapolating the capabilities of language models. arXiv preprint arXiv:2206.04615, 2022

  33. [33]

    Challenging BIG-Bench Tasks and Whether Chain-of-Thought Can Solve Them

    Mirac Suzgun, Nathan Scales, Nathanael Schärli, Sebastian Gehrmann, Yi Tay, Hyung Won Chung, Aakanksha Chowdhery, Quoc V Le, Ed H Chi, Denny Zhou, et al. Challenging big- bench tasks and whether chain-of-thought can solve them. arXiv preprint arXiv:2210.09261, 2022

  34. [34]

    Gemma: Open Models Based on Gemini Research and Technology

    Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on gemini research and technology. arXiv preprint arXiv:2403.08295, 2024

  35. [35]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikolay Bashlykov, Soumya Batra, Prajjwal Bhargava, Shruti Bhosale, et al. Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288, 2023. 11

  36. [36]

    Zephyr: Direct distillation of lm alignment.arXiv preprint arXiv:2310.16944, 2023

    Lewis Tunstall, Edward Beeching, Nathan Lambert, Nazneen Rajani, Kashif Rasul, Younes Belkada, Shengyi Huang, Leandro von Werra, Clémentine Fourrier, Nathan Habib, et al. Zephyr: Direct distillation of lm alignment. arXiv preprint arXiv:2310.16944, 2023

  37. [37]

    GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

    Alex Wang, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel R Bowman. Glue: A multi-task benchmark and analysis platform for natural language understanding. arXiv preprint arXiv:1804.07461, 2018

  38. [38]

    Superglue: A stickier benchmark for general-purpose language understanding systems

    Alex Wang, Yada Pruksachatkun, Nikita Nangia, Amanpreet Singh, Julian Michael, Felix Hill, Omer Levy, and Samuel Bowman. Superglue: A stickier benchmark for general-purpose language understanding systems. Advances in neural information processing systems , 32, 2019

  39. [39]

    Open- chat: Advancing open-source language models with mixed-quality data

    Guan Wang, Sijie Cheng, Xianyuan Zhan, Xiangang Li, Sen Song, and Yang Liu. Open- chat: Advancing open-source language models with mixed-quality data. arXiv preprint arXiv:2309.11235, 2023

  40. [40]

    arXiv preprint arXiv:2307.10635

    Xiaoxuan Wang, Ziniu Hu, Pan Lu, Yanqiao Zhu, Jieyu Zhang, Satyen Subramaniam, Arjun R Loomba, Shichang Zhang, Yizhou Sun, and Wei Wang. Scibench: Evaluating college-level scientific problem-solving abilities of large language models. arXiv preprint arXiv:2307.10635, 2023

  41. [41]

    Chain-of-thought prompting elicits reasoning in large language models

    Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Fei Xia, Ed Chi, Quoc V Le, Denny Zhou, et al. Chain-of-thought prompting elicits reasoning in large language models. Advances in neural information processing systems , 35:24824–24837, 2022

  42. [42]

    Internlm-math: Open math large language models toward verifiable reasoning, 2024

    Huaiyuan Ying, Shuo Zhang, Linyang Li, Zhejian Zhou, Yunfan Shao, Zhaoye Fei, Yichuan Ma, Jiawei Hong, Kuikun Liu, Ziyi Wang, Yudong Wang, Zijian Wu, Shuaibin Li, Fengzhe Zhou, Hongwei Liu, Songyang Zhang, Wenwei Zhang, Hang Yan, Xipeng Qiu, Jiayu Wang, Kai Chen, and Dahua Lin. Internlm-math: Open math large language models toward verifiable reasoning, 2024

  43. [43]

    Yi: Open Foundation Models by 01.AI

    Alex Young, Bei Chen, Chao Li, Chengen Huang, Ge Zhang, Guanwei Zhang, Heng Li, Jiangcheng Zhu, Jianqun Chen, Jing Chang, et al. Yi: Open foundation models by 01. ai. arXiv preprint arXiv:2403.04652, 2024

  44. [44]

    Mammoth2: Scaling instructions from the web

    Xiang Yue, Tuney Zheng, Ge Zhang, and Wenhu Chen. Mammoth2: Scaling instructions from the web. arXiv preprint arXiv:2405.03548, 2024

  45. [45]

    HellaSwag: Can a Machine Really Finish Your Sentence?

    Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. Hellaswag: Can a machine really finish your sentence? arXiv preprint arXiv:1905.07830, 2019

  46. [46]

    Map-neo: Highly capable and transparent bilingual large language model series, 2024

    Ge Zhang, Scott Qu, Jiaheng Liu, Chenchen Zhang, Chenghua Lin, Chou Leuang Yu, Danny Pan, Esther Cheng, Jie Liu, Qunshu Lin, Raven Yuan, Tuney Zheng, Wei Pang, Xinrun Du, Yiming Liang, Yinghao Ma, Yizhi Li, Ziyang Ma, Bill Lin, Emmanouil Benetos, Huan Yang, Junting Zhou, Kaijing Ma, Minghao Liu, Morry Niu, Noah Wang, Quehry Que, Ruibo Liu, Sine Liu, Shawn...

  47. [47]

    Large language models are not robust multiple choice selectors

    Chujie Zheng, Hao Zhou, Fandong Meng, Jie Zhou, and Minlie Huang. Large language models are not robust multiple choice selectors. In The Twelfth International Conference on Learning Representations, 2023

  48. [48]

    Agieval: A human-centric benchmark for evaluating foundation models

    Wanjun Zhong, Ruixiang Cui, Yiduo Guo, Yaobo Liang, Shuai Lu, Yanlin Wang, Amin Saied, Weizhu Chen, and Nan Duan. Agieval: A human-centric benchmark for evaluating foundation models. arXiv preprint arXiv:2304.06364, 2023

  49. [49]

    image_question

    Banghua Zhu, Evan Frick, Tianhao Wu, Hanlin Zhu, and Jiantao Jiao. Starling-7b: Improving llm helpfulness and harmlessness with rlaif, November 2023. 12 A Appendix A.1 Dataset Construction Details Initial Filtering Details Table 4 provides a comprehensive overview of the initial data filtering performed on the Massive Multitask Language Understanding (MML...

  50. [50]

    - Strain II has an average weight of 15 grams

    **Determine the genotypes of Strain I and Strain II:** - Strain I has an average weight of 30 grams. - Strain II has an average weight of 15 grams

  51. [51]

    - Each lowercase letter gene (a, b, d) contributes 2.5 grams

    **Calculate the contribution of each gene to the body weight:** - Each capital letter gene (A, B, D) contributes 5.0 grams. - Each lowercase letter gene (a, b, d) contributes 2.5 grams

  52. [52]

    - For Strain II (15 grams): - 15 = ( a + a + b + b + d + d) - Each lowercase letter contributes 2.5 grams, so: - 15 = 6 × 2.5 - Therefore, Strain II must have the genotype aabbdd

    **Determine the total genetic contribution to the body weight:** - For Strain I (30 grams): - 30 = ( A + A + B + B + D + D) - Each capital letter contributes 5 grams, so: - 30 = 6 × 5 - Therefore, Strain I must have the genotype AABBDD . - For Strain II (15 grams): - 15 = ( a + a + b + b + d + d) - Each lowercase letter contributes 2.5 grams, so: - 15 = 6...

  53. [53]

    **Determine the phenotype of the F1 progeny:** - When Strain I (AABB DD) is crossed with Strain II (aabb dd), the F1 progeny will have the genotype AaBbDd. - Each heterozygous pair (Aa, Bb, Dd) will contribute the average of the contributions of the dominant and recessive alleles: - A = 5 , a = 2 .5 → Aa = (5 + 2 .5)/2 = 3 .75 - B = 5 , b = 2 .5 → Bb = (5...

  54. [54]

    - Since there are three pairs, the total weight is 11.25 × 3 = 33 .75 grams

    **Recalculate the phenotype of the F1 progeny:** - The correct calculation should be: - Aa = 3 .75, Bb = 3 .75, Dd = 3 .75 - Total weight contribution: 3.75 × 3 = 11 .25 grams per pair. - Since there are three pairs, the total weight is 11.25 × 3 = 33 .75 grams. - However, this seems incorrect. Let’s re-evaluate: - The correct calculation should be: - Aa ...