pith. machine review for the scientific record. sign in

arxiv: 2311.12022 · v1 · submitted 2023-11-20 · 💻 cs.AI · cs.CL

Recognition: 2 theorem links

· Lean Theorem

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

Asa Cooper Stickland, Betty Li Hou, David Rein, Jackson Petty, Julian Michael, Julien Dirani, Richard Yuanzhe Pang, Samuel R. Bowman

Authors on Pith no claims yet

Pith reviewed 2026-05-11 03:56 UTC · model grok-4.3

classification 💻 cs.AI cs.CL
keywords benchmark datasetAI evaluationscalable oversightgraduate-level questionsexpert validationmultiple-choice science questionsbiology physics chemistry
0
0 comments X

The pith

A benchmark of 448 expert-written questions reaches only 65 percent accuracy for PhD experts while non-experts and GPT-4 stay below 40 percent.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates GPQA as a set of 448 multiple-choice questions in biology, physics, and chemistry, each written and vetted by domain experts to reach graduate difficulty. It shows that experts achieve 65 percent accuracy, which rises to 74 percent after removing clear mistakes they later spotted, while skilled non-experts with full web access average only 34 percent after more than 30 minutes per question. The strongest GPT-4 baseline reaches 39 percent. The authors argue that this persistent gap for both non-experts and current AI creates a practical setting to test methods that let humans supervise AI systems on questions where the AI may eventually exceed human skill.

Core claim

GPQA is a dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. Experts who have or are pursuing PhDs in those fields reach 65 percent accuracy, or 74 percent when clear mistakes are discounted in retrospect. Highly skilled non-expert validators reach only 34 percent despite more than 30 minutes and unrestricted web access. The strongest GPT-4 baseline achieves 39 percent accuracy. The authors position this difficulty as the basis for realistic experiments on scalable oversight of future AI systems that may surpass human capabilities on hard scientific questions.

What carries the argument

The GPQA dataset itself, built by having domain experts write questions and then validate them through expert review to keep them both high-quality and resistant to quick web lookup.

If this is right

  • The dataset supplies a concrete testbed for experiments on human supervision of AI outputs on questions where AI may already or soon exceed human performance.
  • It allows direct comparison of oversight techniques by measuring whether humans can extract reliable answers from AI systems that otherwise score low on the questions.
  • Future models can be evaluated on whether they improve on GPQA only when paired with human oversight rather than through standalone scaling.
  • Progress on GPQA would support the broader goal of using AI to generate new scientific knowledge under reliable human review.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same expert-writing and validation process could be applied in additional domains such as mathematics or computer science to widen the range of testable oversight scenarios.
  • Researchers could run repeated trials on the same questions with different oversight protocols to measure which techniques most reliably close the gap between AI capability and human-verified correctness.
  • If models eventually solve most GPQA questions at expert level, the dataset could still serve as a calibration point for checking whether oversight methods generalize to even harder, previously unseen questions.

Load-bearing premise

The questions will stay difficult and representative once released publicly, and the performance gaps will continue to signal the need for new oversight methods rather than just reflecting temporary limits in search or models.

What would settle it

A controlled test in which non-expert accuracy on the full set of released questions rises above 50 percent after standard web search would falsify the Google-proof property; alternatively, an AI model reaching 65 percent accuracy on GPQA without any added oversight technique would weaken the claim that the benchmark immediately requires new supervision methods.

read the original abstract

We present GPQA, a challenging dataset of 448 multiple-choice questions written by domain experts in biology, physics, and chemistry. We ensure that the questions are high-quality and extremely difficult: experts who have or are pursuing PhDs in the corresponding domains reach 65% accuracy (74% when discounting clear mistakes the experts identified in retrospect), while highly skilled non-expert validators only reach 34% accuracy, despite spending on average over 30 minutes with unrestricted access to the web (i.e., the questions are "Google-proof"). The questions are also difficult for state-of-the-art AI systems, with our strongest GPT-4 based baseline achieving 39% accuracy. If we are to use future AI systems to help us answer very hard questions, for example, when developing new scientific knowledge, we need to develop scalable oversight methods that enable humans to supervise their outputs, which may be difficult even if the supervisors are themselves skilled and knowledgeable. The difficulty of GPQA both for skilled non-experts and frontier AI systems should enable realistic scalable oversight experiments, which we hope can help devise ways for human experts to reliably get truthful information from AI systems that surpass human capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

0 major / 2 minor

Summary. The paper introduces GPQA, a benchmark of 448 multiple-choice questions in biology, physics, and chemistry written by domain experts. It reports that PhD-level experts achieve 65% accuracy (74% after discounting clear mistakes identified in retrospect), highly skilled non-experts reach only 34% despite over 30 minutes with unrestricted web access, and the strongest GPT-4 baseline reaches 39%. The authors position the dataset as a resource for realistic experiments on scalable oversight methods to supervise AI systems on hard scientific questions.

Significance. If the measured performance gaps hold under the reported validation protocols, GPQA supplies a concrete, expert-curated benchmark that can accelerate research on human-AI supervision for tasks exceeding current human or model capabilities. The explicit counts, domain breakdowns, and inter-annotator details supplied in the full text constitute reproducible elements that strengthen the work's utility for AI safety and evaluation studies.

minor comments (2)
  1. [Abstract] Abstract: the summary of accuracy figures is clear, but a single sentence noting the total question count and the existence of inter-annotator agreement statistics would improve standalone readability without lengthening the paragraph.
  2. [Validation protocol section] Validation protocol section: while the criteria for discounting expert mistakes are described, an explicit statement of how many questions were affected by this adjustment would allow readers to assess sensitivity of the 74% figure.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive review and recommendation to accept the paper. We appreciate the recognition of GPQA's value as a reproducible benchmark for studying scalable oversight on expert-level scientific questions.

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper presents an empirical benchmark dataset along with directly measured performance metrics from external human validators (PhD experts and skilled non-experts) and an off-the-shelf GPT-4 model. These accuracy figures (65% expert, 34% non-expert, 39% GPT-4) are computed from validation experiments with explicit protocols, question counts, and inter-annotator details rather than any mathematical derivation, fitted parameter, self-referential equation, or load-bearing self-citation. The central claims about question difficulty and the need for scalable oversight rest on these independent measurements and do not reduce to inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The contribution rests on standard benchmark-construction assumptions rather than new free parameters or invented entities.

axioms (2)
  • domain assumption PhD-level experts can serve as reliable upper-bound validators for graduate-level questions in their domains
    Invoked to establish the 65% expert accuracy ceiling.
  • domain assumption Unrestricted web search plus 30+ minutes is sufficient for skilled non-experts to exhaust publicly available information on the questions
    Used to support the Google-proof claim.

pith-pipeline@v0.9.0 · 5525 in / 1349 out tokens · 47475 ms · 2026-05-11T03:56:10.382894+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

    cs.CL 2026-05 conditional novelty 8.0

    AutoTTS discovers width-depth test-time scaling controllers through agentic search in a pre-collected trajectory environment, yielding better accuracy-cost tradeoffs than hand-designed baselines on math reasoning task...

  2. Large Language Diffusion Models

    cs.CL 2025-02 unverdicted novelty 8.0

    LLaDA is a scalable diffusion-based language model that matches autoregressive LLMs like LLaMA3 8B on tasks and surpasses GPT-4o on reversal poem completion.

  3. Inducing Artificial Uncertainty in Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    Inducing artificial uncertainty on trivial tasks allows training probes that achieve higher calibration on hard data than standard approaches while retaining performance on easy data.

  4. Query-Conditioned Test-Time Self-Training for Large Language Models

    cs.CL 2026-05 unverdicted novelty 7.0

    QueST lets LLMs create query-conditioned problem-solution pairs at inference time and use them for parameter-efficient self-training, outperforming prior test-time baselines on math and science benchmarks.

  5. Learning Agentic Policy from Action Guidance

    cs.CL 2026-05 unverdicted novelty 7.0

    ActGuide-RL uses human action data as plan-style guidance in mixed-policy RL to overcome exploration barriers in LLM agents, matching SFT+RL performance on search benchmarks without cold-start training.

  6. Human-Grounded Multimodal Benchmark with 900K-Scale Aggregated Student Response Distributions from Japan's National Assessment of Academic Ability

    cs.CL 2026-05 unverdicted novelty 7.0

    A new benchmark dataset drawn from Japan's National Assessment of Academic Ability supplies real exam layouts, diagrams, Japanese text, and nationwide student response distributions for evaluating multimodal LLMs.

  7. Breaking $\textit{Winner-Takes-All}$: Cooperative Policy Optimization Improves Diverse LLM Reasoning

    cs.AI 2026-05 unverdicted novelty 7.0

    GCPO shifts RLVR from rollout competition to team cooperation by assigning advantages via marginal contributions to a determinant-based coverage volume over semantic embeddings, yielding higher accuracy and solution d...

  8. TRACE: Distilling Where It Matters via Token-Routed Self On-Policy Alignment

    cs.AI 2026-05 unverdicted novelty 7.0

    TRACE improves math reasoning by distilling only on annotator-marked critical spans with forward KL on correct key spans, optional reverse KL on errors, and GRPO elsewhere, gaining 2.76 points over GRPO while preservi...

  9. LLM-Guided Monte Carlo Tree Search over Knowledge Graphs: Composing Mechanistic Explanations for Drug-Disease Pairs

    cs.AI 2026-05 unverdicted novelty 7.0

    TESSERA combines LLMs as local policy and evaluator with MCTS on knowledge graphs to compose mechanistic drug-disease explanations.

  10. Test-Time Speculation

    cs.CL 2026-05 unverdicted novelty 7.0

    Test-Time Speculation adapts draft models online via target-model verifications to sustain high acceptance lengths during long LLM generations.

  11. DUET: Optimize Token-Budget Allocation for Reinforcement Learning with Verifiable Rewards

    cs.LG 2026-05 unverdicted novelty 7.0

    DUET improves RLVR by allocating tokens across both prompt selection and rollout length, outperforming full-budget baselines even when using only half the tokens.

  12. Queryable LoRA: Instruction-Regularized Routing Over Shared Low-Rank Update Atoms

    cs.LG 2026-05 unverdicted novelty 7.0

    Queryable LoRA adds dynamic routing over shared low-rank atoms with attention and language-instruction regularization to make parameter-efficient fine-tuning more adaptive across inputs and layers.

  13. LLMs Improving LLMs: Agentic Discovery for Test-Time Scaling

    cs.CL 2026-05 unverdicted novelty 7.0

    AutoTTS discovers superior test-time scaling strategies for LLMs via cheap controller synthesis in a pre-collected trajectory environment, outperforming manual baselines on math benchmarks with low discovery cost.

  14. KL for a KL: On-Policy Distillation with Control Variate Baseline

    cs.LG 2026-05 unverdicted novelty 7.0

    vOPD stabilizes on-policy distillation gradients by subtracting a closed-form per-token negative reverse KL baseline as a detached control variate, preserving unbiasedness while lowering variance and matching expensiv...

  15. Rubric-based On-policy Distillation

    cs.LG 2026-05 unverdicted novelty 7.0

    Rubric-based on-policy distillation allows training student models using only teacher responses by generating scoring rubrics from contrasts and using them for on-policy optimization, achieving superior performance an...

  16. When Are Experts Misrouted? Counterfactual Routing Analysis in Mixture-of-Experts Language Models

    cs.LG 2026-05 unverdicted novelty 7.0

    Standard top-k routers in MoE language models often select suboptimal routes for difficult tokens, and updating only the final router layer raises pass@K on AIME and HMMT benchmarks across multiple models.

  17. Star Elastic: Many-in-One Reasoning LLMs with Efficient Budget Control

    cs.LG 2026-05 unverdicted novelty 7.0

    Star Elastic trains N nested submodels in a single post-training job on a parent reasoning LLM, supporting elastic budget control that matches or exceeds independent baselines while cutting training compute by up to 360x.

  18. Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

    cs.AI 2026-05 unverdicted novelty 7.0

    RL training compute for logical reasoning follows a power law in proof depth whose exponent rises with logic expressiveness, and more expressive training yields larger gains on downstream benchmarks.

  19. Can RL Teach Long-Horizon Reasoning to LLMs? Expressiveness Is Key

    cs.AI 2026-05 unverdicted novelty 7.0

    RL training on more expressive logical tasks follows a steeper power-law scaling with reasoning depth and transfers more efficiently to math and reasoning benchmarks.

  20. MRI-Eval: A Tiered Benchmark for Evaluating LLM Performance on MRI Physics and GE Scanner Operations Knowledge

    eess.IV 2026-05 unverdicted novelty 7.0

    MRI-Eval benchmark shows frontier LLMs scoring 93-97% on MRI MCQs but falling to 37-61% on stem-only questions, with GE scanner operations as the weakest category for all models.

  21. Token Arena: A Continuous Benchmark Unifying Energy and Cognition in AI Inference

    cs.AI 2026-05 unverdicted novelty 7.0

    TokenArena is a continuous benchmark for AI inference endpoints that measures output speed, time to first token, blended price, effective context, quality, and modeled energy to produce composites of joules per correc...

  22. Jailbroken Frontier Models Retain Their Capabilities

    cs.LG 2026-04 unverdicted novelty 7.0

    Jailbreak-induced performance loss shrinks as model capability grows, with the strongest models showing almost no degradation on benchmarks.

  23. Guess-Verify-Refine: Data-Aware Top-K for Sparse-Attention Decoding on Blackwell via Temporal Correlation

    cs.DC 2026-04 unverdicted novelty 7.0

    GVR uses previous-step Top-K predictions, pre-indexed stats, secant counting, and shared-memory verification to deliver 1.88x average speedup over radix-select while preserving bit-exact Top-K on DeepSeek-V3.2 workloads.

  24. IRIS: Interpolative R\'enyi Iterative Self-play for Large Language Model Fine-Tuning

    cs.LG 2026-04 unverdicted novelty 7.0

    IRIS unifies self-play fine-tuning under an interpolative Rényi objective with adaptive alpha scheduling and reports better benchmark scores than baselines while surpassing full supervised fine-tuning with only 13% of...

  25. Super Apriel: One Checkpoint, Many Speeds

    cs.LG 2026-04 unverdicted novelty 7.0

    A single 15B supernet checkpoint supports runtime switching between attention mixer placements for multiple decode speed presets while retaining 77-96% quality relative to the teacher model.

  26. More Thinking, More Bias: Length-Driven Position Bias in Reasoning Models

    cs.AI 2026-04 unverdicted novelty 7.0

    Position bias scales positively with reasoning trajectory length in CoT models, shown by partial correlations and truncation interventions across multiple benchmarks and model scales.

  27. Low-rank Optimization Trajectories Modeling for LLM RLVR Acceleration

    cs.LG 2026-04 unverdicted novelty 7.0

    NExt accelerates RLVR training for LLMs by nonlinearly extrapolating low-rank parameter trajectories extracted from LoRA runs.

  28. Controllable and Verifiable Tool-Use Data Synthesis for Agentic Reinforcement Learning

    cs.AI 2026-04 unverdicted novelty 7.0

    COVERT generates verifiable synthetic tool-use environments for RL by validated trajectory synthesis and oracle-preserving augmentations, improving tool-use accuracy on BFCL v3 and ACEBench while remaining complementa...

  29. MARS: Enabling Autoregressive Models Multi-Token Generation

    cs.CL 2026-04 unverdicted novelty 7.0

    MARS fine-tunes autoregressive models to predict multiple tokens per step via continued training on instruction data, achieving 1.5-1.7x throughput while matching baseline accuracy and supporting real-time speed adjustment.

  30. FrontierFinance: A Long-Horizon Computer-Use Benchmark of Real-World Financial Tasks

    cs.CL 2026-04 unverdicted novelty 7.0

    FrontierFinance benchmark shows human financial experts outperform state-of-the-art LLMs by achieving higher scores and more client-ready outputs on realistic long-horizon tasks.

  31. Stress-Testing the Reasoning Competence of LLMs With Proofs Under Minimal Formalism

    cs.LO 2026-04 unverdicted novelty 7.0

    ProofGrid is a new benchmark for LLM reasoning that uses machine-checkable proofs in minimal formal notation, revealing progress on basic tasks but major gaps in complex combinatorial and synthesis reasoning.

  32. ThermoQA: A Three-Tier Benchmark for Evaluating Thermodynamic Reasoning in Large Language Models

    cs.AI 2026-03 accept novelty 7.0

    ThermoQA benchmark shows top LLMs reach 92-94% overall on thermodynamics problems but degrade sharply on full cycle analysis, confirming that property knowledge does not equal reasoning ability.

  33. Do NOT Think That Much for 2+3=? On the Overthinking of o1-Like LLMs

    cs.CL 2024-12 unverdicted novelty 7.0

    o1-like models overthink easy tasks; self-training reduces compute use without accuracy loss on GSM8K, MATH500, GPQA, and AIME.

  34. TextGrad: Automatic "Differentiation" via Text

    cs.CL 2024-06 unverdicted novelty 7.0

    TextGrad performs automatic differentiation for compound AI systems by backpropagating natural-language feedback from LLMs to optimize variables ranging from code to molecular structures.

  35. $\delta$-mem: Efficient Online Memory for Large Language Models

    cs.AI 2026-05 unverdicted novelty 6.0

    δ-mem augments frozen LLMs with an 8x8 online memory state updated by delta-rule learning to generate low-rank attention corrections, delivering 1.10x average gains over the backbone and larger improvements on memory-...

  36. Nice Fold or Hero Call: Learning Budget-Efficient Thinking for Adaptive Reasoning

    cs.AI 2026-05 unverdicted novelty 6.0

    BET reduces reasoning tokens by about 55% on average while improving performance across benchmarks by learning to short-solve easy queries, fold early on unsolvable ones, and preserve budget for hard solvable queries.

  37. Selective Off-Policy Reference Tuning with Plan Guidance

    cs.AI 2026-05 unverdicted novelty 6.0

    SORT turns all-wrong prompts into selective learning signals by weighting tokens more predictable under plan guidance from reference solutions, improving over GRPO on reasoning benchmarks especially for weaker models.

  38. DataMaster: Data-Centric Autonomous AI Research

    cs.LG 2026-05 unverdicted novelty 6.0

    DataMaster autonomously optimizes data via tree search and shared memory, raising medal rate 32.27% on MLE-Bench Lite and beating the base instruct model on GPQA.

  39. DataMaster: Data-Centric Autonomous AI Research

    cs.LG 2026-05 unverdicted novelty 6.0

    DataMaster deploys an AI agent to autonomously engineer data via tree search over external sources, shared candidate pools, and memory of past outcomes, yielding 32% higher medal rates on MLE-Bench Lite and a small GP...

  40. The Generalized Turing Test: A Foundation for Comparing Intelligence

    cs.AI 2026-05 unverdicted novelty 6.0

    The Generalized Turing Test defines relative intelligence as the inability of one agent to distinguish an imitator from the original through interaction.

  41. Active Testing of Large Language Models via Approximate Neyman Allocation

    cs.AI 2026-05 unverdicted novelty 6.0

    Active testing via surrogate semantic entropy stratification and approximate Neyman allocation reduces MSE by up to 28% versus uniform sampling and saves about 23% of the labeling budget on language and multimodal benchmarks.

  42. FocuSFT: Bilevel Optimization for Dilution-Aware Long-Context Fine-Tuning

    cs.CL 2026-05 unverdicted novelty 6.0

    FocuSFT uses an inner optimization loop to adapt fast-weight parameters into a parametric memory that sharpens attention on relevant content, then conditions outer-loop supervised fine-tuning on this representation, y...

  43. Efficient LLM Reasoning via Variational Posterior Guidance with Efficiency Awareness

    cs.LG 2026-05 unverdicted novelty 6.0

    VPG-EA applies variational posterior guidance and efficiency-aware distillation to compress LLM reasoning chains while preserving performance.

  44. DARE: Difficulty-Adaptive Reinforcement Learning with Co-Evolved Difficulty Estimation

    cs.LG 2026-05 unverdicted novelty 6.0

    DARE co-evolves difficulty estimation and policy in RL for LLMs to improve training efficiency, final performance, and inference speed by using tailored strategies for different difficulty levels.

  45. Rotation-Preserving Supervised Fine-Tuning

    cs.LG 2026-05 unverdicted novelty 6.0

    RPSFT improves the in-domain versus out-of-domain performance trade-off during LLM supervised fine-tuning by penalizing rotations in pretrained singular subspaces as a proxy for loss-sensitive directions.

  46. AIPO: : Learning to Reason from Active Interaction

    cs.CL 2026-05 unverdicted novelty 6.0

    AIPO trains LLMs to expand their reasoning capability boundary via active multi-agent interaction with Verify, Knowledge, and Reasoning agents during RLVR, using importance sampling and clipping to handle feedback, th...

  47. Rubric-Grounded RL: Structured Judge Rewards for Generalizable Reasoning

    cs.AI 2026-05 unverdicted novelty 6.0

    Rubric-grounded RL with LLM judges on document-derived criteria raises Llama-3.1-8B normalized reward to 71.7% on held-out rubrics and improves performance on GSM8K, MATH, and GPQA benchmarks.

  48. Memory-Efficient Looped Transformer: Decoupling Compute from Memory in Looped Language Models

    cs.CL 2026-05 unverdicted novelty 6.0

    MELT decouples reasoning depth from memory in looped LLMs by sharing a single gated KV cache per layer and using two-phase chunk-wise distillation from Ouro, delivering constant memory use while matching or beating st...

  49. SEIF: Self-Evolving Reinforcement Learning for Instruction Following

    cs.CL 2026-05 conditional novelty 6.0

    SEIF creates a self-reinforcing loop in which an LLM alternately generates increasingly difficult instructions and learns to follow them better using reinforcement learning signals from its own judgments.

  50. An Interpretable and Scalable Framework for Evaluating Large Language Models

    stat.ML 2026-05 unverdicted novelty 6.0

    A majorization-minimization framework turns IRT into scalable matrix factorization subproblems for LLM evaluation, delivering orders-of-magnitude speedups with identifiability guarantees.

  51. Measuring Black-Box Confidence via Reasoning Trajectories: Geometry, Coverage, and Verbalization

    cs.AI 2026-05 unverdicted novelty 6.0

    Trajectory geometry in embedding space fused with coverage and verbalization yields better black-box CoT confidence estimation than self-consistency at lower sample counts across six benchmark-reasoner pairs.

  52. Uno-Orchestra: Parsimonious Agent Routing via Selective Delegation

    cs.AI 2026-05 unverdicted novelty 6.0

    A learned orchestration policy for LLM agents that jointly optimizes task decomposition and selective routing to (model, primitive) pairs, delivering 77% macro pass@1 at 10x lower cost than strong baselines across 13 ...

  53. Counting as a minimal probe of language model reliability

    cs.CL 2026-05 unverdicted novelty 6.0

    Language models have limited stable counting capacity well below context limits and rely on a finite set of count-like internal states, collapsing to guessing once exhausted.

  54. State Stream Transformer (SST) V2: Parallel Training of Nonlinear Recurrence for Latent Space Reasoning

    cs.LG 2026-04 unverdicted novelty 6.0

    SST V2 introduces parallel-trainable nonlinear recurrence in latent space to let transformers reason continuously across positions, delivering +15 points on GPQA-Diamond and halving remaining GSM8K errors over matched...

  55. Latent-GRPO: Group Relative Policy Optimization for Latent Reasoning

    cs.LG 2026-04 unverdicted novelty 6.0

    Latent-GRPO stabilizes reinforcement learning in latent space, delivering 7.86 Pass@1 gains on low-difficulty tasks over latent baselines and 4.27 points over explicit GRPO on high-difficulty tasks with 3-4x shorter r...

  56. Large Language Models Decide Early and Explain Later

    cs.CL 2026-04 unverdicted novelty 6.0

    LLMs settle on their answer after a minority of CoT tokens and produce an average 760 more as post-decision explanation, enabling early stopping that saves 500 tokens per query at a 2% accuracy cost.

  57. How Far Are Video Models from True Multimodal Reasoning?

    cs.CV 2026-04 unverdicted novelty 6.0

    Current video models succeed on basic understanding but achieve under 25% success on logically grounded generation and near 0% on interactive generation, exposing gaps in multimodal reasoning.

  58. Train Separately, Merge Together: Modular Post-Training with Mixture-of-Experts

    cs.LG 2026-04 unverdicted novelty 6.0

    BAR trains independent domain experts via separate mid-training, SFT, and RL pipelines then composes them with a MoE router to match monolithic retraining performance at lower cost and without catastrophic forgetting.

  59. ContraPrompt: Contrastive Prompt Optimization via Dyadic Reasoning Trace Analysis

    cs.AI 2026-04 unverdicted novelty 6.0

    ContraPrompt extracts optimization rules from dyadic differences in reasoning traces on identical inputs and organizes them into input-aware decision trees, outperforming GEPA on four benchmarks with gains up to 8.29 pp.

  60. SPS: Steering Probability Squeezing for Better Exploration in Reinforcement Learning for Large Language Models

    cs.CL 2026-04 unverdicted novelty 6.0

    SPS interleaves RL and IRL to counteract probability squeezing in LLM reasoning trajectories, improving Pass@k on five benchmarks while identifying an empirical upper bound on multi-sample performance.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · cited by 78 Pith papers · 1 internal anchor

  1. [1]

    and Dahl, George

    Association for Computational Linguistics. doi: 10.18653/v1/2021.naacl-main.385. URL https://aclanthology.org/2021.naacl-main.385. S. R. Bowman, J. Hyun, E. Perez, E. Chen, C. Pettit, S. Heiner, K. Lukoši¯ut˙e, A. Askell, A. Jones, A. Chen, A. Goldie, A. Mirhoseini, C. McKinnon, C. Olah, D. Amodei, D. Amodei, D. Drain, D. Li, E. Tran-Johnson, J. Kernion, ...

  2. [2]

    URL https://arxiv.org/abs/1805.00899. A. Jacovi, A. Caciularu, J. Herzig, R. Aharoni, B. Bohnet, and M. Geva. A comprehensive evaluation of tool-assisted generation strategies.arXiv preprint arXiv:2310.10062, 2023. URL https://arxiv.org/abs/2310.10062. M. Joshi, E. Choi, D. Weld, and L. Zettlemoyer. TriviaQA: A large scale distantly supervised challenge d...

  3. [3]

    doi: 10.18653/v1/2020.acl-main.626

    Association for Computational Linguistics. doi: 10.18653/v1/2020.acl-main.626. URL https://aclanthology.org/2020.acl-main.626. P. Sen and A. Saffari. What do models learn from question answering datasets? InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2429–2438, Stroudsburg, PA, USA, Nov. 2020. Asso...

  4. [4]

    URL https://arxiv.org/abs/2307.09288. A. Wang, R. Y. Pang, A. Chen, J. Phang, and S. R. Bowman. SQuALITY: Building a long-document summarization dataset the hard way. InProceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pages 1139–1156, Abu Dhabi, United Arab Emirates, Dec

  5. [5]

    (a) .... (b) .... (c) .... (d)

    Association for Computational Linguistics. doi: 10.18653/v1/2022.emnlp-main.75. URL https://aclanthology.org/2022.emnlp-main.75. Z. Yang, P. Qi, S. Zhang, Y. Bengio, W. Cohen, R. Salakhutdinov, and C. D. Manning. HotpotQA: A dataset for diverse, explainable multi-hop question answering. InProceedings of the 2018 Conference on Empirical Methods in Natural ...

  6. [6]

    $20 bonus for each expert validator that answers correctly (two expert validators) 23

  7. [7]

    $15bonusforeachnon-expertvalidatorthatanswers incorrectly(threenon-expertvalidators)

  8. [8]

    $30extrabonus if both expert validators answer correctly,and at least 2/3 of the non-expert validators answer incorrectly Expert Validation Payment for expert validation is broken up into three sections

  9. [10]

    $10 bonus for each question answered correctly

  10. [11]

    Expert validators do not know whether they are the first or second non-expert validator when they are given the question

    $10 bonus for each question one validates where the 2nd expert validator answerscorrectly, and a $5 bonus for each question validated where the majority ofnon-expert validators answersincorrectly • Because it can’t be guaranteed that everyone will be assigned the same number of expertvalidationswheretheirfeedbackcanbeincorporated(i.e.,firstexpertvalidatio...

  11. [12]

    $10 base payment per question validated

  12. [13]

    We additionally gave detailed comments on several example questions for each part (not shown here), highlighting positive and negative aspects of the examples

    $30 bonus for each question answered correctly A.5 Annotator Instructions Here are the main instructions we gave to contractors for each of the four main parts of the project: question-writing, expert validation, question revision, and non-expert validation. We additionally gave detailed comments on several example questions for each part (not shown here)...

  13. [14]

    use proper capitalization, etc.), and check for spelling and grammar mistakes – we encourage you to use Grammarly

    Clarity: Please write in complete sentences (e.g. use proper capitalization, etc.), and check for spelling and grammar mistakes – we encourage you to use Grammarly

  14. [15]

    Objectivity: Check that there is an objectively correct answer, which experts in your domain would agree upon

  15. [16]

    Difficulty:Consider whether the question will actually be challenging for non-experts to answer, especially considering that the non-experts will be experts in other fields, and will have access to any internet resources they can find

  16. [17]

    Search resistance:Pretend to be a non-expert—what would you search the web for to try and answer the question? Perform those web searches and see what you find—the question may be easier than you think. 24

  17. [18]

    Stand-alone questions:We want questions that can be answered correctly by other experts even if they aren’t shown the answer choices

  18. [19]

    adding letter or number labels like a) b) c) d), or 1

    Label-free: Do not label your answer choices with any extra information, e.g. adding letter or number labels like a) b) c) d), or 1. 2. 3. 4. This also applies to your explanations

  19. [20]

    Minimal calculation:Including some small amount of calculation/computation is okay, but that shouldn’t be the only reason for the question’s difficulty

  20. [21]

    Evaluation: Once you’ve written a question, remember to evaluate it to ensure it satisfies our expectations before submitting it

  21. [22]

    If we believe that you are using AI tools or relying on other individuals for help, we will not continue working with you

    AI tools are not allowed:At no point in this project are you permitted to use AI assistants like ChatGPT, Bing Chat, Claude, Bard, etc. If we believe that you are using AI tools or relying on other individuals for help, we will not continue working with you. Suggestions and Strategies for Writing Questions

  22. [23]

    Listoutbroadareasofknowledge/experienceyouhavethatarerelevantforthespecialty you were hired for

  23. [24]

    Consider what aspects of the topic were difficult to grasp

    Think about topics in your domain that took you a long time to fully understand. Consider what aspects of the topic were difficult to grasp

  24. [25]

    Are there aspects of your prior work you can draw on?

    Lookbackatyourpublications,theses,orotherartifactsofyourfieldyouhaveproduced. Are there aspects of your prior work you can draw on?

  25. [26]

    Explorenicheorspecializedareaswithinyourdomainthatarenotcommonlyaddressed in general educational resources (there is likely to be less information easily available online in these areas)

  26. [27]

    Identify concepts or techniques that are often misunderstood or misused, even by professionals in your field

  27. [28]

    Look for topics that require extensive background knowledge or understanding of multiple concepts

  28. [29]

    Consideruncommonorcomplexapplicationsofwell-knownprinciplesinyourdomain

  29. [30]

    Was there some controversy in your domain of expertise over a question that is now resolved? If so, the opposing sides of this controversy could be good potential answer choices,as long as the controversy is truly settled

  30. [31]

    way of doing things

    Is there some implicit practice or “way of doing things” that isn’t explicitly discussed in textbooks or classes, but it’s always there in the background?

  31. [32]

    Craftquestionsthatappealtomanydifferentconceptsfromyourareaofexpertise,such thatanswering correctlyrequiressynthesizing knowledgeacross abroad range ofareas

  32. [33]

    Aretherecommonanalogiesthatpeopleinyourfielduse,thatyoucouldusetodisguise the topic/motivation of the question? Consider making it difficult for non-experts to even know what domain the question is asking about

  33. [34]

    In other words, don’t use a more technical but less accurate/correct term in place of a simpler but more precise word

    Don’t be afraid to use technical jargon; remembering and understanding terminology is an important aspect of expertise! However, make sure to be extremely precise in your language. In other words, don’t use a more technical but less accurate/correct term in place of a simpler but more precise word. Finally,onceyou’vewrittenafewquestions,pleasereviewyourqu...

  34. [35]

    Answer the question (a) Answer the question to the best of your ability. i. You are allowed to use Google to help you answer the question,as long as you indicate this in your answer explanation. (b) Provide an explanation for how you came to your decision. i. Your explanation should be detailed enough such that if you end up answering incorrectly, the que...

  35. [36]

    useful feedback

    Give Feedback and Suggest Revisions (a) Then, you will be shown the “correct” answer and explanations, and you will give feedback on the question. This feedback will be shown to the question writer, and they will use it to improve the question. i. If you answered correctly, can you make the question more difficult, for example by adding some extra irrelev...

  36. [37]

    Read the expert validator’s answer selection, explanation, and feedback, and consider the changes that they suggest

  37. [38]

    Given their answer choice, explanation, and feedback, decide whether you agree with their feedback and wish to incorporate it

    If they disagreed with your answer choice, consider why. Given their answer choice, explanation, and feedback, decide whether you agree with their feedback and wish to incorporate it

  38. [39]

    It may be the case that the expert validator answers incorrectly, but after seeing your explanation, believes they simply made a mistake, and that your answer is indeed correct, and the question is objective. If you don’t think the question requires any additional revision, you can copy/paste your original question without modification (assuming they didn...

  39. [40]

    Try to make your question moreobjective(especially if they answered incorrectly) and more difficult(regardless of whether they answered incorrectly). Even if you disagree with the expert validator on what the correct answer should be, try and find a way to rephraseyourquestionsothatyouandthey wouldagree(whileretainingthequestion’s difficulty). It is okay ...

  40. [41]

    Revise Question

    Iftheexpertvalidatordoesnotsuggestanyrevisions, andtherearen’tanywaysyoucan think of to improve the question, you still need to submit the “Revise Question” form - in this case you can simply copy/paste your original question into the form. Otherwise, our system won’t know that you at least considered revising the question (but decided not to)

  41. [42]

    A.5.4 Non-Expert Validation In non-expert validation, you’ll try to answer questions written by experts in different domains from you

    Make sure your revised question still adheres to our expectations and guidelines listed in theWriting A New Questiondocument. A.5.4 Non-Expert Validation In non-expert validation, you’ll try to answer questions written by experts in different domains from you. You’re expected/encouraged to use Google, and anyinternet resourcesyou can find to try and answe...

  42. [43]

    If we detect that you are using AI tools or relying on other individuals for help, we will not continue working with you

    You are encouraged to use Google, and any websites you can think of or find that may help you answer the question.However, you are NOT allowed to use AI assistants like chatGPT, Bing Chat, Claude, Bard, etc., or ask people for help. If we detect that you are using AI tools or relying on other individuals for help, we will not continue working with you

  43. [44]

    If you haven’t settled on an answer choice in that time, we encourage you to spend as long as you need to be confident in your selection

    We ask that you spendat least 15 minutestrying to answer each question before making your selection. If you haven’t settled on an answer choice in that time, we encourage you to spend as long as you need to be confident in your selection

  44. [45]

    Read relevant resources, take plenty of time, and give up and guess only if you’re pretty sure you have no realistic way of answering confidently

    Thesequestionswillbehard, andyouwilllikelyneedtospendawhileoneachofthem to make progress understanding the questions. Read relevant resources, take plenty of time, and give up and guess only if you’re pretty sure you have no realistic way of answering confidently

  45. [46]

    List out the most useful 3-5 websites you found while answering the question

  46. [47]

    You will also be given the opportunity to give feedback on the question. We’re especially interested in feedback about whether the question wastoo easy(if so, why was it too easy?), but please feel free to give feedback of any other form! Suggestions and Strategies for Non-Expert Validation

  47. [48]

    Keep a list of those definitions handy so you can easily refer back to the definitions if you forget the jargon

    Look up definitions for all of the unfamiliar terms in the question and answer choices. Keep a list of those definitions handy so you can easily refer back to the definitions if you forget the jargon

  48. [49]

    Look for primary resources, like research papers and textbooks, as these can often contain clearer explanations than sources like Wikipedia (although Wikipedia can be useful in many cases as well)

  49. [50]

    Consider if there are online tools for answering correctly. For example, if you’re given a difficult question about what the best move in a certain chess position is, you could input the board into a chess engine like Stockfish, and see what the algorithm suggests. For questions involving mathematical computations, Wolfram Alpha and other online symbolic ...

  50. [51]

    Sometimes online forums or question-answering platforms like Stack Exchange can already have questions similar to the one you’re given - this simplifies the problem to transferring the answer from the online version to fit the specific details of the question you’re given. 28