hub

Frontierscience: Evaluating ai’s ability to perform expert-level scien- tific reasoning

Miles Wang, Robi Lin, Kat Hu, Joy Jiao, Neil Chowdhury, Ethan Chang, Tejal Patwardhan · 2026 · arXiv 2601.21165

12 Pith papers cite this work. Polarity classification is still indexing.

12 Pith papers citing it

read on arXiv browse 12 citing papers

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

dataset 2 background 1

citation-polarity summary

use dataset 2 background 1

representative citing papers

Forecasting Scientific Progress with Artificial Intelligence

cs.AI · 2026-05-21 · unverdicted · novelty 7.0

Introduces the CUSP benchmark across 4760 events and finds frontier AI models can pick plausible directions but fail to predict whether or when scientific advances will occur, with performance varying by domain and insensitive to training cutoffs.

Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction

cs.LG · 2026-05-13 · unverdicted · novelty 7.0

Collider-Bench is a new benchmark showing that current LLM agents cannot reliably reproduce LHC analyses at the level of a physicist-in-the-loop.

CITE: Anytime-Valid Statistical Inference in LLM Self-Consistency

stat.ML · 2026-05-07 · unverdicted · novelty 7.0

CITE certifies that a prespecified answer is the unique mode of an LLM response distribution with anytime-valid error control under arbitrary data-driven stopping and without prior knowledge of the answer set.

Argus: Evidence Assembly for Scalable Deep Research Agents

cs.CL · 2026-05-15 · unverdicted · novelty 6.0 · 2 refs

Argus coordinates a Navigator and multiple Searchers via an evidence graph for deep research, reporting average gains of 5.5 points with one Searcher and 12.7 points with eight parallel Searchers across eight benchmarks, reaching 86.2 on BrowseComp with 64 Searchers.

Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning

cs.CL · 2026-05-07 · unverdicted · novelty 6.0

A training recipe for tool-integrated reasoning models achieves state-of-the-art open-source results on math benchmarks such as 96.7% and 99.2% on AIME 2025 at 4B and 30B scales by balancing tool-use trajectories and optimizing for pass@k during SFT before stable RLVR.

LABBench2: An Improved Benchmark for AI Systems Performing Biology Research

cs.AI · 2026-02-04 · unverdicted · novelty 6.0

LABBench2 is a more challenging benchmark than LAB-Bench for assessing AI performance on biology research tasks, with frontier models showing accuracy drops of 26-46% across subtasks.

Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling

cs.AI · 2026-05-13 · unverdicted · novelty 5.0

A 30B model trained via reverse-perplexity SFT, two-stage RL, and test-time scaling achieves gold-medal-level results on IMO 2025 and IPhO 2024/2025.

EvoMaster: A Foundational Evolving Agent Framework for Agentic Science at Scale

cs.AI · 2026-04-19 · unverdicted · novelty 5.0

EvoMaster is a self-evolving agent framework that achieves state-of-the-art results on scientific benchmarks by enabling iterative hypothesis refinement and knowledge accumulation across domains.

COMPOSITE-Stem

cs.AI · 2026-04-10 · conditional · novelty 5.0

COMPOSITE-STEM is a new benchmark of 70 expert-curated STEM tasks where frontier AI agents score at most 21% using flexible exact-match and rubric-based grading.

AI for Auto-Research: Roadmap & User Guide

cs.AI · 2026-05-18 · unverdicted · novelty 4.0

The paper delivers a stage-by-stage roadmap for AI in research, showing reliable assistance in retrieval and tool tasks but fragility in novelty and judgment, advocating human-governed collaboration.

AI and the Research-Education Environment of Physics

physics.ed-ph · 2026-05-04 · unverdicted · novelty 1.0

A summary of expert opinions on AI's impact on the research-education environment in physics from a KITP discussion session.

SciResearcher: Scaling Deep Research Agents for Frontier Scientific Reasoning

cs.AI · 2026-05-02

citing papers explorer

Showing 12 of 12 citing papers.

Forecasting Scientific Progress with Artificial Intelligence cs.AI · 2026-05-21 · unverdicted · none · ref 46
Introduces the CUSP benchmark across 4760 events and finds frontier AI models can pick plausible directions but fail to predict whether or when scientific advances will occur, with performance varying by domain and insensitive to training cutoffs.
Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction cs.LG · 2026-05-13 · unverdicted · none · ref 15
Collider-Bench is a new benchmark showing that current LLM agents cannot reliably reproduce LHC analyses at the level of a physicist-in-the-loop.
CITE: Anytime-Valid Statistical Inference in LLM Self-Consistency stat.ML · 2026-05-07 · unverdicted · none · ref 22
CITE certifies that a prespecified answer is the unique mode of an LLM response distribution with anytime-valid error control under arbitrary data-driven stopping and without prior knowledge of the answer set.
Argus: Evidence Assembly for Scalable Deep Research Agents cs.CL · 2026-05-15 · unverdicted · none · ref 41 · 2 links
Argus coordinates a Navigator and multiple Searchers via an evidence graph for deep research, reporting average gains of 5.5 points with one Searcher and 12.7 points with eight parallel Searchers across eight benchmarks, reaching 86.2 on BrowseComp with 64 Searchers.
Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning cs.CL · 2026-05-07 · unverdicted · none · ref 38
A training recipe for tool-integrated reasoning models achieves state-of-the-art open-source results on math benchmarks such as 96.7% and 99.2% on AIME 2025 at 4B and 30B scales by balancing tool-use trajectories and optimizing for pass@k during SFT before stable RLVR.
LABBench2: An Improved Benchmark for AI Systems Performing Biology Research cs.AI · 2026-02-04 · unverdicted · none · ref 29
LABBench2 is a more challenging benchmark than LAB-Bench for assessing AI performance on biology research tasks, with frontier models showing accuracy drops of 26-46% across subtasks.
Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling cs.AI · 2026-05-13 · unverdicted · none · ref 1
A 30B model trained via reverse-perplexity SFT, two-stage RL, and test-time scaling achieves gold-medal-level results on IMO 2025 and IPhO 2024/2025.
EvoMaster: A Foundational Evolving Agent Framework for Agentic Science at Scale cs.AI · 2026-04-19 · unverdicted · none · ref 14
EvoMaster is a self-evolving agent framework that achieves state-of-the-art results on scientific benchmarks by enabling iterative hypothesis refinement and knowledge accumulation across domains.
COMPOSITE-Stem cs.AI · 2026-04-10 · conditional · none · ref 9
COMPOSITE-STEM is a new benchmark of 70 expert-curated STEM tasks where frontier AI agents score at most 21% using flexible exact-match and rubric-based grading.
AI for Auto-Research: Roadmap & User Guide cs.AI · 2026-05-18 · unverdicted · none · ref 209
The paper delivers a stage-by-stage roadmap for AI in research, showing reliable assistance in retrieval and tool tasks but fragility in novelty and judgment, advocating human-governed collaboration.
AI and the Research-Education Environment of Physics physics.ed-ph · 2026-05-04 · unverdicted · none · ref 10
A summary of expert opinions on AI's impact on the research-education environment in physics from a KITP discussion session.
SciResearcher: Scaling Deep Research Agents for Frontier Scientific Reasoning cs.AI · 2026-05-02 · unreviewed · ref 40

Frontierscience: Evaluating ai’s ability to perform expert-level scien- tific reasoning

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer