Introduces the CUSP benchmark across 4760 events and finds frontier AI models can pick plausible directions but fail to predict whether or when scientific advances will occur, with performance varying by domain and insensitive to training cutoffs.
hub
Frontierscience: Evaluating ai’s ability to perform expert-level scien- tific reasoning
12 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
years
2026 12representative citing papers
Collider-Bench is a new benchmark showing that current LLM agents cannot reliably reproduce LHC analyses at the level of a physicist-in-the-loop.
CITE certifies that a prespecified answer is the unique mode of an LLM response distribution with anytime-valid error control under arbitrary data-driven stopping and without prior knowledge of the answer set.
Argus coordinates a Navigator and multiple Searchers via an evidence graph for deep research, reporting average gains of 5.5 points with one Searcher and 12.7 points with eight parallel Searchers across eight benchmarks, reaching 86.2 on BrowseComp with 64 Searchers.
A training recipe for tool-integrated reasoning models achieves state-of-the-art open-source results on math benchmarks such as 96.7% and 99.2% on AIME 2025 at 4B and 30B scales by balancing tool-use trajectories and optimizing for pass@k during SFT before stable RLVR.
LABBench2 is a more challenging benchmark than LAB-Bench for assessing AI performance on biology research tasks, with frontier models showing accuracy drops of 26-46% across subtasks.
A 30B model trained via reverse-perplexity SFT, two-stage RL, and test-time scaling achieves gold-medal-level results on IMO 2025 and IPhO 2024/2025.
EvoMaster is a self-evolving agent framework that achieves state-of-the-art results on scientific benchmarks by enabling iterative hypothesis refinement and knowledge accumulation across domains.
COMPOSITE-STEM is a new benchmark of 70 expert-curated STEM tasks where frontier AI agents score at most 21% using flexible exact-match and rubric-based grading.
The paper delivers a stage-by-stage roadmap for AI in research, showing reliable assistance in retrieval and tool tasks but fragility in novelty and judgment, advocating human-governed collaboration.
A summary of expert opinions on AI's impact on the research-education environment in physics from a KITP discussion session.
citing papers explorer
-
Forecasting Scientific Progress with Artificial Intelligence
Introduces the CUSP benchmark across 4760 events and finds frontier AI models can pick plausible directions but fail to predict whether or when scientific advances will occur, with performance varying by domain and insensitive to training cutoffs.
-
Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction
Collider-Bench is a new benchmark showing that current LLM agents cannot reliably reproduce LHC analyses at the level of a physicist-in-the-loop.
-
CITE: Anytime-Valid Statistical Inference in LLM Self-Consistency
CITE certifies that a prespecified answer is the unique mode of an LLM response distribution with anytime-valid error control under arbitrary data-driven stopping and without prior knowledge of the answer set.
-
Argus: Evidence Assembly for Scalable Deep Research Agents
Argus coordinates a Navigator and multiple Searchers via an evidence graph for deep research, reporting average gains of 5.5 points with one Searcher and 12.7 points with eight parallel Searchers across eight benchmarks, reaching 86.2 on BrowseComp with 64 Searchers.
-
Teaching Thinking Models to Reason with Tools: A Full-Pipeline Recipe for Tool-Integrated Reasoning
A training recipe for tool-integrated reasoning models achieves state-of-the-art open-source results on math benchmarks such as 96.7% and 99.2% on AIME 2025 at 4B and 30B scales by balancing tool-use trajectories and optimizing for pass@k during SFT before stable RLVR.
-
LABBench2: An Improved Benchmark for AI Systems Performing Biology Research
LABBench2 is a more challenging benchmark than LAB-Bench for assessing AI performance on biology research tasks, with frontier models showing accuracy drops of 26-46% across subtasks.
-
Achieving Gold-Medal-Level Olympiad Reasoning via Simple and Unified Scaling
A 30B model trained via reverse-perplexity SFT, two-stage RL, and test-time scaling achieves gold-medal-level results on IMO 2025 and IPhO 2024/2025.
-
EvoMaster: A Foundational Evolving Agent Framework for Agentic Science at Scale
EvoMaster is a self-evolving agent framework that achieves state-of-the-art results on scientific benchmarks by enabling iterative hypothesis refinement and knowledge accumulation across domains.
-
COMPOSITE-Stem
COMPOSITE-STEM is a new benchmark of 70 expert-curated STEM tasks where frontier AI agents score at most 21% using flexible exact-match and rubric-based grading.
-
AI for Auto-Research: Roadmap & User Guide
The paper delivers a stage-by-stage roadmap for AI in research, showing reliable assistance in retrieval and tool tasks but fragility in novelty and judgment, advocating human-governed collaboration.
-
AI and the Research-Education Environment of Physics
A summary of expert opinions on AI's impact on the research-education environment in physics from a KITP discussion session.
- SciResearcher: Scaling Deep Research Agents for Frontier Scientific Reasoning