PhySciBench benchmark shows current AI models achieve at most 33.5% accuracy on physical science tasks; DelveAgent framework improves accuracy by up to 7.5 points and cuts costs to one-third.
hub
Frontierscience: Evaluating ai’s ability to perform expert-level scientific tasks
14 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
years
2026 14representative citing papers
Introduces the CUSP benchmark across 4760 events and finds frontier AI models can pick plausible directions but fail to predict whether or when scientific advances will occur, with performance varying by domain and insensitive to training cutoffs.
Collider-Bench is a new benchmark showing that current LLM agents cannot reliably reproduce LHC analyses at the level of a physicist-in-the-loop.
CITE certifies that a prespecified answer is the unique mode of an LLM response distribution with anytime-valid error control under arbitrary data-driven stopping and without prior knowledge of the answer set.
Arbor combines a coordinator, executors, and a hypothesis tree to enable cumulative autonomous research, outperforming Codex and Claude Code by over 2.5x on six real tasks and reaching 86.36% Any Medal on MLE-Bench Lite.
Argus coordinates a Navigator and multiple Searchers via an evidence graph for deep research, reporting average gains of 5.5 points with one Searcher and 12.7 points with eight parallel Searchers across eight benchmarks, reaching 86.2 on BrowseComp with 64 Searchers.
A training recipe for tool-integrated reasoning models achieves state-of-the-art open-source results on math benchmarks such as 96.7% and 99.2% on AIME 2025 at 4B and 30B scales by balancing tool-use trajectories and optimizing for pass@k during SFT before stable RLVR.
LABBench2 is a more challenging benchmark than LAB-Bench for assessing AI performance on biology research tasks, with frontier models showing accuracy drops of 26-46% across subtasks.
A 30B model trained via reverse-perplexity SFT, two-stage RL, and test-time scaling achieves gold-medal-level results on IMO 2025 and IPhO 2024/2025.
COMPOSITE-STEM is a new benchmark of 70 expert-curated STEM tasks where frontier AI agents score at most 21% using flexible exact-match and rubric-based grading.
The paper delivers a stage-by-stage roadmap for AI in research, showing reliable assistance in retrieval and tool tasks but fragility in novelty and judgment, advocating human-governed collaboration.
A summary of expert opinions on AI's impact on the research-education environment in physics from a KITP discussion session.
citing papers explorer
No citing papers match the current filters.