Introduces the Matter to Mechanism benchmark of 2,645 structured instances and a composite metric suite for evaluating AI co-scientists on problem-to-hypothesis reasoning in battery materials research.
hub
Laurent, Alex Andonian, Benjamin Tenmann, Siddharth Narayanan, Geemi P
14 Pith papers cite this work. Polarity classification is still indexing.
hub tools
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
DiscoverPhysics is a new benchmark with 22 on-demand N-body simulated worlds where LLM agents design experiments to infer non-standard physics, evaluated via held-out trajectory MSE and LLM-judged explanation quality.
BioXArena benchmarks LLM agents on generating end-to-end ML pipelines for 76 multi-modal biomedical tasks, with MLEvolve plus Gemini-3.1-Pro scoring highest at 0.666.
Collider-Bench is a new benchmark showing that current LLM agents cannot reliably reproduce LHC analyses at the level of a physicist-in-the-loop.
AssayBench is a new gene-ranking benchmark for phenotypic CRISPR screens that shows zero-shot generalist LLMs outperform both biology-specific LLMs and trainable baselines on adjusted nDCG.
LLM agents execute scientific tasks but fail to follow core scientific reasoning norms such as evidence consideration and belief revision based on refutations.
Kosmos is an AI scientist that maintains coherence over hundreds of agent steps via a shared world model, executes thousands of code lines and reads thousands of papers per run, and produces traceable reports with 79.4% statement accuracy according to independent reviewers.
TxBench-PP benchmark shows leading AI agents achieve at most 59% success on tasks requiring recovery of preclinical pharmacology conclusions from assay data.
Introduces SpatialBench-Long benchmark with 24 evaluations on spatial biology datasets from PDAC, glioblastoma, lung adenocarcinoma and optic nerve systems, reporting top model performance at 8/72 runs (11.1%).
LEAPBench shows trajectory scoring changes best-model rankings on 53% of tasks, LLMs do not beat Bayesian optimization, and domain-aware prompting underperforms domain-agnostic on biology tasks aligned with published literature.
CFDLLMBench is a new benchmark suite with CFDQuery, CFDCodeBench, and FoamBench to evaluate LLMs on graduate-level CFD knowledge, numerical reasoning, and context-dependent code implementation.
DeepResearch Bench supplies 100 expert-crafted PhD-level tasks and two human-aligned evaluation frameworks to measure deep research agents on report quality and citation accuracy.
AtomisticSkills is a new harness framework with 100+ human-curated skills that lets general AI agents perform atomistic research tasks including simulations, screening, and analysis, shown on electrolyte design, CO2 capture, drug screening, and catalyst tasks.
BioResearcher is a new multi-agent system that leads baselines on single-step biomedical tests, BixBench, BaisBench, and a 30-query clinical discovery benchmark with 74.7% positive hit rate.
citing papers explorer
-
Matter to Mechanism: A Benchmark for AI Co-Scientists in Materials and Battery Research
Introduces the Matter to Mechanism benchmark of 2,645 structured instances and a composite metric suite for evaluating AI co-scientists on problem-to-hypothesis reasoning in battery materials research.
-
DiscoverPhysics: Benchmarking LLMs for Out-of-the-Box Scientific Thinking
DiscoverPhysics is a new benchmark with 22 on-demand N-body simulated worlds where LLM agents design experiments to infer non-standard physics, evaluated via held-out trajectory MSE and LLM-judged explanation quality.
-
BioXArena: Benchmarking LLM Agents on Multi-Modal Biomedical Machine Learning Tasks
BioXArena benchmarks LLM agents on generating end-to-end ML pipelines for 76 multi-modal biomedical tasks, with MLEvolve plus Gemini-3.1-Pro scoring highest at 0.666.
-
Collider-Bench: Benchmarking AI Agents with Particle Physics Analysis Reproduction
Collider-Bench is a new benchmark showing that current LLM agents cannot reliably reproduce LHC analyses at the level of a physicist-in-the-loop.
-
AssayBench: An Assay-Level Virtual Cell Benchmark for LLMs and Agents
AssayBench is a new gene-ranking benchmark for phenotypic CRISPR screens that shows zero-shot generalist LLMs outperform both biology-specific LLMs and trainable baselines on adjusted nDCG.
-
AI scientists produce results without reasoning scientifically
LLM agents execute scientific tasks but fail to follow core scientific reasoning norms such as evidence consideration and belief revision based on refutations.
-
Kosmos: An AI Scientist for Autonomous Discovery
Kosmos is an AI scientist that maintains coherence over hundreds of agent steps via a shared world model, executes thousands of code lines and reads thousands of papers per run, and produces traceable reports with 79.4% statement accuracy according to independent reviewers.
-
TxBench-PP: Analyzing AI Agent Performance on Small-Molecule Preclinical Pharmacology
TxBench-PP benchmark shows leading AI agents achieve at most 59% success on tasks requiring recovery of preclinical pharmacology conclusions from assay data.
-
Verifiable Benchmarking of Long-Horizon Spatial Biology
Introduces SpatialBench-Long benchmark with 24 evaluations on spatial biology datasets from PDAC, glioblastoma, lung adenocarcinoma and optic nerve systems, reporting top model performance at 8/72 runs (11.1%).
-
LEAP: Trajectory-Level Evaluation of LLMs in Iterative Scientific Design
LEAPBench shows trajectory scoring changes best-model rankings on 53% of tasks, LLMs do not beat Bayesian optimization, and domain-aware prompting underperforms domain-agnostic on biology tasks aligned with published literature.
-
CFDLLMBench: A Benchmark Suite for Evaluating Large Language Models in Computational Fluid Dynamics
CFDLLMBench is a new benchmark suite with CFDQuery, CFDCodeBench, and FoamBench to evaluate LLMs on graduate-level CFD knowledge, numerical reasoning, and context-dependent code implementation.
-
DeepResearch Bench: A Comprehensive Benchmark for Deep Research Agents
DeepResearch Bench supplies 100 expert-crafted PhD-level tasks and two human-aligned evaluation frameworks to measure deep research agents on report quality and citation accuracy.
-
Harnessing AtomisticSkills for Agentic Atomistic Research
AtomisticSkills is a new harness framework with 100+ human-curated skills that lets general AI agents perform atomistic research tasks including simulations, screening, and analysis, shown on electrolyte design, CO2 capture, drug screening, and catalyst tasks.
-
BioResearcher: Scenario-Guided Multi-Agent for Translational Medicine
BioResearcher is a new multi-agent system that leads baselines on single-step biomedical tests, BixBench, BaisBench, and a 30-query clinical discovery benchmark with 74.7% positive hit rate.