super hub Canonical reference

Towards an AI co-scientist

Alexander Daryin, Anil Palepu, Juraj Gottweis, Petar Sirkovic, Tao Tu, Wei-Hung Weng · 2025 · cs.AI · arXiv 2502.18864

Canonical reference. 88% of citing Pith papers cite this work as background.

100 Pith papers citing it

Background 88% of classified citations

open full Pith review browse 100 citing papers more from Alexander Daryin arXiv PDF

abstract

Scientific discovery relies on scientists generating novel hypotheses that undergo rigorous experimental validation. To augment this process, we introduce an AI co-scientist, a multi-agent system built on Gemini 2.0. The AI co-scientist is intended to help uncover new, original knowledge and to formulate demonstrably novel research hypotheses and proposals, building upon prior evidence and aligned to scientist-provided research objectives and guidance. The system's design incorporates a generate, debate, and evolve approach to hypothesis generation, inspired by the scientific method and accelerated by scaling test-time compute. Key contributions include: (1) a multi-agent architecture with an asynchronous task execution framework for flexible compute scaling; (2) a tournament evolution process for self-improving hypotheses generation. Automated evaluations show continued benefits of test-time compute, improving hypothesis quality. While general purpose, we focus development and validation in three biomedical areas: drug repurposing, novel target discovery, and explaining mechanisms of bacterial evolution and anti-microbial resistance. For drug repurposing, the system proposes candidates with promising validation findings, including candidates for acute myeloid leukemia that show tumor inhibition in vitro at clinically applicable concentrations. For novel target discovery, the AI co-scientist proposed new epigenetic targets for liver fibrosis, validated by anti-fibrotic activity and liver cell regeneration in human hepatic organoids. Finally, the AI co-scientist recapitulated unpublished experimental results via a parallel in silico discovery of a novel gene transfer mechanism in bacterial evolution. These results, detailed in separate, co-timed reports, demonstrate the potential to augment biomedical and scientific discovery and usher an era of AI empowered scientists.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 22 baseline 1 other 1

citation-polarity summary

background 21 unclear 2 baseline 1

claims ledger

abstract Scientific discovery relies on scientists generating novel hypotheses that undergo rigorous experimental validation. To augment this process, we introduce an AI co-scientist, a multi-agent system built on Gemini 2.0. The AI co-scientist is intended to help uncover new, original knowledge and to formulate demonstrably novel research hypotheses and proposals, building upon prior evidence and aligned to scientist-provided research objectives and guidance. The system's design incorporates a generate, debate, and evolve approach to hypothesis generation, inspired by the scientific method and accele

authors

Alexander Daryin Anil Palepu Juraj Gottweis Petar Sirkovic Tao Tu Wei-Hung Weng

co-cited works

representative citing papers

Unlocking the Visual Record of Materials Science: A Large-Scale Multimodal Dataset from Scientific Literature

cs.CV · 2026-06-29 · accept · novelty 8.0

MatMMExtract pipeline creates MatSciFig dataset of 391k annotated materials science figure panels and MaterialScope detection dataset with high accuracy.

Learning to Hand Off: Provably Convergent Workflow Learning under Interface Constraints

cs.AI · 2026-05-18 · unverdicted · novelty 8.0

Formalizes interface-constrained semi-Markov decision processes and proves a finite-sample bound for neural IC-Q that decomposes into neural approximation error, interface gap, and mixing-time residual, with experiments showing parity to centralized oracles.

AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery

cs.AI · 2026-04-28 · accept · novelty 8.0

AutoResearchBench is a new benchmark showing top AI agents achieve under 10% success on complex scientific literature discovery tasks that demand deep comprehension and open-ended search.

FermiLink: A Unified Agent Framework for Multidomain Autonomous Scientific Simulations

physics.chem-ph · 2026-04-03 · conditional · novelty 8.0

FermiLink is a unified AI agent framework that automates multidomain scientific simulations via separated package knowledge bases and a four-layer progressive disclosure mechanism, reproducing 56% of target figures in benchmarks and generating research-grade results on unpublished problems.

Evaluating Large Language Models in Scientific Discovery

cs.AI · 2025-12-17 · unverdicted · novelty 8.0

The SDE benchmark shows LLMs lag on scientific discovery tasks relative to general science tests, with diminishing scaling returns and shared weaknesses across models.

Why Do Multi-Agent LLM Systems Fail?

cs.AI · 2025-03-17 · unverdicted · novelty 8.0

The authors create the first large-scale dataset and taxonomy of failure modes in multi-agent LLM systems to explain their limited performance gains.

CollabSim: A CSCW-Grounded Methodology for Investigating Collaborative Competence of LLM Agents through Controlled Multi-Agent Experiments

cs.CL · 2026-06-04 · unverdicted · novelty 7.0

CollabSim is a new CSCW-grounded simulation framework that enables controlled multi-agent experiments to measure collaborative competence in LLM agents.

Matter to Mechanism: A Benchmark for AI Co-Scientists in Materials and Battery Research

cs.CE · 2026-06-01 · unverdicted · novelty 7.0

Introduces the Matter to Mechanism benchmark of 2,645 structured instances and a composite metric suite for evaluating AI co-scientists on problem-to-hypothesis reasoning in battery materials research.

Forecasting Scientific Progress with Artificial Intelligence

cs.AI · 2026-05-21 · unverdicted · novelty 7.0

Introduces the CUSP benchmark across 4760 events and finds frontier AI models can pick plausible directions but fail to predict whether or when scientific advances will occur, with performance varying by domain and insensitive to training cutoffs.

EngiAI: A Multi-Agent Framework and Benchmark Suite for LLM-Driven Engineering Design

cs.AI · 2026-05-19 · conditional · novelty 7.0 · 2 refs

EngiAI introduces a LangGraph-based multi-agent framework and a three-part benchmark suite for LLM-driven engineering design, reporting high task completion rates for proprietary models on Beams2D and Photonics2D problems.

An Experimental Method to Study Opinion Diffusion in Human-AI Hybrid Societies

cs.SI · 2026-05-09 · unverdicted · novelty 7.0

Hybrid human-AI networks in 5x5 grids reached lower final polarization than human-only networks after eight rounds of opinion revision on polarizing topics.

MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI

cs.LG · 2026-05-09 · unverdicted · novelty 7.0 · 2 refs

MLS-Bench is a benchmark with 140 tasks that evaluates AI agents on inventing generalizable and scalable ML methods, finding they lag human performance especially in insight-driven invention rather than tuning.

AI co-mathematician: Accelerating mathematicians with agentic AI

cs.AI · 2026-05-07 · unverdicted · novelty 7.0

An interactive AI workbench for mathematicians achieves 48% on FrontierMath Tier 4 and helped solve open problems in early tests.

AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents

physics.flu-dyn · 2026-05-07 · conditional · novelty 7.0 · 3 refs

AI CFD Scientist autonomously discovers a Spalart-Allmaras runtime correction reducing lower-wall Cf RMSE by 7.89% on the periodic hill at Reh=5600 while using a vision-language gate to detect 14 of 16 silent failures missed by solver checks.

End-to-end autonomous scientific discovery on a real optical platform

cs.AI · 2026-04-29 · unverdicted · novelty 7.0

An LLM agent autonomously identifies and experimentally validates a previously unreported optical bilinear interaction on a physical platform.

RosettaSearch: Multi-Objective Inference-Time Search for Protein Sequence Design

cs.LG · 2026-04-19 · unverdicted · novelty 7.0

RosettaSearch applies LLM-driven multi-objective search at inference time to improve backbone-conditioned protein sequences, recovering designs with 18-68% better structural fidelity and 2.5x higher success rates than single-pass models like LigandMPNN.

VERITAS: Verifiable Epistemic Reasoning for Image-Derived Hypothesis Testing via Agentic Systems

cs.MA · 2026-04-13 · unverdicted · novelty 7.0

VERITAS is a multi-agent system for verifiable hypothesis testing on multimodal clinical MRI datasets that achieves 81.4% verdict accuracy with frontier models and introduces an epistemic evidence labeling framework.

Kosmos: An AI Scientist for Autonomous Discovery

cs.AI · 2025-11-04 · unverdicted · novelty 7.0

Kosmos is an AI scientist that maintains coherence over hundreds of agent steps via a shared world model, executes thousands of code lines and reads thousands of papers per run, and produces traceable reports with 79.4% statement accuracy according to independent reviewers.

AlphaEvolve: A coding agent for scientific and algorithmic discovery

cs.AI · 2025-06-16 · unverdicted · novelty 7.0

AlphaEvolve is an LLM-orchestrated evolutionary coding agent that discovered a 4x4 complex matrix multiplication algorithm using 48 scalar multiplications, the first improvement over Strassen's algorithm in 56 years, plus optimizations for Google data centers and hardware.

Code Researcher: Deep Research Agent for Large Systems Code and Commit History

cs.SE · 2025-05-27 · unverdicted · novelty 7.0

Code Researcher retrieves global context via multi-step reasoning on code semantics, patterns, and commit history to fix Linux kernel crashes, reaching 48% crash-resolution rate versus 31% for baselines.

Human-LLM Compound System for Scientific Ideation through Facet Recombination and Novelty Evaluation

cs.HC · 2024-09-23 · unverdicted · novelty 7.0

Scideator enables facet-based scientific ideation through LLM-driven extraction, human-guided recombination, analogous retrieval, and facet-grounded novelty verification, showing significantly higher creativity support than a baseline LLM in a user study with CS researchers.

Autonomous Scientific Discovery via Iterative Meta-Reflection

cs.CV · 2026-07-01 · unverdicted · novelty 6.0

DiscoPER uses code generation, statistical validation, and second-order meta-reflection on accumulated discoveries to recover 8 of 9 known ecological patterns on a new benchmark at 72.7% support rate.

Testing Frontier Large Language Models' Physics Literacy in Parallel Physical Worlds

cs.LG · 2026-06-30 · unverdicted · novelty 6.0

Introduces an auditable four-stage diagnostic for LLM physics reasoning in novel frameworks and applies it to three parallel worlds, yielding pass rates of 6/15, 6/15, and 0/15 on frontier models with noted qualitative-quantitative asymmetry.

AutoTrainess: Teaching Language Models to Improve Language Models Autonomously

cs.CL · 2026-06-30 · unverdicted · novelty 6.0

AutoTrainess exposes training operations via agent-computer interfaces and outperforms CLI-only baselines on PostTrainBench with scores of 26.94 vs 23.21 for GPT-5.4 and similar gains on other models.

citing papers explorer

Showing 1 of 1 citing paper after filters.

Matter to Mechanism: A Benchmark for AI Co-Scientists in Materials and Battery Research cs.CE · 2026-06-01 · unverdicted · none · ref 46 · internal anchor
Introduces the Matter to Mechanism benchmark of 2,645 structured instances and a composite metric suite for evaluating AI co-scientists on problem-to-hypothesis reasoning in battery materials research.

Towards an AI co-scientist

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer