super hub Canonical reference

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, David Ha, Jakob Foerster, Jeff Clune, Robert Tjarko Lange · 2024 · cs.AI · arXiv 2408.06292

Canonical reference. 77% of citing Pith papers cite this work as background.

159 Pith papers citing it

Background 77% of classified citations

open full Pith review browse 159 citing papers more from Chris Lu arXiv PDF

abstract

One of the grand challenges of artificial general intelligence is developing agents capable of conducting scientific research and discovering new knowledge. While frontier models have already been used as aides to human scientists, e.g. for brainstorming ideas, writing code, or prediction tasks, they still conduct only a small part of the scientific process. This paper presents the first comprehensive framework for fully automatic scientific discovery, enabling frontier large language models to perform research independently and communicate their findings. We introduce The AI Scientist, which generates novel research ideas, writes code, executes experiments, visualizes results, describes its findings by writing a full scientific paper, and then runs a simulated review process for evaluation. In principle, this process can be repeated to iteratively develop ideas in an open-ended fashion, acting like the human scientific community. We demonstrate its versatility by applying it to three distinct subfields of machine learning: diffusion modeling, transformer-based language modeling, and learning dynamics. Each idea is implemented and developed into a full paper at a cost of less than $15 per paper. To evaluate the generated papers, we design and validate an automated reviewer, which we show achieves near-human performance in evaluating paper scores. The AI Scientist can produce papers that exceed the acceptance threshold at a top machine learning conference as judged by our automated reviewer. This approach signifies the beginning of a new era in scientific discovery in machine learning: bringing the transformative benefits of AI agents to the entire research process of AI itself, and taking us closer to a world where endless affordable creativity and innovation can be unleashed on the world's most challenging problems. Our code is open-sourced at https://github.com/SakanaAI/AI-Scientist

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 30 baseline 2 dataset 1 method 1 other 1

citation-polarity summary

background 27 unclear 3 baseline 2 support 1 use dataset 1 use method 1

claims ledger

abstract One of the grand challenges of artificial general intelligence is developing agents capable of conducting scientific research and discovering new knowledge. While frontier models have already been used as aides to human scientists, e.g. for brainstorming ideas, writing code, or prediction tasks, they still conduct only a small part of the scientific process. This paper presents the first comprehensive framework for fully automatic scientific discovery, enabling frontier large language models to perform research independently and communicate their findings. We introduce The AI Scientist, which

authors

Chris Lu Cong Lu David Ha Jakob Foerster Jeff Clune Robert Tjarko Lange

co-cited works

representative citing papers

AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot

cs.AI · 2026-04-15 · conditional · novelty 9.0

AI reviews for all 22,977 AAAI-26 papers were preferred by authors and PC members over human reviews on accuracy and suggestions and outperformed baselines at spotting weaknesses.

Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty

cs.CL · 2026-05-12 · unverdicted · novelty 8.0

Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

cs.CL · 2026-05-11 · unverdicted · novelty 8.0

A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.

AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery

cs.AI · 2026-04-28 · accept · novelty 8.0

AutoResearchBench is a new benchmark showing top AI agents achieve under 10% success on complex scientific literature discovery tasks that demand deep comprehension and open-ended search.

The Last Human-Written Paper: Agent-Native Research Artifacts

cs.LG · 2026-04-27 · unverdicted · novelty 8.0

Introduces ARA as a four-layer machine-executable research package and reports benchmark gains in agent QA accuracy and reproduction success.

FermiLink: A Unified Agent Framework for Multidomain Autonomous Scientific Simulations

physics.chem-ph · 2026-04-03 · conditional · novelty 8.0

FermiLink is a unified AI agent framework that automates multidomain scientific simulations via separated package knowledge bases and a four-layer progressive disclosure mechanism, reproducing 56% of target figures in benchmarks and generating research-grade results on unpublished problems.

Co-Designing Quantum Codes with Transversal Diagonal Gates via Multi-Agent Systems

quant-ph · 2025-10-23 · accept · novelty 8.0

A Lean-verified multi-agent system produces a catalogue of 14,116 quantum codes with transversal diagonal gates for small parameters, extracts infinite families, and resolves specific distance-3 cases with constructions and no-go proofs.

AI Trading's Alpha Singularity: Emergent Market Reasoning through Agent-to-Agent Self-Evolution

cs.AI · 2026-06-28 · reject · novelty 7.0

Multi-agent LLM system Agora under Sealed Joint Search conditions produces +1.87 holdout Sharpe on CSI 1000 over a 91-day sealed period, exceeding the best baseline at +1.334 under favorable seed.

AICID: Unique Identifiers for AI Scientists

cs.DL · 2026-06-27 · unverdicted · novelty 7.0

The paper proposes AICID as a new identifier system to make the provenance of AI-generated scholarly work transparent and machine-readable.

Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio

cs.CL · 2026-06-15 · unverdicted · novelty 7.0

MetaSyn benchmark shows LLM agents recover at most 52.7% of relevant studies in meta-analysis pipelines due to failures in PI/ECO-based screening despite strong retrieval.

DiscoverPhysics: Benchmarking LLMs for Out-of-the-Box Scientific Thinking

stat.ML · 2026-05-25 · unverdicted · novelty 7.0

DiscoverPhysics is a new benchmark with 22 on-demand N-body simulated worlds where LLM agents design experiments to infer non-standard physics, evaluated via held-out trajectory MSE and LLM-judged explanation quality.

LLM-driven design of physics-constrained constitutive models: two agents are better than one

cs.LG · 2026-05-22 · unverdicted · novelty 7.0

A Creator-Inspector multi-agent LLM pipeline for constitutive artificial neural networks increases the rate of models satisfying all nine physical constraints to 100% or 56% depending on the LLM backbone.

Whose Good, Whose Place? The Moral Geography of Agentic AI for Social Good

cs.CY · 2026-05-21 · unverdicted · novelty 7.0

Survey of 112 agentic AI for social good papers reveals moral-geographic asymmetry with 73% lacking geographic context (lowest for SDG 16) and only 25% reporting deployments.

IdleSpec: Exploiting Idle Time via Speculative Planning for LLM Agents

cs.AI · 2026-05-21 · conditional · novelty 7.0

IdleSpec improves LLM agent accuracy by generating and aggregating speculative plans during idle time between tool calls and observations using complementary drafting strategies.

FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics

cs.LG · 2026-05-17 · unverdicted · novelty 7.0 · 2 refs

FML-Bench shows a simple greedy hill-climber nearly matches tree search on dense-opportunity tasks while an adaptive agent that broadens search on stagnation outperforms six baselines across 18 tasks.

1GC-7RC: One Graphic Card -- Seven Research Challenges! How Good Are AI Agents at Doing Your Job?

cs.LG · 2026-05-16 · unverdicted · novelty 7.0

Introduces the 1GC-7RC benchmark to evaluate AI coding agents on seven diverse ML tasks under single-GPU time and access constraints.

Graphs of Research: Citation Evolution Graphs as Supervision for Research Idea Generation

cs.CL · 2026-05-14 · unverdicted · novelty 7.0

GoR extracts citation DAGs using position, frequency, predecessor links and time, then fine-tunes Qwen2.5-7B on 498 seed papers to generate ideas, claiming SOTA over gpt-4o baselines via LLM judges.

Test-Time Learning with an Evolving Library

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

EvoLib enables LLMs to accumulate, reuse, and evolve knowledge abstractions from inference trajectories at test time, yielding substantial gains on math reasoning, code generation, and agentic benchmarks without parameter updates or supervision.

Sheaf-Theoretic Transport and Obstruction for Detecting Scientific Theory Shift in AI Agents

cs.AI · 2026-05-13 · unverdicted · novelty 7.0

A finite sheaf-theoretic framework ranks obstruction measures to identify when an AI agent's theory must deform within its language or extend to a new one, validated on a controlled transition benchmark.

Harnessing Agentic Evolution

cs.AI · 2026-05-13 · unverdicted · novelty 7.0

AEvo introduces a meta-agent that edits the evolution procedure or agent context based on accumulated state, outperforming baselines by 26% relative improvement on agentic benchmarks and achieving SOTA on open-ended tasks.

ASIA: an Autonomous System Identification Agent

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

ASIA uses an LLM-based coding agent to autonomously perform system identification, tested empirically on two benchmarks while noting limitations in transparency and reproducibility.

PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

PaperFit uses rendered page images in a closed loop to diagnose and repair typesetting defects in LaTeX documents, outperforming baselines on a new benchmark of 200 papers.

Hypothesis-Driven Deep Research with Large Language Models: A Structured Methodology for Automated Knowledge Discovery

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

HDRI is a six-principle eight-stage framework for hypothesis-organized LLM research featuring gap-driven iteration, traceable fact reasoning, and subject locking, realized in INFOMINER with reported gains in fact density and completeness.

MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

MLS-Bench is a benchmark with 140 tasks that evaluates AI agents on inventing generalizable and scalable ML methods, finding they lag human performance especially in insight-driven invention rather than tuning.

citing papers explorer

Showing 24 of 24 citing papers after filters.

Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty cs.CL · 2026-05-12 · unverdicted · none · ref 10 · internal anchor
Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.
WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation cs.CL · 2026-05-11 · unverdicted · none · ref 23 · internal anchor
A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.
Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio cs.CL · 2026-06-15 · unverdicted · none · ref 26 · internal anchor
MetaSyn benchmark shows LLM agents recover at most 52.7% of relevant studies in meta-analysis pipelines due to failures in PI/ECO-based screening despite strong retrieval.
Graphs of Research: Citation Evolution Graphs as Supervision for Research Idea Generation cs.CL · 2026-05-14 · unverdicted · none · ref 21 · internal anchor
GoR extracts citation DAGs using position, frequency, predecessor links and time, then fine-tunes Qwen2.5-7B on 498 seed papers to generate ideas, claiming SOTA over gpt-4o baselines via LLM judges.
ReviewGrounder: Improving Review Substantiveness with Rubric-Guided, Tool-Integrated Agents cs.CL · 2026-04-15 · unverdicted · none · ref 3 · internal anchor
ReviewGrounder decomposes review generation into rubric-guided drafting and tool-integrated grounding stages, outperforming larger baseline models on a new benchmark measuring alignment with human judgments and review quality.
From Passive Generation to Investigation: A Proactive Scientific Peer Review Agent cs.CL · 2026-06-11 · unverdicted · none · ref 29 · internal anchor
ProReviewer is an MDP-formulated proactive peer review agent trained with SFT and RL on an 8B model that outperforms larger frontier LLMs on review quality metrics.
Toward Generalist Autonomous Research via Hypothesis-Tree Refinement cs.CL · 2026-06-10 · unverdicted · none · ref 136 · internal anchor
Arbor combines a coordinator, executors, and a hypothesis tree to enable cumulative autonomous research, outperforming Codex and Claude Code by over 2.5x on six real tasks and reaching 86.36% Any Medal on MLE-Bench Lite.
Mem-$\pi$: Adaptive Memory through Learning When and What to Generate cs.CL · 2026-05-20 · unverdicted · none · ref 26 · internal anchor
Mem-π is a framework using a dedicated model and decision-content decoupled RL to generate context-specific guidance on demand for LLM agents, outperforming retrieval baselines by over 30% on web navigation.
One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue cs.CL · 2026-05-07 · unverdicted · none · ref 17 · 2 links · internal anchor
TurnGate identifies the critical turn in multi-turn dialogues where a response would complete hidden malicious intent, outperforming baselines on the new MTID dataset while keeping over-refusal low.
HiRAS: A Hierarchical Multi-Agent Framework for Paper-to-Code Generation and Execution cs.CL · 2026-04-20 · unverdicted · none · ref 2 · internal anchor
HiRAS introduces hierarchical multi-agent coordination for paper-to-code generation and experiment reproduction, claiming over 10% relative gains over prior state-of-the-art on a refined benchmark with reduced hallucination.
Language Model Goal Selection Differs from Humans' in a Self-Directed Learning Task cs.CL · 2026-02-06 · unverdicted · none · ref 10 · internal anchor
LLMs diverge from human goal selection in self-directed learning by exploiting single solutions with low variability across instances.
Scheming Ability in LLM-to-LLM Strategic Interactions cs.CL · 2025-10-11 · conditional · none · ref 37 · internal anchor
Frontier LLMs exhibit high scheming propensity in Cheap Talk signaling and Peer Evaluation games, achieving 95-100% success rates when choosing to deceive and 100% deception choice in one setup even without prompting.
ShinkaEvolve: Towards Open-Ended And Sample-Efficient Program Evolution cs.CL · 2025-09-17 · unverdicted · none · ref 237 · internal anchor
ShinkaEvolve improves sample efficiency in LLM-driven program evolution via parent sampling, code novelty rejection-sampling, and bandit LLM ensemble selection, achieving new SOTA circle packing with 150 samples and gains on math reasoning and competitive programming tasks.
InternBootcamp Technical Report: Boosting LLM Reasoning with Verifiable Task Scaling cs.CL · 2025-08-12 · unverdicted · none · ref 29 · internal anchor
InternBootcamp supplies 1000+ verifiable, auto-generated task environments across domains that enable task scaling to improve LLM reasoning, producing a 32B model with state-of-the-art results on the new Bootcamp-EVAL benchmark.
RExBench: Can coding agents autonomously implement AI research extensions? cs.CL · 2025-06-27 · unverdicted · none · ref 32 · internal anchor
RExBench is a new benchmark showing that LLM coding agents fail to autonomously implement most realistic research extensions to prior AI papers.
XtraGPT: Context-Aware and Controllable Academic Paper Revision via Human-AI Collaboration cs.CL · 2025-05-16 · conditional · none · ref 24 · internal anchor
XtraGPT is a suite of 1.5B-14B parameter open-source LLMs fine-tuned on 140,000 revision pairs from 7,000 top-tier papers to support controllable, context-aware academic paper editing.
EvoGens: A Population-Based Heuristic Search Framework for Scientific Idea Generation cs.CL · 2026-05-29 · unverdicted · none · ref 23 · internal anchor
EvoGens uses rank-based mutation, semantic-aware crossover, and lightweight evaluation to evolve populations of LLM-generated scientific ideas, boosting novelty and diversity metrics.
CP-Agent: A Calibrated Risk-Controlled Agent for Feedback-Driven Competitive Programming cs.CL · 2026-05-23 · unverdicted · none · ref 33 · internal anchor
CP-Agent improves LLM competitive programming performance via calibrated feedback mechanisms that target false-admission risk, evidence against bad programs, and success hazard.
Code as Agent Harness cs.CL · 2026-05-18 · accept · none · ref 63 · internal anchor
A survey that organizes existing work on LLM-based agents around code as the central harness, structured in three layers of interfaces, mechanisms, and multi-agent scaling, with applications across domains and listed open challenges.
StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction cs.CL · 2026-05-07 · unverdicted · none · ref 26 · internal anchor
StraTA improves LLM agent success rates to 93.1% on ALFWorld and 84.2% on WebShop by sampling a compact initial strategy and training it jointly with action execution via hierarchical GRPO-style rollouts.
TSAssistant: A Human-in-the-Loop Agentic Framework for Automated Target Safety Assessment cs.CL · 2026-04-27 · unreviewed · ref 8 · 2 links · internal anchor
Toward Autonomous Long-Horizon Engineering for ML Research cs.CL · 2026-04-14 · unreviewed · ref 12 · internal anchor
AutoSOTA: An End-to-End Automated Research System for State-of-the-Art AI Model Discovery cs.CL · 2026-04-07 · unreviewed · ref 20 · internal anchor
IDRBench: Understanding the Capability of Large Language Models on Interdisciplinary Research cs.CL · 2025-07-21 · unreviewed · ref 25 · internal anchor

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer