super hub Canonical reference

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, David Ha, Jakob Foerster, Jeff Clune, Robert Tjarko Lange · 2024 · cs.AI · arXiv 2408.06292

Canonical reference. 77% of citing Pith papers cite this work as background.

163 Pith papers citing it

Background 77% of classified citations

open full Pith review browse 163 citing papers more from Chris Lu arXiv PDF

abstract

One of the grand challenges of artificial general intelligence is developing agents capable of conducting scientific research and discovering new knowledge. While frontier models have already been used as aides to human scientists, e.g. for brainstorming ideas, writing code, or prediction tasks, they still conduct only a small part of the scientific process. This paper presents the first comprehensive framework for fully automatic scientific discovery, enabling frontier large language models to perform research independently and communicate their findings. We introduce The AI Scientist, which generates novel research ideas, writes code, executes experiments, visualizes results, describes its findings by writing a full scientific paper, and then runs a simulated review process for evaluation. In principle, this process can be repeated to iteratively develop ideas in an open-ended fashion, acting like the human scientific community. We demonstrate its versatility by applying it to three distinct subfields of machine learning: diffusion modeling, transformer-based language modeling, and learning dynamics. Each idea is implemented and developed into a full paper at a cost of less than $15 per paper. To evaluate the generated papers, we design and validate an automated reviewer, which we show achieves near-human performance in evaluating paper scores. The AI Scientist can produce papers that exceed the acceptance threshold at a top machine learning conference as judged by our automated reviewer. This approach signifies the beginning of a new era in scientific discovery in machine learning: bringing the transformative benefits of AI agents to the entire research process of AI itself, and taking us closer to a world where endless affordable creativity and innovation can be unleashed on the world's most challenging problems. Our code is open-sourced at https://github.com/SakanaAI/AI-Scientist

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 30 baseline 2 dataset 1 method 1 other 1

citation-polarity summary

background 27 unclear 3 baseline 2 support 1 use dataset 1 use method 1

claims ledger

abstract One of the grand challenges of artificial general intelligence is developing agents capable of conducting scientific research and discovering new knowledge. While frontier models have already been used as aides to human scientists, e.g. for brainstorming ideas, writing code, or prediction tasks, they still conduct only a small part of the scientific process. This paper presents the first comprehensive framework for fully automatic scientific discovery, enabling frontier large language models to perform research independently and communicate their findings. We introduce The AI Scientist, which

authors

Chris Lu Cong Lu David Ha Jakob Foerster Jeff Clune Robert Tjarko Lange

co-cited works

representative citing papers

AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot

cs.AI · 2026-04-15 · conditional · novelty 9.0

AI reviews for all 22,977 AAAI-26 papers were preferred by authors and PC members over human reviews on accuracy and suggestions and outperformed baselines at spotting weaknesses.

Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty

cs.CL · 2026-05-12 · unverdicted · novelty 8.0

Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

cs.CL · 2026-05-11 · unverdicted · novelty 8.0

A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.

AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery

cs.AI · 2026-04-28 · accept · novelty 8.0

AutoResearchBench is a new benchmark showing top AI agents achieve under 10% success on complex scientific literature discovery tasks that demand deep comprehension and open-ended search.

The Last Human-Written Paper: Agent-Native Research Artifacts

cs.LG · 2026-04-27 · unverdicted · novelty 8.0

Introduces ARA as a four-layer machine-executable research package and reports benchmark gains in agent QA accuracy and reproduction success.

FermiLink: A Unified Agent Framework for Multidomain Autonomous Scientific Simulations

physics.chem-ph · 2026-04-03 · conditional · novelty 8.0

FermiLink is a unified AI agent framework that automates multidomain scientific simulations via separated package knowledge bases and a four-layer progressive disclosure mechanism, reproducing 56% of target figures in benchmarks and generating research-grade results on unpublished problems.

Co-Designing Quantum Codes with Transversal Diagonal Gates via Multi-Agent Systems

quant-ph · 2025-10-23 · accept · novelty 8.0

A Lean-verified multi-agent system produces a catalogue of 14,116 quantum codes with transversal diagonal gates for small parameters, extracts infinite families, and resolves specific distance-3 cases with constructions and no-go proofs.

FARS: A Fully Automated Research System Deployed at Scale

cs.AI · 2026-06-30 · unverdicted · novelty 7.0

FARS deployed at scale produced 166 AI/ML papers across 67 topics that received 282 structured human reviews indicating some review-worthy outputs alongside recurring failure modes.

AI Trading's Alpha Singularity: Emergent Market Reasoning through Agent-to-Agent Self-Evolution

cs.AI · 2026-06-28 · reject · novelty 7.0

Multi-agent LLM system Agora under Sealed Joint Search conditions produces +1.87 holdout Sharpe on CSI 1000 over a 91-day sealed period, exceeding the best baseline at +1.334 under favorable seed.

AICID: Unique Identifiers for AI Scientists

cs.DL · 2026-06-27 · unverdicted · novelty 7.0

The paper proposes AICID as a new identifier system to make the provenance of AI-generated scholarly work transparent and machine-readable.

Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio

cs.CL · 2026-06-15 · unverdicted · novelty 7.0

MetaSyn benchmark shows LLM agents recover at most 52.7% of relevant studies in meta-analysis pipelines due to failures in PI/ECO-based screening despite strong retrieval.

DiscoverPhysics: Benchmarking LLMs for Out-of-the-Box Scientific Thinking

stat.ML · 2026-05-25 · unverdicted · novelty 7.0

DiscoverPhysics is a new benchmark with 22 on-demand N-body simulated worlds where LLM agents design experiments to infer non-standard physics, evaluated via held-out trajectory MSE and LLM-judged explanation quality.

LLM-driven design of physics-constrained constitutive models: two agents are better than one

cs.LG · 2026-05-22 · unverdicted · novelty 7.0

A Creator-Inspector multi-agent LLM pipeline for constitutive artificial neural networks increases the rate of models satisfying all nine physical constraints to 100% or 56% depending on the LLM backbone.

Whose Good, Whose Place? The Moral Geography of Agentic AI for Social Good

cs.CY · 2026-05-21 · unverdicted · novelty 7.0

Survey of 112 agentic AI for social good papers reveals moral-geographic asymmetry with 73% lacking geographic context (lowest for SDG 16) and only 25% reporting deployments.

IdleSpec: Exploiting Idle Time via Speculative Planning for LLM Agents

cs.AI · 2026-05-21 · conditional · novelty 7.0

IdleSpec improves LLM agent accuracy by generating and aggregating speculative plans during idle time between tool calls and observations using complementary drafting strategies.

FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics

cs.LG · 2026-05-17 · unverdicted · novelty 7.0 · 2 refs

FML-Bench shows a simple greedy hill-climber nearly matches tree search on dense-opportunity tasks while an adaptive agent that broadens search on stagnation outperforms six baselines across 18 tasks.

1GC-7RC: One Graphic Card -- Seven Research Challenges! How Good Are AI Agents at Doing Your Job?

cs.LG · 2026-05-16 · unverdicted · novelty 7.0

Introduces the 1GC-7RC benchmark to evaluate AI coding agents on seven diverse ML tasks under single-GPU time and access constraints.

The Scientific Contribution Graph: Automated Literature-based Technological Roadmapping at Scale

cs.CL · 2026-05-14 · unverdicted · novelty 7.0

Builds a 2M-contribution graph from 230k papers with 12.5M prerequisite links and reports 0.48 MAP on temporal backtesting for predicting enabling technologies.

Graphs of Research: Citation Evolution Graphs as Supervision for Research Idea Generation

cs.CL · 2026-05-14 · unverdicted · novelty 7.0

GoR extracts citation DAGs using position, frequency, predecessor links and time, then fine-tunes Qwen2.5-7B on 498 seed papers to generate ideas, claiming SOTA over gpt-4o baselines via LLM judges.

Test-Time Learning with an Evolving Library

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

EvoLib enables LLMs to accumulate, reuse, and evolve knowledge abstractions from inference trajectories at test time, yielding substantial gains on math reasoning, code generation, and agentic benchmarks without parameter updates or supervision.

Sheaf-Theoretic Transport and Obstruction for Detecting Scientific Theory Shift in AI Agents

cs.AI · 2026-05-13 · unverdicted · novelty 7.0

A finite sheaf-theoretic framework ranks obstruction measures to identify when an AI agent's theory must deform within its language or extend to a new one, validated on a controlled transition benchmark.

Harnessing Agentic Evolution

cs.AI · 2026-05-13 · unverdicted · novelty 7.0

AEvo introduces a meta-agent that edits the evolution procedure or agent context based on accumulated state, outperforming baselines by 26% relative improvement on agentic benchmarks and achieving SOTA on open-ended tasks.

ASIA: an Autonomous System Identification Agent

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

ASIA uses an LLM-based coding agent to autonomously perform system identification, tested empirically on two benchmarks while noting limitations in transparency and reproducibility.

PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

PaperFit uses rendered page images in a closed loop to diagnose and repair typesetting defects in LaTeX documents, outperforming baselines on a new benchmark of 200 papers.

citing papers explorer

Showing 50 of 163 citing papers.

AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot cs.AI · 2026-04-15 · conditional · none · ref 8 · internal anchor
AI reviews for all 22,977 AAAI-26 papers were preferred by authors and PC members over human reviews on accuracy and suggestions and outperformed baselines at spotting weaknesses.
Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty cs.CL · 2026-05-12 · unverdicted · none · ref 10 · internal anchor
Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.
WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation cs.CL · 2026-05-11 · unverdicted · none · ref 23 · internal anchor
A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.
AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery cs.AI · 2026-04-28 · accept · none · ref 1 · internal anchor
AutoResearchBench is a new benchmark showing top AI agents achieve under 10% success on complex scientific literature discovery tasks that demand deep comprehension and open-ended search.
The Last Human-Written Paper: Agent-Native Research Artifacts cs.LG · 2026-04-27 · unverdicted · none · ref 3 · internal anchor
Introduces ARA as a four-layer machine-executable research package and reports benchmark gains in agent QA accuracy and reproduction success.
FermiLink: A Unified Agent Framework for Multidomain Autonomous Scientific Simulations physics.chem-ph · 2026-04-03 · conditional · none · ref 12 · internal anchor
FermiLink is a unified AI agent framework that automates multidomain scientific simulations via separated package knowledge bases and a four-layer progressive disclosure mechanism, reproducing 56% of target figures in benchmarks and generating research-grade results on unpublished problems.
Co-Designing Quantum Codes with Transversal Diagonal Gates via Multi-Agent Systems quant-ph · 2025-10-23 · accept · full · ref 2 · internal anchor
A Lean-verified multi-agent system produces a catalogue of 14,116 quantum codes with transversal diagonal gates for small parameters, extracts infinite families, and resolves specific distance-3 cases with constructions and no-go proofs.
FARS: A Fully Automated Research System Deployed at Scale cs.AI · 2026-06-30 · unverdicted · none · ref 16 · internal anchor
FARS deployed at scale produced 166 AI/ML papers across 67 topics that received 282 structured human reviews indicating some review-worthy outputs alongside recurring failure modes.
AI Trading's Alpha Singularity: Emergent Market Reasoning through Agent-to-Agent Self-Evolution cs.AI · 2026-06-28 · reject · none · ref 21 · internal anchor
Multi-agent LLM system Agora under Sealed Joint Search conditions produces +1.87 holdout Sharpe on CSI 1000 over a 91-day sealed period, exceeding the best baseline at +1.334 under favorable seed.
AICID: Unique Identifiers for AI Scientists cs.DL · 2026-06-27 · unverdicted · none · ref 9 · internal anchor
The paper proposes AICID as a new identifier system to make the provenance of AI-generated scholarly work transparent and machine-readable.
Benchmarking LLM Agents on Meta-Analysis Articles from Nature Portfolio cs.CL · 2026-06-15 · unverdicted · none · ref 26 · internal anchor
MetaSyn benchmark shows LLM agents recover at most 52.7% of relevant studies in meta-analysis pipelines due to failures in PI/ECO-based screening despite strong retrieval.
DiscoverPhysics: Benchmarking LLMs for Out-of-the-Box Scientific Thinking stat.ML · 2026-05-25 · unverdicted · none · ref 4 · internal anchor
DiscoverPhysics is a new benchmark with 22 on-demand N-body simulated worlds where LLM agents design experiments to infer non-standard physics, evaluated via held-out trajectory MSE and LLM-judged explanation quality.
LLM-driven design of physics-constrained constitutive models: two agents are better than one cs.LG · 2026-05-22 · unverdicted · none · ref 110 · internal anchor
A Creator-Inspector multi-agent LLM pipeline for constitutive artificial neural networks increases the rate of models satisfying all nine physical constraints to 100% or 56% depending on the LLM backbone.
Whose Good, Whose Place? The Moral Geography of Agentic AI for Social Good cs.CY · 2026-05-21 · unverdicted · none · ref 51 · internal anchor
Survey of 112 agentic AI for social good papers reveals moral-geographic asymmetry with 73% lacking geographic context (lowest for SDG 16) and only 25% reporting deployments.
IdleSpec: Exploiting Idle Time via Speculative Planning for LLM Agents cs.AI · 2026-05-21 · conditional · none · ref 9 · internal anchor
IdleSpec improves LLM agent accuracy by generating and aggregating speculative plans during idle time between tool calls and observations using complementary drafting strategies.
FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics cs.LG · 2026-05-17 · unverdicted · none · ref 31 · 2 links · internal anchor
FML-Bench shows a simple greedy hill-climber nearly matches tree search on dense-opportunity tasks while an adaptive agent that broadens search on stagnation outperforms six baselines across 18 tasks.
1GC-7RC: One Graphic Card -- Seven Research Challenges! How Good Are AI Agents at Doing Your Job? cs.LG · 2026-05-16 · unverdicted · none · ref 31 · internal anchor
Introduces the 1GC-7RC benchmark to evaluate AI coding agents on seven diverse ML tasks under single-GPU time and access constraints.
The Scientific Contribution Graph: Automated Literature-based Technological Roadmapping at Scale cs.CL · 2026-05-14 · unverdicted · none · ref 22 · internal anchor
Builds a 2M-contribution graph from 230k papers with 12.5M prerequisite links and reports 0.48 MAP on temporal backtesting for predicting enabling technologies.
Graphs of Research: Citation Evolution Graphs as Supervision for Research Idea Generation cs.CL · 2026-05-14 · unverdicted · none · ref 21 · internal anchor
GoR extracts citation DAGs using position, frequency, predecessor links and time, then fine-tunes Qwen2.5-7B on 498 seed papers to generate ideas, claiming SOTA over gpt-4o baselines via LLM judges.
Test-Time Learning with an Evolving Library cs.LG · 2026-05-14 · unverdicted · none · ref 30 · internal anchor
EvoLib enables LLMs to accumulate, reuse, and evolve knowledge abstractions from inference trajectories at test time, yielding substantial gains on math reasoning, code generation, and agentic benchmarks without parameter updates or supervision.
Sheaf-Theoretic Transport and Obstruction for Detecting Scientific Theory Shift in AI Agents cs.AI · 2026-05-13 · unverdicted · none · ref 25 · internal anchor
A finite sheaf-theoretic framework ranks obstruction measures to identify when an AI agent's theory must deform within its language or extend to a new one, validated on a controlled transition benchmark.
Harnessing Agentic Evolution cs.AI · 2026-05-13 · unverdicted · none · ref 18 · internal anchor
AEvo introduces a meta-agent that edits the evolution procedure or agent context based on accumulated state, outperforming baselines by 26% relative improvement on agentic benchmarks and achieving SOTA on open-ended tasks.
ASIA: an Autonomous System Identification Agent cs.AI · 2026-05-11 · unverdicted · none · ref 11 · internal anchor
ASIA uses an LLM-based coding agent to autonomously perform system identification, tested empirically on two benchmarks while noting limitations in transparency and reproducibility.
PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents cs.AI · 2026-05-11 · unverdicted · none · ref 136 · internal anchor
PaperFit uses rendered page images in a closed loop to diagnose and repair typesetting defects in LaTeX documents, outperforming baselines on a new benchmark of 200 papers.
Hypothesis-Driven Deep Research with Large Language Models: A Structured Methodology for Automated Knowledge Discovery cs.AI · 2026-05-11 · unverdicted · none · ref 17 · internal anchor
HDRI is a six-principle eight-stage framework for hypothesis-organized LLM research featuring gap-driven iteration, traceable fact reasoning, and subject locking, realized in INFOMINER with reported gains in fact density and completeness.
MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI cs.LG · 2026-05-09 · unverdicted · none · ref 57 · 2 links · internal anchor
MLS-Bench is a benchmark with 140 tasks that evaluates AI agents on inventing generalizable and scalable ML methods, finding they lag human performance especially in insight-driven invention rather than tuning.
Self-Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale cs.LG · 2026-05-07 · conditional · none · ref 37 · 3 links · internal anchor
Starling, a multi-agent LLM system, extracts ~6.3 million nuanced structured records from PubMed across six tasks with reported error rates of 0.6-7.7%, lower than several curated databases.
AI co-mathematician: Accelerating mathematicians with agentic AI cs.AI · 2026-05-07 · unverdicted · none · ref 11 · 2 links · internal anchor
An interactive AI workbench for mathematicians achieves 48% on FrontierMath Tier 4 and helped solve open problems in early tests.
AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents physics.flu-dyn · 2026-05-07 · conditional · none · ref 20 · 2 links · internal anchor
AI CFD Scientist autonomously discovers a Spalart-Allmaras runtime correction reducing lower-wall Cf RMSE by 7.89% on the periodic hill at Reh=5600 while using a vision-language gate to detect 14 of 16 silent failures missed by solver checks.
Evolving Idea Graphs with Learnable Edits-and-Commits for Multi-Agent Scientific Ideation cs.MA · 2026-05-06 · unverdicted · none · ref 21 · internal anchor
EIG represents research ideas as evolving graphs with nodes for claims and edges for relations, using a learned controller for edits and commits to produce higher-quality scientific proposals than text-only multi-agent baselines.
Agentic-imodels: Evolving agentic interpretability tools via autoresearch cs.AI · 2026-05-05 · unverdicted · none · ref 5 · internal anchor
Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.
Theory Under Construction: Orchestrating Language Models for Research Software Where the Specification Evolves cs.SE · 2026-04-29 · unverdicted · none · ref 22 · internal anchor
Comet-H orchestrates LLMs via deficit-scoring prompt selection and half-life task tracking to co-evolve research software components, demonstrated by a static analysis tool reaching F1=0.768 versus a 0.364 baseline.
End-to-end autonomous scientific discovery on a real optical platform cs.AI · 2026-04-29 · unverdicted · none · ref 5 · internal anchor
An LLM agent autonomously identifies and experimentally validates a previously unreported optical bilinear interaction on a physical platform.
Knows: Agent-Native Structured Research Representations cs.AI · 2026-04-19 · conditional · none · ref 7 · internal anchor
Knows uses a YAML sidecar specification to provide structured, agent-consumable representations of research papers, yielding large accuracy gains for small LLMs on comprehension tasks and rapid community adoption via a public hub.
ReviewGrounder: Improving Review Substantiveness with Rubric-Guided, Tool-Integrated Agents cs.CL · 2026-04-15 · unverdicted · none · ref 3 · internal anchor
ReviewGrounder decomposes review generation into rubric-guided drafting and tool-integrated grounding stages, outperforming larger baseline models on a new benchmark measuring alignment with human judgments and review quality.
VERITAS: Verifiable Epistemic Reasoning for Image-Derived Hypothesis Testing via Agentic Systems cs.MA · 2026-04-13 · unverdicted · none · ref 21 · internal anchor
VERITAS is a multi-agent system for verifiable hypothesis testing on multimodal clinical MRI datasets that achieves 81.4% verdict accuracy with frontier models and introduces an epistemic evidence labeling framework.
Camyla: Scaling Autonomous Research in Medical Image Segmentation cs.AI · 2026-04-12 · unverdicted · none · ref 1 · internal anchor
Camyla autonomously generates research proposals, experiments, and manuscripts in medical image segmentation, outperforming baselines on 24 of 31 recent datasets while producing 40 human-reviewed papers.
Figures as Interfaces: Toward LLM-Native Artifacts for Scientific Discovery cs.HC · 2026-04-09 · unverdicted · none · ref 8 · internal anchor
LLM-native figures embed provenance and enable direct LLM interaction with scientific visualizations to accelerate discovery and improve reproducibility.
$k$-server-bench: Automating Potential Discovery for the $k$-Server Conjecture cs.MS · 2026-04-08 · accept · none · ref 30 · internal anchor
k-server-bench formulates potential-function discovery for the k-server conjecture as a code-based inequality-satisfaction task; current agents fully solve the resolved k=3 case and reduce violations on the open k=4 case.
FactReview: Evidence-Grounded Reviews with Literature Positioning and Execution-Based Claim Verification cs.AI · 2026-04-05 · conditional · none · ref 8 · internal anchor
FactReview extracts claims from ML papers, positions them via literature retrieval, and verifies them through code execution, labeling each as Supported, Partially supported, or In conflict, as shown in a CompGCN case study.
The Alien Space of Science: Sampling Coherent but Cognitively Unavailable Research Directions cs.AI · 2026-03-01 · conditional · none · ref 6 · internal anchor
A framework decomposes LLM papers into idea atoms, trains coherence and availability models over the resulting vocabulary, and samples atom combinations that are coherent yet unlikely under existing author communities.
DiagramBank: A Large-scale Dataset of Diagram Design Exemplars with Paper Metadata for Retrieval-Augmented Generation cs.IR · 2026-02-28 · accept · none · ref 14 · internal anchor
DiagramBank is a large-scale curated dataset of 89,422 schematic diagrams from scientific papers with rich metadata to support multimodal retrieval and exemplar-driven figure generation.
Kosmos: An AI Scientist for Autonomous Discovery cs.AI · 2025-11-04 · unverdicted · none · ref 2 · internal anchor
Kosmos is an AI scientist that maintains coherence over hundreds of agent steps via a shared world model, executes thousands of code lines and reads thousands of papers per run, and produces traceable reports with 79.4% statement accuracy according to independent reviewers.
Evalet: Evaluating Large Language Models through Functional Fragmentation cs.HC · 2025-09-14 · conditional · none · ref 54 · internal anchor
Evalet applies functional fragmentation to deliver fragment-level qualitative analysis of LLM evaluations, with a user study showing 48% more misalignment detections than holistic scoring.
SciGA: A Comprehensive Dataset for Designing Graphical Abstracts in Academic Papers cs.CV · 2025-07-03 · unverdicted · none · ref 29 · internal anchor
Introduces the SciGA-145k dataset with intra-paper and cross-paper graphical abstract recommendation tasks plus the CAR evaluation metric.
Human-LLM Compound System for Scientific Ideation through Facet Recombination and Novelty Evaluation cs.HC · 2024-09-23 · unverdicted · none · ref 49 · internal anchor
Scideator enables facet-based scientific ideation through LLM-driven extraction, human-guided recombination, analogous retrieval, and facet-grounded novelty verification, showing significantly higher creativity support than a baseline LLM in a user study with CS researchers.
Automated Design of Agentic Systems cs.AI · 2024-08-15 · conditional · none · ref 178 · internal anchor
Meta Agent Search uses a meta-agent to iteratively program novel agentic systems in code, producing agents that outperform state-of-the-art hand-designed ones across coding, science, and math while transferring across domains and models.
AutoTrainess: Teaching Language Models to Improve Language Models Autonomously cs.CL · 2026-06-30 · unverdicted · none · ref 19 · internal anchor
AutoTrainess exposes training operations via agent-computer interfaces and outperforms CLI-only baselines on PostTrainBench with scores of 26.94 vs 23.21 for GPT-5.4 and similar gains on other models.
One Reflection Is Not Enough: Self-Correcting Autonomous Research via Multi-Hypothesis Failure Attribution cs.AI · 2026-06-30 · unverdicted · none · ref 99 · internal anchor
SAGE with MHFA improves failure recovery in autonomous research agents, raising metrics-bearing outputs from 42% to 92% on a 12-topic benchmark versus single-reflection baselines.
Evidence-Informed LLM Beliefs for Continual Scientific Discovery cs.AI · 2026-06-28 · unverdicted · none · ref 24 · internal anchor
Evidence-informed belief updates make Bayesian surprise non-stationary in LLM hypothesis search, with embedding-based RAG identifying 37.5% spurious static surprisals and modified search (filtering plus diversity) yielding 30.62% higher accumulated non-stationary surprisal across five domains.

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

hub tools

citation-role summary

citation-polarity summary

claims ledger

authors

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer