hub

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, Robert Tjarko Lange, Jakob Foerster, Jeff Clune, David Ha · 2024 · cs.AI · arXiv 2408.06292

71 Pith papers cite this work. Polarity classification is still indexing.

71 Pith papers citing it

open full Pith review browse 71 citing papers arXiv PDF

abstract

One of the grand challenges of artificial general intelligence is developing agents capable of conducting scientific research and discovering new knowledge. While frontier models have already been used as aides to human scientists, e.g. for brainstorming ideas, writing code, or prediction tasks, they still conduct only a small part of the scientific process. This paper presents the first comprehensive framework for fully automatic scientific discovery, enabling frontier large language models to perform research independently and communicate their findings. We introduce The AI Scientist, which generates novel research ideas, writes code, executes experiments, visualizes results, describes its findings by writing a full scientific paper, and then runs a simulated review process for evaluation. In principle, this process can be repeated to iteratively develop ideas in an open-ended fashion, acting like the human scientific community. We demonstrate its versatility by applying it to three distinct subfields of machine learning: diffusion modeling, transformer-based language modeling, and learning dynamics. Each idea is implemented and developed into a full paper at a cost of less than $15 per paper. To evaluate the generated papers, we design and validate an automated reviewer, which we show achieves near-human performance in evaluating paper scores. The AI Scientist can produce papers that exceed the acceptance threshold at a top machine learning conference as judged by our automated reviewer. This approach signifies the beginning of a new era in scientific discovery in machine learning: bringing the transformative benefits of AI agents to the entire research process of AI itself, and taking us closer to a world where endless affordable creativity and innovation can be unleashed on the world's most challenging problems. Our code is open-sourced at https://github.com/SakanaAI/AI-Scientist

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1

citation-polarity summary

background 1

claims ledger

abstract One of the grand challenges of artificial general intelligence is developing agents capable of conducting scientific research and discovering new knowledge. While frontier models have already been used as aides to human scientists, e.g. for brainstorming ideas, writing code, or prediction tasks, they still conduct only a small part of the scientific process. This paper presents the first comprehensive framework for fully automatic scientific discovery, enabling frontier large language models to perform research independently and communicate their findings. We introduce The AI Scientist, which

co-cited works

representative citing papers

AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot

cs.AI · 2026-04-15 · conditional · novelty 9.0

AI reviews for all 22,977 AAAI-26 papers were preferred by authors and PC members over human reviews on accuracy and suggestions and outperformed baselines at spotting weaknesses.

Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty

cs.CL · 2026-05-12 · unverdicted · novelty 8.0

Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.

WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

cs.CL · 2026-05-11 · unverdicted · novelty 8.0

A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.

AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery

cs.AI · 2026-04-28 · accept · novelty 8.0

AutoResearchBench is a new benchmark showing top AI agents achieve under 10% success on complex scientific literature discovery tasks that demand deep comprehension and open-ended search.

FermiLink: A Unified Agent Framework for Multidomain Autonomous Scientific Simulations

physics.chem-ph · 2026-04-03 · conditional · novelty 8.0

FermiLink is a unified AI agent framework that automates multidomain scientific simulations via separated package knowledge bases and a four-layer progressive disclosure mechanism, reproducing 56% of target figures in benchmarks and generating research-grade results on unpublished problems.

Harnessing Agentic Evolution

cs.AI · 2026-05-13 · unverdicted · novelty 7.0

AEvo introduces a meta-agent that edits the evolution procedure or agent context based on accumulated state, outperforming baselines by 26% relative improvement on agentic benchmarks and achieving SOTA on open-ended tasks.

AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration -- Learning from Cheap, Optimizing Expensive

cs.AI · 2026-05-12 · unverdicted · novelty 7.0

AutoLLMResearch trains agents via a multi-fidelity environment and MDP pipeline to extrapolate configuration principles from inexpensive to costly LLM experiments.

ASIA: an Autonomous System Identification Agent

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

ASIA uses an LLM-based coding agent to autonomously perform system identification, tested empirically on two benchmarks while noting limitations in transparency and reproducibility.

PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

PaperFit uses rendered page images in a closed loop to diagnose and repair typesetting defects in LaTeX documents, outperforming baselines on a new benchmark of 200 papers.

Hypothesis-Driven Deep Research with Large Language Models: A Structured Methodology for Automated Knowledge Discovery

cs.AI · 2026-05-11 · unverdicted · novelty 7.0

HDRI is a six-principle eight-stage framework for hypothesis-organized LLM research featuring gap-driven iteration, traceable fact reasoning, and subject locking, realized in INFOMINER with reported gains in fact density and completeness.

Self Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale

cs.LG · 2026-05-07 · conditional · novelty 7.0

Starling uses LLMs and agents to turn 22.5M PubMed papers into 6.3M nuanced structured records across six tasks with 0.6-7.7% frontier-model rejection rates, lower than error rates on existing curated databases.

AI co-mathematician: Accelerating mathematicians with agentic AI

cs.AI · 2026-05-07 · unverdicted · novelty 7.0

An interactive AI workbench for mathematicians achieves 48% on FrontierMath Tier 4 and helped solve open problems in early tests.

Evolving Idea Graphs with Learnable Edits-and-Commits for Multi-Agent Scientific Ideation

cs.MA · 2026-05-06 · unverdicted · novelty 7.0

EIG represents research ideas as evolving graphs with nodes for claims and edges for relations, using a learned controller for edits and commits to produce higher-quality scientific proposals than text-only multi-agent baselines.

Agentic-imodels: Evolving agentic interpretability tools via autoresearch

cs.AI · 2026-05-05 · unverdicted · novelty 7.0

Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.

Theory Under Construction: Orchestrating Language Models for Research Software Where the Specification Evolves

cs.SE · 2026-04-29 · unverdicted · novelty 7.0

Comet-H orchestrates LLMs via deficit-scoring prompt selection and half-life task tracking to co-evolve research software components, demonstrated by a static analysis tool reaching F1=0.768 versus a 0.364 baseline.

End-to-end autonomous scientific discovery on a real optical platform

cs.AI · 2026-04-29 · unverdicted · novelty 7.0

An LLM agent autonomously identifies and experimentally validates a previously unreported optical bilinear interaction on a physical platform.

Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond

cs.AI · 2026-04-24 · unverdicted · novelty 7.0

Proposes a levels x laws taxonomy for world models in AI agents, defining L1-L3 capabilities across physical, digital, social, and scientific regimes while reviewing over 400 works to outline a roadmap for advanced agentic modeling.

Knows: Agent-Native Structured Research Representations

cs.AI · 2026-04-19 · conditional · novelty 7.0

Knows uses a YAML sidecar specification to provide structured, agent-consumable representations of research papers, yielding large accuracy gains for small LLMs on comprehension tasks and rapid community adoption via a public hub.

ReviewGrounder: Improving Review Substantiveness with Rubric-Guided, Tool-Integrated Agents

cs.CL · 2026-04-15 · unverdicted · novelty 7.0

ReviewGrounder decomposes review generation into rubric-guided drafting and tool-integrated grounding stages, outperforming larger baseline models on a new benchmark measuring alignment with human judgments and review quality.

VERITAS: Verifiable Epistemic Reasoning for Image-Derived Hypothesis Testing via Agentic Systems

cs.MA · 2026-04-13 · unverdicted · novelty 7.0

VERITAS is a multi-agent system for verifiable hypothesis testing on multimodal clinical MRI datasets that achieves 81.4% verdict accuracy with frontier models and introduces an epistemic evidence labeling framework.

Camyla: Scaling Autonomous Research in Medical Image Segmentation

cs.AI · 2026-04-12 · unverdicted · novelty 7.0

Camyla autonomously generates research proposals, experiments, and manuscripts in medical image segmentation, outperforming baselines on 24 of 31 recent datasets while producing 40 human-reviewed papers.

Figures as Interfaces: Toward LLM-Native Artifacts for Scientific Discovery

cs.HC · 2026-04-09 · unverdicted · novelty 7.0

LLM-native figures embed provenance and enable direct LLM interaction with scientific visualizations to accelerate discovery and improve reproducibility.

$k$-server-bench: Automating Potential Discovery for the $k$-Server Conjecture

cs.MS · 2026-04-08 · accept · novelty 7.0

k-server-bench formulates potential-function discovery for the k-server conjecture as a code-based inequality-satisfaction task; current agents fully solve the resolved k=3 case and reduce violations on the open k=4 case.

AutoSOTA: An End-to-End Automated Research System for State-of-the-Art AI Model Discovery

cs.CL · 2026-04-07 · unverdicted · novelty 7.0

AutoSOTA uses eight specialized agents to replicate and optimize models from recent AI papers, producing 105 new SOTA results in about five hours per paper on average.

citing papers explorer

Showing 50 of 71 citing papers.

AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot cs.AI · 2026-04-15 · conditional · none · ref 8 · internal anchor
AI reviews for all 22,977 AAAI-26 papers were preferred by authors and PC members over human reviews on accuracy and suggestions and outperformed baselines at spotting weaknesses.
Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty cs.CL · 2026-05-12 · unverdicted · none · ref 10 · internal anchor
Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.
WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation cs.CL · 2026-05-11 · unverdicted · none · ref 23 · internal anchor
A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.
AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery cs.AI · 2026-04-28 · accept · none · ref 1 · internal anchor
AutoResearchBench is a new benchmark showing top AI agents achieve under 10% success on complex scientific literature discovery tasks that demand deep comprehension and open-ended search.
FermiLink: A Unified Agent Framework for Multidomain Autonomous Scientific Simulations physics.chem-ph · 2026-04-03 · conditional · none · ref 12 · internal anchor
FermiLink is a unified AI agent framework that automates multidomain scientific simulations via separated package knowledge bases and a four-layer progressive disclosure mechanism, reproducing 56% of target figures in benchmarks and generating research-grade results on unpublished problems.
Harnessing Agentic Evolution cs.AI · 2026-05-13 · unverdicted · none · ref 18 · internal anchor
AEvo introduces a meta-agent that edits the evolution procedure or agent context based on accumulated state, outperforming baselines by 26% relative improvement on agentic benchmarks and achieving SOTA on open-ended tasks.
AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration -- Learning from Cheap, Optimizing Expensive cs.AI · 2026-05-12 · unverdicted · none · ref 44 · internal anchor
AutoLLMResearch trains agents via a multi-fidelity environment and MDP pipeline to extrapolate configuration principles from inexpensive to costly LLM experiments.
ASIA: an Autonomous System Identification Agent cs.AI · 2026-05-11 · unverdicted · none · ref 11 · internal anchor
ASIA uses an LLM-based coding agent to autonomously perform system identification, tested empirically on two benchmarks while noting limitations in transparency and reproducibility.
PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents cs.AI · 2026-05-11 · unverdicted · none · ref 136 · internal anchor
PaperFit uses rendered page images in a closed loop to diagnose and repair typesetting defects in LaTeX documents, outperforming baselines on a new benchmark of 200 papers.
Hypothesis-Driven Deep Research with Large Language Models: A Structured Methodology for Automated Knowledge Discovery cs.AI · 2026-05-11 · unverdicted · none · ref 17 · internal anchor
HDRI is a six-principle eight-stage framework for hypothesis-organized LLM research featuring gap-driven iteration, traceable fact reasoning, and subject locking, realized in INFOMINER with reported gains in fact density and completeness.
Self Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale cs.LG · 2026-05-07 · conditional · none · ref 37 · internal anchor
Starling uses LLMs and agents to turn 22.5M PubMed papers into 6.3M nuanced structured records across six tasks with 0.6-7.7% frontier-model rejection rates, lower than error rates on existing curated databases.
AI co-mathematician: Accelerating mathematicians with agentic AI cs.AI · 2026-05-07 · unverdicted · none · ref 11 · internal anchor
An interactive AI workbench for mathematicians achieves 48% on FrontierMath Tier 4 and helped solve open problems in early tests.
Evolving Idea Graphs with Learnable Edits-and-Commits for Multi-Agent Scientific Ideation cs.MA · 2026-05-06 · unverdicted · none · ref 21 · internal anchor
EIG represents research ideas as evolving graphs with nodes for claims and edges for relations, using a learned controller for edits and commits to produce higher-quality scientific proposals than text-only multi-agent baselines.
Agentic-imodels: Evolving agentic interpretability tools via autoresearch cs.AI · 2026-05-05 · unverdicted · none · ref 5 · internal anchor
Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.
Theory Under Construction: Orchestrating Language Models for Research Software Where the Specification Evolves cs.SE · 2026-04-29 · unverdicted · none · ref 22 · internal anchor
Comet-H orchestrates LLMs via deficit-scoring prompt selection and half-life task tracking to co-evolve research software components, demonstrated by a static analysis tool reaching F1=0.768 versus a 0.364 baseline.
End-to-end autonomous scientific discovery on a real optical platform cs.AI · 2026-04-29 · unverdicted · none · ref 5 · internal anchor
An LLM agent autonomously identifies and experimentally validates a previously unreported optical bilinear interaction on a physical platform.
Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond cs.AI · 2026-04-24 · unverdicted · none · ref 252 · internal anchor
Proposes a levels x laws taxonomy for world models in AI agents, defining L1-L3 capabilities across physical, digital, social, and scientific regimes while reviewing over 400 works to outline a roadmap for advanced agentic modeling.
Knows: Agent-Native Structured Research Representations cs.AI · 2026-04-19 · conditional · none · ref 7 · internal anchor
Knows uses a YAML sidecar specification to provide structured, agent-consumable representations of research papers, yielding large accuracy gains for small LLMs on comprehension tasks and rapid community adoption via a public hub.
ReviewGrounder: Improving Review Substantiveness with Rubric-Guided, Tool-Integrated Agents cs.CL · 2026-04-15 · unverdicted · none · ref 3 · internal anchor
ReviewGrounder decomposes review generation into rubric-guided drafting and tool-integrated grounding stages, outperforming larger baseline models on a new benchmark measuring alignment with human judgments and review quality.
VERITAS: Verifiable Epistemic Reasoning for Image-Derived Hypothesis Testing via Agentic Systems cs.MA · 2026-04-13 · unverdicted · none · ref 21 · internal anchor
VERITAS is a multi-agent system for verifiable hypothesis testing on multimodal clinical MRI datasets that achieves 81.4% verdict accuracy with frontier models and introduces an epistemic evidence labeling framework.
Camyla: Scaling Autonomous Research in Medical Image Segmentation cs.AI · 2026-04-12 · unverdicted · none · ref 1 · internal anchor
Camyla autonomously generates research proposals, experiments, and manuscripts in medical image segmentation, outperforming baselines on 24 of 31 recent datasets while producing 40 human-reviewed papers.
Figures as Interfaces: Toward LLM-Native Artifacts for Scientific Discovery cs.HC · 2026-04-09 · unverdicted · none · ref 8 · internal anchor
LLM-native figures embed provenance and enable direct LLM interaction with scientific visualizations to accelerate discovery and improve reproducibility.
$k$-server-bench: Automating Potential Discovery for the $k$-Server Conjecture cs.MS · 2026-04-08 · accept · none · ref 30 · internal anchor
k-server-bench formulates potential-function discovery for the k-server conjecture as a code-based inequality-satisfaction task; current agents fully solve the resolved k=3 case and reduce violations on the open k=4 case.
AutoSOTA: An End-to-End Automated Research System for State-of-the-Art AI Model Discovery cs.CL · 2026-04-07 · unverdicted · none · ref 20 · internal anchor
AutoSOTA uses eight specialized agents to replicate and optimize models from recent AI papers, producing 105 new SOTA results in about five hours per paper on average.
FactReview: Evidence-Grounded Reviews with Literature Positioning and Execution-Based Claim Verification cs.AI · 2026-04-05 · conditional · none · ref 8 · internal anchor
FactReview extracts claims from ML papers, positions them via literature retrieval, and verifies them through code execution, labeling each as Supported, Partially supported, or In conflict, as shown in a CompGCN case study.
OpenAaaS: An Open Agent-as-a-Service Framework for Distributed Materials-Informatics Research cond-mat.mtrl-sci · 2026-05-13 · unverdicted · none · ref 40 · internal anchor
OpenAaaS is a hierarchical agent-as-a-service system that enables secure multi-agent collaboration for materials informatics by moving code to data rather than data to code.
Letting the neural code speak: Automated characterization of monkey visual neurons through human language q-bio.NC · 2026-05-12 · unverdicted · none · ref 91 · internal anchor
Natural-language descriptions generated and verified through generative models and digital twins capture the selectivity of most neurons in macaque V1 and V4.
Unlocking LLM Creativity in Science through Analogical Reasoning cs.AI · 2026-05-11 · conditional · none · ref 27 · internal anchor
Analogical reasoning increases LLM solution diversity by 90-173% and novelty rate to over 50%, delivering up to 13-fold gains on biomedical tasks including perturbation prediction and cell communication.
NanoResearch: Co-Evolving Skills, Memory, and Policy for Personalized Research Automation cs.AI · 2026-05-11 · unverdicted · none · ref 21 · internal anchor
NanoResearch introduces a tri-level co-evolving framework of skills, memory, and policy to personalize LLM-powered research automation across projects and users.
ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox cs.AI · 2026-05-11 · unverdicted · none · ref 7 · internal anchor
ComplexMCP benchmark shows current LLM agents achieve at most 60% success on interdependent tool tasks versus 90% for humans, due to tool retrieval saturation, over-confidence, and strategic defeatism.
TIDE-Bench: Task-Aware and Diagnostic Evaluation of Tool-Integrated Reasoning cs.AI · 2026-05-10 · unverdicted · none · ref 29 · internal anchor
TIDE-Bench is a new benchmark for tool-integrated reasoning that combines diverse tasks, multi-aspect metrics covering answer quality, process reliability, efficiency and cost, plus filtered challenging test sets.
Towards a Virtual Neuroscientist: Autonomous Neuroimaging Analysis via Multi-Agent Collaboration cs.AI · 2026-05-10 · unverdicted · none · ref 13 · internal anchor
NIAgent uses code-centric multi-agent collaboration and hierarchical verification to build adaptive neuroimaging pipelines that outperform static baselines on ADHD-200 and ADNI data.
MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI cs.LG · 2026-05-09 · unverdicted · none · ref 57 · internal anchor
MLS-Bench shows that current AI agents fall short of reliably inventing generalizable ML methods, with engineering tuning easier than genuine invention.
CellScientist: Dual-Space Hierarchical Orchestration for Closed-Loop Refinement of Virtual Cell Models cs.LG · 2026-05-08 · unverdicted · none · ref 21 · internal anchor
CellScientist introduces a dual-space hierarchical orchestration system that enables closed-loop refinement of virtual cell models by routing execution discrepancies back to hypothesis or implementation updates, yielding improved benchmark performance with auditable traces.
FAME: Forecasting Academic Impact via Continuous-Time Manifold Evolution cs.LG · 2026-05-08 · unverdicted · none · ref 19 · internal anchor
FAME models scientific topic trajectories in continuous time to forecast paper impact more accurately than LLMs by aligning manuscripts with field momentum in a dynamic latent space.
Intentmaking and Sensemaking: Human Interaction with AI-Guided Mathematical Discovery cs.AI · 2026-05-07 · unverdicted · none · ref 25 · internal anchor
Expert mathematicians using an AI coding agent for discovery engage in repeated cycles of intentmaking to define goals and sensemaking to interpret outputs.
One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue cs.CL · 2026-05-07 · unverdicted · none · ref 17 · 2 links · internal anchor
TurnGate identifies the critical turn in multi-turn dialogues where a response would complete hidden malicious intent, outperforming baselines on the new MTID dataset while keeping over-refusal low.
BioVeil MATRIX: Uncovering and categorizing vulnerabilities of agentic biological AI scientists q-bio.OT · 2026-04-30 · unverdicted · none · ref 3 · internal anchor
Agentic biological AI systems like Biomni and K-Dense assist with dual-use tasks blocked by safeguards and gain performance uplift on WMDP proxies; BioVeil MATRIX is introduced as a 10-category taxonomy with 22 techniques to categorize and red-team AI-enabled biosecurity risks.
Intern-Atlas: A Methodological Evolution Graph as Research Infrastructure for AI Scientists cs.AI · 2026-04-30 · unverdicted · none · ref 4 · internal anchor
Intern-Atlas constructs a methodological evolution graph with 9.4 million edges from 1.03 million AI papers to capture how methods emerge, adapt, and transition, enabling better idea evaluation and generation for AI-driven research.
AgentEconomist: An End-to-end Agentic System Translating Economic Intuitions into Executable Computational Experiments cs.HC · 2026-04-30 · unverdicted · none · ref 18 · internal anchor
AgentEconomist is an end-to-end agentic system with idea development, experimental design, and execution stages that uses a large economics paper database to produce research ideas with better literature grounding, novelty, and insight than generic LLMs.
OMEGA: Optimizing Machine Learning by Evaluating Generated Algorithms cs.AI · 2026-04-29 · unverdicted · none · ref 6 · internal anchor
OMEGA framework generates novel ML classifiers via meta-prompts and executable code that outperform scikit-learn baselines on 20 benchmark datasets.
How Researchers Navigate Accountability, Transparency, and Trust When Using AI Tools in Early-Stage Research: A Think-Aloud Study cs.CY · 2026-04-25 · unverdicted · none · ref 45 · internal anchor
A think-aloud study reveals that AI tools in early research misrepresent uncertainty, obscure provenance, and create fragile trust, leading researchers to develop compensatory strategies to preserve scholarly judgment.
Rethinking Publication: A Certification Framework for AI-Enabled Research cs.AI · 2026-04-23 · unverdicted · none · ref 29 · 2 links · internal anchor
A two-layer certification framework decouples knowledge validity from human authorship to accommodate AI-enabled research in existing publication systems.
A Scientific Human-Agent Reproduction Pipeline hep-ph · 2026-04-20 · unverdicted · none · ref 4 · internal anchor
SHARP is a human-AI collaboration pipeline for reproducing scientific analyses, demonstrated by recreating a jet classification task from a particle physics paper.
HiRAS: A Hierarchical Multi-Agent Framework for Paper-to-Code Generation and Execution cs.CL · 2026-04-20 · unverdicted · none · ref 2 · internal anchor
HiRAS introduces hierarchical multi-agent coordination for paper-to-code generation and experiment reproduction, claiming over 10% relative gains over prior state-of-the-art on a refined benchmark with reduced hallucination.
TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration cs.AI · 2026-04-15 · unverdicted · none · ref 31 · internal anchor
TREX automates the LLM training lifecycle via collaborative agents and tree-based exploration, delivering consistent performance gains across 10 real-world fine-tuning tasks in FT-Bench.
Toward Autonomous Long-Horizon Engineering for ML Research cs.CL · 2026-04-14 · unverdicted · none · ref 12 · internal anchor
AiScientist improves ML research benchmarks by 10.54 points on PaperBench and reaches 81.82% Any Medal on MLE-Bench Lite through hierarchical control plus durable file-based state instead of conversational handoffs.
Pioneer Agent: Continual Improvement of Small Language Models in Production cs.AI · 2026-04-10 · unverdicted · none · ref 59 · internal anchor
Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on benchmarks and large lifts in production-style tasks.
ResearchEVO: An End-to-End Framework for Automated Scientific Discovery and Documentation cs.AI · 2026-04-07 · unverdicted · none · ref 20 · internal anchor
ResearchEVO automates the discover-then-explain cycle by evolving algorithms via fitness-driven LLM co-evolution and generating grounded, anti-hallucination research papers through sentence-level RAG.
Towards Verifiable and Self-Correcting AI Physicists for Quantum Many-Body Simulations physics.comp-ph · 2026-03-31 · unverdicted · none · ref 1 · internal anchor
QMP-Bench supplies a realistic test set for AI on quantum many-body problems while PhysVEC uses integrated verifiers to turn unreliable LLM generations into code that passes both syntax and physics checks, outperforming baselines.

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

hub tools

citation-role summary

citation-polarity summary

claims ledger

co-cited works

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer