arxiv: 2408.06292 · v3 · submitted 2024-08-12 · 💻 cs.AI · cs.CL· cs.LG

Recognition: no theorem link

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, David Ha, Jakob Foerster, Jeff Clune, Robert Tjarko Lange

Pith reviewed 2026-05-11 04:36 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG

keywords automated scientific discoverylarge language modelsAI research agentsmachine learningautonomous paper generationself-review processopen-ended discovery

0 comments

The pith

Frontier large language models can autonomously conduct full scientific research cycles using the AI Scientist framework, producing papers that pass automated conference-level review.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes The AI Scientist, a framework that enables large language models to independently manage the complete scientific process. The system generates research ideas, implements them through code and experiments, creates visualizations, writes full papers, and performs its own review evaluation. It is tested on three machine learning subfields with each paper costing less than fifteen dollars. The authors also create an automated reviewer that scores papers near human levels, and some AI-generated papers exceed the acceptance bar according to this reviewer. This represents a step toward AI agents driving open-ended discovery in machine learning research.

Core claim

The AI Scientist is the first comprehensive framework for fully automatic scientific discovery. It allows frontier large language models to generate novel research ideas, write code, execute experiments, visualize results, write full scientific papers, and run a simulated review process. This can be repeated iteratively in an open-ended way. Applied to diffusion modeling, transformer-based language modeling, and learning dynamics, it produces papers at less than $15 each. The automated reviewer achieves near-human performance, and the system generates papers that exceed the acceptance threshold at a top machine learning conference.

What carries the argument

The AI Scientist framework, which sequences LLM capabilities to cover the entire research pipeline from idea generation to self-assessment.

If this is right

Open-ended iteration of the process can mimic the human scientific community in developing ideas.
The generated papers can meet or exceed acceptance thresholds for top machine learning conferences per the automated reviewer.
Versatility across distinct subfields of machine learning including diffusion, language modeling, and learning dynamics.
Low-cost production of full research papers under fifteen dollars each.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This could enable much higher throughput in exploring new ideas within AI research if the quality holds up under human scrutiny.
Similar systems might eventually be adapted for discovery in other scientific fields, though domain-specific tools would be needed.
Long-term use might create feedback loops where AI builds upon its own prior discoveries without human input.

Load-bearing premise

The automated reviewer provides an accurate assessment of paper quality comparable to human experts at top conferences.

What would settle it

Having the AI-generated papers submitted to a real top-tier machine learning conference and observing whether they are accepted or rejected based on human reviews.

read the original abstract

One of the grand challenges of artificial general intelligence is developing agents capable of conducting scientific research and discovering new knowledge. While frontier models have already been used as aides to human scientists, e.g. for brainstorming ideas, writing code, or prediction tasks, they still conduct only a small part of the scientific process. This paper presents the first comprehensive framework for fully automatic scientific discovery, enabling frontier large language models to perform research independently and communicate their findings. We introduce The AI Scientist, which generates novel research ideas, writes code, executes experiments, visualizes results, describes its findings by writing a full scientific paper, and then runs a simulated review process for evaluation. In principle, this process can be repeated to iteratively develop ideas in an open-ended fashion, acting like the human scientific community. We demonstrate its versatility by applying it to three distinct subfields of machine learning: diffusion modeling, transformer-based language modeling, and learning dynamics. Each idea is implemented and developed into a full paper at a cost of less than $15 per paper. To evaluate the generated papers, we design and validate an automated reviewer, which we show achieves near-human performance in evaluating paper scores. The AI Scientist can produce papers that exceed the acceptance threshold at a top machine learning conference as judged by our automated reviewer. This approach signifies the beginning of a new era in scientific discovery in machine learning: bringing the transformative benefits of AI agents to the entire research process of AI itself, and taking us closer to a world where endless affordable creativity and innovation can be unleashed on the world's most challenging problems. Our code is open-sourced at https://github.com/SakanaAI/AI-Scientist

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper closes an end-to-end LLM loop for idea-to-paper research in ML but its main results depend on an unvalidated internal reviewer.

read the letter

The core new piece is the full closed loop: an LLM agent that proposes ideas, writes and runs code, plots results, drafts a paper, and then subjects it to a simulated review, all repeatable in an open-ended way. They show this working across diffusion models, language modeling, and learning dynamics, with each paper costing under $15 and the code released publicly. That is a concrete step beyond single-task tools like code assistants or literature summarizers, and the open-sourcing lets others test the pipeline directly. The automated reviewer is presented as reaching near-human scoring, which is the part that lets them claim some outputs clear a top-conference bar. The soft spot is exactly there. The reviewer was designed and tuned inside the same project, with no reported blind tests against real conference decisions or checks for whether it favors LLM-style outputs. Without those, the acceptance-threshold claim stays circular and hard to interpret. There are also no ablations on idea novelty, failure modes, or how much the results depend on the base model choice. The work is aimed at researchers building agentic systems for science rather than at domain experts looking for new ML findings. It is coherent on its own terms and shows clear engineering effort, so it deserves a serious referee to pressure-test the evaluation setup and see whether the loop holds up under external scrutiny.

Referee Report

3 major / 2 minor

Summary. The paper introduces The AI Scientist, a framework enabling frontier LLMs to autonomously generate novel research ideas, implement code and run experiments, visualize results, write full scientific papers, and evaluate them through a simulated review process. Applied to diffusion modeling, transformer language modeling, and learning dynamics, it claims to produce papers at under $15 each, with some exceeding top-ML-conference acceptance thresholds as scored by an internally designed automated reviewer that achieves near-human performance. The process is presented as repeatable for open-ended discovery, with code open-sourced.

Significance. If the central claims hold after addressing evaluation gaps, this would be a notable step toward fully automated scientific discovery in machine learning, demonstrating a closed-loop system for idea-to-paper generation at low cost and highlighting potential for iterative research. The open-sourcing of code strengthens reproducibility and invites community extensions, though the current lack of external validation limits immediate impact on the broader scientific process.

major comments (3)

[Automated Reviewer section] Automated Reviewer section: The paper's core claim—that generated papers exceed conference acceptance thresholds—rests entirely on scores from the authors' internally designed and validated automated reviewer. No quantitative details are provided on its training corpus, calibration against real conference decisions, correlation with human reviewers, or performance on a blind test set separating LLM-generated from human papers. This self-referential loop undermines the acceptance-threshold result.
[Experimental Results (Section 5)] Experimental Results (Section 5): The reported successes in three subfields lack ablation studies on key components (e.g., idea generation vs. experiment execution), quantitative metrics on idea novelty (such as literature overlap or expert originality ratings), and error rates for code validity or experimental soundness. These omissions make it impossible to determine what drives any apparent success or whether outputs represent genuine advances.
[Abstract and Results summary] Abstract and Results summary: The assertion of 'near-human performance' for the automated reviewer and papers exceeding acceptance thresholds provides no supporting numbers (e.g., inter-rater agreement, threshold calibration details, or comparison to actual conference acceptance rates), leaving the central evaluation unsupported.

minor comments (2)

[Figures and cost analysis] The workflow diagram and cost breakdowns would benefit from clearer labels and step-by-step explanations to improve readability for readers unfamiliar with the pipeline.
[Methods description] Some terms (e.g., specific LLM sampling parameters) are referenced without initial definition or explicit values in the methods description.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments, which highlight important areas for improving the clarity and rigor of our evaluation. We address each major comment point by point below, indicating planned revisions to the manuscript where appropriate. Our goal is to strengthen the presentation of the automated reviewer and experimental results without altering the core contributions of the AI Scientist framework.

read point-by-point responses

Referee: [Automated Reviewer section] Automated Reviewer section: The paper's core claim—that generated papers exceed conference acceptance thresholds—rests entirely on scores from the authors' internally designed and validated automated reviewer. No quantitative details are provided on its training corpus, calibration against real conference decisions, correlation with human reviewers, or performance on a blind test set separating LLM-generated from human papers. This self-referential loop undermines the acceptance-threshold result.

Authors: We agree that the manuscript would benefit from greater transparency on the automated reviewer. The current version describes its design and validation at a high level but omits specific quantitative details. In the revision, we will expand the Automated Reviewer section to include: the composition of the training corpus (human-written papers from prior NeurIPS/ICML/ICLR proceedings), calibration details against historical acceptance rates, Pearson/Spearman correlations with human reviewer scores, and performance metrics on a held-out blind test set. We will also explicitly note that the reviewer was trained exclusively on human papers to mitigate self-reference concerns. These additions will be supported by new tables and figures. revision: yes
Referee: [Experimental Results (Section 5)] Experimental Results (Section 5): The reported successes in three subfields lack ablation studies on key components (e.g., idea generation vs. experiment execution), quantitative metrics on idea novelty (such as literature overlap or expert originality ratings), and error rates for code validity or experimental soundness. These omissions make it impossible to determine what drives any apparent success or whether outputs represent genuine advances.

Authors: We acknowledge the value of ablations and additional metrics for isolating contributions. The manuscript focuses on end-to-end feasibility rather than component-wise analysis, but we agree this limits interpretability. In revision, we will add: (1) basic ablation results comparing full pipeline performance against versions with simplified idea generation or execution modules; (2) quantitative novelty metrics such as n-gram overlap and citation similarity with existing literature; and (3) reported error rates for code execution failures and experimental soundness (e.g., percentage of runs that completed without runtime errors). Expert originality ratings remain resource-intensive and will be noted as a limitation with discussion of future work. These changes will appear in an expanded Section 5. revision: partial
Referee: [Abstract and Results summary] Abstract and Results summary: The assertion of 'near-human performance' for the automated reviewer and papers exceeding acceptance thresholds provides no supporting numbers (e.g., inter-rater agreement, threshold calibration details, or comparison to actual conference acceptance rates), leaving the central evaluation unsupported.

Authors: We will revise both the abstract and the results summary to include concrete supporting statistics. Specifically, we will report: inter-rater agreement (e.g., Cohen's kappa or correlation values) between the automated reviewer and human reviewers, the precise acceptance threshold calibrated from past conference data (e.g., average scores of accepted papers), and direct comparisons to real acceptance rates. These numbers will be added to the abstract and highlighted in the results section with references to the expanded validation details. revision: yes

Circularity Check

1 steps flagged

Central claim of exceeding conference thresholds rests on authors' self-designed automated reviewer

specific steps

fitted input called prediction [Abstract]
"To evaluate the generated papers, we design and validate an automated reviewer, which we show achieves near-human performance in evaluating paper scores. The AI Scientist can produce papers that exceed the acceptance threshold at a top machine learning conference as judged by our automated reviewer."

The headline success metric ('exceed the acceptance threshold') is not an external or pre-existing benchmark but is computed by the authors' own reviewer, which they designed, validated, and then used to judge their system's outputs. This reduces the 'prediction' of research success to performance on an internally constructed evaluator, matching the fitted-input-called-prediction pattern.

full rationale

The paper's primary result—that The AI Scientist generates papers exceeding top-ML-conference acceptance thresholds—is defined entirely by scores from an automated reviewer the authors explicitly state they 'design and validate.' This creates a load-bearing self-referential evaluation loop. While the abstract claims near-human performance, no independent external benchmark (e.g., correlation with actual conference decisions on mixed human/LLM papers) is exhibited in the provided text. Other components (idea generation, code execution, paper writing) do not reduce to this loop, so the circularity is partial and confined to the success metric. This warrants a moderate score rather than 8-10, as the framework itself is not definitionally tautological.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The framework rests on the unproven assumption that current frontier LLMs possess sufficient capability for open-ended research tasks and introduces an internally validated reviewer whose independence from the generated content is not externally demonstrated.

free parameters (2)

LLM sampling parameters and model choice
Specific temperature, top-p, and model versions used for idea generation and code writing are not detailed in the abstract but are central to reproducibility.
Automated reviewer acceptance threshold
The numerical cutoff used to declare papers exceed top-conference standards is not specified.

axioms (1)

domain assumption Frontier LLMs can reliably generate novel, implementable research ideas and produce correct experimental code without human intervention
Invoked throughout the description of the AI Scientist pipeline.

invented entities (1)

Automated reviewer no independent evidence
purpose: To score generated papers and determine acceptance without human input
New component introduced and validated by the authors themselves.

pith-pipeline@v0.9.0 · 5617 in / 1596 out tokens · 67826 ms · 2026-05-11T04:36:48.483357+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot
cs.AI 2026-04 conditional novelty 9.0

AI reviews for all 22,977 AAAI-26 papers were preferred by authors and PC members over human reviews on accuracy and suggestions and outperformed baselines at spotting weaknesses.
Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty
cs.CL 2026-05 unverdicted novelty 8.0

Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.
WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
cs.CL 2026-05 unverdicted novelty 8.0

A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.
AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery
cs.AI 2026-04 accept novelty 8.0

AutoResearchBench is a new benchmark showing top AI agents achieve under 10% success on complex scientific literature discovery tasks that demand deep comprehension and open-ended search.
FermiLink: A Unified Agent Framework for Multidomain Autonomous Scientific Simulations
physics.chem-ph 2026-04 conditional novelty 8.0

FermiLink is a unified AI agent framework that automates multidomain scientific simulations via separated package knowledge bases and a four-layer progressive disclosure mechanism, reproducing 56% of target figures in...
AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration -- Learning from Cheap, Optimizing Expensive
cs.AI 2026-05 unverdicted novelty 7.0

AutoLLMResearch trains agents via a multi-fidelity environment and MDP pipeline to extrapolate configuration principles from inexpensive to costly LLM experiments.
ASIA: an Autonomous System Identification Agent
cs.AI 2026-05 unverdicted novelty 7.0

ASIA uses an LLM-based coding agent to autonomously perform system identification, tested empirically on two benchmarks while noting limitations in transparency and reproducibility.
PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents
cs.AI 2026-05 unverdicted novelty 7.0

PaperFit uses rendered page images in a closed loop to diagnose and repair typesetting defects in LaTeX documents, outperforming baselines on a new benchmark of 200 papers.
Hypothesis-Driven Deep Research with Large Language Models: A Structured Methodology for Automated Knowledge Discovery
cs.AI 2026-05 unverdicted novelty 7.0

HDRI is a six-principle eight-stage framework for hypothesis-organized LLM research featuring gap-driven iteration, traceable fact reasoning, and subject locking, realized in INFOMINER with reported gains in fact dens...
Self Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale
cs.LG 2026-05 conditional novelty 7.0

Starling uses LLMs and agents to turn 22.5M PubMed papers into 6.3M nuanced structured records across six tasks with 0.6-7.7% frontier-model rejection rates, lower than error rates on existing curated databases.
AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents
physics.flu-dyn 2026-05 conditional novelty 7.0

AI CFD Scientist autonomously finds a Spalart-Allmaras turbulence correction that lowers wall-friction error by 7.89% versus DNS on the periodic hill case using vision-language physics verification.
Evolving Idea Graphs with Learnable Edits-and-Commits for Multi-Agent Scientific Ideation
cs.MA 2026-05 unverdicted novelty 7.0

EIG represents research ideas as evolving graphs with nodes for claims and edges for relations, using a learned controller for edits and commits to produce higher-quality scientific proposals than text-only multi-agen...
Agentic-imodels: Evolving agentic interpretability tools via autoresearch
cs.AI 2026-05 unverdicted novelty 7.0

Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.
Theory Under Construction: Orchestrating Language Models for Research Software Where the Specification Evolves
cs.SE 2026-04 unverdicted novelty 7.0

Comet-H orchestrates LLMs via deficit-scoring prompt selection and half-life task tracking to co-evolve research software components, demonstrated by a static analysis tool reaching F1=0.768 versus a 0.364 baseline.
End-to-end autonomous scientific discovery on a real optical platform
cs.AI 2026-04 unverdicted novelty 7.0

An LLM agent autonomously identifies and experimentally validates a previously unreported optical bilinear interaction on a physical platform.
Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond
cs.AI 2026-04 unverdicted novelty 7.0

Proposes a levels x laws taxonomy for world models in AI agents, defining L1-L3 capabilities across physical, digital, social, and scientific regimes while reviewing over 400 works to outline a roadmap for advanced ag...
Knows: Agent-Native Structured Research Representations
cs.AI 2026-04 conditional novelty 7.0

Knows uses a YAML sidecar specification to provide structured, agent-consumable representations of research papers, yielding large accuracy gains for small LLMs on comprehension tasks and rapid community adoption via ...
ReviewGrounder: Improving Review Substantiveness with Rubric-Guided, Tool-Integrated Agents
cs.CL 2026-04 unverdicted novelty 7.0

ReviewGrounder decomposes review generation into rubric-guided drafting and tool-integrated grounding stages, outperforming larger baseline models on a new benchmark measuring alignment with human judgments and review...
VERITAS: Verifiable Epistemic Reasoning for Image-Derived Hypothesis Testing via Agentic Systems
cs.MA 2026-04 unverdicted novelty 7.0

VERITAS is a multi-agent system for verifiable hypothesis testing on multimodal clinical MRI datasets that achieves 81.4% verdict accuracy with frontier models and introduces an epistemic evidence labeling framework.
Camyla: Scaling Autonomous Research in Medical Image Segmentation
cs.AI 2026-04 unverdicted novelty 7.0

Camyla autonomously generates research proposals, experiments, and manuscripts in medical image segmentation, outperforming baselines on 24 of 31 recent datasets while producing 40 human-reviewed papers.
Figures as Interfaces: Toward LLM-Native Artifacts for Scientific Discovery
cs.HC 2026-04 unverdicted novelty 7.0

LLM-native figures embed provenance and enable direct LLM interaction with scientific visualizations to accelerate discovery and improve reproducibility.
$k$-server-bench: Automating Potential Discovery for the $k$-Server Conjecture
cs.MS 2026-04 accept novelty 7.0

k-server-bench formulates potential-function discovery for the k-server conjecture as a code-based inequality-satisfaction task; current agents fully solve the resolved k=3 case and reduce violations on the open k=4 case.
AutoSOTA: An End-to-End Automated Research System for State-of-the-Art AI Model Discovery
cs.CL 2026-04 unverdicted novelty 7.0

AutoSOTA uses eight specialized agents to replicate and optimize models from recent AI papers, producing 105 new SOTA results in about five hours per paper on average.
FactReview: Evidence-Grounded Reviews with Literature Positioning and Execution-Based Claim Verification
cs.AI 2026-04 conditional novelty 7.0

FactReview extracts claims from ML papers, positions them via literature retrieval, and verifies them through code execution, labeling each as Supported, Partially supported, or In conflict, as shown in a CompGCN case study.
Letting the neural code speak: Automated characterization of monkey visual neurons through human language
q-bio.NC 2026-05 unverdicted novelty 6.0

Natural-language descriptions generated and verified through generative models and digital twins capture the selectivity of most neurons in macaque V1 and V4.
Unlocking LLM Creativity in Science through Analogical Reasoning
cs.AI 2026-05 conditional novelty 6.0

Analogical reasoning increases LLM solution diversity by 90-173% and novelty rate to over 50%, delivering up to 13-fold gains on biomedical tasks including perturbation prediction and cell communication.
NanoResearch: Co-Evolving Skills, Memory, and Policy for Personalized Research Automation
cs.AI 2026-05 unverdicted novelty 6.0

NanoResearch introduces a tri-level co-evolving framework of skills, memory, and policy to personalize LLM-powered research automation across projects and users.
ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox
cs.AI 2026-05 unverdicted novelty 6.0

ComplexMCP benchmark shows current LLM agents achieve at most 60% success on interdependent tool tasks versus 90% for humans, due to tool retrieval saturation, over-confidence, and strategic defeatism.
TIDE-Bench: Task-Aware and Diagnostic Evaluation of Tool-Integrated Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

TIDE-Bench is a new benchmark for tool-integrated reasoning that combines diverse tasks, multi-aspect metrics covering answer quality, process reliability, efficiency and cost, plus filtered challenging test sets.
Towards a Virtual Neuroscientist: Autonomous Neuroimaging Analysis via Multi-Agent Collaboration
cs.AI 2026-05 unverdicted novelty 6.0

NIAgent uses code-centric multi-agent collaboration and hierarchical verification to build adaptive neuroimaging pipelines that outperform static baselines on ADHD-200 and ADNI data.
MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI
cs.LG 2026-05 unverdicted novelty 6.0

MLS-Bench shows that current AI agents fall short of reliably inventing generalizable ML methods, with engineering tuning easier than genuine invention.
CellScientist: Dual-Space Hierarchical Orchestration for Closed-Loop Refinement of Virtual Cell Models
cs.LG 2026-05 unverdicted novelty 6.0

CellScientist introduces a dual-space hierarchical orchestration system that enables closed-loop refinement of virtual cell models by routing execution discrepancies back to hypothesis or implementation updates, yield...
FAME: Forecasting Academic Impact via Continuous-Time Manifold Evolution
cs.LG 2026-05 unverdicted novelty 6.0

FAME models scientific topic trajectories in continuous time to forecast paper impact more accurately than LLMs by aligning manuscripts with field momentum in a dynamic latent space.
AI co-mathematician: Accelerating mathematicians with agentic AI
cs.AI 2026-05 unverdicted novelty 6.0

An interactive AI workbench called the AI co-mathematician supports open-ended mathematical research and achieves a new high score of 48% on FrontierMath Tier 4.
AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents
physics.flu-dyn 2026-05 unverdicted novelty 6.0

An integrated AI agent framework for CFD uses vision-based physics gates to autonomously discover a Spalart-Allmaras runtime correction that cuts lower-wall skin-friction error by 7.89% versus DNS on the periodic hill...
Intentmaking and Sensemaking: Human Interaction with AI-Guided Mathematical Discovery
cs.AI 2026-05 unverdicted novelty 6.0

Expert mathematicians using an AI coding agent for discovery engage in repeated cycles of intentmaking to define goals and sensemaking to interpret outputs.
One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue
cs.CL 2026-05 unverdicted novelty 6.0

TurnGate identifies the critical turn in multi-turn dialogues where a response would complete hidden malicious intent, outperforming baselines on the new MTID dataset while keeping over-refusal low.
One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue
cs.CL 2026-05 unverdicted novelty 6.0

TurnGate uses a new multi-turn intent dataset to detect the harm-enabling closure point in dialogues, outperforming baselines with low over-refusal and generalizing across domains.
BioVeil MATRIX: Uncovering and categorizing vulnerabilities of agentic biological AI scientists
q-bio.OT 2026-04 unverdicted novelty 6.0

Agentic biological AI systems like Biomni and K-Dense assist with dual-use tasks blocked by safeguards and gain performance uplift on WMDP proxies; BioVeil MATRIX is introduced as a 10-category taxonomy with 22 techni...
Intern-Atlas: A Methodological Evolution Graph as Research Infrastructure for AI Scientists
cs.AI 2026-04 unverdicted novelty 6.0

Intern-Atlas constructs a methodological evolution graph with 9.4 million edges from 1.03 million AI papers to capture how methods emerge, adapt, and transition, enabling better idea evaluation and generation for AI-d...
AgentEconomist: An End-to-end Agentic System Translating Economic Intuitions into Executable Computational Experiments
cs.HC 2026-04 unverdicted novelty 6.0

AgentEconomist is an end-to-end agentic system with idea development, experimental design, and execution stages that uses a large economics paper database to produce research ideas with better literature grounding, no...
OMEGA: Optimizing Machine Learning by Evaluating Generated Algorithms
cs.AI 2026-04 unverdicted novelty 6.0

OMEGA framework generates novel ML classifiers via meta-prompts and executable code that outperform scikit-learn baselines on 20 benchmark datasets.
TSAssistant: A Human-in-the-Loop Agentic Framework for Automated Target Safety Assessment
cs.CL 2026-04 unverdicted novelty 6.0

TSAssistant is a human-in-the-loop multi-agent system that generates citable, evidence-grounded sections for target safety assessment reports by coordinating specialized subagents with interactive user refinement.
How Researchers Navigate Accountability, Transparency, and Trust When Using AI Tools in Early-Stage Research: A Think-Aloud Study
cs.CY 2026-04 unverdicted novelty 6.0

A think-aloud study reveals that AI tools in early research misrepresent uncertainty, obscure provenance, and create fragile trust, leading researchers to develop compensatory strategies to preserve scholarly judgment.
Rethinking Publication: A Certification Framework for AI-Enabled Research
cs.AI 2026-04 unverdicted novelty 6.0

A two-layer certification framework decouples knowledge validity from human authorship to accommodate AI-enabled research in existing publication systems.
Rethinking Publication: A Certification Framework for AI-Enabled Research
cs.AI 2026-04 conditional novelty 6.0

The paper introduces a certification framework that grades AI research contributions into Categories A, B, and C based on pipeline reach at submission time and adds benchmark slots for fully automated work.
A Scientific Human-Agent Reproduction Pipeline
hep-ph 2026-04 unverdicted novelty 6.0

SHARP is a human-AI collaboration pipeline for reproducing scientific analyses, demonstrated by recreating a jet classification task from a particle physics paper.
HiRAS: A Hierarchical Multi-Agent Framework for Paper-to-Code Generation and Execution
cs.CL 2026-04 unverdicted novelty 6.0

HiRAS introduces hierarchical multi-agent coordination for paper-to-code generation and experiment reproduction, claiming over 10% relative gains over prior state-of-the-art on a refined benchmark with reduced hallucination.
TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration
cs.AI 2026-04 unverdicted novelty 6.0

TREX automates the LLM training lifecycle via collaborative agents and tree-based exploration, delivering consistent performance gains across 10 real-world fine-tuning tasks in FT-Bench.
Toward Autonomous Long-Horizon Engineering for ML Research
cs.CL 2026-04 unverdicted novelty 6.0

AiScientist improves ML research benchmarks by 10.54 points on PaperBench and reaches 81.82% Any Medal on MLE-Bench Lite through hierarchical control plus durable file-based state instead of conversational handoffs.
Pioneer Agent: Continual Improvement of Small Language Models in Production
cs.AI 2026-04 unverdicted novelty 6.0

Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on ...
ResearchEVO: An End-to-End Framework for Automated Scientific Discovery and Documentation
cs.AI 2026-04 unverdicted novelty 6.0

ResearchEVO automates the discover-then-explain cycle by evolving algorithms via fitness-driven LLM co-evolution and generating grounded, anti-hallucination research papers through sentence-level RAG.
Towards Verifiable and Self-Correcting AI Physicists for Quantum Many-Body Simulations
physics.comp-ph 2026-03 unverdicted novelty 6.0

QMP-Bench supplies a realistic test set for AI on quantum many-body problems while PhysVEC uses integrated verifiers to turn unreliable LLM generations into code that passes both syntax and physics checks, outperformi...
Video models are zero-shot learners and reasoners
cs.LG 2025-09 unverdicted novelty 6.0

Generative video models exhibit emergent zero-shot capabilities across perception, manipulation, and basic reasoning tasks.
Read, Grep, and Synthesize: Diagnosing Cross-Domain Seed Exposure for LLM Research Ideation
cs.AI 2026-05 unverdicted novelty 5.0

LLM research ideation benefits from exposure to diverse mechanisms across domains but does not yet exploit the specific semantic reasons for cross-domain seed retrieval.
Toward an Engineering of Science: Rebalancing Generation and Verification in the Age of AI
cs.CY 2026-05 unverdicted novelty 5.0

AI lowers the cost of generating plausible scientific artifacts without lowering verification costs, so the paper proposes blueprints as typed graph components that decompose claims, evidence, and assumptions to enabl...
StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction
cs.CL 2026-05 unverdicted novelty 5.0

StraTA improves LLM agent success rates to 93.1% on ALFWorld and 84.2% on WebShop by sampling a compact initial strategy and training it jointly with action execution via hierarchical GRPO-style rollouts.
From Experimental Limits to Physical Insight: A Retrieval-Augmented Multi-Agent Framework for Interpreting Searches Beyond the Standard Model
hep-ex 2026-05 unverdicted novelty 5.0

HEP-CoPilot is a new multi-agent retrieval framework that retrieves, reconstructs, and compares experimental limits from HEP literature and HEPData to support interpretation of beyond-Standard-Model searches.
TSAssistant: A Human-in-the-Loop Agentic Framework for Automated Target Safety Assessment
cs.CL 2026-04 unverdicted novelty 5.0

TSAssistant is a modular, human-in-the-loop multi-agent system that generates citable, section-specific drafts for target safety assessment reports by coordinating specialized sub-agents with biomedical data sources a...
pAI/MSc: ML Theory Research with Humans on the Loop
cs.AI 2026-04 unverdicted novelty 5.0

pAI/MSc is a customizable multi-agent system that reduces human steering by orders of magnitude when turning a hypothesis into a literature-grounded, mathematically established, experimentally supported manuscript dra...

Reference graph

Works this paper leans on

116 extracted references · 116 canonical work pages · cited by 69 Pith papers · 7 internal anchors

[1]

Meta-learning curiosity algorithms

Ferran Alet, Martin F Schneider, Tomas Lozano-Perez, and Leslie Pack Kaelbling. Meta-learning curiosity algorithms. arXiv preprint arXiv:2003.05325, 2020

work page arXiv 2003
[2]

Artificial intelligence in scientific writing: a friend or a foe? Reproductive BioMedicine Online, 47 0 (1): 0 3--9, 2023

Signe Altm \"a e, Alberto Sola-Leyva, and Andres Salumets. Artificial intelligence in scientific writing: a friend or a foe? Reproductive BioMedicine Online, 47 0 (1): 0 3--9, 2023

work page 2023
[3]

Model card and evaluations for claude models, 2023

Anthropic. Model card and evaluations for claude models, 2023. URL https://www-files.anthropic.com/production/images/Model-Card-Claude-2.pdf

work page 2023
[4]

The claude 3 model family: Opus, sonnet, haiku, 2024

Anthropic. The claude 3 model family: Opus, sonnet, haiku, 2024. URL https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf

work page 2024
[5]

Cloud labs: where robots do the research

Carrie Arnold. Cloud labs: where robots do the research. Nature, 606 0 (7914): 0 612--613, 2022

work page 2022
[6]

arXiv preprint arXiv:2404.07738

Jinheon Baek, Sujay Kumar Jauhar, Silviu Cucerzan, and Sung Ju Hwang. Researchagent: Iterative research idea generation over scientific literature with large language models, 2024. URL https://arxiv.org/abs/2404.07738

work page arXiv 2024
[7]

Iclr2022-openreviewdata, 2024

Federico Berto. Iclr2022-openreviewdata, 2024. URL https://github.com/fedebotu/ICLR2022-OpenReviewData

work page 2024
[8]

The neurips 2021 consistency experiment

Alina Beygelzimer, Yann Dauphin, Percy Liang, and Jennifer Wortman Vaughan. The neurips 2021 consistency experiment. Neural Information Processing Systems blog post, 2021. URL https://blog. neurips. cc/2021/12/08/the-neurips-2021-consistency-experiment

work page 2021
[9]

Quality-diversity through ai feedback

Herbie Bradley, Andrew Dai, Hannah Benita Teufel, Jenny Zhang, Koen Oostermeijer, Marco Bellagente, Jeff Clune, Kenneth Stanley, Gregory Schott, and Joel Lehman. Quality-diversity through ai feedback. In The Twelfth International Conference on Learning Representations, 2024

work page 2024
[10]

Minimal criterion coevolution: a new approach to open-ended search

Jonathan C Brant and Kenneth O Stanley. Minimal criterion coevolution: a new approach to open-ended search. In Proceedings of the Genetic and Evolutionary Computation Conference, pages 67--74, 2017

work page 2017
[11]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

work page 2020
[12]

Dendral and meta-dendral: Their applications dimension

Bruce G Buchanan and Edward A Feigenbaum. Dendral and meta-dendral: Their applications dimension. In Readings in artificial intelligence, pages 313--322. Elsevier, 1981

work page 1981
[13]

Weak-to-strong generalization: eliciting strong capabilities with weak supervision.arXiv preprint arXiv:2312.09390, 2023

Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, and Jeff Wu. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision, 2023. URL https://arxiv.org/abs/2312.09390

work page arXiv 2023
[14]

What is this thing called science? McGraw-Hill Education (UK), 2013

Alan Chalmers. What is this thing called science? McGraw-Hill Education (UK), 2013

work page 2013
[15]

Evoprompting: Language models for code-level neural architecture search

Angelica Chen, David Dohan, and David So. Evoprompting: Language models for code-level neural architecture search. Advances in Neural Information Processing Systems, 36, 2024 a

work page 2024
[16]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[17]

Symbolic discovery of optimization algorithms

Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, et al. Symbolic discovery of optimization algorithms. Advances in Neural Information Processing Systems, 36, 2024 b

work page 2024
[18]

Ai-gas: Ai-generating algorithms, an alternate paradigm for producing general artificial intelli- gence.arXiv preprint arXiv:1905.10985, 2019

Jeff Clune. Ai-gas: Ai-generating algorithms, an alternate paradigm for producing general artificial intelligence. arXiv preprint arXiv:1905.10985, 2019

work page arXiv 1905
[19]

Marg: Multi-agent review generation for scientific papers.ArXiv, abs/2401.04259,

Mike D'Arcy, Tom Hope, Larry Birnbaum, and Doug Downey. Marg: Multi-agent review generation for scientific papers, 2024. URL https://arxiv.org/abs/2401.04259

work page arXiv 2024
[20]

J. Dewey. How We Think. D.C. Heath & Company, 1910. ISBN 9781519501868. URL https://books.google.co.uk/books?id=WF0AAAAAMAAJ

work page 1910
[21]

Quality diversity through human feedback: Towards open-ended diversity-driven optimization

Li Ding, Jenny Zhang, Jeff Clune, Lee Spector, and Joel Lehman. Quality diversity through human feedback: Towards open-ended diversity-driven optimization. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=9zlZuAAb08

work page 2024
[22]

Symbolicai: A framework for logic-based approaches combining generative models and solvers, 2024

Marius-Constantin Dinu, Claudiu Leoveanu-Condrei, Markus Holzleitner, Werner Zellinger, and Sepp Hochreiter. Symbolicai: A framework for logic-based approaches combining generative models and solvers, 2024. URL https://arxiv.org/abs/2402.00854

work page arXiv 2024
[23]

Art and the science of generative ai

Ziv Epstein, Aaron Hertzmann, Investigators of Human Creativity, Memo Akten, Hany Farid, Jessica Fjeld, Morgan R Frank, Matthew Groh, Laura Herman, Neil Leach, et al. Art and the science of generative ai. Science, 380 0 (6650): 0 1110--1111, 2023

work page 2023
[24]

Omni-epic: Open-endedness via models of human notions of interestingness with environments programmed in code.arXiv preprint arXiv:2405.15568, 2024

Maxence Faldor, Jenny Zhang, Antoine Cully, and Jeff Clune. Omni-epic: Open-endedness via models of human notions of interestingness with environments programmed in code, 2024. URL https://arxiv.org/abs/2405.15568

work page arXiv 2024
[25]

Integrating quantitative and qualitative discovery: the abacus system

Brian C Falkenhainer and Ryszard S Michalski. Integrating quantitative and qualitative discovery: the abacus system. Machine Learning, 1: 0 367--401, 1986

work page 1986
[26]

Discovering faster matrix multiplication algorithms with reinforcement learning

Alhussein Fawzi, Matej Balog, Aja Huang, Thomas Hubert, Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Francisco J R Ruiz, Julian Schrittwieser, Grzegorz Swirszcz, et al. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature, 610 0 (7930): 0 47--53, 2022

work page 2022
[27]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23 0 (120): 0 1--39, 2022. URL http://jmlr.org/papers/v23/21-0998.html

work page 2022
[28]

Semantic scholar

Suzanne Fricke. Semantic scholar. Journal of the Medical Library Association: JMLA, 106 0 (1): 0 145, 2018

work page 2018
[29]

aider, 2024

Paul Gauthier. aider, 2024. URL https://github.com/paul-gauthier/aider

work page 2024
[30]

Probabilistic machine learning and artificial intelligence

Zoubin Ghahramani. Probabilistic machine learning and artificial intelligence. Nature, 521 0 (7553): 0 452--459, 2015

work page 2015
[31]

Ideas are dimes a dozen: Large language models for idea generation in innovation

Karan Girotra, Lennart Meincke, Christian Terwiesch, and Karl T Ulrich. Ideas are dimes a dozen: Large language models for idea generation in innovation. Available at SSRN 4526071, 2023

work page 2023
[32]

Understanding the difficulty of training deep feedforward neural networks

Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249--256. JMLR Workshop and Conference Proceedings, 2010

work page 2010
[33]

Generative adversarial nets

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014. URL https://proceeding...

work page 2014
[34]

Gemini: A family of highly capable multimodal models, 2023

Google DeepMind Gemini Team . Gemini: A family of highly capable multimodal models, 2023

work page 2023
[35]

Diffit: Diffusion vision transformers for image generation,

Ali Hatamizadeh, Jiaming Song, Guilin Liu, Jan Kautz, and Arash Vahdat. Diffit: Diffusion vision transformers for image generation, 2024. URL https://arxiv.org/abs/2312.02139

work page arXiv 2024
[36]

Simulating 500 million years of evolution with a language model

Tomas Hayes, Roshan Rao, Halil Akin, Nicholas J Sofroniew, Deniz Oktay, Zeming Lin, Robert Verkuil, Vincent Q Tran, Jonathan Deaton, Marius Wiggert, et al. Simulating 500 million years of evolution with a language model. bioRxiv, pages 2024--07, 2024

work page 2024
[37]

Automl: A survey of the state-of-the-art

Xin He, Kaiyong Zhao, and Xiaowen Chu. Automl: A survey of the state-of-the-art. Knowledge-based systems, 212: 0 106622, 2021

work page 2021
[38]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 6840--6851. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf

work page 2020
[39]

Deep paper gestalt

Jia-Bin Huang. Deep paper gestalt. arXiv preprint arXiv:1812.08775, 2018

work page arXiv 2018
[40]

Mlagentbench: Evaluating language agents on machine learning experimentation

Qian Huang, Jian Vora, Percy Liang, and Jure Leskovec. Mlagentbench: Evaluating language agents on machine learning experimentation. In Forty-first International Conference on Machine Learning, 2024

work page 2024
[41]

Automated machine learning: methods, systems, challenges

Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren. Automated machine learning: methods, systems, challenges. Springer Nature, 2019

work page 2019
[42]

The hutter prize, 2006

Marcus Hutter. The hutter prize, 2006. URL http://prize.hutter1.net

work page 2006
[43]

Autonomous llm-driven research from data to human-verifiable research papers, 2024

Tal Ifargan, Lukas Hafner, Maor Kern, Ori Alcalay, and Roy Kishony. Autonomous llm-driven research from data to human-verifiable research papers, 2024. URL https://arxiv.org/abs/2404.17605

work page arXiv 2024
[44]

The principles of science: A treatise on logic and scientific method

William Stanley Jevons. The principles of science: A treatise on logic and scientific method. Macmillan and Company, 1877

work page
[45]

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?, 2024. URL https://arxiv.org/abs/2310.06770

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

Highly accurate protein structure prediction with alphafold

John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Z \' dek, Anna Potapenko, et al. Highly accurate protein structure prediction with alphafold. nature, 596 0 (7873): 0 583--589, 2021

work page 2021
[48]

The unreasonable effectiveness of recurrent neural networks, 2015

Andrej Karpathy. The unreasonable effectiveness of recurrent neural networks, 2015. URL https://karpathy.github.io/2015/05/21/rnn-effectiveness/

work page 2015
[49]

NanoGPT , 2022

Andrej Karpathy. NanoGPT , 2022. URL https://github.com/karpathy/nanoGPT

work page 2022
[50]

A survey of research on cloud robotics and automation

Ben Kehoe, Sachin Patil, Pieter Abbeel, and Ken Goldberg. A survey of research on cloud robotics and automation. IEEE Transactions on automation science and engineering, 12 0 (2): 0 398--409, 2015

work page 2015
[51]

Kingma and Max Welling

Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes . In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings , 2014

work page 2014
[52]

Improving generalization in meta reinforcement learning using learned objectives

Louis Kirsch, Sjoerd van Steenkiste, and J \"u rgen Schmidhuber. Improving generalization in meta reinforcement learning using learned objectives. arXiv preprint arXiv:1910.04098, 2019

work page arXiv 1910
[53]

Discovering attention-based genetic algorithms via meta-black-box optimization

Robert Lange, Tom Schaul, Yutian Chen, Chris Lu, Tom Zahavy, Valentin Dalibard, and Sebastian Flennerhag. Discovering attention-based genetic algorithms via meta-black-box optimization. In Proceedings of the Genetic and Evolutionary Computation Conference, pages 929--937, 2023 a

work page 2023
[54]

Discovering evolution strategies via meta-black-box optimization

Robert Lange, Tom Schaul, Yutian Chen, Tom Zahavy, Valentin Dalibard, Chris Lu, Satinder Singh, and Sebastian Flennerhag. Discovering evolution strategies via meta-black-box optimization. In Proceedings of the Companion Conference on Genetic and Evolutionary Computation, pages 29--30, 2023 b

work page 2023
[55]

Large language models as evolution strategies

Robert Tjarko Lange, Yingtao Tian, and Yujin Tang. Large language models as evolution strategies. arXiv preprint arXiv:2402.18381, 2024

work page arXiv 2024
[56]

Scientific discovery: Computational explorations of the creative processes

Pat Langley. Scientific discovery: Computational explorations of the creative processes. MIT press, 1987

work page 1987
[57]

Integrated systems for computational scientific discovery

Pat Langley. Integrated systems for computational scientific discovery. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 22598--22606, 2024

work page 2024
[58]

Exploiting open-endedness to solve problems through the search for novelty

Joel Lehman, Kenneth O Stanley, et al. Exploiting open-endedness to solve problems through the search for novelty. In ALIFE, pages 329--336, 2008

work page 2008
[59]

The surprising creativity of digital evolution: A collection of anecdotes from the evolutionary computation and artificial life research communities

Joel Lehman, Jeff Clune, Dusan Misevic, Christoph Adami, Lee Altenberg, Julie Beaulieu, Peter J Bentley, Samuel Bernard, Guillaume Beslon, David M Bryson, et al. The surprising creativity of digital evolution: A collection of anecdotes from the evolutionary computation and artificial life research communities. Artificial life, 26 0 (2): 0 274--306, 2020

work page 2020
[60]

Joel Lehman, Jonathan Gordon, Shawn Jain, Kamal Ndousse, Cathy Yeh, and Kenneth O. Stanley. Evolution through large models, 2022. URL https://arxiv.org/abs/2206.08896

work page arXiv 2022
[61]

Evolution through large models

Joel Lehman, Jonathan Gordon, Shawn Jain, Kamal Ndousse, Cathy Yeh, and Kenneth O Stanley. Evolution through large models. In Handbook of Evolutionary Machine Learning, pages 331--366. Springer, 2023

work page 2023
[62]

Automated theory formation in mathematics

Douglas B Lenat. Automated theory formation in mathematics. In IJCAI, volume 77, pages 833--842, 1977

work page 1977
[63]

Why am and eurisko appear to work

Douglas B Lenat and John Seely Brown. Why am and eurisko appear to work. Artificial intelligence, 23 0 (3): 0 269--294, 1984

work page 1984
[64]

Can large language models provide useful feedback on research papers? a large-scale empirical analysis

Weixin Liang, Yuhui Zhang, Hancheng Cao, Binglu Wang, Daisy Yi Ding, Xinyu Yang, Kailas Vodrahalli, Siyu He, Daniel Scott Smith, Yian Yin, et al. Can large language models provide useful feedback on research papers? a large-scale empirical analysis. NEJM AI, page AIoa2400196, 2024

work page 2024
[65]

Large language models as in-context ai generators for quality-diversity

Bryan Lim, Manon Flageat, and Antoine Cully. Large language models as in-context ai generators for quality-diversity. arXiv preprint arXiv:2404.15794, 2024

work page arXiv 2024
[66]

The Llama 3 Herd of Models

Llama Team . The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[67]

Discovered policy optimisation

Chris Lu, Jakub Kuba, Alistair Letcher, Luke Metz, Christian Schroeder de Witt, and Jakob Foerster. Discovered policy optimisation. Advances in Neural Information Processing Systems, 35: 0 16455--16468, 2022 a

work page 2022
[68]

Discovering preference optimization algorithms with and for large language models

Chris Lu, Samuel Holt, Claudio Fanconi, Alex J Chan, Jakob Foerster, Mihaela van der Schaar, and Robert Tjarko Lange. Discovering preference optimization algorithms with and for large language models. arXiv preprint arXiv:2406.08414, 2024 a

work page arXiv 2024
[69]

Cong Lu, Philip Ball, Jack Parker-Holder, Michael Osborne, and Stephen J. Roberts. Revisiting design choices in offline model based reinforcement learning. In International Conference on Learning Representations, 2022 b . URL https://openreview.net/forum?id=zz9hXVhf40

work page 2022
[70]

Intelligent go-explore: Standing on the shoulders of giant foundation models, 2024 b

Cong Lu, Shengran Hu, and Jeff Clune. Intelligent go-explore: Standing on the shoulders of giant foundation models, 2024 b . URL https://arxiv.org/abs/2405.15143

work page arXiv 2024
[71]

Eureka: Human- level reward design via coding large language models,

Yecheng Jason Ma, William Liang, Guanzhi Wang, De-An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv:2310.12931, 2023

work page arXiv 2023
[72]

About the test data, 2011

Matt Mahoney. About the test data, 2011. URL http://mattmahoney.net/dc/textdata.html

work page 2011
[73]

Discoverybench: Towards data-driven discovery with large language models, 2024

Bodhisattwa Prasad Majumder, Harshit Surana, Dhruv Agarwal, Bhavana Dalvi Mishra, Abhijeetsingh Meena, Aryan Prakhar, Tirth Vora, Tushar Khot, Ashish Sabharwal, and Peter Clark. Discoverybench: Towards data-driven discovery with large language models, 2024. URL https://arxiv.org/abs/2407.01725

work page arXiv 2024
[74]

grokking , 2022

Daniel May. grokking , 2022. URL https://github.com/danielmamay/grokking

work page 2022
[75]

Scaling deep learning for materials discovery

Amil Merchant, Simon Batzner, Samuel S Schoenholz, Muratahan Aykol, Gowoon Cheon, and Ekin Dogus Cubuk. Scaling deep learning for materials discovery. Nature, 624 0 (7990): 0 80--85, 2023

work page 2023
[76]

Alex Krizhevsky and Geoffrey Hinton

Luke Metz, James Harrison, C Daniel Freeman, Amil Merchant, Lucas Beyer, James Bradbury, Naman Agrawal, Ben Poole, Igor Mordatch, Adam Roberts, et al. Velo: Training versatile learned optimizers by scaling up. arXiv preprint arXiv:2211.09760, 2022

work page arXiv 2022
[77]

A robust approach to numeric discovery

Bernd Nordhausen and Pat Langley. A robust approach to numeric discovery. In Machine learning proceedings 1990, pages 411--418. Elsevier, 1990

work page 1990
[78]

In-context Learning and Induction Heads

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[79]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023

work page 2023
[80]

tiny-diffusion, 2023

Tanel P\" a rnamaa. tiny-diffusion, 2023. URL https://github.com/tanelp/tiny-diffusion

work page 2023

Showing first 80 references.