The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu; Cong Lu; David Ha; Jakob Foerster; Jeff Clune; Robert Tjarko Lange

arxiv: 2408.06292 · v3 · submitted 2024-08-12 · 💻 cs.AI · cs.CL· cs.LG

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu , Cong Lu , Robert Tjarko Lange , Jakob Foerster , Jeff Clune , David Ha This is my paper

Pith reviewed 2026-05-11 04:36 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG

keywords automated scientific discoverylarge language modelsAI research agentsmachine learningautonomous paper generationself-review processopen-ended discovery

0 comments

The pith

Frontier large language models can autonomously conduct full scientific research cycles using the AI Scientist framework, producing papers that pass automated conference-level review.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes The AI Scientist, a framework that enables large language models to independently manage the complete scientific process. The system generates research ideas, implements them through code and experiments, creates visualizations, writes full papers, and performs its own review evaluation. It is tested on three machine learning subfields with each paper costing less than fifteen dollars. The authors also create an automated reviewer that scores papers near human levels, and some AI-generated papers exceed the acceptance bar according to this reviewer. This represents a step toward AI agents driving open-ended discovery in machine learning research.

Core claim

The AI Scientist is the first comprehensive framework for fully automatic scientific discovery. It allows frontier large language models to generate novel research ideas, write code, execute experiments, visualize results, write full scientific papers, and run a simulated review process. This can be repeated iteratively in an open-ended way. Applied to diffusion modeling, transformer-based language modeling, and learning dynamics, it produces papers at less than $15 each. The automated reviewer achieves near-human performance, and the system generates papers that exceed the acceptance threshold at a top machine learning conference.

What carries the argument

The AI Scientist framework, which sequences LLM capabilities to cover the entire research pipeline from idea generation to self-assessment.

If this is right

Open-ended iteration of the process can mimic the human scientific community in developing ideas.
The generated papers can meet or exceed acceptance thresholds for top machine learning conferences per the automated reviewer.
Versatility across distinct subfields of machine learning including diffusion, language modeling, and learning dynamics.
Low-cost production of full research papers under fifteen dollars each.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

This could enable much higher throughput in exploring new ideas within AI research if the quality holds up under human scrutiny.
Similar systems might eventually be adapted for discovery in other scientific fields, though domain-specific tools would be needed.
Long-term use might create feedback loops where AI builds upon its own prior discoveries without human input.

Load-bearing premise

The automated reviewer provides an accurate assessment of paper quality comparable to human experts at top conferences.

What would settle it

Having the AI-generated papers submitted to a real top-tier machine learning conference and observing whether they are accepted or rejected based on human reviews.

read the original abstract

One of the grand challenges of artificial general intelligence is developing agents capable of conducting scientific research and discovering new knowledge. While frontier models have already been used as aides to human scientists, e.g. for brainstorming ideas, writing code, or prediction tasks, they still conduct only a small part of the scientific process. This paper presents the first comprehensive framework for fully automatic scientific discovery, enabling frontier large language models to perform research independently and communicate their findings. We introduce The AI Scientist, which generates novel research ideas, writes code, executes experiments, visualizes results, describes its findings by writing a full scientific paper, and then runs a simulated review process for evaluation. In principle, this process can be repeated to iteratively develop ideas in an open-ended fashion, acting like the human scientific community. We demonstrate its versatility by applying it to three distinct subfields of machine learning: diffusion modeling, transformer-based language modeling, and learning dynamics. Each idea is implemented and developed into a full paper at a cost of less than $15 per paper. To evaluate the generated papers, we design and validate an automated reviewer, which we show achieves near-human performance in evaluating paper scores. The AI Scientist can produce papers that exceed the acceptance threshold at a top machine learning conference as judged by our automated reviewer. This approach signifies the beginning of a new era in scientific discovery in machine learning: bringing the transformative benefits of AI agents to the entire research process of AI itself, and taking us closer to a world where endless affordable creativity and innovation can be unleashed on the world's most challenging problems. Our code is open-sourced at https://github.com/SakanaAI/AI-Scientist

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper closes an end-to-end LLM loop for idea-to-paper research in ML but its main results depend on an unvalidated internal reviewer.

read the letter

The core new piece is the full closed loop: an LLM agent that proposes ideas, writes and runs code, plots results, drafts a paper, and then subjects it to a simulated review, all repeatable in an open-ended way. They show this working across diffusion models, language modeling, and learning dynamics, with each paper costing under $15 and the code released publicly. That is a concrete step beyond single-task tools like code assistants or literature summarizers, and the open-sourcing lets others test the pipeline directly. The automated reviewer is presented as reaching near-human scoring, which is the part that lets them claim some outputs clear a top-conference bar. The soft spot is exactly there. The reviewer was designed and tuned inside the same project, with no reported blind tests against real conference decisions or checks for whether it favors LLM-style outputs. Without those, the acceptance-threshold claim stays circular and hard to interpret. There are also no ablations on idea novelty, failure modes, or how much the results depend on the base model choice. The work is aimed at researchers building agentic systems for science rather than at domain experts looking for new ML findings. It is coherent on its own terms and shows clear engineering effort, so it deserves a serious referee to pressure-test the evaluation setup and see whether the loop holds up under external scrutiny.

Referee Report

3 major / 2 minor

Summary. The paper introduces The AI Scientist, a framework enabling frontier LLMs to autonomously generate novel research ideas, implement code and run experiments, visualize results, write full scientific papers, and evaluate them through a simulated review process. Applied to diffusion modeling, transformer language modeling, and learning dynamics, it claims to produce papers at under $15 each, with some exceeding top-ML-conference acceptance thresholds as scored by an internally designed automated reviewer that achieves near-human performance. The process is presented as repeatable for open-ended discovery, with code open-sourced.

Significance. If the central claims hold after addressing evaluation gaps, this would be a notable step toward fully automated scientific discovery in machine learning, demonstrating a closed-loop system for idea-to-paper generation at low cost and highlighting potential for iterative research. The open-sourcing of code strengthens reproducibility and invites community extensions, though the current lack of external validation limits immediate impact on the broader scientific process.

major comments (3)

[Automated Reviewer section] Automated Reviewer section: The paper's core claim—that generated papers exceed conference acceptance thresholds—rests entirely on scores from the authors' internally designed and validated automated reviewer. No quantitative details are provided on its training corpus, calibration against real conference decisions, correlation with human reviewers, or performance on a blind test set separating LLM-generated from human papers. This self-referential loop undermines the acceptance-threshold result.
[Experimental Results (Section 5)] Experimental Results (Section 5): The reported successes in three subfields lack ablation studies on key components (e.g., idea generation vs. experiment execution), quantitative metrics on idea novelty (such as literature overlap or expert originality ratings), and error rates for code validity or experimental soundness. These omissions make it impossible to determine what drives any apparent success or whether outputs represent genuine advances.
[Abstract and Results summary] Abstract and Results summary: The assertion of 'near-human performance' for the automated reviewer and papers exceeding acceptance thresholds provides no supporting numbers (e.g., inter-rater agreement, threshold calibration details, or comparison to actual conference acceptance rates), leaving the central evaluation unsupported.

minor comments (2)

[Figures and cost analysis] The workflow diagram and cost breakdowns would benefit from clearer labels and step-by-step explanations to improve readability for readers unfamiliar with the pipeline.
[Methods description] Some terms (e.g., specific LLM sampling parameters) are referenced without initial definition or explicit values in the methods description.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments, which highlight important areas for improving the clarity and rigor of our evaluation. We address each major comment point by point below, indicating planned revisions to the manuscript where appropriate. Our goal is to strengthen the presentation of the automated reviewer and experimental results without altering the core contributions of the AI Scientist framework.

read point-by-point responses

Referee: [Automated Reviewer section] Automated Reviewer section: The paper's core claim—that generated papers exceed conference acceptance thresholds—rests entirely on scores from the authors' internally designed and validated automated reviewer. No quantitative details are provided on its training corpus, calibration against real conference decisions, correlation with human reviewers, or performance on a blind test set separating LLM-generated from human papers. This self-referential loop undermines the acceptance-threshold result.

Authors: We agree that the manuscript would benefit from greater transparency on the automated reviewer. The current version describes its design and validation at a high level but omits specific quantitative details. In the revision, we will expand the Automated Reviewer section to include: the composition of the training corpus (human-written papers from prior NeurIPS/ICML/ICLR proceedings), calibration details against historical acceptance rates, Pearson/Spearman correlations with human reviewer scores, and performance metrics on a held-out blind test set. We will also explicitly note that the reviewer was trained exclusively on human papers to mitigate self-reference concerns. These additions will be supported by new tables and figures. revision: yes
Referee: [Experimental Results (Section 5)] Experimental Results (Section 5): The reported successes in three subfields lack ablation studies on key components (e.g., idea generation vs. experiment execution), quantitative metrics on idea novelty (such as literature overlap or expert originality ratings), and error rates for code validity or experimental soundness. These omissions make it impossible to determine what drives any apparent success or whether outputs represent genuine advances.

Authors: We acknowledge the value of ablations and additional metrics for isolating contributions. The manuscript focuses on end-to-end feasibility rather than component-wise analysis, but we agree this limits interpretability. In revision, we will add: (1) basic ablation results comparing full pipeline performance against versions with simplified idea generation or execution modules; (2) quantitative novelty metrics such as n-gram overlap and citation similarity with existing literature; and (3) reported error rates for code execution failures and experimental soundness (e.g., percentage of runs that completed without runtime errors). Expert originality ratings remain resource-intensive and will be noted as a limitation with discussion of future work. These changes will appear in an expanded Section 5. revision: partial
Referee: [Abstract and Results summary] Abstract and Results summary: The assertion of 'near-human performance' for the automated reviewer and papers exceeding acceptance thresholds provides no supporting numbers (e.g., inter-rater agreement, threshold calibration details, or comparison to actual conference acceptance rates), leaving the central evaluation unsupported.

Authors: We will revise both the abstract and the results summary to include concrete supporting statistics. Specifically, we will report: inter-rater agreement (e.g., Cohen's kappa or correlation values) between the automated reviewer and human reviewers, the precise acceptance threshold calibrated from past conference data (e.g., average scores of accepted papers), and direct comparisons to real acceptance rates. These numbers will be added to the abstract and highlighted in the results section with references to the expanded validation details. revision: yes

Circularity Check

1 steps flagged

Central claim of exceeding conference thresholds rests on authors' self-designed automated reviewer

specific steps

fitted input called prediction [Abstract]
"To evaluate the generated papers, we design and validate an automated reviewer, which we show achieves near-human performance in evaluating paper scores. The AI Scientist can produce papers that exceed the acceptance threshold at a top machine learning conference as judged by our automated reviewer."

The headline success metric ('exceed the acceptance threshold') is not an external or pre-existing benchmark but is computed by the authors' own reviewer, which they designed, validated, and then used to judge their system's outputs. This reduces the 'prediction' of research success to performance on an internally constructed evaluator, matching the fitted-input-called-prediction pattern.

full rationale

The paper's primary result—that The AI Scientist generates papers exceeding top-ML-conference acceptance thresholds—is defined entirely by scores from an automated reviewer the authors explicitly state they 'design and validate.' This creates a load-bearing self-referential evaluation loop. While the abstract claims near-human performance, no independent external benchmark (e.g., correlation with actual conference decisions on mixed human/LLM papers) is exhibited in the provided text. Other components (idea generation, code execution, paper writing) do not reduce to this loop, so the circularity is partial and confined to the success metric. This warrants a moderate score rather than 8-10, as the framework itself is not definitionally tautological.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The framework rests on the unproven assumption that current frontier LLMs possess sufficient capability for open-ended research tasks and introduces an internally validated reviewer whose independence from the generated content is not externally demonstrated.

free parameters (2)

LLM sampling parameters and model choice
Specific temperature, top-p, and model versions used for idea generation and code writing are not detailed in the abstract but are central to reproducibility.
Automated reviewer acceptance threshold
The numerical cutoff used to declare papers exceed top-conference standards is not specified.

axioms (1)

domain assumption Frontier LLMs can reliably generate novel, implementable research ideas and produce correct experimental code without human intervention
Invoked throughout the description of the AI Scientist pipeline.

invented entities (1)

Automated reviewer no independent evidence
purpose: To score generated papers and determine acceptance without human input
New component introduced and validated by the authors themselves.

pith-pipeline@v0.9.0 · 5617 in / 1596 out tokens · 67826 ms · 2026-05-11T04:36:48.483357+00:00 · methodology

discussion (0)

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot
cs.AI 2026-04 conditional novelty 9.0

AI reviews for all 22,977 AAAI-26 papers were preferred by authors and PC members over human reviews on accuracy and suggestions and outperformed baselines at spotting weaknesses.
Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty
cs.CL 2026-05 unverdicted novelty 8.0

Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.
WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
cs.CL 2026-05 unverdicted novelty 8.0

A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.
AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery
cs.AI 2026-04 accept novelty 8.0

AutoResearchBench is a new benchmark showing top AI agents achieve under 10% success on complex scientific literature discovery tasks that demand deep comprehension and open-ended search.
The Last Human-Written Paper: Agent-Native Research Artifacts
cs.LG 2026-04 unverdicted novelty 8.0

Introduces ARA as a four-layer machine-executable research package and reports benchmark gains in agent QA accuracy and reproduction success.
FermiLink: A Unified Agent Framework for Multidomain Autonomous Scientific Simulations
physics.chem-ph 2026-04 conditional novelty 8.0

FermiLink is a unified AI agent framework that automates multidomain scientific simulations via separated package knowledge bases and a four-layer progressive disclosure mechanism, reproducing 56% of target figures in...
Co-Designing Quantum Codes with Transversal Diagonal Gates via Multi-Agent Systems
quant-ph 2025-10 accept novelty 8.0 full

A Lean-verified multi-agent system produces a catalogue of 14,116 quantum codes with transversal diagonal gates for small parameters, extracts infinite families, and resolves specific distance-3 cases with constructio...
LLM-driven design of physics-constrained constitutive models: two agents are better than one
cs.LG 2026-05 unverdicted novelty 7.0

A Creator-Inspector multi-agent LLM pipeline for constitutive artificial neural networks increases the rate of models satisfying all nine physical constraints to 100% or 56% depending on the LLM backbone.
Whose Good, Whose Place? The Moral Geography of Agentic AI for Social Good
cs.CY 2026-05 unverdicted novelty 7.0

Survey of 112 agentic AI for social good papers reveals moral-geographic asymmetry with 73% lacking geographic context (lowest for SDG 16) and only 25% reporting deployments.
IdleSpec: Exploiting Idle Time via Speculative Planning for LLM Agents
cs.AI 2026-05 conditional novelty 7.0

IdleSpec improves LLM agent accuracy by generating and aggregating speculative plans during idle time between tool calls and observations using complementary drafting strategies.
1GC-7RC: One Graphic Card -- Seven Research Challenges! How Good Are AI Agents at Doing Your Job?
cs.LG 2026-05 unverdicted novelty 7.0

Introduces the 1GC-7RC benchmark to evaluate AI coding agents on seven diverse ML tasks under single-GPU time and access constraints.
Test-Time Learning with an Evolving Library
cs.LG 2026-05 unverdicted novelty 7.0

EvoLib enables LLMs to accumulate, reuse, and evolve knowledge abstractions from inference trajectories at test time, yielding substantial gains on math reasoning, code generation, and agentic benchmarks without param...
Sheaf-Theoretic Transport and Obstruction for Detecting Scientific Theory Shift in AI Agents
cs.AI 2026-05 unverdicted novelty 7.0

A finite sheaf-theoretic framework ranks obstruction measures to identify when an AI agent's theory must deform within its language or extend to a new one, validated on a controlled transition benchmark.
Harnessing Agentic Evolution
cs.AI 2026-05 unverdicted novelty 7.0

AEvo introduces a meta-agent that edits the evolution procedure or agent context based on accumulated state, outperforming baselines by 26% relative improvement on agentic benchmarks and achieving SOTA on open-ended tasks.
AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration - Learning from Cheap, Optimizing Expensive
cs.AI 2026-05 unverdicted novelty 7.0

AutoLLMResearch trains agents via a multi-fidelity environment and MDP pipeline to extrapolate configuration principles from inexpensive to costly LLM experiments.
ASIA: an Autonomous System Identification Agent
cs.AI 2026-05 unverdicted novelty 7.0

ASIA uses an LLM-based coding agent to autonomously perform system identification, tested empirically on two benchmarks while noting limitations in transparency and reproducibility.
PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents
cs.AI 2026-05 unverdicted novelty 7.0

PaperFit uses rendered page images in a closed loop to diagnose and repair typesetting defects in LaTeX documents, outperforming baselines on a new benchmark of 200 papers.
Hypothesis-Driven Deep Research with Large Language Models: A Structured Methodology for Automated Knowledge Discovery
cs.AI 2026-05 unverdicted novelty 7.0

HDRI is a six-principle eight-stage framework for hypothesis-organized LLM research featuring gap-driven iteration, traceable fact reasoning, and subject locking, realized in INFOMINER with reported gains in fact dens...
Self-Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale
cs.LG 2026-05 conditional novelty 7.0

Starling uses LLMs and agents to turn 22.5M PubMed papers into 6.3M nuanced structured records across six tasks with 0.6-7.7% frontier-model rejection rates, lower than error rates on existing curated databases.
AI co-mathematician: Accelerating mathematicians with agentic AI
cs.AI 2026-05 unverdicted novelty 7.0

An interactive AI workbench for mathematicians achieves 48% on FrontierMath Tier 4 and helped solve open problems in early tests.
AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents
physics.flu-dyn 2026-05 conditional novelty 7.0

AI CFD Scientist autonomously discovers a Spalart-Allmaras runtime correction reducing lower-wall Cf RMSE by 7.89% on the periodic hill at Reh=5600 while using a vision-language gate to detect 14 of 16 silent failures...
AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents
physics.flu-dyn 2026-05 conditional novelty 7.0

AI CFD Scientist autonomously finds a Spalart-Allmaras turbulence correction that lowers wall-friction error by 7.89% versus DNS on the periodic hill case using vision-language physics verification.
Evolving Idea Graphs with Learnable Edits-and-Commits for Multi-Agent Scientific Ideation
cs.MA 2026-05 unverdicted novelty 7.0

EIG represents research ideas as evolving graphs with nodes for claims and edges for relations, using a learned controller for edits and commits to produce higher-quality scientific proposals than text-only multi-agen...
Agentic-imodels: Evolving agentic interpretability tools via autoresearch
cs.AI 2026-05 unverdicted novelty 7.0

Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.
Theory Under Construction: Orchestrating Language Models for Research Software Where the Specification Evolves
cs.SE 2026-04 unverdicted novelty 7.0

Comet-H orchestrates LLMs via deficit-scoring prompt selection and half-life task tracking to co-evolve research software components, demonstrated by a static analysis tool reaching F1=0.768 versus a 0.364 baseline.
End-to-end autonomous scientific discovery on a real optical platform
cs.AI 2026-04 unverdicted novelty 7.0

An LLM agent autonomously identifies and experimentally validates a previously unreported optical bilinear interaction on a physical platform.
The Last Human-Written Paper: Agent-Native Research Artifacts
cs.LG 2026-04 conditional novelty 7.0

The authors introduce Agent-Native Research Artifacts (ARA) as executable research packages with four layers to reduce information loss in papers for AI agents, showing benchmark gains in question-answering and reproduction.
Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond
cs.AI 2026-04 unverdicted novelty 7.0

Proposes a levels x laws taxonomy for world models in AI agents, defining L1-L3 capabilities across physical, digital, social, and scientific regimes while reviewing over 400 works to outline a roadmap for advanced ag...
Knows: Agent-Native Structured Research Representations
cs.AI 2026-04 conditional novelty 7.0

Knows uses a YAML sidecar specification to provide structured, agent-consumable representations of research papers, yielding large accuracy gains for small LLMs on comprehension tasks and rapid community adoption via ...
ReviewGrounder: Improving Review Substantiveness with Rubric-Guided, Tool-Integrated Agents
cs.CL 2026-04 unverdicted novelty 7.0

ReviewGrounder decomposes review generation into rubric-guided drafting and tool-integrated grounding stages, outperforming larger baseline models on a new benchmark measuring alignment with human judgments and review...
VERITAS: Verifiable Epistemic Reasoning for Image-Derived Hypothesis Testing via Agentic Systems
cs.MA 2026-04 unverdicted novelty 7.0

VERITAS is a multi-agent system for verifiable hypothesis testing on multimodal clinical MRI datasets that achieves 81.4% verdict accuracy with frontier models and introduces an epistemic evidence labeling framework.
Camyla: Scaling Autonomous Research in Medical Image Segmentation
cs.AI 2026-04 unverdicted novelty 7.0

Camyla autonomously generates research proposals, experiments, and manuscripts in medical image segmentation, outperforming baselines on 24 of 31 recent datasets while producing 40 human-reviewed papers.
Figures as Interfaces: Toward LLM-Native Artifacts for Scientific Discovery
cs.HC 2026-04 unverdicted novelty 7.0

LLM-native figures embed provenance and enable direct LLM interaction with scientific visualizations to accelerate discovery and improve reproducibility.
$k$-server-bench: Automating Potential Discovery for the $k$-Server Conjecture
cs.MS 2026-04 accept novelty 7.0

k-server-bench formulates potential-function discovery for the k-server conjecture as a code-based inequality-satisfaction task; current agents fully solve the resolved k=3 case and reduce violations on the open k=4 case.
AutoSOTA: An End-to-End Automated Research System for State-of-the-Art AI Model Discovery
cs.CL 2026-04 unverdicted novelty 7.0

AutoSOTA uses eight specialized agents to replicate and optimize models from recent AI papers, producing 105 new SOTA results in about five hours per paper on average.
FactReview: Evidence-Grounded Reviews with Literature Positioning and Execution-Based Claim Verification
cs.AI 2026-04 conditional novelty 7.0

FactReview extracts claims from ML papers, positions them via literature retrieval, and verifies them through code execution, labeling each as Supported, Partially supported, or In conflict, as shown in a CompGCN case study.
The Alien Space of Science: Sampling Coherent but Cognitively Unavailable Research Directions
cs.AI 2026-03 conditional novelty 7.0

A framework decomposes LLM papers into idea atoms, trains coherence and availability models over the resulting vocabulary, and samples atom combinations that are coherent yet unlikely under existing author communities.
DiagramBank: A Large-scale Dataset of Diagram Design Exemplars with Paper Metadata for Retrieval-Augmented Generation
cs.IR 2026-02 accept novelty 7.0

DiagramBank is a large-scale curated dataset of 89,422 schematic diagrams from scientific papers with rich metadata to support multimodal retrieval and exemplar-driven figure generation.
Kosmos: An AI Scientist for Autonomous Discovery
cs.AI 2025-11 unverdicted novelty 7.0

Kosmos is an AI scientist that maintains coherence over hundreds of agent steps via a shared world model, executes thousands of code lines and reads thousands of papers per run, and produces traceable reports with 79....
Evalet: Evaluating Large Language Models through Functional Fragmentation
cs.HC 2025-09 conditional novelty 7.0

Evalet applies functional fragmentation to deliver fragment-level qualitative analysis of LLM evaluations, with a user study showing 48% more misalignment detections than holistic scoring.
IDRBench: Understanding the Capability of Large Language Models on Interdisciplinary Research
cs.CL 2025-07 unverdicted novelty 7.0

IDRBench is presented as the first benchmark framework consisting of datasets and three evaluation tasks to measure LLMs' ability to perform interdisciplinary research.
SciGA: A Comprehensive Dataset for Designing Graphical Abstracts in Academic Papers
cs.CV 2025-07 unverdicted novelty 7.0

Introduces the SciGA-145k dataset with intra-paper and cross-paper graphical abstract recommendation tasks plus the CAR evaluation metric.
Human-LLM Compound System for Scientific Ideation through Facet Recombination and Novelty Evaluation
cs.HC 2024-09 unverdicted novelty 7.0

Scideator enables facet-based scientific ideation through LLM-driven extraction, human-guided recombination, analogous retrieval, and facet-grounded novelty verification, showing significantly higher creativity suppor...
Automated Design of Agentic Systems
cs.AI 2024-08 conditional novelty 7.0

Meta Agent Search uses a meta-agent to iteratively program novel agentic systems in code, producing agents that outperform state-of-the-art hand-designed ones across coding, science, and math while transferring across...
Mem-$\pi$: Adaptive Memory through Learning When and What to Generate
cs.CL 2026-05 unverdicted novelty 6.0

Mem-π is a framework using a dedicated model and decision-content decoupled RL to generate context-specific guidance on demand for LLM agents, outperforming retrieval baselines by over 30% on web navigation.
LLM Agents Make Collective Belief Dynamics Programmable: Challenges and Research Directions
cs.MA 2026-05 unverdicted novelty 6.0

LLM agents make collective belief dynamics programmable, with simulations showing coordinated agents induce stable belief shifts, and four structural properties that complicate detection and defense.
How Far Are We From True Auto-Research?
cs.AI 2026-05 unverdicted novelty 6.0

ResearchArena shows that agent-generated papers fail top-tier acceptance standards primarily due to fabricated results, underpowered experiments, and plan-execution mismatches that vary sharply by agent.
STRIDE: A Self-Reflective Agent Framework for Reliable Automatic Equation Discovery
cs.AI 2026-05 unverdicted novelty 6.0

STRIDE is a self-reflective agent framework that improves accuracy, OOD robustness, and structural recovery in LLM-based symbolic regression by integrating generation, evaluation, repair, and diversity-preserving memory.
FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics
cs.LG 2026-05 accept novelty 6.0

FML-Bench shows that a simple greedy hill-climber performs nearly as well as complex tree-search agents on ML research tasks, with an adaptive strategy that switches exploration modes outperforming all tested agents.
ArtifactLinker: Linking Scientific Artifacts for Automatic State-of-the-Art Discovery
cs.LG 2026-05 unverdicted novelty 6.0

ArtifactLinker frames SOTA discovery as missing-link prediction on an artifact graph of models and datasets, with a two-stage ranking-plus-verification pipeline and a new benchmark of 14k artifacts.
MLReplicate: Benchmarking Autonomous Research Systems for Machine Learning Reproducibility
cs.LG 2026-05 conditional novelty 6.0

MLReplicate benchmark evaluates six autonomous systems on 45 manuscripts from ICML 2025 papers, finding that automated reviews accept flawed outputs with fabricated claims while human review exposes methodological fai...
OpenAaaS: An Open Agent-as-a-Service Framework for Distributed Materials-Informatics Research
cond-mat.mtrl-sci 2026-05 unverdicted novelty 6.0

OpenAaaS is a hierarchical agent-as-a-service system that enables secure multi-agent collaboration for materials informatics by moving code to data rather than data to code.
Letting the neural code speak: Automated characterization of monkey visual neurons through human language
q-bio.NC 2026-05 unverdicted novelty 6.0

Natural language descriptions generated via a closed-loop pipeline with digital twins capture the selectivity of most neurons in macaque V1 and V4, with synthesized images driving 96% of V4 neurons into the top or bot...
Letting the neural code speak: Automated characterization of monkey visual neurons through human language
q-bio.NC 2026-05 unverdicted novelty 6.0

Natural-language descriptions generated and verified through generative models and digital twins capture the selectivity of most neurons in macaque V1 and V4.
AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration - Learning from Cheap, Optimizing Expensive
cs.AI 2026-05 unverdicted novelty 6.0

AutoLLMResearch trains agents in a multi-fidelity LLMConfig-Gym environment formulated as a long-horizon MDP to enable cross-fidelity extrapolation for automating high-cost LLM experiment configurations.
Unlocking LLM Creativity in Science through Analogical Reasoning
cs.AI 2026-05 conditional novelty 6.0

Analogical reasoning increases LLM solution diversity by 90-173% and novelty rate to over 50%, delivering up to 13-fold gains on biomedical tasks including perturbation prediction and cell communication.
NanoResearch: Co-Evolving Skills, Memory, and Policy for Personalized Research Automation
cs.AI 2026-05 unverdicted novelty 6.0

NanoResearch introduces a tri-level co-evolving framework of skills, memory, and policy to personalize LLM-powered research automation across projects and users.
ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox
cs.AI 2026-05 unverdicted novelty 6.0

ComplexMCP benchmark shows top LLM agents achieve under 60% success on dynamic interdependent tool tasks versus 90% for humans, due to tool retrieval saturation, over-confidence, and strategic defeatism.
ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox
cs.AI 2026-05 unverdicted novelty 6.0

ComplexMCP benchmark shows current LLM agents achieve at most 60% success on interdependent tool tasks versus 90% for humans, due to tool retrieval saturation, over-confidence, and strategic defeatism.
TIDE-Bench: Task-Aware and Diagnostic Evaluation of Tool-Integrated Reasoning
cs.AI 2026-05 unverdicted novelty 6.0

TIDE-Bench is a new benchmark for tool-integrated reasoning that combines diverse tasks, multi-aspect metrics covering answer quality, process reliability, efficiency and cost, plus filtered challenging test sets.

Reference graph

Works this paper leans on

116 extracted references · 116 canonical work pages · cited by 127 Pith papers · 9 internal anchors

[1]

Meta-learning curiosity algorithms

Ferran Alet, Martin F Schneider, Tomas Lozano-Perez, and Leslie Pack Kaelbling. Meta-learning curiosity algorithms. arXiv preprint arXiv:2003.05325, 2020

work page arXiv 2003
[2]

Artificial intelligence in scientific writing: a friend or a foe? Reproductive BioMedicine Online, 47 0 (1): 0 3--9, 2023

Signe Altm \"a e, Alberto Sola-Leyva, and Andres Salumets. Artificial intelligence in scientific writing: a friend or a foe? Reproductive BioMedicine Online, 47 0 (1): 0 3--9, 2023

work page 2023
[3]

Model card and evaluations for claude models, 2023

Anthropic. Model card and evaluations for claude models, 2023. URL https://www-files.anthropic.com/production/images/Model-Card-Claude-2.pdf

work page 2023
[4]

The claude 3 model family: Opus, sonnet, haiku, 2024

Anthropic. The claude 3 model family: Opus, sonnet, haiku, 2024. URL https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf

work page 2024
[5]

Cloud labs: where robots do the research

Carrie Arnold. Cloud labs: where robots do the research. Nature, 606 0 (7914): 0 612--613, 2022

work page 2022
[6]

K., Cucerzan, S

Jinheon Baek, Sujay Kumar Jauhar, Silviu Cucerzan, and Sung Ju Hwang. Researchagent: Iterative research idea generation over scientific literature with large language models, 2024. URL https://arxiv.org/abs/2404.07738

work page arXiv 2024
[7]

Iclr2022-openreviewdata, 2024

Federico Berto. Iclr2022-openreviewdata, 2024. URL https://github.com/fedebotu/ICLR2022-OpenReviewData

work page 2024
[8]

The neurips 2021 consistency experiment

Alina Beygelzimer, Yann Dauphin, Percy Liang, and Jennifer Wortman Vaughan. The neurips 2021 consistency experiment. Neural Information Processing Systems blog post, 2021. URL https://blog. neurips. cc/2021/12/08/the-neurips-2021-consistency-experiment

work page 2021
[9]

Quality-diversity through ai feedback

Herbie Bradley, Andrew Dai, Hannah Benita Teufel, Jenny Zhang, Koen Oostermeijer, Marco Bellagente, Jeff Clune, Kenneth Stanley, Gregory Schott, and Joel Lehman. Quality-diversity through ai feedback. In The Twelfth International Conference on Learning Representations, 2024

work page 2024
[10]

Minimal criterion coevolution: a new approach to open-ended search

Jonathan C Brant and Kenneth O Stanley. Minimal criterion coevolution: a new approach to open-ended search. In Proceedings of the Genetic and Evolutionary Computation Conference, pages 67--74, 2017

work page 2017
[11]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

work page 2020
[12]

Dendral and meta-dendral: Their applications dimension

Bruce G Buchanan and Edward A Feigenbaum. Dendral and meta-dendral: Their applications dimension. In Readings in artificial intelligence, pages 313--322. Elsevier, 1981

work page 1981
[13]

Burns, P

Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, and Jeff Wu. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision, 2023. URL https://arxiv.org/abs/2312.09390

work page arXiv 2023
[14]

What is this thing called science? McGraw-Hill Education (UK), 2013

Alan Chalmers. What is this thing called science? McGraw-Hill Education (UK), 2013

work page 2013
[15]

Evoprompting: Language models for code-level neural architecture search

Angelica Chen, David Dohan, and David So. Evoprompting: Language models for code-level neural architecture search. Advances in Neural Information Processing Systems, 36, 2024 a

work page 2024
[16]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021
[17]

Symbolic discovery of optimization algorithms

Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, et al. Symbolic discovery of optimization algorithms. Advances in Neural Information Processing Systems, 36, 2024 b

work page 2024
[18]

Clune, Ai-gas: Ai-generating algorithms, an alternate paradigm for producing general artificial intelligence

Jeff Clune. Ai-gas: Ai-generating algorithms, an alternate paradigm for producing general artificial intelligence. arXiv preprint arXiv:1905.10985, 2019

work page arXiv 1905
[19]

Marg: Multi-agent review generation for scientific papers

Mike D'Arcy, Tom Hope, Larry Birnbaum, and Doug Downey. Marg: Multi-agent review generation for scientific papers, 2024. URL https://arxiv.org/abs/2401.04259

work page arXiv 2024
[20]

J. Dewey. How We Think. D.C. Heath & Company, 1910. ISBN 9781519501868. URL https://books.google.co.uk/books?id=WF0AAAAAMAAJ

work page 1910
[21]

Quality diversity through human feedback: Towards open-ended diversity-driven optimization

Li Ding, Jenny Zhang, Jeff Clune, Lee Spector, and Joel Lehman. Quality diversity through human feedback: Towards open-ended diversity-driven optimization. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=9zlZuAAb08

work page 2024
[22]

Symbolicai: A framework for logic-based approaches combining generative models and solvers, 2024

Marius-Constantin Dinu, Claudiu Leoveanu-Condrei, Markus Holzleitner, Werner Zellinger, and Sepp Hochreiter. Symbolicai: A framework for logic-based approaches combining generative models and solvers, 2024. URL https://arxiv.org/abs/2402.00854

work page arXiv 2024
[23]

Art and the science of generative ai

Ziv Epstein, Aaron Hertzmann, Investigators of Human Creativity, Memo Akten, Hany Farid, Jessica Fjeld, Morgan R Frank, Matthew Groh, Laura Herman, Neil Leach, et al. Art and the science of generative ai. Science, 380 0 (6650): 0 1110--1111, 2023

work page 2023
[24]

arXiv preprint arXiv:2405.15568 , year=

Maxence Faldor, Jenny Zhang, Antoine Cully, and Jeff Clune. Omni-epic: Open-endedness via models of human notions of interestingness with environments programmed in code, 2024. URL https://arxiv.org/abs/2405.15568

work page arXiv 2024
[25]

Integrating quantitative and qualitative discovery: the abacus system

Brian C Falkenhainer and Ryszard S Michalski. Integrating quantitative and qualitative discovery: the abacus system. Machine Learning, 1: 0 367--401, 1986

work page 1986
[26]

Discovering faster matrix multiplication algorithms with reinforcement learning

Alhussein Fawzi, Matej Balog, Aja Huang, Thomas Hubert, Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Francisco J R Ruiz, Julian Schrittwieser, Grzegorz Swirszcz, et al. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature, 610 0 (7930): 0 47--53, 2022

work page 2022
[27]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23 0 (120): 0 1--39, 2022. URL http://jmlr.org/papers/v23/21-0998.html

work page 2022
[28]

Semantic scholar

Suzanne Fricke. Semantic scholar. Journal of the Medical Library Association: JMLA, 106 0 (1): 0 145, 2018

work page 2018
[29]

aider, 2024

Paul Gauthier. aider, 2024. URL https://github.com/paul-gauthier/aider

work page 2024
[30]

Probabilistic machine learning and artificial intelligence

Zoubin Ghahramani. Probabilistic machine learning and artificial intelligence. Nature, 521 0 (7553): 0 452--459, 2015

work page 2015
[31]

Ideas are dimes a dozen: Large language models for idea generation in innovation

Karan Girotra, Lennart Meincke, Christian Terwiesch, and Karl T Ulrich. Ideas are dimes a dozen: Large language models for idea generation in innovation. Available at SSRN 4526071, 2023

work page 2023
[32]

Understanding the difficulty of training deep feedforward neural networks

Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249--256. JMLR Workshop and Conference Proceedings, 2010

work page 2010
[33]

Generative adversarial nets

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014. URL https://proceeding...

work page 2014
[34]

Gemini: A family of highly capable multimodal models, 2023

Google DeepMind Gemini Team . Gemini: A family of highly capable multimodal models, 2023

work page 2023
[35]

DiffiT: Diffusion vision transformers for image generation,

Ali Hatamizadeh, Jiaming Song, Guilin Liu, Jan Kautz, and Arash Vahdat. Diffit: Diffusion vision transformers for image generation, 2024. URL https://arxiv.org/abs/2312.02139

work page arXiv 2024
[36]

Simulating 500 million years of evolution with a language model

Tomas Hayes, Roshan Rao, Halil Akin, Nicholas J Sofroniew, Deniz Oktay, Zeming Lin, Robert Verkuil, Vincent Q Tran, Jonathan Deaton, Marius Wiggert, et al. Simulating 500 million years of evolution with a language model. bioRxiv, pages 2024--07, 2024

work page 2024
[37]

Automl: A survey of the state-of-the-art

Xin He, Kaiyong Zhao, and Xiaowen Chu. Automl: A survey of the state-of-the-art. Knowledge-based systems, 212: 0 106622, 2021

work page 2021
[38]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 6840--6851. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf

work page 2020
[39]

Deep Paper Gestalt

Jia-Bin Huang. Deep paper gestalt. arXiv preprint arXiv:1812.08775, 2018

work page Pith review arXiv 2018
[40]

Mlagentbench: Evaluating language agents on machine learning experimentation

Qian Huang, Jian Vora, Percy Liang, and Jure Leskovec. Mlagentbench: Evaluating language agents on machine learning experimentation. In Forty-first International Conference on Machine Learning, 2024

work page 2024
[41]

Automated machine learning: methods, systems, challenges

Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren. Automated machine learning: methods, systems, challenges. Springer Nature, 2019

work page 2019
[42]

The hutter prize, 2006

Marcus Hutter. The hutter prize, 2006. URL http://prize.hutter1.net

work page 2006
[43]

Autonomous llm-driven research from data to human-verifiable research papers, 2024

Tal Ifargan, Lukas Hafner, Maor Kern, Ori Alcalay, and Roy Kishony. Autonomous llm-driven research from data to human-verifiable research papers, 2024. URL https://arxiv.org/abs/2404.17605

work page arXiv 2024
[44]

The principles of science: A treatise on logic and scientific method

William Stanley Jevons. The principles of science: A treatise on logic and scientific method. Macmillan and Company, 1877

work page
[45]

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[46]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?, 2024. URL https://arxiv.org/abs/2310.06770

work page internal anchor Pith review Pith/arXiv arXiv 2024
[47]

Highly accurate protein structure prediction with alphafold

John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Z \' dek, Anna Potapenko, et al. Highly accurate protein structure prediction with alphafold. nature, 596 0 (7873): 0 583--589, 2021

work page 2021
[48]

The unreasonable effectiveness of recurrent neural networks, 2015

Andrej Karpathy. The unreasonable effectiveness of recurrent neural networks, 2015. URL https://karpathy.github.io/2015/05/21/rnn-effectiveness/

work page 2015
[49]

NanoGPT , 2022

Andrej Karpathy. NanoGPT , 2022. URL https://github.com/karpathy/nanoGPT

work page 2022
[50]

A survey of research on cloud robotics and automation

Ben Kehoe, Sachin Patil, Pieter Abbeel, and Ken Goldberg. A survey of research on cloud robotics and automation. IEEE Transactions on automation science and engineering, 12 0 (2): 0 398--409, 2015

work page 2015
[51]

Kingma and Max Welling

Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes . In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings , 2014

work page 2014
[52]

Improving generalization in meta reinforcement learning using learned objectives.arXiv preprint arXiv:1910.04098,

Louis Kirsch, Sjoerd van Steenkiste, and J \"u rgen Schmidhuber. Improving generalization in meta reinforcement learning using learned objectives. arXiv preprint arXiv:1910.04098, 2019

work page arXiv 1910
[53]

Discovering attention-based genetic algorithms via meta-black-box optimization

Robert Lange, Tom Schaul, Yutian Chen, Chris Lu, Tom Zahavy, Valentin Dalibard, and Sebastian Flennerhag. Discovering attention-based genetic algorithms via meta-black-box optimization. In Proceedings of the Genetic and Evolutionary Computation Conference, pages 929--937, 2023 a

work page 2023
[54]

Discovering evolution strategies via meta-black-box optimization

Robert Lange, Tom Schaul, Yutian Chen, Tom Zahavy, Valentin Dalibard, Chris Lu, Satinder Singh, and Sebastian Flennerhag. Discovering evolution strategies via meta-black-box optimization. In Proceedings of the Companion Conference on Genetic and Evolutionary Computation, pages 29--30, 2023 b

work page 2023
[55]

Large language models as evolution strategies

Robert Tjarko Lange, Yingtao Tian, and Yujin Tang. Large language models as evolution strategies. arXiv preprint arXiv:2402.18381, 2024

work page arXiv 2024
[56]

Scientific discovery: Computational explorations of the creative processes

Pat Langley. Scientific discovery: Computational explorations of the creative processes. MIT press, 1987

work page 1987
[57]

Integrated systems for computational scientific discovery

Pat Langley. Integrated systems for computational scientific discovery. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 22598--22606, 2024

work page 2024
[58]

Exploiting open-endedness to solve problems through the search for novelty

Joel Lehman, Kenneth O Stanley, et al. Exploiting open-endedness to solve problems through the search for novelty. In ALIFE, pages 329--336, 2008

work page 2008
[59]

The surprising creativity of digital evolution: A collection of anecdotes from the evolutionary computation and artificial life research communities

Joel Lehman, Jeff Clune, Dusan Misevic, Christoph Adami, Lee Altenberg, Julie Beaulieu, Peter J Bentley, Samuel Bernard, Guillaume Beslon, David M Bryson, et al. The surprising creativity of digital evolution: A collection of anecdotes from the evolutionary computation and artificial life research communities. Artificial life, 26 0 (2): 0 274--306, 2020

work page 2020
[60]

Joel Lehman, Jonathan Gordon, Shawn Jain, Kamal Ndousse, Cathy Yeh, and Kenneth O. Stanley. Evolution through large models, 2022. URL https://arxiv.org/abs/2206.08896

work page arXiv 2022
[61]

Evolution through large models

Joel Lehman, Jonathan Gordon, Shawn Jain, Kamal Ndousse, Cathy Yeh, and Kenneth O Stanley. Evolution through large models. In Handbook of Evolutionary Machine Learning, pages 331--366. Springer, 2023

work page 2023
[62]

Automated theory formation in mathematics

Douglas B Lenat. Automated theory formation in mathematics. In IJCAI, volume 77, pages 833--842, 1977

work page 1977
[63]

Why am and eurisko appear to work

Douglas B Lenat and John Seely Brown. Why am and eurisko appear to work. Artificial intelligence, 23 0 (3): 0 269--294, 1984

work page 1984
[64]

Can large language models provide useful feedback on research papers? a large-scale empirical analysis

Weixin Liang, Yuhui Zhang, Hancheng Cao, Binglu Wang, Daisy Yi Ding, Xinyu Yang, Kailas Vodrahalli, Siyu He, Daniel Scott Smith, Yian Yin, et al. Can large language models provide useful feedback on research papers? a large-scale empirical analysis. NEJM AI, page AIoa2400196, 2024

work page 2024
[65]

arXiv preprint arXiv:2404.15794 , year=

Bryan Lim, Manon Flageat, and Antoine Cully. Large language models as in-context ai generators for quality-diversity. arXiv preprint arXiv:2404.15794, 2024

work page arXiv 2024
[66]

The Llama 3 Herd of Models

Llama Team . The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[67]

Discovered policy optimisation

Chris Lu, Jakub Kuba, Alistair Letcher, Luke Metz, Christian Schroeder de Witt, and Jakob Foerster. Discovered policy optimisation. Advances in Neural Information Processing Systems, 35: 0 16455--16468, 2022 a

work page 2022
[68]

Discovering preference optimization algorithms with and for large language models

Chris Lu, Samuel Holt, Claudio Fanconi, Alex J Chan, Jakob Foerster, Mihaela van der Schaar, and Robert Tjarko Lange. Discovering preference optimization algorithms with and for large language models. arXiv preprint arXiv:2406.08414, 2024 a

work page arXiv 2024
[69]

Cong Lu, Philip Ball, Jack Parker-Holder, Michael Osborne, and Stephen J. Roberts. Revisiting design choices in offline model based reinforcement learning. In International Conference on Learning Representations, 2022 b . URL https://openreview.net/forum?id=zz9hXVhf40

work page 2022
[70]

arXiv preprint arXiv:2405.15143 , year=

Cong Lu, Shengran Hu, and Jeff Clune. Intelligent go-explore: Standing on the shoulders of giant foundation models, 2024 b . URL https://arxiv.org/abs/2405.15143

work page arXiv 2024
[71]

Eureka: Human-Level Reward Design via Coding Large Language Models

Yecheng Jason Ma, William Liang, Guanzhi Wang, De-An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv:2310.12931, 2023

work page internal anchor Pith review arXiv 2023
[72]

About the test data, 2011

Matt Mahoney. About the test data, 2011. URL http://mattmahoney.net/dc/textdata.html

work page 2011
[73]

10 Preprint

Bodhisattwa Prasad Majumder, Harshit Surana, Dhruv Agarwal, Bhavana Dalvi Mishra, Abhijeetsingh Meena, Aryan Prakhar, Tirth Vora, Tushar Khot, Ashish Sabharwal, and Peter Clark. Discoverybench: Towards data-driven discovery with large language models, 2024. URL https://arxiv.org/abs/2407.01725

work page arXiv 2024
[74]

grokking , 2022

Daniel May. grokking , 2022. URL https://github.com/danielmamay/grokking

work page 2022
[75]

Scaling deep learning for materials discovery

Amil Merchant, Simon Batzner, Samuel S Schoenholz, Muratahan Aykol, Gowoon Cheon, and Ekin Dogus Cubuk. Scaling deep learning for materials discovery. Nature, 624 0 (7990): 0 80--85, 2023

work page 2023
[76]

arXiv:2211.09760 (2022) 4

Luke Metz, James Harrison, C Daniel Freeman, Amil Merchant, Lucas Beyer, James Bradbury, Naman Agrawal, Ben Poole, Igor Mordatch, Adam Roberts, et al. Velo: Training versatile learned optimizers by scaling up. arXiv preprint arXiv:2211.09760, 2022

work page arXiv 2022
[77]

A robust approach to numeric discovery

Bernd Nordhausen and Pat Langley. A robust approach to numeric discovery. In Machine learning proceedings 1990, pages 411--418. Elsevier, 1990

work page 1990
[78]

In-context Learning and Induction Heads

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[79]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023

work page 2023
[80]

tiny-diffusion, 2023

Tanel P\" a rnamaa. tiny-diffusion, 2023. URL https://github.com/tanelp/tiny-diffusion

work page 2023

Showing first 80 references.

[1] [1]

Meta-learning curiosity algorithms

Ferran Alet, Martin F Schneider, Tomas Lozano-Perez, and Leslie Pack Kaelbling. Meta-learning curiosity algorithms. arXiv preprint arXiv:2003.05325, 2020

work page arXiv 2003

[2] [2]

Artificial intelligence in scientific writing: a friend or a foe? Reproductive BioMedicine Online, 47 0 (1): 0 3--9, 2023

Signe Altm \"a e, Alberto Sola-Leyva, and Andres Salumets. Artificial intelligence in scientific writing: a friend or a foe? Reproductive BioMedicine Online, 47 0 (1): 0 3--9, 2023

work page 2023

[3] [3]

Model card and evaluations for claude models, 2023

Anthropic. Model card and evaluations for claude models, 2023. URL https://www-files.anthropic.com/production/images/Model-Card-Claude-2.pdf

work page 2023

[4] [4]

The claude 3 model family: Opus, sonnet, haiku, 2024

Anthropic. The claude 3 model family: Opus, sonnet, haiku, 2024. URL https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf

work page 2024

[5] [5]

Cloud labs: where robots do the research

Carrie Arnold. Cloud labs: where robots do the research. Nature, 606 0 (7914): 0 612--613, 2022

work page 2022

[6] [6]

K., Cucerzan, S

Jinheon Baek, Sujay Kumar Jauhar, Silviu Cucerzan, and Sung Ju Hwang. Researchagent: Iterative research idea generation over scientific literature with large language models, 2024. URL https://arxiv.org/abs/2404.07738

work page arXiv 2024

[7] [7]

Iclr2022-openreviewdata, 2024

Federico Berto. Iclr2022-openreviewdata, 2024. URL https://github.com/fedebotu/ICLR2022-OpenReviewData

work page 2024

[8] [8]

The neurips 2021 consistency experiment

Alina Beygelzimer, Yann Dauphin, Percy Liang, and Jennifer Wortman Vaughan. The neurips 2021 consistency experiment. Neural Information Processing Systems blog post, 2021. URL https://blog. neurips. cc/2021/12/08/the-neurips-2021-consistency-experiment

work page 2021

[9] [9]

Quality-diversity through ai feedback

Herbie Bradley, Andrew Dai, Hannah Benita Teufel, Jenny Zhang, Koen Oostermeijer, Marco Bellagente, Jeff Clune, Kenneth Stanley, Gregory Schott, and Joel Lehman. Quality-diversity through ai feedback. In The Twelfth International Conference on Learning Representations, 2024

work page 2024

[10] [10]

Minimal criterion coevolution: a new approach to open-ended search

Jonathan C Brant and Kenneth O Stanley. Minimal criterion coevolution: a new approach to open-ended search. In Proceedings of the Genetic and Evolutionary Computation Conference, pages 67--74, 2017

work page 2017

[11] [11]

Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

work page 2020

[12] [12]

Dendral and meta-dendral: Their applications dimension

Bruce G Buchanan and Edward A Feigenbaum. Dendral and meta-dendral: Their applications dimension. In Readings in artificial intelligence, pages 313--322. Elsevier, 1981

work page 1981

[13] [13]

Burns, P

Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, and Jeff Wu. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision, 2023. URL https://arxiv.org/abs/2312.09390

work page arXiv 2023

[14] [14]

What is this thing called science? McGraw-Hill Education (UK), 2013

Alan Chalmers. What is this thing called science? McGraw-Hill Education (UK), 2013

work page 2013

[15] [15]

Evoprompting: Language models for code-level neural architecture search

Angelica Chen, David Dohan, and David So. Evoprompting: Language models for code-level neural architecture search. Advances in Neural Information Processing Systems, 36, 2024 a

work page 2024

[16] [16]

Evaluating Large Language Models Trained on Code

Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

work page internal anchor Pith review Pith/arXiv arXiv 2021

[17] [17]

Symbolic discovery of optimization algorithms

Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, et al. Symbolic discovery of optimization algorithms. Advances in Neural Information Processing Systems, 36, 2024 b

work page 2024

[18] [18]

Clune, Ai-gas: Ai-generating algorithms, an alternate paradigm for producing general artificial intelligence

Jeff Clune. Ai-gas: Ai-generating algorithms, an alternate paradigm for producing general artificial intelligence. arXiv preprint arXiv:1905.10985, 2019

work page arXiv 1905

[19] [19]

Marg: Multi-agent review generation for scientific papers

Mike D'Arcy, Tom Hope, Larry Birnbaum, and Doug Downey. Marg: Multi-agent review generation for scientific papers, 2024. URL https://arxiv.org/abs/2401.04259

work page arXiv 2024

[20] [20]

J. Dewey. How We Think. D.C. Heath & Company, 1910. ISBN 9781519501868. URL https://books.google.co.uk/books?id=WF0AAAAAMAAJ

work page 1910

[21] [21]

Quality diversity through human feedback: Towards open-ended diversity-driven optimization

Li Ding, Jenny Zhang, Jeff Clune, Lee Spector, and Joel Lehman. Quality diversity through human feedback: Towards open-ended diversity-driven optimization. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=9zlZuAAb08

work page 2024

[22] [22]

Symbolicai: A framework for logic-based approaches combining generative models and solvers, 2024

Marius-Constantin Dinu, Claudiu Leoveanu-Condrei, Markus Holzleitner, Werner Zellinger, and Sepp Hochreiter. Symbolicai: A framework for logic-based approaches combining generative models and solvers, 2024. URL https://arxiv.org/abs/2402.00854

work page arXiv 2024

[23] [23]

Art and the science of generative ai

Ziv Epstein, Aaron Hertzmann, Investigators of Human Creativity, Memo Akten, Hany Farid, Jessica Fjeld, Morgan R Frank, Matthew Groh, Laura Herman, Neil Leach, et al. Art and the science of generative ai. Science, 380 0 (6650): 0 1110--1111, 2023

work page 2023

[24] [24]

arXiv preprint arXiv:2405.15568 , year=

Maxence Faldor, Jenny Zhang, Antoine Cully, and Jeff Clune. Omni-epic: Open-endedness via models of human notions of interestingness with environments programmed in code, 2024. URL https://arxiv.org/abs/2405.15568

work page arXiv 2024

[25] [25]

Integrating quantitative and qualitative discovery: the abacus system

Brian C Falkenhainer and Ryszard S Michalski. Integrating quantitative and qualitative discovery: the abacus system. Machine Learning, 1: 0 367--401, 1986

work page 1986

[26] [26]

Discovering faster matrix multiplication algorithms with reinforcement learning

Alhussein Fawzi, Matej Balog, Aja Huang, Thomas Hubert, Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Francisco J R Ruiz, Julian Schrittwieser, Grzegorz Swirszcz, et al. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature, 610 0 (7930): 0 47--53, 2022

work page 2022

[27] [27]

Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity

William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23 0 (120): 0 1--39, 2022. URL http://jmlr.org/papers/v23/21-0998.html

work page 2022

[28] [28]

Semantic scholar

Suzanne Fricke. Semantic scholar. Journal of the Medical Library Association: JMLA, 106 0 (1): 0 145, 2018

work page 2018

[29] [29]

aider, 2024

Paul Gauthier. aider, 2024. URL https://github.com/paul-gauthier/aider

work page 2024

[30] [30]

Probabilistic machine learning and artificial intelligence

Zoubin Ghahramani. Probabilistic machine learning and artificial intelligence. Nature, 521 0 (7553): 0 452--459, 2015

work page 2015

[31] [31]

Ideas are dimes a dozen: Large language models for idea generation in innovation

Karan Girotra, Lennart Meincke, Christian Terwiesch, and Karl T Ulrich. Ideas are dimes a dozen: Large language models for idea generation in innovation. Available at SSRN 4526071, 2023

work page 2023

[32] [32]

Understanding the difficulty of training deep feedforward neural networks

Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249--256. JMLR Workshop and Conference Proceedings, 2010

work page 2010

[33] [33]

Generative adversarial nets

Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014. URL https://proceeding...

work page 2014

[34] [34]

Gemini: A family of highly capable multimodal models, 2023

Google DeepMind Gemini Team . Gemini: A family of highly capable multimodal models, 2023

work page 2023

[35] [35]

DiffiT: Diffusion vision transformers for image generation,

Ali Hatamizadeh, Jiaming Song, Guilin Liu, Jan Kautz, and Arash Vahdat. Diffit: Diffusion vision transformers for image generation, 2024. URL https://arxiv.org/abs/2312.02139

work page arXiv 2024

[36] [36]

Simulating 500 million years of evolution with a language model

Tomas Hayes, Roshan Rao, Halil Akin, Nicholas J Sofroniew, Deniz Oktay, Zeming Lin, Robert Verkuil, Vincent Q Tran, Jonathan Deaton, Marius Wiggert, et al. Simulating 500 million years of evolution with a language model. bioRxiv, pages 2024--07, 2024

work page 2024

[37] [37]

Automl: A survey of the state-of-the-art

Xin He, Kaiyong Zhao, and Xiaowen Chu. Automl: A survey of the state-of-the-art. Knowledge-based systems, 212: 0 106622, 2021

work page 2021

[38] [38]

Denoising diffusion probabilistic models

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 6840--6851. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf

work page 2020

[39] [39]

Deep Paper Gestalt

Jia-Bin Huang. Deep paper gestalt. arXiv preprint arXiv:1812.08775, 2018

work page Pith review arXiv 2018

[40] [40]

Mlagentbench: Evaluating language agents on machine learning experimentation

Qian Huang, Jian Vora, Percy Liang, and Jure Leskovec. Mlagentbench: Evaluating language agents on machine learning experimentation. In Forty-first International Conference on Machine Learning, 2024

work page 2024

[41] [41]

Automated machine learning: methods, systems, challenges

Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren. Automated machine learning: methods, systems, challenges. Springer Nature, 2019

work page 2019

[42] [42]

The hutter prize, 2006

Marcus Hutter. The hutter prize, 2006. URL http://prize.hutter1.net

work page 2006

[43] [43]

Autonomous llm-driven research from data to human-verifiable research papers, 2024

Tal Ifargan, Lukas Hafner, Maor Kern, Ori Alcalay, and Roy Kishony. Autonomous llm-driven research from data to human-verifiable research papers, 2024. URL https://arxiv.org/abs/2404.17605

work page arXiv 2024

[44] [44]

The principles of science: A treatise on logic and scientific method

William Stanley Jevons. The principles of science: A treatise on logic and scientific method. Macmillan and Company, 1877

work page

[45] [45]

Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[46] [46]

SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?, 2024. URL https://arxiv.org/abs/2310.06770

work page internal anchor Pith review Pith/arXiv arXiv 2024

[47] [47]

Highly accurate protein structure prediction with alphafold

John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Z \' dek, Anna Potapenko, et al. Highly accurate protein structure prediction with alphafold. nature, 596 0 (7873): 0 583--589, 2021

work page 2021

[48] [48]

The unreasonable effectiveness of recurrent neural networks, 2015

Andrej Karpathy. The unreasonable effectiveness of recurrent neural networks, 2015. URL https://karpathy.github.io/2015/05/21/rnn-effectiveness/

work page 2015

[49] [49]

NanoGPT , 2022

Andrej Karpathy. NanoGPT , 2022. URL https://github.com/karpathy/nanoGPT

work page 2022

[50] [50]

A survey of research on cloud robotics and automation

Ben Kehoe, Sachin Patil, Pieter Abbeel, and Ken Goldberg. A survey of research on cloud robotics and automation. IEEE Transactions on automation science and engineering, 12 0 (2): 0 398--409, 2015

work page 2015

[51] [51]

Kingma and Max Welling

Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes . In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings , 2014

work page 2014

[52] [52]

Improving generalization in meta reinforcement learning using learned objectives.arXiv preprint arXiv:1910.04098,

Louis Kirsch, Sjoerd van Steenkiste, and J \"u rgen Schmidhuber. Improving generalization in meta reinforcement learning using learned objectives. arXiv preprint arXiv:1910.04098, 2019

work page arXiv 1910

[53] [53]

Discovering attention-based genetic algorithms via meta-black-box optimization

Robert Lange, Tom Schaul, Yutian Chen, Chris Lu, Tom Zahavy, Valentin Dalibard, and Sebastian Flennerhag. Discovering attention-based genetic algorithms via meta-black-box optimization. In Proceedings of the Genetic and Evolutionary Computation Conference, pages 929--937, 2023 a

work page 2023

[54] [54]

Discovering evolution strategies via meta-black-box optimization

Robert Lange, Tom Schaul, Yutian Chen, Tom Zahavy, Valentin Dalibard, Chris Lu, Satinder Singh, and Sebastian Flennerhag. Discovering evolution strategies via meta-black-box optimization. In Proceedings of the Companion Conference on Genetic and Evolutionary Computation, pages 29--30, 2023 b

work page 2023

[55] [55]

Large language models as evolution strategies

Robert Tjarko Lange, Yingtao Tian, and Yujin Tang. Large language models as evolution strategies. arXiv preprint arXiv:2402.18381, 2024

work page arXiv 2024

[56] [56]

Scientific discovery: Computational explorations of the creative processes

Pat Langley. Scientific discovery: Computational explorations of the creative processes. MIT press, 1987

work page 1987

[57] [57]

Integrated systems for computational scientific discovery

Pat Langley. Integrated systems for computational scientific discovery. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 22598--22606, 2024

work page 2024

[58] [58]

Exploiting open-endedness to solve problems through the search for novelty

Joel Lehman, Kenneth O Stanley, et al. Exploiting open-endedness to solve problems through the search for novelty. In ALIFE, pages 329--336, 2008

work page 2008

[59] [59]

The surprising creativity of digital evolution: A collection of anecdotes from the evolutionary computation and artificial life research communities

Joel Lehman, Jeff Clune, Dusan Misevic, Christoph Adami, Lee Altenberg, Julie Beaulieu, Peter J Bentley, Samuel Bernard, Guillaume Beslon, David M Bryson, et al. The surprising creativity of digital evolution: A collection of anecdotes from the evolutionary computation and artificial life research communities. Artificial life, 26 0 (2): 0 274--306, 2020

work page 2020

[60] [60]

Joel Lehman, Jonathan Gordon, Shawn Jain, Kamal Ndousse, Cathy Yeh, and Kenneth O. Stanley. Evolution through large models, 2022. URL https://arxiv.org/abs/2206.08896

work page arXiv 2022

[61] [61]

Evolution through large models

Joel Lehman, Jonathan Gordon, Shawn Jain, Kamal Ndousse, Cathy Yeh, and Kenneth O Stanley. Evolution through large models. In Handbook of Evolutionary Machine Learning, pages 331--366. Springer, 2023

work page 2023

[62] [62]

Automated theory formation in mathematics

Douglas B Lenat. Automated theory formation in mathematics. In IJCAI, volume 77, pages 833--842, 1977

work page 1977

[63] [63]

Why am and eurisko appear to work

Douglas B Lenat and John Seely Brown. Why am and eurisko appear to work. Artificial intelligence, 23 0 (3): 0 269--294, 1984

work page 1984

[64] [64]

Can large language models provide useful feedback on research papers? a large-scale empirical analysis

Weixin Liang, Yuhui Zhang, Hancheng Cao, Binglu Wang, Daisy Yi Ding, Xinyu Yang, Kailas Vodrahalli, Siyu He, Daniel Scott Smith, Yian Yin, et al. Can large language models provide useful feedback on research papers? a large-scale empirical analysis. NEJM AI, page AIoa2400196, 2024

work page 2024

[65] [65]

arXiv preprint arXiv:2404.15794 , year=

Bryan Lim, Manon Flageat, and Antoine Cully. Large language models as in-context ai generators for quality-diversity. arXiv preprint arXiv:2404.15794, 2024

work page arXiv 2024

[66] [66]

The Llama 3 Herd of Models

Llama Team . The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024

[67] [67]

Discovered policy optimisation

Chris Lu, Jakub Kuba, Alistair Letcher, Luke Metz, Christian Schroeder de Witt, and Jakob Foerster. Discovered policy optimisation. Advances in Neural Information Processing Systems, 35: 0 16455--16468, 2022 a

work page 2022

[68] [68]

Discovering preference optimization algorithms with and for large language models

Chris Lu, Samuel Holt, Claudio Fanconi, Alex J Chan, Jakob Foerster, Mihaela van der Schaar, and Robert Tjarko Lange. Discovering preference optimization algorithms with and for large language models. arXiv preprint arXiv:2406.08414, 2024 a

work page arXiv 2024

[69] [69]

Cong Lu, Philip Ball, Jack Parker-Holder, Michael Osborne, and Stephen J. Roberts. Revisiting design choices in offline model based reinforcement learning. In International Conference on Learning Representations, 2022 b . URL https://openreview.net/forum?id=zz9hXVhf40

work page 2022

[70] [70]

arXiv preprint arXiv:2405.15143 , year=

Cong Lu, Shengran Hu, and Jeff Clune. Intelligent go-explore: Standing on the shoulders of giant foundation models, 2024 b . URL https://arxiv.org/abs/2405.15143

work page arXiv 2024

[71] [71]

Eureka: Human-Level Reward Design via Coding Large Language Models

Yecheng Jason Ma, William Liang, Guanzhi Wang, De-An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv:2310.12931, 2023

work page internal anchor Pith review arXiv 2023

[72] [72]

About the test data, 2011

Matt Mahoney. About the test data, 2011. URL http://mattmahoney.net/dc/textdata.html

work page 2011

[73] [73]

10 Preprint

Bodhisattwa Prasad Majumder, Harshit Surana, Dhruv Agarwal, Bhavana Dalvi Mishra, Abhijeetsingh Meena, Aryan Prakhar, Tirth Vora, Tushar Khot, Ashish Sabharwal, and Peter Clark. Discoverybench: Towards data-driven discovery with large language models, 2024. URL https://arxiv.org/abs/2407.01725

work page arXiv 2024

[74] [74]

grokking , 2022

Daniel May. grokking , 2022. URL https://github.com/danielmamay/grokking

work page 2022

[75] [75]

Scaling deep learning for materials discovery

Amil Merchant, Simon Batzner, Samuel S Schoenholz, Muratahan Aykol, Gowoon Cheon, and Ekin Dogus Cubuk. Scaling deep learning for materials discovery. Nature, 624 0 (7990): 0 80--85, 2023

work page 2023

[76] [76]

arXiv:2211.09760 (2022) 4

Luke Metz, James Harrison, C Daniel Freeman, Amil Merchant, Lucas Beyer, James Bradbury, Naman Agrawal, Ben Poole, Igor Mordatch, Adam Roberts, et al. Velo: Training versatile learned optimizers by scaling up. arXiv preprint arXiv:2211.09760, 2022

work page arXiv 2022

[77] [77]

A robust approach to numeric discovery

Bernd Nordhausen and Pat Langley. A robust approach to numeric discovery. In Machine learning proceedings 1990, pages 411--418. Elsevier, 1990

work page 1990

[78] [78]

In-context Learning and Induction Heads

Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[79] [79]

Gpt-4 technical report, 2023

OpenAI. Gpt-4 technical report, 2023

work page 2023

[80] [80]

tiny-diffusion, 2023

Tanel P\" a rnamaa. tiny-diffusion, 2023. URL https://github.com/tanelp/tiny-diffusion

work page 2023