pith. machine review for the scientific record. sign in

arxiv: 2408.06292 · v3 · submitted 2024-08-12 · 💻 cs.AI · cs.CL· cs.LG

Recognition: no theorem link

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Chris Lu, Cong Lu, David Ha, Jakob Foerster, Jeff Clune, Robert Tjarko Lange

Pith reviewed 2026-05-11 04:36 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG
keywords automated scientific discoverylarge language modelsAI research agentsmachine learningautonomous paper generationself-review processopen-ended discovery
0
0 comments X

The pith

Frontier large language models can autonomously conduct full scientific research cycles using the AI Scientist framework, producing papers that pass automated conference-level review.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes The AI Scientist, a framework that enables large language models to independently manage the complete scientific process. The system generates research ideas, implements them through code and experiments, creates visualizations, writes full papers, and performs its own review evaluation. It is tested on three machine learning subfields with each paper costing less than fifteen dollars. The authors also create an automated reviewer that scores papers near human levels, and some AI-generated papers exceed the acceptance bar according to this reviewer. This represents a step toward AI agents driving open-ended discovery in machine learning research.

Core claim

The AI Scientist is the first comprehensive framework for fully automatic scientific discovery. It allows frontier large language models to generate novel research ideas, write code, execute experiments, visualize results, write full scientific papers, and run a simulated review process. This can be repeated iteratively in an open-ended way. Applied to diffusion modeling, transformer-based language modeling, and learning dynamics, it produces papers at less than $15 each. The automated reviewer achieves near-human performance, and the system generates papers that exceed the acceptance threshold at a top machine learning conference.

What carries the argument

The AI Scientist framework, which sequences LLM capabilities to cover the entire research pipeline from idea generation to self-assessment.

If this is right

  • Open-ended iteration of the process can mimic the human scientific community in developing ideas.
  • The generated papers can meet or exceed acceptance thresholds for top machine learning conferences per the automated reviewer.
  • Versatility across distinct subfields of machine learning including diffusion, language modeling, and learning dynamics.
  • Low-cost production of full research papers under fifteen dollars each.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This could enable much higher throughput in exploring new ideas within AI research if the quality holds up under human scrutiny.
  • Similar systems might eventually be adapted for discovery in other scientific fields, though domain-specific tools would be needed.
  • Long-term use might create feedback loops where AI builds upon its own prior discoveries without human input.

Load-bearing premise

The automated reviewer provides an accurate assessment of paper quality comparable to human experts at top conferences.

What would settle it

Having the AI-generated papers submitted to a real top-tier machine learning conference and observing whether they are accepted or rejected based on human reviews.

read the original abstract

One of the grand challenges of artificial general intelligence is developing agents capable of conducting scientific research and discovering new knowledge. While frontier models have already been used as aides to human scientists, e.g. for brainstorming ideas, writing code, or prediction tasks, they still conduct only a small part of the scientific process. This paper presents the first comprehensive framework for fully automatic scientific discovery, enabling frontier large language models to perform research independently and communicate their findings. We introduce The AI Scientist, which generates novel research ideas, writes code, executes experiments, visualizes results, describes its findings by writing a full scientific paper, and then runs a simulated review process for evaluation. In principle, this process can be repeated to iteratively develop ideas in an open-ended fashion, acting like the human scientific community. We demonstrate its versatility by applying it to three distinct subfields of machine learning: diffusion modeling, transformer-based language modeling, and learning dynamics. Each idea is implemented and developed into a full paper at a cost of less than $15 per paper. To evaluate the generated papers, we design and validate an automated reviewer, which we show achieves near-human performance in evaluating paper scores. The AI Scientist can produce papers that exceed the acceptance threshold at a top machine learning conference as judged by our automated reviewer. This approach signifies the beginning of a new era in scientific discovery in machine learning: bringing the transformative benefits of AI agents to the entire research process of AI itself, and taking us closer to a world where endless affordable creativity and innovation can be unleashed on the world's most challenging problems. Our code is open-sourced at https://github.com/SakanaAI/AI-Scientist

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces The AI Scientist, a framework enabling frontier LLMs to autonomously generate novel research ideas, implement code and run experiments, visualize results, write full scientific papers, and evaluate them through a simulated review process. Applied to diffusion modeling, transformer language modeling, and learning dynamics, it claims to produce papers at under $15 each, with some exceeding top-ML-conference acceptance thresholds as scored by an internally designed automated reviewer that achieves near-human performance. The process is presented as repeatable for open-ended discovery, with code open-sourced.

Significance. If the central claims hold after addressing evaluation gaps, this would be a notable step toward fully automated scientific discovery in machine learning, demonstrating a closed-loop system for idea-to-paper generation at low cost and highlighting potential for iterative research. The open-sourcing of code strengthens reproducibility and invites community extensions, though the current lack of external validation limits immediate impact on the broader scientific process.

major comments (3)
  1. [Automated Reviewer section] Automated Reviewer section: The paper's core claim—that generated papers exceed conference acceptance thresholds—rests entirely on scores from the authors' internally designed and validated automated reviewer. No quantitative details are provided on its training corpus, calibration against real conference decisions, correlation with human reviewers, or performance on a blind test set separating LLM-generated from human papers. This self-referential loop undermines the acceptance-threshold result.
  2. [Experimental Results (Section 5)] Experimental Results (Section 5): The reported successes in three subfields lack ablation studies on key components (e.g., idea generation vs. experiment execution), quantitative metrics on idea novelty (such as literature overlap or expert originality ratings), and error rates for code validity or experimental soundness. These omissions make it impossible to determine what drives any apparent success or whether outputs represent genuine advances.
  3. [Abstract and Results summary] Abstract and Results summary: The assertion of 'near-human performance' for the automated reviewer and papers exceeding acceptance thresholds provides no supporting numbers (e.g., inter-rater agreement, threshold calibration details, or comparison to actual conference acceptance rates), leaving the central evaluation unsupported.
minor comments (2)
  1. [Figures and cost analysis] The workflow diagram and cost breakdowns would benefit from clearer labels and step-by-step explanations to improve readability for readers unfamiliar with the pipeline.
  2. [Methods description] Some terms (e.g., specific LLM sampling parameters) are referenced without initial definition or explicit values in the methods description.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments, which highlight important areas for improving the clarity and rigor of our evaluation. We address each major comment point by point below, indicating planned revisions to the manuscript where appropriate. Our goal is to strengthen the presentation of the automated reviewer and experimental results without altering the core contributions of the AI Scientist framework.

read point-by-point responses
  1. Referee: [Automated Reviewer section] Automated Reviewer section: The paper's core claim—that generated papers exceed conference acceptance thresholds—rests entirely on scores from the authors' internally designed and validated automated reviewer. No quantitative details are provided on its training corpus, calibration against real conference decisions, correlation with human reviewers, or performance on a blind test set separating LLM-generated from human papers. This self-referential loop undermines the acceptance-threshold result.

    Authors: We agree that the manuscript would benefit from greater transparency on the automated reviewer. The current version describes its design and validation at a high level but omits specific quantitative details. In the revision, we will expand the Automated Reviewer section to include: the composition of the training corpus (human-written papers from prior NeurIPS/ICML/ICLR proceedings), calibration details against historical acceptance rates, Pearson/Spearman correlations with human reviewer scores, and performance metrics on a held-out blind test set. We will also explicitly note that the reviewer was trained exclusively on human papers to mitigate self-reference concerns. These additions will be supported by new tables and figures. revision: yes

  2. Referee: [Experimental Results (Section 5)] Experimental Results (Section 5): The reported successes in three subfields lack ablation studies on key components (e.g., idea generation vs. experiment execution), quantitative metrics on idea novelty (such as literature overlap or expert originality ratings), and error rates for code validity or experimental soundness. These omissions make it impossible to determine what drives any apparent success or whether outputs represent genuine advances.

    Authors: We acknowledge the value of ablations and additional metrics for isolating contributions. The manuscript focuses on end-to-end feasibility rather than component-wise analysis, but we agree this limits interpretability. In revision, we will add: (1) basic ablation results comparing full pipeline performance against versions with simplified idea generation or execution modules; (2) quantitative novelty metrics such as n-gram overlap and citation similarity with existing literature; and (3) reported error rates for code execution failures and experimental soundness (e.g., percentage of runs that completed without runtime errors). Expert originality ratings remain resource-intensive and will be noted as a limitation with discussion of future work. These changes will appear in an expanded Section 5. revision: partial

  3. Referee: [Abstract and Results summary] Abstract and Results summary: The assertion of 'near-human performance' for the automated reviewer and papers exceeding acceptance thresholds provides no supporting numbers (e.g., inter-rater agreement, threshold calibration details, or comparison to actual conference acceptance rates), leaving the central evaluation unsupported.

    Authors: We will revise both the abstract and the results summary to include concrete supporting statistics. Specifically, we will report: inter-rater agreement (e.g., Cohen's kappa or correlation values) between the automated reviewer and human reviewers, the precise acceptance threshold calibrated from past conference data (e.g., average scores of accepted papers), and direct comparisons to real acceptance rates. These numbers will be added to the abstract and highlighted in the results section with references to the expanded validation details. revision: yes

Circularity Check

1 steps flagged

Central claim of exceeding conference thresholds rests on authors' self-designed automated reviewer

specific steps
  1. fitted input called prediction [Abstract]
    "To evaluate the generated papers, we design and validate an automated reviewer, which we show achieves near-human performance in evaluating paper scores. The AI Scientist can produce papers that exceed the acceptance threshold at a top machine learning conference as judged by our automated reviewer."

    The headline success metric ('exceed the acceptance threshold') is not an external or pre-existing benchmark but is computed by the authors' own reviewer, which they designed, validated, and then used to judge their system's outputs. This reduces the 'prediction' of research success to performance on an internally constructed evaluator, matching the fitted-input-called-prediction pattern.

full rationale

The paper's primary result—that The AI Scientist generates papers exceeding top-ML-conference acceptance thresholds—is defined entirely by scores from an automated reviewer the authors explicitly state they 'design and validate.' This creates a load-bearing self-referential evaluation loop. While the abstract claims near-human performance, no independent external benchmark (e.g., correlation with actual conference decisions on mixed human/LLM papers) is exhibited in the provided text. Other components (idea generation, code execution, paper writing) do not reduce to this loop, so the circularity is partial and confined to the success metric. This warrants a moderate score rather than 8-10, as the framework itself is not definitionally tautological.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The framework rests on the unproven assumption that current frontier LLMs possess sufficient capability for open-ended research tasks and introduces an internally validated reviewer whose independence from the generated content is not externally demonstrated.

free parameters (2)
  • LLM sampling parameters and model choice
    Specific temperature, top-p, and model versions used for idea generation and code writing are not detailed in the abstract but are central to reproducibility.
  • Automated reviewer acceptance threshold
    The numerical cutoff used to declare papers exceed top-conference standards is not specified.
axioms (1)
  • domain assumption Frontier LLMs can reliably generate novel, implementable research ideas and produce correct experimental code without human intervention
    Invoked throughout the description of the AI Scientist pipeline.
invented entities (1)
  • Automated reviewer no independent evidence
    purpose: To score generated papers and determine acceptance without human input
    New component introduced and validated by the authors themselves.

pith-pipeline@v0.9.0 · 5617 in / 1596 out tokens · 67826 ms · 2026-05-11T04:36:48.483357+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot

    cs.AI 2026-04 conditional novelty 9.0

    AI reviews for all 22,977 AAAI-26 papers were preferred by authors and PC members over human reviews on accuracy and suggestions and outperformed baselines at spotting weaknesses.

  2. Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty

    cs.CL 2026-05 unverdicted novelty 8.0

    Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.

  3. WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

    cs.CL 2026-05 unverdicted novelty 8.0

    A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.

  4. AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery

    cs.AI 2026-04 accept novelty 8.0

    AutoResearchBench is a new benchmark showing top AI agents achieve under 10% success on complex scientific literature discovery tasks that demand deep comprehension and open-ended search.

  5. FermiLink: A Unified Agent Framework for Multidomain Autonomous Scientific Simulations

    physics.chem-ph 2026-04 conditional novelty 8.0

    FermiLink is a unified AI agent framework that automates multidomain scientific simulations via separated package knowledge bases and a four-layer progressive disclosure mechanism, reproducing 56% of target figures in...

  6. AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration -- Learning from Cheap, Optimizing Expensive

    cs.AI 2026-05 unverdicted novelty 7.0

    AutoLLMResearch trains agents via a multi-fidelity environment and MDP pipeline to extrapolate configuration principles from inexpensive to costly LLM experiments.

  7. ASIA: an Autonomous System Identification Agent

    cs.AI 2026-05 unverdicted novelty 7.0

    ASIA uses an LLM-based coding agent to autonomously perform system identification, tested empirically on two benchmarks while noting limitations in transparency and reproducibility.

  8. PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents

    cs.AI 2026-05 unverdicted novelty 7.0

    PaperFit uses rendered page images in a closed loop to diagnose and repair typesetting defects in LaTeX documents, outperforming baselines on a new benchmark of 200 papers.

  9. Hypothesis-Driven Deep Research with Large Language Models: A Structured Methodology for Automated Knowledge Discovery

    cs.AI 2026-05 unverdicted novelty 7.0

    HDRI is a six-principle eight-stage framework for hypothesis-organized LLM research featuring gap-driven iteration, traceable fact reasoning, and subject locking, realized in INFOMINER with reported gains in fact dens...

  10. Self Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale

    cs.LG 2026-05 conditional novelty 7.0

    Starling uses LLMs and agents to turn 22.5M PubMed papers into 6.3M nuanced structured records across six tasks with 0.6-7.7% frontier-model rejection rates, lower than error rates on existing curated databases.

  11. AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents

    physics.flu-dyn 2026-05 conditional novelty 7.0

    AI CFD Scientist autonomously finds a Spalart-Allmaras turbulence correction that lowers wall-friction error by 7.89% versus DNS on the periodic hill case using vision-language physics verification.

  12. Evolving Idea Graphs with Learnable Edits-and-Commits for Multi-Agent Scientific Ideation

    cs.MA 2026-05 unverdicted novelty 7.0

    EIG represents research ideas as evolving graphs with nodes for claims and edges for relations, using a learned controller for edits and commits to produce higher-quality scientific proposals than text-only multi-agen...

  13. Agentic-imodels: Evolving agentic interpretability tools via autoresearch

    cs.AI 2026-05 unverdicted novelty 7.0

    Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.

  14. Theory Under Construction: Orchestrating Language Models for Research Software Where the Specification Evolves

    cs.SE 2026-04 unverdicted novelty 7.0

    Comet-H orchestrates LLMs via deficit-scoring prompt selection and half-life task tracking to co-evolve research software components, demonstrated by a static analysis tool reaching F1=0.768 versus a 0.364 baseline.

  15. End-to-end autonomous scientific discovery on a real optical platform

    cs.AI 2026-04 unverdicted novelty 7.0

    An LLM agent autonomously identifies and experimentally validates a previously unreported optical bilinear interaction on a physical platform.

  16. Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond

    cs.AI 2026-04 unverdicted novelty 7.0

    Proposes a levels x laws taxonomy for world models in AI agents, defining L1-L3 capabilities across physical, digital, social, and scientific regimes while reviewing over 400 works to outline a roadmap for advanced ag...

  17. Knows: Agent-Native Structured Research Representations

    cs.AI 2026-04 conditional novelty 7.0

    Knows uses a YAML sidecar specification to provide structured, agent-consumable representations of research papers, yielding large accuracy gains for small LLMs on comprehension tasks and rapid community adoption via ...

  18. ReviewGrounder: Improving Review Substantiveness with Rubric-Guided, Tool-Integrated Agents

    cs.CL 2026-04 unverdicted novelty 7.0

    ReviewGrounder decomposes review generation into rubric-guided drafting and tool-integrated grounding stages, outperforming larger baseline models on a new benchmark measuring alignment with human judgments and review...

  19. VERITAS: Verifiable Epistemic Reasoning for Image-Derived Hypothesis Testing via Agentic Systems

    cs.MA 2026-04 unverdicted novelty 7.0

    VERITAS is a multi-agent system for verifiable hypothesis testing on multimodal clinical MRI datasets that achieves 81.4% verdict accuracy with frontier models and introduces an epistemic evidence labeling framework.

  20. Camyla: Scaling Autonomous Research in Medical Image Segmentation

    cs.AI 2026-04 unverdicted novelty 7.0

    Camyla autonomously generates research proposals, experiments, and manuscripts in medical image segmentation, outperforming baselines on 24 of 31 recent datasets while producing 40 human-reviewed papers.

  21. Figures as Interfaces: Toward LLM-Native Artifacts for Scientific Discovery

    cs.HC 2026-04 unverdicted novelty 7.0

    LLM-native figures embed provenance and enable direct LLM interaction with scientific visualizations to accelerate discovery and improve reproducibility.

  22. $k$-server-bench: Automating Potential Discovery for the $k$-Server Conjecture

    cs.MS 2026-04 accept novelty 7.0

    k-server-bench formulates potential-function discovery for the k-server conjecture as a code-based inequality-satisfaction task; current agents fully solve the resolved k=3 case and reduce violations on the open k=4 case.

  23. AutoSOTA: An End-to-End Automated Research System for State-of-the-Art AI Model Discovery

    cs.CL 2026-04 unverdicted novelty 7.0

    AutoSOTA uses eight specialized agents to replicate and optimize models from recent AI papers, producing 105 new SOTA results in about five hours per paper on average.

  24. FactReview: Evidence-Grounded Reviews with Literature Positioning and Execution-Based Claim Verification

    cs.AI 2026-04 conditional novelty 7.0

    FactReview extracts claims from ML papers, positions them via literature retrieval, and verifies them through code execution, labeling each as Supported, Partially supported, or In conflict, as shown in a CompGCN case study.

  25. Letting the neural code speak: Automated characterization of monkey visual neurons through human language

    q-bio.NC 2026-05 unverdicted novelty 6.0

    Natural-language descriptions generated and verified through generative models and digital twins capture the selectivity of most neurons in macaque V1 and V4.

  26. Unlocking LLM Creativity in Science through Analogical Reasoning

    cs.AI 2026-05 conditional novelty 6.0

    Analogical reasoning increases LLM solution diversity by 90-173% and novelty rate to over 50%, delivering up to 13-fold gains on biomedical tasks including perturbation prediction and cell communication.

  27. NanoResearch: Co-Evolving Skills, Memory, and Policy for Personalized Research Automation

    cs.AI 2026-05 unverdicted novelty 6.0

    NanoResearch introduces a tri-level co-evolving framework of skills, memory, and policy to personalize LLM-powered research automation across projects and users.

  28. ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox

    cs.AI 2026-05 unverdicted novelty 6.0

    ComplexMCP benchmark shows current LLM agents achieve at most 60% success on interdependent tool tasks versus 90% for humans, due to tool retrieval saturation, over-confidence, and strategic defeatism.

  29. TIDE-Bench: Task-Aware and Diagnostic Evaluation of Tool-Integrated Reasoning

    cs.AI 2026-05 unverdicted novelty 6.0

    TIDE-Bench is a new benchmark for tool-integrated reasoning that combines diverse tasks, multi-aspect metrics covering answer quality, process reliability, efficiency and cost, plus filtered challenging test sets.

  30. Towards a Virtual Neuroscientist: Autonomous Neuroimaging Analysis via Multi-Agent Collaboration

    cs.AI 2026-05 unverdicted novelty 6.0

    NIAgent uses code-centric multi-agent collaboration and hierarchical verification to build adaptive neuroimaging pipelines that outperform static baselines on ADHD-200 and ADNI data.

  31. MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI

    cs.LG 2026-05 unverdicted novelty 6.0

    MLS-Bench shows that current AI agents fall short of reliably inventing generalizable ML methods, with engineering tuning easier than genuine invention.

  32. CellScientist: Dual-Space Hierarchical Orchestration for Closed-Loop Refinement of Virtual Cell Models

    cs.LG 2026-05 unverdicted novelty 6.0

    CellScientist introduces a dual-space hierarchical orchestration system that enables closed-loop refinement of virtual cell models by routing execution discrepancies back to hypothesis or implementation updates, yield...

  33. FAME: Forecasting Academic Impact via Continuous-Time Manifold Evolution

    cs.LG 2026-05 unverdicted novelty 6.0

    FAME models scientific topic trajectories in continuous time to forecast paper impact more accurately than LLMs by aligning manuscripts with field momentum in a dynamic latent space.

  34. AI co-mathematician: Accelerating mathematicians with agentic AI

    cs.AI 2026-05 unverdicted novelty 6.0

    An interactive AI workbench called the AI co-mathematician supports open-ended mathematical research and achieves a new high score of 48% on FrontierMath Tier 4.

  35. AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents

    physics.flu-dyn 2026-05 unverdicted novelty 6.0

    An integrated AI agent framework for CFD uses vision-based physics gates to autonomously discover a Spalart-Allmaras runtime correction that cuts lower-wall skin-friction error by 7.89% versus DNS on the periodic hill...

  36. Intentmaking and Sensemaking: Human Interaction with AI-Guided Mathematical Discovery

    cs.AI 2026-05 unverdicted novelty 6.0

    Expert mathematicians using an AI coding agent for discovery engage in repeated cycles of intentmaking to define goals and sensemaking to interpret outputs.

  37. One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue

    cs.CL 2026-05 unverdicted novelty 6.0

    TurnGate identifies the critical turn in multi-turn dialogues where a response would complete hidden malicious intent, outperforming baselines on the new MTID dataset while keeping over-refusal low.

  38. One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue

    cs.CL 2026-05 unverdicted novelty 6.0

    TurnGate uses a new multi-turn intent dataset to detect the harm-enabling closure point in dialogues, outperforming baselines with low over-refusal and generalizing across domains.

  39. BioVeil MATRIX: Uncovering and categorizing vulnerabilities of agentic biological AI scientists

    q-bio.OT 2026-04 unverdicted novelty 6.0

    Agentic biological AI systems like Biomni and K-Dense assist with dual-use tasks blocked by safeguards and gain performance uplift on WMDP proxies; BioVeil MATRIX is introduced as a 10-category taxonomy with 22 techni...

  40. Intern-Atlas: A Methodological Evolution Graph as Research Infrastructure for AI Scientists

    cs.AI 2026-04 unverdicted novelty 6.0

    Intern-Atlas constructs a methodological evolution graph with 9.4 million edges from 1.03 million AI papers to capture how methods emerge, adapt, and transition, enabling better idea evaluation and generation for AI-d...

  41. AgentEconomist: An End-to-end Agentic System Translating Economic Intuitions into Executable Computational Experiments

    cs.HC 2026-04 unverdicted novelty 6.0

    AgentEconomist is an end-to-end agentic system with idea development, experimental design, and execution stages that uses a large economics paper database to produce research ideas with better literature grounding, no...

  42. OMEGA: Optimizing Machine Learning by Evaluating Generated Algorithms

    cs.AI 2026-04 unverdicted novelty 6.0

    OMEGA framework generates novel ML classifiers via meta-prompts and executable code that outperform scikit-learn baselines on 20 benchmark datasets.

  43. TSAssistant: A Human-in-the-Loop Agentic Framework for Automated Target Safety Assessment

    cs.CL 2026-04 unverdicted novelty 6.0

    TSAssistant is a human-in-the-loop multi-agent system that generates citable, evidence-grounded sections for target safety assessment reports by coordinating specialized subagents with interactive user refinement.

  44. How Researchers Navigate Accountability, Transparency, and Trust When Using AI Tools in Early-Stage Research: A Think-Aloud Study

    cs.CY 2026-04 unverdicted novelty 6.0

    A think-aloud study reveals that AI tools in early research misrepresent uncertainty, obscure provenance, and create fragile trust, leading researchers to develop compensatory strategies to preserve scholarly judgment.

  45. Rethinking Publication: A Certification Framework for AI-Enabled Research

    cs.AI 2026-04 unverdicted novelty 6.0

    A two-layer certification framework decouples knowledge validity from human authorship to accommodate AI-enabled research in existing publication systems.

  46. Rethinking Publication: A Certification Framework for AI-Enabled Research

    cs.AI 2026-04 conditional novelty 6.0

    The paper introduces a certification framework that grades AI research contributions into Categories A, B, and C based on pipeline reach at submission time and adds benchmark slots for fully automated work.

  47. A Scientific Human-Agent Reproduction Pipeline

    hep-ph 2026-04 unverdicted novelty 6.0

    SHARP is a human-AI collaboration pipeline for reproducing scientific analyses, demonstrated by recreating a jet classification task from a particle physics paper.

  48. HiRAS: A Hierarchical Multi-Agent Framework for Paper-to-Code Generation and Execution

    cs.CL 2026-04 unverdicted novelty 6.0

    HiRAS introduces hierarchical multi-agent coordination for paper-to-code generation and experiment reproduction, claiming over 10% relative gains over prior state-of-the-art on a refined benchmark with reduced hallucination.

  49. TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration

    cs.AI 2026-04 unverdicted novelty 6.0

    TREX automates the LLM training lifecycle via collaborative agents and tree-based exploration, delivering consistent performance gains across 10 real-world fine-tuning tasks in FT-Bench.

  50. Toward Autonomous Long-Horizon Engineering for ML Research

    cs.CL 2026-04 unverdicted novelty 6.0

    AiScientist improves ML research benchmarks by 10.54 points on PaperBench and reaches 81.82% Any Medal on MLE-Bench Lite through hierarchical control plus durable file-based state instead of conversational handoffs.

  51. Pioneer Agent: Continual Improvement of Small Language Models in Production

    cs.AI 2026-04 unverdicted novelty 6.0

    Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on ...

  52. ResearchEVO: An End-to-End Framework for Automated Scientific Discovery and Documentation

    cs.AI 2026-04 unverdicted novelty 6.0

    ResearchEVO automates the discover-then-explain cycle by evolving algorithms via fitness-driven LLM co-evolution and generating grounded, anti-hallucination research papers through sentence-level RAG.

  53. Towards Verifiable and Self-Correcting AI Physicists for Quantum Many-Body Simulations

    physics.comp-ph 2026-03 unverdicted novelty 6.0

    QMP-Bench supplies a realistic test set for AI on quantum many-body problems while PhysVEC uses integrated verifiers to turn unreliable LLM generations into code that passes both syntax and physics checks, outperformi...

  54. Video models are zero-shot learners and reasoners

    cs.LG 2025-09 unverdicted novelty 6.0

    Generative video models exhibit emergent zero-shot capabilities across perception, manipulation, and basic reasoning tasks.

  55. Read, Grep, and Synthesize: Diagnosing Cross-Domain Seed Exposure for LLM Research Ideation

    cs.AI 2026-05 unverdicted novelty 5.0

    LLM research ideation benefits from exposure to diverse mechanisms across domains but does not yet exploit the specific semantic reasons for cross-domain seed retrieval.

  56. Toward an Engineering of Science: Rebalancing Generation and Verification in the Age of AI

    cs.CY 2026-05 unverdicted novelty 5.0

    AI lowers the cost of generating plausible scientific artifacts without lowering verification costs, so the paper proposes blueprints as typed graph components that decompose claims, evidence, and assumptions to enabl...

  57. StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction

    cs.CL 2026-05 unverdicted novelty 5.0

    StraTA improves LLM agent success rates to 93.1% on ALFWorld and 84.2% on WebShop by sampling a compact initial strategy and training it jointly with action execution via hierarchical GRPO-style rollouts.

  58. From Experimental Limits to Physical Insight: A Retrieval-Augmented Multi-Agent Framework for Interpreting Searches Beyond the Standard Model

    hep-ex 2026-05 unverdicted novelty 5.0

    HEP-CoPilot is a new multi-agent retrieval framework that retrieves, reconstructs, and compares experimental limits from HEP literature and HEPData to support interpretation of beyond-Standard-Model searches.

  59. TSAssistant: A Human-in-the-Loop Agentic Framework for Automated Target Safety Assessment

    cs.CL 2026-04 unverdicted novelty 5.0

    TSAssistant is a modular, human-in-the-loop multi-agent system that generates citable, section-specific drafts for target safety assessment reports by coordinating specialized sub-agents with biomedical data sources a...

  60. pAI/MSc: ML Theory Research with Humans on the Loop

    cs.AI 2026-04 unverdicted novelty 5.0

    pAI/MSc is a customizable multi-agent system that reduces human steering by orders of magnitude when turning a hypothesis into a literature-grounded, mathematically established, experimentally supported manuscript dra...

Reference graph

Works this paper leans on

116 extracted references · 116 canonical work pages · cited by 69 Pith papers · 7 internal anchors

  1. [1]

    Meta-learning curiosity algorithms

    Ferran Alet, Martin F Schneider, Tomas Lozano-Perez, and Leslie Pack Kaelbling. Meta-learning curiosity algorithms. arXiv preprint arXiv:2003.05325, 2020

  2. [2]

    Artificial intelligence in scientific writing: a friend or a foe? Reproductive BioMedicine Online, 47 0 (1): 0 3--9, 2023

    Signe Altm \"a e, Alberto Sola-Leyva, and Andres Salumets. Artificial intelligence in scientific writing: a friend or a foe? Reproductive BioMedicine Online, 47 0 (1): 0 3--9, 2023

  3. [3]

    Model card and evaluations for claude models, 2023

    Anthropic. Model card and evaluations for claude models, 2023. URL https://www-files.anthropic.com/production/images/Model-Card-Claude-2.pdf

  4. [4]

    The claude 3 model family: Opus, sonnet, haiku, 2024

    Anthropic. The claude 3 model family: Opus, sonnet, haiku, 2024. URL https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf

  5. [5]

    Cloud labs: where robots do the research

    Carrie Arnold. Cloud labs: where robots do the research. Nature, 606 0 (7914): 0 612--613, 2022

  6. [6]

    arXiv preprint arXiv:2404.07738

    Jinheon Baek, Sujay Kumar Jauhar, Silviu Cucerzan, and Sung Ju Hwang. Researchagent: Iterative research idea generation over scientific literature with large language models, 2024. URL https://arxiv.org/abs/2404.07738

  7. [7]

    Iclr2022-openreviewdata, 2024

    Federico Berto. Iclr2022-openreviewdata, 2024. URL https://github.com/fedebotu/ICLR2022-OpenReviewData

  8. [8]

    The neurips 2021 consistency experiment

    Alina Beygelzimer, Yann Dauphin, Percy Liang, and Jennifer Wortman Vaughan. The neurips 2021 consistency experiment. Neural Information Processing Systems blog post, 2021. URL https://blog. neurips. cc/2021/12/08/the-neurips-2021-consistency-experiment

  9. [9]

    Quality-diversity through ai feedback

    Herbie Bradley, Andrew Dai, Hannah Benita Teufel, Jenny Zhang, Koen Oostermeijer, Marco Bellagente, Jeff Clune, Kenneth Stanley, Gregory Schott, and Joel Lehman. Quality-diversity through ai feedback. In The Twelfth International Conference on Learning Representations, 2024

  10. [10]

    Minimal criterion coevolution: a new approach to open-ended search

    Jonathan C Brant and Kenneth O Stanley. Minimal criterion coevolution: a new approach to open-ended search. In Proceedings of the Genetic and Evolutionary Computation Conference, pages 67--74, 2017

  11. [11]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

  12. [12]

    Dendral and meta-dendral: Their applications dimension

    Bruce G Buchanan and Edward A Feigenbaum. Dendral and meta-dendral: Their applications dimension. In Readings in artificial intelligence, pages 313--322. Elsevier, 1981

  13. [13]

    Weak-to-strong generalization: eliciting strong capabilities with weak supervision.arXiv preprint arXiv:2312.09390, 2023

    Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, and Jeff Wu. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision, 2023. URL https://arxiv.org/abs/2312.09390

  14. [14]

    What is this thing called science? McGraw-Hill Education (UK), 2013

    Alan Chalmers. What is this thing called science? McGraw-Hill Education (UK), 2013

  15. [15]

    Evoprompting: Language models for code-level neural architecture search

    Angelica Chen, David Dohan, and David So. Evoprompting: Language models for code-level neural architecture search. Advances in Neural Information Processing Systems, 36, 2024 a

  16. [16]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

  17. [17]

    Symbolic discovery of optimization algorithms

    Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, et al. Symbolic discovery of optimization algorithms. Advances in Neural Information Processing Systems, 36, 2024 b

  18. [18]

    Ai-gas: Ai-generating algorithms, an alternate paradigm for producing general artificial intelli- gence.arXiv preprint arXiv:1905.10985, 2019

    Jeff Clune. Ai-gas: Ai-generating algorithms, an alternate paradigm for producing general artificial intelligence. arXiv preprint arXiv:1905.10985, 2019

  19. [19]

    Marg: Multi-agent review generation for scientific papers.ArXiv, abs/2401.04259,

    Mike D'Arcy, Tom Hope, Larry Birnbaum, and Doug Downey. Marg: Multi-agent review generation for scientific papers, 2024. URL https://arxiv.org/abs/2401.04259

  20. [20]

    J. Dewey. How We Think. D.C. Heath & Company, 1910. ISBN 9781519501868. URL https://books.google.co.uk/books?id=WF0AAAAAMAAJ

  21. [21]

    Quality diversity through human feedback: Towards open-ended diversity-driven optimization

    Li Ding, Jenny Zhang, Jeff Clune, Lee Spector, and Joel Lehman. Quality diversity through human feedback: Towards open-ended diversity-driven optimization. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=9zlZuAAb08

  22. [22]

    Symbolicai: A framework for logic-based approaches combining generative models and solvers, 2024

    Marius-Constantin Dinu, Claudiu Leoveanu-Condrei, Markus Holzleitner, Werner Zellinger, and Sepp Hochreiter. Symbolicai: A framework for logic-based approaches combining generative models and solvers, 2024. URL https://arxiv.org/abs/2402.00854

  23. [23]

    Art and the science of generative ai

    Ziv Epstein, Aaron Hertzmann, Investigators of Human Creativity, Memo Akten, Hany Farid, Jessica Fjeld, Morgan R Frank, Matthew Groh, Laura Herman, Neil Leach, et al. Art and the science of generative ai. Science, 380 0 (6650): 0 1110--1111, 2023

  24. [24]

    Omni-epic: Open-endedness via models of human notions of interestingness with environments programmed in code.arXiv preprint arXiv:2405.15568, 2024

    Maxence Faldor, Jenny Zhang, Antoine Cully, and Jeff Clune. Omni-epic: Open-endedness via models of human notions of interestingness with environments programmed in code, 2024. URL https://arxiv.org/abs/2405.15568

  25. [25]

    Integrating quantitative and qualitative discovery: the abacus system

    Brian C Falkenhainer and Ryszard S Michalski. Integrating quantitative and qualitative discovery: the abacus system. Machine Learning, 1: 0 367--401, 1986

  26. [26]

    Discovering faster matrix multiplication algorithms with reinforcement learning

    Alhussein Fawzi, Matej Balog, Aja Huang, Thomas Hubert, Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Francisco J R Ruiz, Julian Schrittwieser, Grzegorz Swirszcz, et al. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature, 610 0 (7930): 0 47--53, 2022

  27. [27]

    Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity

    William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23 0 (120): 0 1--39, 2022. URL http://jmlr.org/papers/v23/21-0998.html

  28. [28]

    Semantic scholar

    Suzanne Fricke. Semantic scholar. Journal of the Medical Library Association: JMLA, 106 0 (1): 0 145, 2018

  29. [29]

    aider, 2024

    Paul Gauthier. aider, 2024. URL https://github.com/paul-gauthier/aider

  30. [30]

    Probabilistic machine learning and artificial intelligence

    Zoubin Ghahramani. Probabilistic machine learning and artificial intelligence. Nature, 521 0 (7553): 0 452--459, 2015

  31. [31]

    Ideas are dimes a dozen: Large language models for idea generation in innovation

    Karan Girotra, Lennart Meincke, Christian Terwiesch, and Karl T Ulrich. Ideas are dimes a dozen: Large language models for idea generation in innovation. Available at SSRN 4526071, 2023

  32. [32]

    Understanding the difficulty of training deep feedforward neural networks

    Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249--256. JMLR Workshop and Conference Proceedings, 2010

  33. [33]

    Generative adversarial nets

    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014. URL https://proceeding...

  34. [34]

    Gemini: A family of highly capable multimodal models, 2023

    Google DeepMind Gemini Team . Gemini: A family of highly capable multimodal models, 2023

  35. [35]

    Diffit: Diffusion vision transformers for image generation,

    Ali Hatamizadeh, Jiaming Song, Guilin Liu, Jan Kautz, and Arash Vahdat. Diffit: Diffusion vision transformers for image generation, 2024. URL https://arxiv.org/abs/2312.02139

  36. [36]

    Simulating 500 million years of evolution with a language model

    Tomas Hayes, Roshan Rao, Halil Akin, Nicholas J Sofroniew, Deniz Oktay, Zeming Lin, Robert Verkuil, Vincent Q Tran, Jonathan Deaton, Marius Wiggert, et al. Simulating 500 million years of evolution with a language model. bioRxiv, pages 2024--07, 2024

  37. [37]

    Automl: A survey of the state-of-the-art

    Xin He, Kaiyong Zhao, and Xiaowen Chu. Automl: A survey of the state-of-the-art. Knowledge-based systems, 212: 0 106622, 2021

  38. [38]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 6840--6851. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf

  39. [39]

    Deep paper gestalt

    Jia-Bin Huang. Deep paper gestalt. arXiv preprint arXiv:1812.08775, 2018

  40. [40]

    Mlagentbench: Evaluating language agents on machine learning experimentation

    Qian Huang, Jian Vora, Percy Liang, and Jure Leskovec. Mlagentbench: Evaluating language agents on machine learning experimentation. In Forty-first International Conference on Machine Learning, 2024

  41. [41]

    Automated machine learning: methods, systems, challenges

    Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren. Automated machine learning: methods, systems, challenges. Springer Nature, 2019

  42. [42]

    The hutter prize, 2006

    Marcus Hutter. The hutter prize, 2006. URL http://prize.hutter1.net

  43. [43]

    Autonomous llm-driven research from data to human-verifiable research papers, 2024

    Tal Ifargan, Lukas Hafner, Maor Kern, Ori Alcalay, and Roy Kishony. Autonomous llm-driven research from data to human-verifiable research papers, 2024. URL https://arxiv.org/abs/2404.17605

  44. [44]

    The principles of science: A treatise on logic and scientific method

    William Stanley Jevons. The principles of science: A treatise on logic and scientific method. Macmillan and Company, 1877

  45. [45]

    Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven...

  46. [46]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?, 2024. URL https://arxiv.org/abs/2310.06770

  47. [47]

    Highly accurate protein structure prediction with alphafold

    John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Z \' dek, Anna Potapenko, et al. Highly accurate protein structure prediction with alphafold. nature, 596 0 (7873): 0 583--589, 2021

  48. [48]

    The unreasonable effectiveness of recurrent neural networks, 2015

    Andrej Karpathy. The unreasonable effectiveness of recurrent neural networks, 2015. URL https://karpathy.github.io/2015/05/21/rnn-effectiveness/

  49. [49]

    NanoGPT , 2022

    Andrej Karpathy. NanoGPT , 2022. URL https://github.com/karpathy/nanoGPT

  50. [50]

    A survey of research on cloud robotics and automation

    Ben Kehoe, Sachin Patil, Pieter Abbeel, and Ken Goldberg. A survey of research on cloud robotics and automation. IEEE Transactions on automation science and engineering, 12 0 (2): 0 398--409, 2015

  51. [51]

    Kingma and Max Welling

    Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes . In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings , 2014

  52. [52]

    Improving generalization in meta reinforcement learning using learned objectives

    Louis Kirsch, Sjoerd van Steenkiste, and J \"u rgen Schmidhuber. Improving generalization in meta reinforcement learning using learned objectives. arXiv preprint arXiv:1910.04098, 2019

  53. [53]

    Discovering attention-based genetic algorithms via meta-black-box optimization

    Robert Lange, Tom Schaul, Yutian Chen, Chris Lu, Tom Zahavy, Valentin Dalibard, and Sebastian Flennerhag. Discovering attention-based genetic algorithms via meta-black-box optimization. In Proceedings of the Genetic and Evolutionary Computation Conference, pages 929--937, 2023 a

  54. [54]

    Discovering evolution strategies via meta-black-box optimization

    Robert Lange, Tom Schaul, Yutian Chen, Tom Zahavy, Valentin Dalibard, Chris Lu, Satinder Singh, and Sebastian Flennerhag. Discovering evolution strategies via meta-black-box optimization. In Proceedings of the Companion Conference on Genetic and Evolutionary Computation, pages 29--30, 2023 b

  55. [55]

    Large language models as evolution strategies

    Robert Tjarko Lange, Yingtao Tian, and Yujin Tang. Large language models as evolution strategies. arXiv preprint arXiv:2402.18381, 2024

  56. [56]

    Scientific discovery: Computational explorations of the creative processes

    Pat Langley. Scientific discovery: Computational explorations of the creative processes. MIT press, 1987

  57. [57]

    Integrated systems for computational scientific discovery

    Pat Langley. Integrated systems for computational scientific discovery. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 22598--22606, 2024

  58. [58]

    Exploiting open-endedness to solve problems through the search for novelty

    Joel Lehman, Kenneth O Stanley, et al. Exploiting open-endedness to solve problems through the search for novelty. In ALIFE, pages 329--336, 2008

  59. [59]

    The surprising creativity of digital evolution: A collection of anecdotes from the evolutionary computation and artificial life research communities

    Joel Lehman, Jeff Clune, Dusan Misevic, Christoph Adami, Lee Altenberg, Julie Beaulieu, Peter J Bentley, Samuel Bernard, Guillaume Beslon, David M Bryson, et al. The surprising creativity of digital evolution: A collection of anecdotes from the evolutionary computation and artificial life research communities. Artificial life, 26 0 (2): 0 274--306, 2020

  60. [60]

    Joel Lehman, Jonathan Gordon, Shawn Jain, Kamal Ndousse, Cathy Yeh, and Kenneth O. Stanley. Evolution through large models, 2022. URL https://arxiv.org/abs/2206.08896

  61. [61]

    Evolution through large models

    Joel Lehman, Jonathan Gordon, Shawn Jain, Kamal Ndousse, Cathy Yeh, and Kenneth O Stanley. Evolution through large models. In Handbook of Evolutionary Machine Learning, pages 331--366. Springer, 2023

  62. [62]

    Automated theory formation in mathematics

    Douglas B Lenat. Automated theory formation in mathematics. In IJCAI, volume 77, pages 833--842, 1977

  63. [63]

    Why am and eurisko appear to work

    Douglas B Lenat and John Seely Brown. Why am and eurisko appear to work. Artificial intelligence, 23 0 (3): 0 269--294, 1984

  64. [64]

    Can large language models provide useful feedback on research papers? a large-scale empirical analysis

    Weixin Liang, Yuhui Zhang, Hancheng Cao, Binglu Wang, Daisy Yi Ding, Xinyu Yang, Kailas Vodrahalli, Siyu He, Daniel Scott Smith, Yian Yin, et al. Can large language models provide useful feedback on research papers? a large-scale empirical analysis. NEJM AI, page AIoa2400196, 2024

  65. [65]

    Large language models as in-context ai generators for quality-diversity

    Bryan Lim, Manon Flageat, and Antoine Cully. Large language models as in-context ai generators for quality-diversity. arXiv preprint arXiv:2404.15794, 2024

  66. [66]

    The Llama 3 Herd of Models

    Llama Team . The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783

  67. [67]

    Discovered policy optimisation

    Chris Lu, Jakub Kuba, Alistair Letcher, Luke Metz, Christian Schroeder de Witt, and Jakob Foerster. Discovered policy optimisation. Advances in Neural Information Processing Systems, 35: 0 16455--16468, 2022 a

  68. [68]

    Discovering preference optimization algorithms with and for large language models

    Chris Lu, Samuel Holt, Claudio Fanconi, Alex J Chan, Jakob Foerster, Mihaela van der Schaar, and Robert Tjarko Lange. Discovering preference optimization algorithms with and for large language models. arXiv preprint arXiv:2406.08414, 2024 a

  69. [69]

    Cong Lu, Philip Ball, Jack Parker-Holder, Michael Osborne, and Stephen J. Roberts. Revisiting design choices in offline model based reinforcement learning. In International Conference on Learning Representations, 2022 b . URL https://openreview.net/forum?id=zz9hXVhf40

  70. [70]

    Intelligent go-explore: Standing on the shoulders of giant foundation models, 2024 b

    Cong Lu, Shengran Hu, and Jeff Clune. Intelligent go-explore: Standing on the shoulders of giant foundation models, 2024 b . URL https://arxiv.org/abs/2405.15143

  71. [71]

    Eureka: Human- level reward design via coding large language models,

    Yecheng Jason Ma, William Liang, Guanzhi Wang, De-An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv:2310.12931, 2023

  72. [72]

    About the test data, 2011

    Matt Mahoney. About the test data, 2011. URL http://mattmahoney.net/dc/textdata.html

  73. [73]

    Discoverybench: Towards data-driven discovery with large language models, 2024

    Bodhisattwa Prasad Majumder, Harshit Surana, Dhruv Agarwal, Bhavana Dalvi Mishra, Abhijeetsingh Meena, Aryan Prakhar, Tirth Vora, Tushar Khot, Ashish Sabharwal, and Peter Clark. Discoverybench: Towards data-driven discovery with large language models, 2024. URL https://arxiv.org/abs/2407.01725

  74. [74]

    grokking , 2022

    Daniel May. grokking , 2022. URL https://github.com/danielmamay/grokking

  75. [75]

    Scaling deep learning for materials discovery

    Amil Merchant, Simon Batzner, Samuel S Schoenholz, Muratahan Aykol, Gowoon Cheon, and Ekin Dogus Cubuk. Scaling deep learning for materials discovery. Nature, 624 0 (7990): 0 80--85, 2023

  76. [76]

    Alex Krizhevsky and Geoffrey Hinton

    Luke Metz, James Harrison, C Daniel Freeman, Amil Merchant, Lucas Beyer, James Bradbury, Naman Agrawal, Ben Poole, Igor Mordatch, Adam Roberts, et al. Velo: Training versatile learned optimizers by scaling up. arXiv preprint arXiv:2211.09760, 2022

  77. [77]

    A robust approach to numeric discovery

    Bernd Nordhausen and Pat Langley. A robust approach to numeric discovery. In Machine learning proceedings 1990, pages 411--418. Elsevier, 1990

  78. [78]

    In-context Learning and Induction Heads

    Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022

  79. [79]

    Gpt-4 technical report, 2023

    OpenAI. Gpt-4 technical report, 2023

  80. [80]

    tiny-diffusion, 2023

    Tanel P\" a rnamaa. tiny-diffusion, 2023. URL https://github.com/tanelp/tiny-diffusion

Showing first 80 references.