pith. sign in

arxiv: 2408.06292 · v3 · submitted 2024-08-12 · 💻 cs.AI · cs.CL· cs.LG

The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery

Pith reviewed 2026-05-11 04:36 UTC · model grok-4.3

classification 💻 cs.AI cs.CLcs.LG
keywords automated scientific discoverylarge language modelsAI research agentsmachine learningautonomous paper generationself-review processopen-ended discovery
0
0 comments X

The pith

Frontier large language models can autonomously conduct full scientific research cycles using the AI Scientist framework, producing papers that pass automated conference-level review.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper proposes The AI Scientist, a framework that enables large language models to independently manage the complete scientific process. The system generates research ideas, implements them through code and experiments, creates visualizations, writes full papers, and performs its own review evaluation. It is tested on three machine learning subfields with each paper costing less than fifteen dollars. The authors also create an automated reviewer that scores papers near human levels, and some AI-generated papers exceed the acceptance bar according to this reviewer. This represents a step toward AI agents driving open-ended discovery in machine learning research.

Core claim

The AI Scientist is the first comprehensive framework for fully automatic scientific discovery. It allows frontier large language models to generate novel research ideas, write code, execute experiments, visualize results, write full scientific papers, and run a simulated review process. This can be repeated iteratively in an open-ended way. Applied to diffusion modeling, transformer-based language modeling, and learning dynamics, it produces papers at less than $15 each. The automated reviewer achieves near-human performance, and the system generates papers that exceed the acceptance threshold at a top machine learning conference.

What carries the argument

The AI Scientist framework, which sequences LLM capabilities to cover the entire research pipeline from idea generation to self-assessment.

If this is right

  • Open-ended iteration of the process can mimic the human scientific community in developing ideas.
  • The generated papers can meet or exceed acceptance thresholds for top machine learning conferences per the automated reviewer.
  • Versatility across distinct subfields of machine learning including diffusion, language modeling, and learning dynamics.
  • Low-cost production of full research papers under fifteen dollars each.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • This could enable much higher throughput in exploring new ideas within AI research if the quality holds up under human scrutiny.
  • Similar systems might eventually be adapted for discovery in other scientific fields, though domain-specific tools would be needed.
  • Long-term use might create feedback loops where AI builds upon its own prior discoveries without human input.

Load-bearing premise

The automated reviewer provides an accurate assessment of paper quality comparable to human experts at top conferences.

What would settle it

Having the AI-generated papers submitted to a real top-tier machine learning conference and observing whether they are accepted or rejected based on human reviews.

read the original abstract

One of the grand challenges of artificial general intelligence is developing agents capable of conducting scientific research and discovering new knowledge. While frontier models have already been used as aides to human scientists, e.g. for brainstorming ideas, writing code, or prediction tasks, they still conduct only a small part of the scientific process. This paper presents the first comprehensive framework for fully automatic scientific discovery, enabling frontier large language models to perform research independently and communicate their findings. We introduce The AI Scientist, which generates novel research ideas, writes code, executes experiments, visualizes results, describes its findings by writing a full scientific paper, and then runs a simulated review process for evaluation. In principle, this process can be repeated to iteratively develop ideas in an open-ended fashion, acting like the human scientific community. We demonstrate its versatility by applying it to three distinct subfields of machine learning: diffusion modeling, transformer-based language modeling, and learning dynamics. Each idea is implemented and developed into a full paper at a cost of less than $15 per paper. To evaluate the generated papers, we design and validate an automated reviewer, which we show achieves near-human performance in evaluating paper scores. The AI Scientist can produce papers that exceed the acceptance threshold at a top machine learning conference as judged by our automated reviewer. This approach signifies the beginning of a new era in scientific discovery in machine learning: bringing the transformative benefits of AI agents to the entire research process of AI itself, and taking us closer to a world where endless affordable creativity and innovation can be unleashed on the world's most challenging problems. Our code is open-sourced at https://github.com/SakanaAI/AI-Scientist

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces The AI Scientist, a framework enabling frontier LLMs to autonomously generate novel research ideas, implement code and run experiments, visualize results, write full scientific papers, and evaluate them through a simulated review process. Applied to diffusion modeling, transformer language modeling, and learning dynamics, it claims to produce papers at under $15 each, with some exceeding top-ML-conference acceptance thresholds as scored by an internally designed automated reviewer that achieves near-human performance. The process is presented as repeatable for open-ended discovery, with code open-sourced.

Significance. If the central claims hold after addressing evaluation gaps, this would be a notable step toward fully automated scientific discovery in machine learning, demonstrating a closed-loop system for idea-to-paper generation at low cost and highlighting potential for iterative research. The open-sourcing of code strengthens reproducibility and invites community extensions, though the current lack of external validation limits immediate impact on the broader scientific process.

major comments (3)
  1. [Automated Reviewer section] Automated Reviewer section: The paper's core claim—that generated papers exceed conference acceptance thresholds—rests entirely on scores from the authors' internally designed and validated automated reviewer. No quantitative details are provided on its training corpus, calibration against real conference decisions, correlation with human reviewers, or performance on a blind test set separating LLM-generated from human papers. This self-referential loop undermines the acceptance-threshold result.
  2. [Experimental Results (Section 5)] Experimental Results (Section 5): The reported successes in three subfields lack ablation studies on key components (e.g., idea generation vs. experiment execution), quantitative metrics on idea novelty (such as literature overlap or expert originality ratings), and error rates for code validity or experimental soundness. These omissions make it impossible to determine what drives any apparent success or whether outputs represent genuine advances.
  3. [Abstract and Results summary] Abstract and Results summary: The assertion of 'near-human performance' for the automated reviewer and papers exceeding acceptance thresholds provides no supporting numbers (e.g., inter-rater agreement, threshold calibration details, or comparison to actual conference acceptance rates), leaving the central evaluation unsupported.
minor comments (2)
  1. [Figures and cost analysis] The workflow diagram and cost breakdowns would benefit from clearer labels and step-by-step explanations to improve readability for readers unfamiliar with the pipeline.
  2. [Methods description] Some terms (e.g., specific LLM sampling parameters) are referenced without initial definition or explicit values in the methods description.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful and constructive comments, which highlight important areas for improving the clarity and rigor of our evaluation. We address each major comment point by point below, indicating planned revisions to the manuscript where appropriate. Our goal is to strengthen the presentation of the automated reviewer and experimental results without altering the core contributions of the AI Scientist framework.

read point-by-point responses
  1. Referee: [Automated Reviewer section] Automated Reviewer section: The paper's core claim—that generated papers exceed conference acceptance thresholds—rests entirely on scores from the authors' internally designed and validated automated reviewer. No quantitative details are provided on its training corpus, calibration against real conference decisions, correlation with human reviewers, or performance on a blind test set separating LLM-generated from human papers. This self-referential loop undermines the acceptance-threshold result.

    Authors: We agree that the manuscript would benefit from greater transparency on the automated reviewer. The current version describes its design and validation at a high level but omits specific quantitative details. In the revision, we will expand the Automated Reviewer section to include: the composition of the training corpus (human-written papers from prior NeurIPS/ICML/ICLR proceedings), calibration details against historical acceptance rates, Pearson/Spearman correlations with human reviewer scores, and performance metrics on a held-out blind test set. We will also explicitly note that the reviewer was trained exclusively on human papers to mitigate self-reference concerns. These additions will be supported by new tables and figures. revision: yes

  2. Referee: [Experimental Results (Section 5)] Experimental Results (Section 5): The reported successes in three subfields lack ablation studies on key components (e.g., idea generation vs. experiment execution), quantitative metrics on idea novelty (such as literature overlap or expert originality ratings), and error rates for code validity or experimental soundness. These omissions make it impossible to determine what drives any apparent success or whether outputs represent genuine advances.

    Authors: We acknowledge the value of ablations and additional metrics for isolating contributions. The manuscript focuses on end-to-end feasibility rather than component-wise analysis, but we agree this limits interpretability. In revision, we will add: (1) basic ablation results comparing full pipeline performance against versions with simplified idea generation or execution modules; (2) quantitative novelty metrics such as n-gram overlap and citation similarity with existing literature; and (3) reported error rates for code execution failures and experimental soundness (e.g., percentage of runs that completed without runtime errors). Expert originality ratings remain resource-intensive and will be noted as a limitation with discussion of future work. These changes will appear in an expanded Section 5. revision: partial

  3. Referee: [Abstract and Results summary] Abstract and Results summary: The assertion of 'near-human performance' for the automated reviewer and papers exceeding acceptance thresholds provides no supporting numbers (e.g., inter-rater agreement, threshold calibration details, or comparison to actual conference acceptance rates), leaving the central evaluation unsupported.

    Authors: We will revise both the abstract and the results summary to include concrete supporting statistics. Specifically, we will report: inter-rater agreement (e.g., Cohen's kappa or correlation values) between the automated reviewer and human reviewers, the precise acceptance threshold calibrated from past conference data (e.g., average scores of accepted papers), and direct comparisons to real acceptance rates. These numbers will be added to the abstract and highlighted in the results section with references to the expanded validation details. revision: yes

Circularity Check

1 steps flagged

Central claim of exceeding conference thresholds rests on authors' self-designed automated reviewer

specific steps
  1. fitted input called prediction [Abstract]
    "To evaluate the generated papers, we design and validate an automated reviewer, which we show achieves near-human performance in evaluating paper scores. The AI Scientist can produce papers that exceed the acceptance threshold at a top machine learning conference as judged by our automated reviewer."

    The headline success metric ('exceed the acceptance threshold') is not an external or pre-existing benchmark but is computed by the authors' own reviewer, which they designed, validated, and then used to judge their system's outputs. This reduces the 'prediction' of research success to performance on an internally constructed evaluator, matching the fitted-input-called-prediction pattern.

full rationale

The paper's primary result—that The AI Scientist generates papers exceeding top-ML-conference acceptance thresholds—is defined entirely by scores from an automated reviewer the authors explicitly state they 'design and validate.' This creates a load-bearing self-referential evaluation loop. While the abstract claims near-human performance, no independent external benchmark (e.g., correlation with actual conference decisions on mixed human/LLM papers) is exhibited in the provided text. Other components (idea generation, code execution, paper writing) do not reduce to this loop, so the circularity is partial and confined to the success metric. This warrants a moderate score rather than 8-10, as the framework itself is not definitionally tautological.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 1 invented entities

The framework rests on the unproven assumption that current frontier LLMs possess sufficient capability for open-ended research tasks and introduces an internally validated reviewer whose independence from the generated content is not externally demonstrated.

free parameters (2)
  • LLM sampling parameters and model choice
    Specific temperature, top-p, and model versions used for idea generation and code writing are not detailed in the abstract but are central to reproducibility.
  • Automated reviewer acceptance threshold
    The numerical cutoff used to declare papers exceed top-conference standards is not specified.
axioms (1)
  • domain assumption Frontier LLMs can reliably generate novel, implementable research ideas and produce correct experimental code without human intervention
    Invoked throughout the description of the AI Scientist pipeline.
invented entities (1)
  • Automated reviewer no independent evidence
    purpose: To score generated papers and determine acceptance without human input
    New component introduced and validated by the authors themselves.

pith-pipeline@v0.9.0 · 5617 in / 1596 out tokens · 67826 ms · 2026-05-11T04:36:48.483357+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 60 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot

    cs.AI 2026-04 conditional novelty 9.0

    AI reviews for all 22,977 AAAI-26 papers were preferred by authors and PC members over human reviews on accuracy and suggestions and outperformed baselines at spotting weaknesses.

  2. Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty

    cs.CL 2026-05 unverdicted novelty 8.0

    Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.

  3. WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation

    cs.CL 2026-05 unverdicted novelty 8.0

    A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.

  4. AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery

    cs.AI 2026-04 accept novelty 8.0

    AutoResearchBench is a new benchmark showing top AI agents achieve under 10% success on complex scientific literature discovery tasks that demand deep comprehension and open-ended search.

  5. The Last Human-Written Paper: Agent-Native Research Artifacts

    cs.LG 2026-04 unverdicted novelty 8.0

    Introduces ARA as a four-layer machine-executable research package and reports benchmark gains in agent QA accuracy and reproduction success.

  6. FermiLink: A Unified Agent Framework for Multidomain Autonomous Scientific Simulations

    physics.chem-ph 2026-04 conditional novelty 8.0

    FermiLink is a unified AI agent framework that automates multidomain scientific simulations via separated package knowledge bases and a four-layer progressive disclosure mechanism, reproducing 56% of target figures in...

  7. Co-Designing Quantum Codes with Transversal Diagonal Gates via Multi-Agent Systems

    quant-ph 2025-10 accept novelty 8.0 full

    A Lean-verified multi-agent system produces a catalogue of 14,116 quantum codes with transversal diagonal gates for small parameters, extracts infinite families, and resolves specific distance-3 cases with constructio...

  8. LLM-driven design of physics-constrained constitutive models: two agents are better than one

    cs.LG 2026-05 unverdicted novelty 7.0

    A Creator-Inspector multi-agent LLM pipeline for constitutive artificial neural networks increases the rate of models satisfying all nine physical constraints to 100% or 56% depending on the LLM backbone.

  9. Whose Good, Whose Place? The Moral Geography of Agentic AI for Social Good

    cs.CY 2026-05 unverdicted novelty 7.0

    Survey of 112 agentic AI for social good papers reveals moral-geographic asymmetry with 73% lacking geographic context (lowest for SDG 16) and only 25% reporting deployments.

  10. IdleSpec: Exploiting Idle Time via Speculative Planning for LLM Agents

    cs.AI 2026-05 conditional novelty 7.0

    IdleSpec improves LLM agent accuracy by generating and aggregating speculative plans during idle time between tool calls and observations using complementary drafting strategies.

  11. 1GC-7RC: One Graphic Card -- Seven Research Challenges! How Good Are AI Agents at Doing Your Job?

    cs.LG 2026-05 unverdicted novelty 7.0

    Introduces the 1GC-7RC benchmark to evaluate AI coding agents on seven diverse ML tasks under single-GPU time and access constraints.

  12. Test-Time Learning with an Evolving Library

    cs.LG 2026-05 unverdicted novelty 7.0

    EvoLib enables LLMs to accumulate, reuse, and evolve knowledge abstractions from inference trajectories at test time, yielding substantial gains on math reasoning, code generation, and agentic benchmarks without param...

  13. Sheaf-Theoretic Transport and Obstruction for Detecting Scientific Theory Shift in AI Agents

    cs.AI 2026-05 unverdicted novelty 7.0

    A finite sheaf-theoretic framework ranks obstruction measures to identify when an AI agent's theory must deform within its language or extend to a new one, validated on a controlled transition benchmark.

  14. Harnessing Agentic Evolution

    cs.AI 2026-05 unverdicted novelty 7.0

    AEvo introduces a meta-agent that edits the evolution procedure or agent context based on accumulated state, outperforming baselines by 26% relative improvement on agentic benchmarks and achieving SOTA on open-ended tasks.

  15. AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration - Learning from Cheap, Optimizing Expensive

    cs.AI 2026-05 unverdicted novelty 7.0

    AutoLLMResearch trains agents via a multi-fidelity environment and MDP pipeline to extrapolate configuration principles from inexpensive to costly LLM experiments.

  16. ASIA: an Autonomous System Identification Agent

    cs.AI 2026-05 unverdicted novelty 7.0

    ASIA uses an LLM-based coding agent to autonomously perform system identification, tested empirically on two benchmarks while noting limitations in transparency and reproducibility.

  17. PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents

    cs.AI 2026-05 unverdicted novelty 7.0

    PaperFit uses rendered page images in a closed loop to diagnose and repair typesetting defects in LaTeX documents, outperforming baselines on a new benchmark of 200 papers.

  18. Hypothesis-Driven Deep Research with Large Language Models: A Structured Methodology for Automated Knowledge Discovery

    cs.AI 2026-05 unverdicted novelty 7.0

    HDRI is a six-principle eight-stage framework for hypothesis-organized LLM research featuring gap-driven iteration, traceable fact reasoning, and subject locking, realized in INFOMINER with reported gains in fact dens...

  19. Self-Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale

    cs.LG 2026-05 conditional novelty 7.0

    Starling uses LLMs and agents to turn 22.5M PubMed papers into 6.3M nuanced structured records across six tasks with 0.6-7.7% frontier-model rejection rates, lower than error rates on existing curated databases.

  20. AI co-mathematician: Accelerating mathematicians with agentic AI

    cs.AI 2026-05 unverdicted novelty 7.0

    An interactive AI workbench for mathematicians achieves 48% on FrontierMath Tier 4 and helped solve open problems in early tests.

  21. AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents

    physics.flu-dyn 2026-05 conditional novelty 7.0

    AI CFD Scientist autonomously discovers a Spalart-Allmaras runtime correction reducing lower-wall Cf RMSE by 7.89% on the periodic hill at Reh=5600 while using a vision-language gate to detect 14 of 16 silent failures...

  22. AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents

    physics.flu-dyn 2026-05 conditional novelty 7.0

    AI CFD Scientist autonomously finds a Spalart-Allmaras turbulence correction that lowers wall-friction error by 7.89% versus DNS on the periodic hill case using vision-language physics verification.

  23. Evolving Idea Graphs with Learnable Edits-and-Commits for Multi-Agent Scientific Ideation

    cs.MA 2026-05 unverdicted novelty 7.0

    EIG represents research ideas as evolving graphs with nodes for claims and edges for relations, using a learned controller for edits and commits to produce higher-quality scientific proposals than text-only multi-agen...

  24. Agentic-imodels: Evolving agentic interpretability tools via autoresearch

    cs.AI 2026-05 unverdicted novelty 7.0

    Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.

  25. Theory Under Construction: Orchestrating Language Models for Research Software Where the Specification Evolves

    cs.SE 2026-04 unverdicted novelty 7.0

    Comet-H orchestrates LLMs via deficit-scoring prompt selection and half-life task tracking to co-evolve research software components, demonstrated by a static analysis tool reaching F1=0.768 versus a 0.364 baseline.

  26. End-to-end autonomous scientific discovery on a real optical platform

    cs.AI 2026-04 unverdicted novelty 7.0

    An LLM agent autonomously identifies and experimentally validates a previously unreported optical bilinear interaction on a physical platform.

  27. The Last Human-Written Paper: Agent-Native Research Artifacts

    cs.LG 2026-04 conditional novelty 7.0

    The authors introduce Agent-Native Research Artifacts (ARA) as executable research packages with four layers to reduce information loss in papers for AI agents, showing benchmark gains in question-answering and reproduction.

  28. Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond

    cs.AI 2026-04 unverdicted novelty 7.0

    Proposes a levels x laws taxonomy for world models in AI agents, defining L1-L3 capabilities across physical, digital, social, and scientific regimes while reviewing over 400 works to outline a roadmap for advanced ag...

  29. Knows: Agent-Native Structured Research Representations

    cs.AI 2026-04 conditional novelty 7.0

    Knows uses a YAML sidecar specification to provide structured, agent-consumable representations of research papers, yielding large accuracy gains for small LLMs on comprehension tasks and rapid community adoption via ...

  30. ReviewGrounder: Improving Review Substantiveness with Rubric-Guided, Tool-Integrated Agents

    cs.CL 2026-04 unverdicted novelty 7.0

    ReviewGrounder decomposes review generation into rubric-guided drafting and tool-integrated grounding stages, outperforming larger baseline models on a new benchmark measuring alignment with human judgments and review...

  31. VERITAS: Verifiable Epistemic Reasoning for Image-Derived Hypothesis Testing via Agentic Systems

    cs.MA 2026-04 unverdicted novelty 7.0

    VERITAS is a multi-agent system for verifiable hypothesis testing on multimodal clinical MRI datasets that achieves 81.4% verdict accuracy with frontier models and introduces an epistemic evidence labeling framework.

  32. Camyla: Scaling Autonomous Research in Medical Image Segmentation

    cs.AI 2026-04 unverdicted novelty 7.0

    Camyla autonomously generates research proposals, experiments, and manuscripts in medical image segmentation, outperforming baselines on 24 of 31 recent datasets while producing 40 human-reviewed papers.

  33. Figures as Interfaces: Toward LLM-Native Artifacts for Scientific Discovery

    cs.HC 2026-04 unverdicted novelty 7.0

    LLM-native figures embed provenance and enable direct LLM interaction with scientific visualizations to accelerate discovery and improve reproducibility.

  34. $k$-server-bench: Automating Potential Discovery for the $k$-Server Conjecture

    cs.MS 2026-04 accept novelty 7.0

    k-server-bench formulates potential-function discovery for the k-server conjecture as a code-based inequality-satisfaction task; current agents fully solve the resolved k=3 case and reduce violations on the open k=4 case.

  35. AutoSOTA: An End-to-End Automated Research System for State-of-the-Art AI Model Discovery

    cs.CL 2026-04 unverdicted novelty 7.0

    AutoSOTA uses eight specialized agents to replicate and optimize models from recent AI papers, producing 105 new SOTA results in about five hours per paper on average.

  36. FactReview: Evidence-Grounded Reviews with Literature Positioning and Execution-Based Claim Verification

    cs.AI 2026-04 conditional novelty 7.0

    FactReview extracts claims from ML papers, positions them via literature retrieval, and verifies them through code execution, labeling each as Supported, Partially supported, or In conflict, as shown in a CompGCN case study.

  37. The Alien Space of Science: Sampling Coherent but Cognitively Unavailable Research Directions

    cs.AI 2026-03 conditional novelty 7.0

    A framework decomposes LLM papers into idea atoms, trains coherence and availability models over the resulting vocabulary, and samples atom combinations that are coherent yet unlikely under existing author communities.

  38. DiagramBank: A Large-scale Dataset of Diagram Design Exemplars with Paper Metadata for Retrieval-Augmented Generation

    cs.IR 2026-02 accept novelty 7.0

    DiagramBank is a large-scale curated dataset of 89,422 schematic diagrams from scientific papers with rich metadata to support multimodal retrieval and exemplar-driven figure generation.

  39. Kosmos: An AI Scientist for Autonomous Discovery

    cs.AI 2025-11 unverdicted novelty 7.0

    Kosmos is an AI scientist that maintains coherence over hundreds of agent steps via a shared world model, executes thousands of code lines and reads thousands of papers per run, and produces traceable reports with 79....

  40. Evalet: Evaluating Large Language Models through Functional Fragmentation

    cs.HC 2025-09 conditional novelty 7.0

    Evalet applies functional fragmentation to deliver fragment-level qualitative analysis of LLM evaluations, with a user study showing 48% more misalignment detections than holistic scoring.

  41. IDRBench: Understanding the Capability of Large Language Models on Interdisciplinary Research

    cs.CL 2025-07 unverdicted novelty 7.0

    IDRBench is presented as the first benchmark framework consisting of datasets and three evaluation tasks to measure LLMs' ability to perform interdisciplinary research.

  42. SciGA: A Comprehensive Dataset for Designing Graphical Abstracts in Academic Papers

    cs.CV 2025-07 unverdicted novelty 7.0

    Introduces the SciGA-145k dataset with intra-paper and cross-paper graphical abstract recommendation tasks plus the CAR evaluation metric.

  43. Human-LLM Compound System for Scientific Ideation through Facet Recombination and Novelty Evaluation

    cs.HC 2024-09 unverdicted novelty 7.0

    Scideator enables facet-based scientific ideation through LLM-driven extraction, human-guided recombination, analogous retrieval, and facet-grounded novelty verification, showing significantly higher creativity suppor...

  44. Automated Design of Agentic Systems

    cs.AI 2024-08 conditional novelty 7.0

    Meta Agent Search uses a meta-agent to iteratively program novel agentic systems in code, producing agents that outperform state-of-the-art hand-designed ones across coding, science, and math while transferring across...

  45. Mem-$\pi$: Adaptive Memory through Learning When and What to Generate

    cs.CL 2026-05 unverdicted novelty 6.0

    Mem-π is a framework using a dedicated model and decision-content decoupled RL to generate context-specific guidance on demand for LLM agents, outperforming retrieval baselines by over 30% on web navigation.

  46. LLM Agents Make Collective Belief Dynamics Programmable: Challenges and Research Directions

    cs.MA 2026-05 unverdicted novelty 6.0

    LLM agents make collective belief dynamics programmable, with simulations showing coordinated agents induce stable belief shifts, and four structural properties that complicate detection and defense.

  47. How Far Are We From True Auto-Research?

    cs.AI 2026-05 unverdicted novelty 6.0

    ResearchArena shows that agent-generated papers fail top-tier acceptance standards primarily due to fabricated results, underpowered experiments, and plan-execution mismatches that vary sharply by agent.

  48. STRIDE: A Self-Reflective Agent Framework for Reliable Automatic Equation Discovery

    cs.AI 2026-05 unverdicted novelty 6.0

    STRIDE is a self-reflective agent framework that improves accuracy, OOD robustness, and structural recovery in LLM-based symbolic regression by integrating generation, evaluation, repair, and diversity-preserving memory.

  49. FML-bench: A Controlled Study of AI Research Agent Strategies from the Perspective of Search Dynamics

    cs.LG 2026-05 accept novelty 6.0

    FML-Bench shows that a simple greedy hill-climber performs nearly as well as complex tree-search agents on ML research tasks, with an adaptive strategy that switches exploration modes outperforming all tested agents.

  50. ArtifactLinker: Linking Scientific Artifacts for Automatic State-of-the-Art Discovery

    cs.LG 2026-05 unverdicted novelty 6.0

    ArtifactLinker frames SOTA discovery as missing-link prediction on an artifact graph of models and datasets, with a two-stage ranking-plus-verification pipeline and a new benchmark of 14k artifacts.

  51. MLReplicate: Benchmarking Autonomous Research Systems for Machine Learning Reproducibility

    cs.LG 2026-05 conditional novelty 6.0

    MLReplicate benchmark evaluates six autonomous systems on 45 manuscripts from ICML 2025 papers, finding that automated reviews accept flawed outputs with fabricated claims while human review exposes methodological fai...

  52. OpenAaaS: An Open Agent-as-a-Service Framework for Distributed Materials-Informatics Research

    cond-mat.mtrl-sci 2026-05 unverdicted novelty 6.0

    OpenAaaS is a hierarchical agent-as-a-service system that enables secure multi-agent collaboration for materials informatics by moving code to data rather than data to code.

  53. Letting the neural code speak: Automated characterization of monkey visual neurons through human language

    q-bio.NC 2026-05 unverdicted novelty 6.0

    Natural language descriptions generated via a closed-loop pipeline with digital twins capture the selectivity of most neurons in macaque V1 and V4, with synthesized images driving 96% of V4 neurons into the top or bot...

  54. Letting the neural code speak: Automated characterization of monkey visual neurons through human language

    q-bio.NC 2026-05 unverdicted novelty 6.0

    Natural-language descriptions generated and verified through generative models and digital twins capture the selectivity of most neurons in macaque V1 and V4.

  55. AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration - Learning from Cheap, Optimizing Expensive

    cs.AI 2026-05 unverdicted novelty 6.0

    AutoLLMResearch trains agents in a multi-fidelity LLMConfig-Gym environment formulated as a long-horizon MDP to enable cross-fidelity extrapolation for automating high-cost LLM experiment configurations.

  56. Unlocking LLM Creativity in Science through Analogical Reasoning

    cs.AI 2026-05 conditional novelty 6.0

    Analogical reasoning increases LLM solution diversity by 90-173% and novelty rate to over 50%, delivering up to 13-fold gains on biomedical tasks including perturbation prediction and cell communication.

  57. NanoResearch: Co-Evolving Skills, Memory, and Policy for Personalized Research Automation

    cs.AI 2026-05 unverdicted novelty 6.0

    NanoResearch introduces a tri-level co-evolving framework of skills, memory, and policy to personalize LLM-powered research automation across projects and users.

  58. ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox

    cs.AI 2026-05 unverdicted novelty 6.0

    ComplexMCP benchmark shows top LLM agents achieve under 60% success on dynamic interdependent tool tasks versus 90% for humans, due to tool retrieval saturation, over-confidence, and strategic defeatism.

  59. ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox

    cs.AI 2026-05 unverdicted novelty 6.0

    ComplexMCP benchmark shows current LLM agents achieve at most 60% success on interdependent tool tasks versus 90% for humans, due to tool retrieval saturation, over-confidence, and strategic defeatism.

  60. TIDE-Bench: Task-Aware and Diagnostic Evaluation of Tool-Integrated Reasoning

    cs.AI 2026-05 unverdicted novelty 6.0

    TIDE-Bench is a new benchmark for tool-integrated reasoning that combines diverse tasks, multi-aspect metrics covering answer quality, process reliability, efficiency and cost, plus filtered challenging test sets.

Reference graph

Works this paper leans on

116 extracted references · 116 canonical work pages · cited by 127 Pith papers · 9 internal anchors

  1. [1]

    Meta-learning curiosity algorithms

    Ferran Alet, Martin F Schneider, Tomas Lozano-Perez, and Leslie Pack Kaelbling. Meta-learning curiosity algorithms. arXiv preprint arXiv:2003.05325, 2020

  2. [2]

    Artificial intelligence in scientific writing: a friend or a foe? Reproductive BioMedicine Online, 47 0 (1): 0 3--9, 2023

    Signe Altm \"a e, Alberto Sola-Leyva, and Andres Salumets. Artificial intelligence in scientific writing: a friend or a foe? Reproductive BioMedicine Online, 47 0 (1): 0 3--9, 2023

  3. [3]

    Model card and evaluations for claude models, 2023

    Anthropic. Model card and evaluations for claude models, 2023. URL https://www-files.anthropic.com/production/images/Model-Card-Claude-2.pdf

  4. [4]

    The claude 3 model family: Opus, sonnet, haiku, 2024

    Anthropic. The claude 3 model family: Opus, sonnet, haiku, 2024. URL https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf

  5. [5]

    Cloud labs: where robots do the research

    Carrie Arnold. Cloud labs: where robots do the research. Nature, 606 0 (7914): 0 612--613, 2022

  6. [6]

    K., Cucerzan, S

    Jinheon Baek, Sujay Kumar Jauhar, Silviu Cucerzan, and Sung Ju Hwang. Researchagent: Iterative research idea generation over scientific literature with large language models, 2024. URL https://arxiv.org/abs/2404.07738

  7. [7]

    Iclr2022-openreviewdata, 2024

    Federico Berto. Iclr2022-openreviewdata, 2024. URL https://github.com/fedebotu/ICLR2022-OpenReviewData

  8. [8]

    The neurips 2021 consistency experiment

    Alina Beygelzimer, Yann Dauphin, Percy Liang, and Jennifer Wortman Vaughan. The neurips 2021 consistency experiment. Neural Information Processing Systems blog post, 2021. URL https://blog. neurips. cc/2021/12/08/the-neurips-2021-consistency-experiment

  9. [9]

    Quality-diversity through ai feedback

    Herbie Bradley, Andrew Dai, Hannah Benita Teufel, Jenny Zhang, Koen Oostermeijer, Marco Bellagente, Jeff Clune, Kenneth Stanley, Gregory Schott, and Joel Lehman. Quality-diversity through ai feedback. In The Twelfth International Conference on Learning Representations, 2024

  10. [10]

    Minimal criterion coevolution: a new approach to open-ended search

    Jonathan C Brant and Kenneth O Stanley. Minimal criterion coevolution: a new approach to open-ended search. In Proceedings of the Genetic and Evolutionary Computation Conference, pages 67--74, 2017

  11. [11]

    Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...

  12. [12]

    Dendral and meta-dendral: Their applications dimension

    Bruce G Buchanan and Edward A Feigenbaum. Dendral and meta-dendral: Their applications dimension. In Readings in artificial intelligence, pages 313--322. Elsevier, 1981

  13. [13]

    Burns, P

    Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, and Jeff Wu. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision, 2023. URL https://arxiv.org/abs/2312.09390

  14. [14]

    What is this thing called science? McGraw-Hill Education (UK), 2013

    Alan Chalmers. What is this thing called science? McGraw-Hill Education (UK), 2013

  15. [15]

    Evoprompting: Language models for code-level neural architecture search

    Angelica Chen, David Dohan, and David So. Evoprompting: Language models for code-level neural architecture search. Advances in Neural Information Processing Systems, 36, 2024 a

  16. [16]

    Evaluating Large Language Models Trained on Code

    Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021

  17. [17]

    Symbolic discovery of optimization algorithms

    Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, et al. Symbolic discovery of optimization algorithms. Advances in Neural Information Processing Systems, 36, 2024 b

  18. [18]

    Clune, Ai-gas: Ai-generating algorithms, an alternate paradigm for producing general artificial intelligence

    Jeff Clune. Ai-gas: Ai-generating algorithms, an alternate paradigm for producing general artificial intelligence. arXiv preprint arXiv:1905.10985, 2019

  19. [19]

    Marg: Multi-agent review generation for scientific papers

    Mike D'Arcy, Tom Hope, Larry Birnbaum, and Doug Downey. Marg: Multi-agent review generation for scientific papers, 2024. URL https://arxiv.org/abs/2401.04259

  20. [20]

    J. Dewey. How We Think. D.C. Heath & Company, 1910. ISBN 9781519501868. URL https://books.google.co.uk/books?id=WF0AAAAAMAAJ

  21. [21]

    Quality diversity through human feedback: Towards open-ended diversity-driven optimization

    Li Ding, Jenny Zhang, Jeff Clune, Lee Spector, and Joel Lehman. Quality diversity through human feedback: Towards open-ended diversity-driven optimization. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=9zlZuAAb08

  22. [22]

    Symbolicai: A framework for logic-based approaches combining generative models and solvers, 2024

    Marius-Constantin Dinu, Claudiu Leoveanu-Condrei, Markus Holzleitner, Werner Zellinger, and Sepp Hochreiter. Symbolicai: A framework for logic-based approaches combining generative models and solvers, 2024. URL https://arxiv.org/abs/2402.00854

  23. [23]

    Art and the science of generative ai

    Ziv Epstein, Aaron Hertzmann, Investigators of Human Creativity, Memo Akten, Hany Farid, Jessica Fjeld, Morgan R Frank, Matthew Groh, Laura Herman, Neil Leach, et al. Art and the science of generative ai. Science, 380 0 (6650): 0 1110--1111, 2023

  24. [24]

    arXiv preprint arXiv:2405.15568 , year=

    Maxence Faldor, Jenny Zhang, Antoine Cully, and Jeff Clune. Omni-epic: Open-endedness via models of human notions of interestingness with environments programmed in code, 2024. URL https://arxiv.org/abs/2405.15568

  25. [25]

    Integrating quantitative and qualitative discovery: the abacus system

    Brian C Falkenhainer and Ryszard S Michalski. Integrating quantitative and qualitative discovery: the abacus system. Machine Learning, 1: 0 367--401, 1986

  26. [26]

    Discovering faster matrix multiplication algorithms with reinforcement learning

    Alhussein Fawzi, Matej Balog, Aja Huang, Thomas Hubert, Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Francisco J R Ruiz, Julian Schrittwieser, Grzegorz Swirszcz, et al. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature, 610 0 (7930): 0 47--53, 2022

  27. [27]

    Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity

    William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23 0 (120): 0 1--39, 2022. URL http://jmlr.org/papers/v23/21-0998.html

  28. [28]

    Semantic scholar

    Suzanne Fricke. Semantic scholar. Journal of the Medical Library Association: JMLA, 106 0 (1): 0 145, 2018

  29. [29]

    aider, 2024

    Paul Gauthier. aider, 2024. URL https://github.com/paul-gauthier/aider

  30. [30]

    Probabilistic machine learning and artificial intelligence

    Zoubin Ghahramani. Probabilistic machine learning and artificial intelligence. Nature, 521 0 (7553): 0 452--459, 2015

  31. [31]

    Ideas are dimes a dozen: Large language models for idea generation in innovation

    Karan Girotra, Lennart Meincke, Christian Terwiesch, and Karl T Ulrich. Ideas are dimes a dozen: Large language models for idea generation in innovation. Available at SSRN 4526071, 2023

  32. [32]

    Understanding the difficulty of training deep feedforward neural networks

    Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249--256. JMLR Workshop and Conference Proceedings, 2010

  33. [33]

    Generative adversarial nets

    Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014. URL https://proceeding...

  34. [34]

    Gemini: A family of highly capable multimodal models, 2023

    Google DeepMind Gemini Team . Gemini: A family of highly capable multimodal models, 2023

  35. [35]

    DiffiT: Diffusion vision transformers for image generation,

    Ali Hatamizadeh, Jiaming Song, Guilin Liu, Jan Kautz, and Arash Vahdat. Diffit: Diffusion vision transformers for image generation, 2024. URL https://arxiv.org/abs/2312.02139

  36. [36]

    Simulating 500 million years of evolution with a language model

    Tomas Hayes, Roshan Rao, Halil Akin, Nicholas J Sofroniew, Deniz Oktay, Zeming Lin, Robert Verkuil, Vincent Q Tran, Jonathan Deaton, Marius Wiggert, et al. Simulating 500 million years of evolution with a language model. bioRxiv, pages 2024--07, 2024

  37. [37]

    Automl: A survey of the state-of-the-art

    Xin He, Kaiyong Zhao, and Xiaowen Chu. Automl: A survey of the state-of-the-art. Knowledge-based systems, 212: 0 106622, 2021

  38. [38]

    Denoising diffusion probabilistic models

    Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 6840--6851. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf

  39. [39]

    Deep Paper Gestalt

    Jia-Bin Huang. Deep paper gestalt. arXiv preprint arXiv:1812.08775, 2018

  40. [40]

    Mlagentbench: Evaluating language agents on machine learning experimentation

    Qian Huang, Jian Vora, Percy Liang, and Jure Leskovec. Mlagentbench: Evaluating language agents on machine learning experimentation. In Forty-first International Conference on Machine Learning, 2024

  41. [41]

    Automated machine learning: methods, systems, challenges

    Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren. Automated machine learning: methods, systems, challenges. Springer Nature, 2019

  42. [42]

    The hutter prize, 2006

    Marcus Hutter. The hutter prize, 2006. URL http://prize.hutter1.net

  43. [43]

    Autonomous llm-driven research from data to human-verifiable research papers, 2024

    Tal Ifargan, Lukas Hafner, Maor Kern, Ori Alcalay, and Roy Kishony. Autonomous llm-driven research from data to human-verifiable research papers, 2024. URL https://arxiv.org/abs/2404.17605

  44. [44]

    The principles of science: A treatise on logic and scientific method

    William Stanley Jevons. The principles of science: A treatise on logic and scientific method. Macmillan and Company, 1877

  45. [45]

    Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven...

  46. [46]

    SWE-bench: Can Language Models Resolve Real-World GitHub Issues?

    Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?, 2024. URL https://arxiv.org/abs/2310.06770

  47. [47]

    Highly accurate protein structure prediction with alphafold

    John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Z \' dek, Anna Potapenko, et al. Highly accurate protein structure prediction with alphafold. nature, 596 0 (7873): 0 583--589, 2021

  48. [48]

    The unreasonable effectiveness of recurrent neural networks, 2015

    Andrej Karpathy. The unreasonable effectiveness of recurrent neural networks, 2015. URL https://karpathy.github.io/2015/05/21/rnn-effectiveness/

  49. [49]

    NanoGPT , 2022

    Andrej Karpathy. NanoGPT , 2022. URL https://github.com/karpathy/nanoGPT

  50. [50]

    A survey of research on cloud robotics and automation

    Ben Kehoe, Sachin Patil, Pieter Abbeel, and Ken Goldberg. A survey of research on cloud robotics and automation. IEEE Transactions on automation science and engineering, 12 0 (2): 0 398--409, 2015

  51. [51]

    Kingma and Max Welling

    Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes . In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings , 2014

  52. [52]

    Improving generalization in meta reinforcement learning using learned objectives.arXiv preprint arXiv:1910.04098,

    Louis Kirsch, Sjoerd van Steenkiste, and J \"u rgen Schmidhuber. Improving generalization in meta reinforcement learning using learned objectives. arXiv preprint arXiv:1910.04098, 2019

  53. [53]

    Discovering attention-based genetic algorithms via meta-black-box optimization

    Robert Lange, Tom Schaul, Yutian Chen, Chris Lu, Tom Zahavy, Valentin Dalibard, and Sebastian Flennerhag. Discovering attention-based genetic algorithms via meta-black-box optimization. In Proceedings of the Genetic and Evolutionary Computation Conference, pages 929--937, 2023 a

  54. [54]

    Discovering evolution strategies via meta-black-box optimization

    Robert Lange, Tom Schaul, Yutian Chen, Tom Zahavy, Valentin Dalibard, Chris Lu, Satinder Singh, and Sebastian Flennerhag. Discovering evolution strategies via meta-black-box optimization. In Proceedings of the Companion Conference on Genetic and Evolutionary Computation, pages 29--30, 2023 b

  55. [55]

    Large language models as evolution strategies

    Robert Tjarko Lange, Yingtao Tian, and Yujin Tang. Large language models as evolution strategies. arXiv preprint arXiv:2402.18381, 2024

  56. [56]

    Scientific discovery: Computational explorations of the creative processes

    Pat Langley. Scientific discovery: Computational explorations of the creative processes. MIT press, 1987

  57. [57]

    Integrated systems for computational scientific discovery

    Pat Langley. Integrated systems for computational scientific discovery. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 22598--22606, 2024

  58. [58]

    Exploiting open-endedness to solve problems through the search for novelty

    Joel Lehman, Kenneth O Stanley, et al. Exploiting open-endedness to solve problems through the search for novelty. In ALIFE, pages 329--336, 2008

  59. [59]

    The surprising creativity of digital evolution: A collection of anecdotes from the evolutionary computation and artificial life research communities

    Joel Lehman, Jeff Clune, Dusan Misevic, Christoph Adami, Lee Altenberg, Julie Beaulieu, Peter J Bentley, Samuel Bernard, Guillaume Beslon, David M Bryson, et al. The surprising creativity of digital evolution: A collection of anecdotes from the evolutionary computation and artificial life research communities. Artificial life, 26 0 (2): 0 274--306, 2020

  60. [60]

    Joel Lehman, Jonathan Gordon, Shawn Jain, Kamal Ndousse, Cathy Yeh, and Kenneth O. Stanley. Evolution through large models, 2022. URL https://arxiv.org/abs/2206.08896

  61. [61]

    Evolution through large models

    Joel Lehman, Jonathan Gordon, Shawn Jain, Kamal Ndousse, Cathy Yeh, and Kenneth O Stanley. Evolution through large models. In Handbook of Evolutionary Machine Learning, pages 331--366. Springer, 2023

  62. [62]

    Automated theory formation in mathematics

    Douglas B Lenat. Automated theory formation in mathematics. In IJCAI, volume 77, pages 833--842, 1977

  63. [63]

    Why am and eurisko appear to work

    Douglas B Lenat and John Seely Brown. Why am and eurisko appear to work. Artificial intelligence, 23 0 (3): 0 269--294, 1984

  64. [64]

    Can large language models provide useful feedback on research papers? a large-scale empirical analysis

    Weixin Liang, Yuhui Zhang, Hancheng Cao, Binglu Wang, Daisy Yi Ding, Xinyu Yang, Kailas Vodrahalli, Siyu He, Daniel Scott Smith, Yian Yin, et al. Can large language models provide useful feedback on research papers? a large-scale empirical analysis. NEJM AI, page AIoa2400196, 2024

  65. [65]

    arXiv preprint arXiv:2404.15794 , year=

    Bryan Lim, Manon Flageat, and Antoine Cully. Large language models as in-context ai generators for quality-diversity. arXiv preprint arXiv:2404.15794, 2024

  66. [66]

    The Llama 3 Herd of Models

    Llama Team . The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783

  67. [67]

    Discovered policy optimisation

    Chris Lu, Jakub Kuba, Alistair Letcher, Luke Metz, Christian Schroeder de Witt, and Jakob Foerster. Discovered policy optimisation. Advances in Neural Information Processing Systems, 35: 0 16455--16468, 2022 a

  68. [68]

    Discovering preference optimization algorithms with and for large language models

    Chris Lu, Samuel Holt, Claudio Fanconi, Alex J Chan, Jakob Foerster, Mihaela van der Schaar, and Robert Tjarko Lange. Discovering preference optimization algorithms with and for large language models. arXiv preprint arXiv:2406.08414, 2024 a

  69. [69]

    Cong Lu, Philip Ball, Jack Parker-Holder, Michael Osborne, and Stephen J. Roberts. Revisiting design choices in offline model based reinforcement learning. In International Conference on Learning Representations, 2022 b . URL https://openreview.net/forum?id=zz9hXVhf40

  70. [70]

    arXiv preprint arXiv:2405.15143 , year=

    Cong Lu, Shengran Hu, and Jeff Clune. Intelligent go-explore: Standing on the shoulders of giant foundation models, 2024 b . URL https://arxiv.org/abs/2405.15143

  71. [71]

    Eureka: Human-Level Reward Design via Coding Large Language Models

    Yecheng Jason Ma, William Liang, Guanzhi Wang, De-An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv:2310.12931, 2023

  72. [72]

    About the test data, 2011

    Matt Mahoney. About the test data, 2011. URL http://mattmahoney.net/dc/textdata.html

  73. [73]

    10 Preprint

    Bodhisattwa Prasad Majumder, Harshit Surana, Dhruv Agarwal, Bhavana Dalvi Mishra, Abhijeetsingh Meena, Aryan Prakhar, Tirth Vora, Tushar Khot, Ashish Sabharwal, and Peter Clark. Discoverybench: Towards data-driven discovery with large language models, 2024. URL https://arxiv.org/abs/2407.01725

  74. [74]

    grokking , 2022

    Daniel May. grokking , 2022. URL https://github.com/danielmamay/grokking

  75. [75]

    Scaling deep learning for materials discovery

    Amil Merchant, Simon Batzner, Samuel S Schoenholz, Muratahan Aykol, Gowoon Cheon, and Ekin Dogus Cubuk. Scaling deep learning for materials discovery. Nature, 624 0 (7990): 0 80--85, 2023

  76. [76]

    arXiv:2211.09760 (2022) 4

    Luke Metz, James Harrison, C Daniel Freeman, Amil Merchant, Lucas Beyer, James Bradbury, Naman Agrawal, Ben Poole, Igor Mordatch, Adam Roberts, et al. Velo: Training versatile learned optimizers by scaling up. arXiv preprint arXiv:2211.09760, 2022

  77. [77]

    A robust approach to numeric discovery

    Bernd Nordhausen and Pat Langley. A robust approach to numeric discovery. In Machine learning proceedings 1990, pages 411--418. Elsevier, 1990

  78. [78]

    In-context Learning and Induction Heads

    Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022

  79. [79]

    Gpt-4 technical report, 2023

    OpenAI. Gpt-4 technical report, 2023

  80. [80]

    tiny-diffusion, 2023

    Tanel P\" a rnamaa. tiny-diffusion, 2023. URL https://github.com/tanelp/tiny-diffusion

Showing first 80 references.