Recognition: no theorem link
The AI Scientist: Towards Fully Automated Open-Ended Scientific Discovery
Pith reviewed 2026-05-11 04:36 UTC · model grok-4.3
The pith
Frontier large language models can autonomously conduct full scientific research cycles using the AI Scientist framework, producing papers that pass automated conference-level review.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The AI Scientist is the first comprehensive framework for fully automatic scientific discovery. It allows frontier large language models to generate novel research ideas, write code, execute experiments, visualize results, write full scientific papers, and run a simulated review process. This can be repeated iteratively in an open-ended way. Applied to diffusion modeling, transformer-based language modeling, and learning dynamics, it produces papers at less than $15 each. The automated reviewer achieves near-human performance, and the system generates papers that exceed the acceptance threshold at a top machine learning conference.
What carries the argument
The AI Scientist framework, which sequences LLM capabilities to cover the entire research pipeline from idea generation to self-assessment.
If this is right
- Open-ended iteration of the process can mimic the human scientific community in developing ideas.
- The generated papers can meet or exceed acceptance thresholds for top machine learning conferences per the automated reviewer.
- Versatility across distinct subfields of machine learning including diffusion, language modeling, and learning dynamics.
- Low-cost production of full research papers under fifteen dollars each.
Where Pith is reading between the lines
- This could enable much higher throughput in exploring new ideas within AI research if the quality holds up under human scrutiny.
- Similar systems might eventually be adapted for discovery in other scientific fields, though domain-specific tools would be needed.
- Long-term use might create feedback loops where AI builds upon its own prior discoveries without human input.
Load-bearing premise
The automated reviewer provides an accurate assessment of paper quality comparable to human experts at top conferences.
What would settle it
Having the AI-generated papers submitted to a real top-tier machine learning conference and observing whether they are accepted or rejected based on human reviews.
read the original abstract
One of the grand challenges of artificial general intelligence is developing agents capable of conducting scientific research and discovering new knowledge. While frontier models have already been used as aides to human scientists, e.g. for brainstorming ideas, writing code, or prediction tasks, they still conduct only a small part of the scientific process. This paper presents the first comprehensive framework for fully automatic scientific discovery, enabling frontier large language models to perform research independently and communicate their findings. We introduce The AI Scientist, which generates novel research ideas, writes code, executes experiments, visualizes results, describes its findings by writing a full scientific paper, and then runs a simulated review process for evaluation. In principle, this process can be repeated to iteratively develop ideas in an open-ended fashion, acting like the human scientific community. We demonstrate its versatility by applying it to three distinct subfields of machine learning: diffusion modeling, transformer-based language modeling, and learning dynamics. Each idea is implemented and developed into a full paper at a cost of less than $15 per paper. To evaluate the generated papers, we design and validate an automated reviewer, which we show achieves near-human performance in evaluating paper scores. The AI Scientist can produce papers that exceed the acceptance threshold at a top machine learning conference as judged by our automated reviewer. This approach signifies the beginning of a new era in scientific discovery in machine learning: bringing the transformative benefits of AI agents to the entire research process of AI itself, and taking us closer to a world where endless affordable creativity and innovation can be unleashed on the world's most challenging problems. Our code is open-sourced at https://github.com/SakanaAI/AI-Scientist
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces The AI Scientist, a framework enabling frontier LLMs to autonomously generate novel research ideas, implement code and run experiments, visualize results, write full scientific papers, and evaluate them through a simulated review process. Applied to diffusion modeling, transformer language modeling, and learning dynamics, it claims to produce papers at under $15 each, with some exceeding top-ML-conference acceptance thresholds as scored by an internally designed automated reviewer that achieves near-human performance. The process is presented as repeatable for open-ended discovery, with code open-sourced.
Significance. If the central claims hold after addressing evaluation gaps, this would be a notable step toward fully automated scientific discovery in machine learning, demonstrating a closed-loop system for idea-to-paper generation at low cost and highlighting potential for iterative research. The open-sourcing of code strengthens reproducibility and invites community extensions, though the current lack of external validation limits immediate impact on the broader scientific process.
major comments (3)
- [Automated Reviewer section] Automated Reviewer section: The paper's core claim—that generated papers exceed conference acceptance thresholds—rests entirely on scores from the authors' internally designed and validated automated reviewer. No quantitative details are provided on its training corpus, calibration against real conference decisions, correlation with human reviewers, or performance on a blind test set separating LLM-generated from human papers. This self-referential loop undermines the acceptance-threshold result.
- [Experimental Results (Section 5)] Experimental Results (Section 5): The reported successes in three subfields lack ablation studies on key components (e.g., idea generation vs. experiment execution), quantitative metrics on idea novelty (such as literature overlap or expert originality ratings), and error rates for code validity or experimental soundness. These omissions make it impossible to determine what drives any apparent success or whether outputs represent genuine advances.
- [Abstract and Results summary] Abstract and Results summary: The assertion of 'near-human performance' for the automated reviewer and papers exceeding acceptance thresholds provides no supporting numbers (e.g., inter-rater agreement, threshold calibration details, or comparison to actual conference acceptance rates), leaving the central evaluation unsupported.
minor comments (2)
- [Figures and cost analysis] The workflow diagram and cost breakdowns would benefit from clearer labels and step-by-step explanations to improve readability for readers unfamiliar with the pipeline.
- [Methods description] Some terms (e.g., specific LLM sampling parameters) are referenced without initial definition or explicit values in the methods description.
Simulated Author's Rebuttal
We thank the referee for their thoughtful and constructive comments, which highlight important areas for improving the clarity and rigor of our evaluation. We address each major comment point by point below, indicating planned revisions to the manuscript where appropriate. Our goal is to strengthen the presentation of the automated reviewer and experimental results without altering the core contributions of the AI Scientist framework.
read point-by-point responses
-
Referee: [Automated Reviewer section] Automated Reviewer section: The paper's core claim—that generated papers exceed conference acceptance thresholds—rests entirely on scores from the authors' internally designed and validated automated reviewer. No quantitative details are provided on its training corpus, calibration against real conference decisions, correlation with human reviewers, or performance on a blind test set separating LLM-generated from human papers. This self-referential loop undermines the acceptance-threshold result.
Authors: We agree that the manuscript would benefit from greater transparency on the automated reviewer. The current version describes its design and validation at a high level but omits specific quantitative details. In the revision, we will expand the Automated Reviewer section to include: the composition of the training corpus (human-written papers from prior NeurIPS/ICML/ICLR proceedings), calibration details against historical acceptance rates, Pearson/Spearman correlations with human reviewer scores, and performance metrics on a held-out blind test set. We will also explicitly note that the reviewer was trained exclusively on human papers to mitigate self-reference concerns. These additions will be supported by new tables and figures. revision: yes
-
Referee: [Experimental Results (Section 5)] Experimental Results (Section 5): The reported successes in three subfields lack ablation studies on key components (e.g., idea generation vs. experiment execution), quantitative metrics on idea novelty (such as literature overlap or expert originality ratings), and error rates for code validity or experimental soundness. These omissions make it impossible to determine what drives any apparent success or whether outputs represent genuine advances.
Authors: We acknowledge the value of ablations and additional metrics for isolating contributions. The manuscript focuses on end-to-end feasibility rather than component-wise analysis, but we agree this limits interpretability. In revision, we will add: (1) basic ablation results comparing full pipeline performance against versions with simplified idea generation or execution modules; (2) quantitative novelty metrics such as n-gram overlap and citation similarity with existing literature; and (3) reported error rates for code execution failures and experimental soundness (e.g., percentage of runs that completed without runtime errors). Expert originality ratings remain resource-intensive and will be noted as a limitation with discussion of future work. These changes will appear in an expanded Section 5. revision: partial
-
Referee: [Abstract and Results summary] Abstract and Results summary: The assertion of 'near-human performance' for the automated reviewer and papers exceeding acceptance thresholds provides no supporting numbers (e.g., inter-rater agreement, threshold calibration details, or comparison to actual conference acceptance rates), leaving the central evaluation unsupported.
Authors: We will revise both the abstract and the results summary to include concrete supporting statistics. Specifically, we will report: inter-rater agreement (e.g., Cohen's kappa or correlation values) between the automated reviewer and human reviewers, the precise acceptance threshold calibrated from past conference data (e.g., average scores of accepted papers), and direct comparisons to real acceptance rates. These numbers will be added to the abstract and highlighted in the results section with references to the expanded validation details. revision: yes
Circularity Check
Central claim of exceeding conference thresholds rests on authors' self-designed automated reviewer
specific steps
-
fitted input called prediction
[Abstract]
"To evaluate the generated papers, we design and validate an automated reviewer, which we show achieves near-human performance in evaluating paper scores. The AI Scientist can produce papers that exceed the acceptance threshold at a top machine learning conference as judged by our automated reviewer."
The headline success metric ('exceed the acceptance threshold') is not an external or pre-existing benchmark but is computed by the authors' own reviewer, which they designed, validated, and then used to judge their system's outputs. This reduces the 'prediction' of research success to performance on an internally constructed evaluator, matching the fitted-input-called-prediction pattern.
full rationale
The paper's primary result—that The AI Scientist generates papers exceeding top-ML-conference acceptance thresholds—is defined entirely by scores from an automated reviewer the authors explicitly state they 'design and validate.' This creates a load-bearing self-referential evaluation loop. While the abstract claims near-human performance, no independent external benchmark (e.g., correlation with actual conference decisions on mixed human/LLM papers) is exhibited in the provided text. Other components (idea generation, code execution, paper writing) do not reduce to this loop, so the circularity is partial and confined to the success metric. This warrants a moderate score rather than 8-10, as the framework itself is not definitionally tautological.
Axiom & Free-Parameter Ledger
free parameters (2)
- LLM sampling parameters and model choice
- Automated reviewer acceptance threshold
axioms (1)
- domain assumption Frontier LLMs can reliably generate novel, implementable research ideas and produce correct experimental code without human intervention
invented entities (1)
-
Automated reviewer
no independent evidence
Forward citations
Cited by 60 Pith papers
-
AI-Assisted Peer Review at Scale: The AAAI-26 AI Review Pilot
AI reviews for all 22,977 AAAI-26 papers were preferred by authors and PC members over human reviews on accuracy and suggestions and outperformed baselines at spotting weaknesses.
-
Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty
Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.
-
WildClawBench: A Benchmark for Real-World, Long-Horizon Agent Evaluation
A new native-runtime benchmark reveals that current frontier AI agents succeed on at most 62 percent of realistic long-horizon CLI tasks.
-
AutoResearchBench: Benchmarking AI Agents on Complex Scientific Literature Discovery
AutoResearchBench is a new benchmark showing top AI agents achieve under 10% success on complex scientific literature discovery tasks that demand deep comprehension and open-ended search.
-
FermiLink: A Unified Agent Framework for Multidomain Autonomous Scientific Simulations
FermiLink is a unified AI agent framework that automates multidomain scientific simulations via separated package knowledge bases and a four-layer progressive disclosure mechanism, reproducing 56% of target figures in...
-
AutoLLMResearch: Training Research Agents for Automating LLM Experiment Configuration -- Learning from Cheap, Optimizing Expensive
AutoLLMResearch trains agents via a multi-fidelity environment and MDP pipeline to extrapolate configuration principles from inexpensive to costly LLM experiments.
-
ASIA: an Autonomous System Identification Agent
ASIA uses an LLM-based coding agent to autonomously perform system identification, tested empirically on two benchmarks while noting limitations in transparency and reproducibility.
-
PaperFit: Vision-in-the-Loop Typesetting Optimization for Scientific Documents
PaperFit uses rendered page images in a closed loop to diagnose and repair typesetting defects in LaTeX documents, outperforming baselines on a new benchmark of 200 papers.
-
Hypothesis-Driven Deep Research with Large Language Models: A Structured Methodology for Automated Knowledge Discovery
HDRI is a six-principle eight-stage framework for hypothesis-organized LLM research featuring gap-driven iteration, traceable fact reasoning, and subject locking, realized in INFOMINER with reported gains in fact dens...
-
Self Driving Datasets: From 20 Million Papers to Nuanced Biomedical Knowledge at Scale
Starling uses LLMs and agents to turn 22.5M PubMed papers into 6.3M nuanced structured records across six tasks with 0.6-7.7% frontier-model rejection rates, lower than error rates on existing curated databases.
-
AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents
AI CFD Scientist autonomously finds a Spalart-Allmaras turbulence correction that lowers wall-friction error by 7.89% versus DNS on the periodic hill case using vision-language physics verification.
-
Evolving Idea Graphs with Learnable Edits-and-Commits for Multi-Agent Scientific Ideation
EIG represents research ideas as evolving graphs with nodes for claims and edges for relations, using a learned controller for edits and commits to produce higher-quality scientific proposals than text-only multi-agen...
-
Agentic-imodels: Evolving agentic interpretability tools via autoresearch
Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.
-
Theory Under Construction: Orchestrating Language Models for Research Software Where the Specification Evolves
Comet-H orchestrates LLMs via deficit-scoring prompt selection and half-life task tracking to co-evolve research software components, demonstrated by a static analysis tool reaching F1=0.768 versus a 0.364 baseline.
-
End-to-end autonomous scientific discovery on a real optical platform
An LLM agent autonomously identifies and experimentally validates a previously unreported optical bilinear interaction on a physical platform.
-
Agentic World Modeling: Foundations, Capabilities, Laws, and Beyond
Proposes a levels x laws taxonomy for world models in AI agents, defining L1-L3 capabilities across physical, digital, social, and scientific regimes while reviewing over 400 works to outline a roadmap for advanced ag...
-
Knows: Agent-Native Structured Research Representations
Knows uses a YAML sidecar specification to provide structured, agent-consumable representations of research papers, yielding large accuracy gains for small LLMs on comprehension tasks and rapid community adoption via ...
-
ReviewGrounder: Improving Review Substantiveness with Rubric-Guided, Tool-Integrated Agents
ReviewGrounder decomposes review generation into rubric-guided drafting and tool-integrated grounding stages, outperforming larger baseline models on a new benchmark measuring alignment with human judgments and review...
-
VERITAS: Verifiable Epistemic Reasoning for Image-Derived Hypothesis Testing via Agentic Systems
VERITAS is a multi-agent system for verifiable hypothesis testing on multimodal clinical MRI datasets that achieves 81.4% verdict accuracy with frontier models and introduces an epistemic evidence labeling framework.
-
Camyla: Scaling Autonomous Research in Medical Image Segmentation
Camyla autonomously generates research proposals, experiments, and manuscripts in medical image segmentation, outperforming baselines on 24 of 31 recent datasets while producing 40 human-reviewed papers.
-
Figures as Interfaces: Toward LLM-Native Artifacts for Scientific Discovery
LLM-native figures embed provenance and enable direct LLM interaction with scientific visualizations to accelerate discovery and improve reproducibility.
-
$k$-server-bench: Automating Potential Discovery for the $k$-Server Conjecture
k-server-bench formulates potential-function discovery for the k-server conjecture as a code-based inequality-satisfaction task; current agents fully solve the resolved k=3 case and reduce violations on the open k=4 case.
-
AutoSOTA: An End-to-End Automated Research System for State-of-the-Art AI Model Discovery
AutoSOTA uses eight specialized agents to replicate and optimize models from recent AI papers, producing 105 new SOTA results in about five hours per paper on average.
-
FactReview: Evidence-Grounded Reviews with Literature Positioning and Execution-Based Claim Verification
FactReview extracts claims from ML papers, positions them via literature retrieval, and verifies them through code execution, labeling each as Supported, Partially supported, or In conflict, as shown in a CompGCN case study.
-
Letting the neural code speak: Automated characterization of monkey visual neurons through human language
Natural-language descriptions generated and verified through generative models and digital twins capture the selectivity of most neurons in macaque V1 and V4.
-
Unlocking LLM Creativity in Science through Analogical Reasoning
Analogical reasoning increases LLM solution diversity by 90-173% and novelty rate to over 50%, delivering up to 13-fold gains on biomedical tasks including perturbation prediction and cell communication.
-
NanoResearch: Co-Evolving Skills, Memory, and Policy for Personalized Research Automation
NanoResearch introduces a tri-level co-evolving framework of skills, memory, and policy to personalize LLM-powered research automation across projects and users.
-
ComplexMCP: Evaluation of LLM Agents in Dynamic, Interdependent, and Large-Scale Tool Sandbox
ComplexMCP benchmark shows current LLM agents achieve at most 60% success on interdependent tool tasks versus 90% for humans, due to tool retrieval saturation, over-confidence, and strategic defeatism.
-
TIDE-Bench: Task-Aware and Diagnostic Evaluation of Tool-Integrated Reasoning
TIDE-Bench is a new benchmark for tool-integrated reasoning that combines diverse tasks, multi-aspect metrics covering answer quality, process reliability, efficiency and cost, plus filtered challenging test sets.
-
Towards a Virtual Neuroscientist: Autonomous Neuroimaging Analysis via Multi-Agent Collaboration
NIAgent uses code-centric multi-agent collaboration and hierarchical verification to build adaptive neuroimaging pipelines that outperform static baselines on ADHD-200 and ADNI data.
-
MLS-Bench: A Holistic and Rigorous Assessment of AI Systems on Building Better AI
MLS-Bench shows that current AI agents fall short of reliably inventing generalizable ML methods, with engineering tuning easier than genuine invention.
-
CellScientist: Dual-Space Hierarchical Orchestration for Closed-Loop Refinement of Virtual Cell Models
CellScientist introduces a dual-space hierarchical orchestration system that enables closed-loop refinement of virtual cell models by routing execution discrepancies back to hypothesis or implementation updates, yield...
-
FAME: Forecasting Academic Impact via Continuous-Time Manifold Evolution
FAME models scientific topic trajectories in continuous time to forecast paper impact more accurately than LLMs by aligning manuscripts with field momentum in a dynamic latent space.
-
AI co-mathematician: Accelerating mathematicians with agentic AI
An interactive AI workbench called the AI co-mathematician supports open-ended mathematical research and achieves a new high score of 48% on FrontierMath Tier 4.
-
AI CFD Scientist: Toward Open-Ended Computational Fluid Dynamics Discovery with Physics-Aware AI Agents
An integrated AI agent framework for CFD uses vision-based physics gates to autonomously discover a Spalart-Allmaras runtime correction that cuts lower-wall skin-friction error by 7.89% versus DNS on the periodic hill...
-
Intentmaking and Sensemaking: Human Interaction with AI-Guided Mathematical Discovery
Expert mathematicians using an AI coding agent for discovery engage in repeated cycles of intentmaking to define goals and sensemaking to interpret outputs.
-
One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue
TurnGate identifies the critical turn in multi-turn dialogues where a response would complete hidden malicious intent, outperforming baselines on the new MTID dataset while keeping over-refusal low.
-
One Turn Too Late: Response-Aware Defense Against Hidden Malicious Intent in Multi-Turn Dialogue
TurnGate uses a new multi-turn intent dataset to detect the harm-enabling closure point in dialogues, outperforming baselines with low over-refusal and generalizing across domains.
-
BioVeil MATRIX: Uncovering and categorizing vulnerabilities of agentic biological AI scientists
Agentic biological AI systems like Biomni and K-Dense assist with dual-use tasks blocked by safeguards and gain performance uplift on WMDP proxies; BioVeil MATRIX is introduced as a 10-category taxonomy with 22 techni...
-
Intern-Atlas: A Methodological Evolution Graph as Research Infrastructure for AI Scientists
Intern-Atlas constructs a methodological evolution graph with 9.4 million edges from 1.03 million AI papers to capture how methods emerge, adapt, and transition, enabling better idea evaluation and generation for AI-d...
-
AgentEconomist: An End-to-end Agentic System Translating Economic Intuitions into Executable Computational Experiments
AgentEconomist is an end-to-end agentic system with idea development, experimental design, and execution stages that uses a large economics paper database to produce research ideas with better literature grounding, no...
-
OMEGA: Optimizing Machine Learning by Evaluating Generated Algorithms
OMEGA framework generates novel ML classifiers via meta-prompts and executable code that outperform scikit-learn baselines on 20 benchmark datasets.
-
TSAssistant: A Human-in-the-Loop Agentic Framework for Automated Target Safety Assessment
TSAssistant is a human-in-the-loop multi-agent system that generates citable, evidence-grounded sections for target safety assessment reports by coordinating specialized subagents with interactive user refinement.
-
How Researchers Navigate Accountability, Transparency, and Trust When Using AI Tools in Early-Stage Research: A Think-Aloud Study
A think-aloud study reveals that AI tools in early research misrepresent uncertainty, obscure provenance, and create fragile trust, leading researchers to develop compensatory strategies to preserve scholarly judgment.
-
Rethinking Publication: A Certification Framework for AI-Enabled Research
A two-layer certification framework decouples knowledge validity from human authorship to accommodate AI-enabled research in existing publication systems.
-
Rethinking Publication: A Certification Framework for AI-Enabled Research
The paper introduces a certification framework that grades AI research contributions into Categories A, B, and C based on pipeline reach at submission time and adds benchmark slots for fully automated work.
-
A Scientific Human-Agent Reproduction Pipeline
SHARP is a human-AI collaboration pipeline for reproducing scientific analyses, demonstrated by recreating a jet classification task from a particle physics paper.
-
HiRAS: A Hierarchical Multi-Agent Framework for Paper-to-Code Generation and Execution
HiRAS introduces hierarchical multi-agent coordination for paper-to-code generation and experiment reproduction, claiming over 10% relative gains over prior state-of-the-art on a refined benchmark with reduced hallucination.
-
TREX: Automating LLM Fine-tuning via Agent-Driven Tree-based Exploration
TREX automates the LLM training lifecycle via collaborative agents and tree-based exploration, delivering consistent performance gains across 10 real-world fine-tuning tasks in FT-Bench.
-
Toward Autonomous Long-Horizon Engineering for ML Research
AiScientist improves ML research benchmarks by 10.54 points on PaperBench and reaches 81.82% Any Medal on MLE-Bench Lite through hierarchical control plus durable file-based state instead of conversational handoffs.
-
Pioneer Agent: Continual Improvement of Small Language Models in Production
Pioneer Agent automates the full lifecycle of adapting and continually improving small language models via diagnosis-driven data synthesis and regression-constrained retraining, delivering gains of 1.6-83.8 points on ...
-
ResearchEVO: An End-to-End Framework for Automated Scientific Discovery and Documentation
ResearchEVO automates the discover-then-explain cycle by evolving algorithms via fitness-driven LLM co-evolution and generating grounded, anti-hallucination research papers through sentence-level RAG.
-
Towards Verifiable and Self-Correcting AI Physicists for Quantum Many-Body Simulations
QMP-Bench supplies a realistic test set for AI on quantum many-body problems while PhysVEC uses integrated verifiers to turn unreliable LLM generations into code that passes both syntax and physics checks, outperformi...
-
Video models are zero-shot learners and reasoners
Generative video models exhibit emergent zero-shot capabilities across perception, manipulation, and basic reasoning tasks.
-
Read, Grep, and Synthesize: Diagnosing Cross-Domain Seed Exposure for LLM Research Ideation
LLM research ideation benefits from exposure to diverse mechanisms across domains but does not yet exploit the specific semantic reasons for cross-domain seed retrieval.
-
Toward an Engineering of Science: Rebalancing Generation and Verification in the Age of AI
AI lowers the cost of generating plausible scientific artifacts without lowering verification costs, so the paper proposes blueprints as typed graph components that decompose claims, evidence, and assumptions to enabl...
-
StraTA: Incentivizing Agentic Reinforcement Learning with Strategic Trajectory Abstraction
StraTA improves LLM agent success rates to 93.1% on ALFWorld and 84.2% on WebShop by sampling a compact initial strategy and training it jointly with action execution via hierarchical GRPO-style rollouts.
-
From Experimental Limits to Physical Insight: A Retrieval-Augmented Multi-Agent Framework for Interpreting Searches Beyond the Standard Model
HEP-CoPilot is a new multi-agent retrieval framework that retrieves, reconstructs, and compares experimental limits from HEP literature and HEPData to support interpretation of beyond-Standard-Model searches.
-
TSAssistant: A Human-in-the-Loop Agentic Framework for Automated Target Safety Assessment
TSAssistant is a modular, human-in-the-loop multi-agent system that generates citable, section-specific drafts for target safety assessment reports by coordinating specialized sub-agents with biomedical data sources a...
-
pAI/MSc: ML Theory Research with Humans on the Loop
pAI/MSc is a customizable multi-agent system that reduces human steering by orders of magnitude when turning a hypothesis into a literature-grounded, mathematically established, experimentally supported manuscript dra...
Reference graph
Works this paper leans on
-
[1]
Meta-learning curiosity algorithms
Ferran Alet, Martin F Schneider, Tomas Lozano-Perez, and Leslie Pack Kaelbling. Meta-learning curiosity algorithms. arXiv preprint arXiv:2003.05325, 2020
-
[2]
Signe Altm \"a e, Alberto Sola-Leyva, and Andres Salumets. Artificial intelligence in scientific writing: a friend or a foe? Reproductive BioMedicine Online, 47 0 (1): 0 3--9, 2023
work page 2023
-
[3]
Model card and evaluations for claude models, 2023
Anthropic. Model card and evaluations for claude models, 2023. URL https://www-files.anthropic.com/production/images/Model-Card-Claude-2.pdf
work page 2023
-
[4]
The claude 3 model family: Opus, sonnet, haiku, 2024
Anthropic. The claude 3 model family: Opus, sonnet, haiku, 2024. URL https://www-cdn.anthropic.com/de8ba9b01c9ab7cbabf5c33b80b7bbc618857627/Model_Card_Claude_3.pdf
work page 2024
-
[5]
Cloud labs: where robots do the research
Carrie Arnold. Cloud labs: where robots do the research. Nature, 606 0 (7914): 0 612--613, 2022
work page 2022
-
[6]
arXiv preprint arXiv:2404.07738
Jinheon Baek, Sujay Kumar Jauhar, Silviu Cucerzan, and Sung Ju Hwang. Researchagent: Iterative research idea generation over scientific literature with large language models, 2024. URL https://arxiv.org/abs/2404.07738
-
[7]
Federico Berto. Iclr2022-openreviewdata, 2024. URL https://github.com/fedebotu/ICLR2022-OpenReviewData
work page 2024
-
[8]
The neurips 2021 consistency experiment
Alina Beygelzimer, Yann Dauphin, Percy Liang, and Jennifer Wortman Vaughan. The neurips 2021 consistency experiment. Neural Information Processing Systems blog post, 2021. URL https://blog. neurips. cc/2021/12/08/the-neurips-2021-consistency-experiment
work page 2021
-
[9]
Quality-diversity through ai feedback
Herbie Bradley, Andrew Dai, Hannah Benita Teufel, Jenny Zhang, Koen Oostermeijer, Marco Bellagente, Jeff Clune, Kenneth Stanley, Gregory Schott, and Joel Lehman. Quality-diversity through ai feedback. In The Twelfth International Conference on Learning Representations, 2024
work page 2024
-
[10]
Minimal criterion coevolution: a new approach to open-ended search
Jonathan C Brant and Kenneth O Stanley. Minimal criterion coevolution: a new approach to open-ended search. In Proceedings of the Genetic and Evolutionary Computation Conference, pages 67--74, 2017
work page 2017
-
[11]
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Litwin...
work page 2020
-
[12]
Dendral and meta-dendral: Their applications dimension
Bruce G Buchanan and Edward A Feigenbaum. Dendral and meta-dendral: Their applications dimension. In Readings in artificial intelligence, pages 313--322. Elsevier, 1981
work page 1981
-
[13]
Collin Burns, Pavel Izmailov, Jan Hendrik Kirchner, Bowen Baker, Leo Gao, Leopold Aschenbrenner, Yining Chen, Adrien Ecoffet, Manas Joglekar, Jan Leike, Ilya Sutskever, and Jeff Wu. Weak-to-strong generalization: Eliciting strong capabilities with weak supervision, 2023. URL https://arxiv.org/abs/2312.09390
-
[14]
What is this thing called science? McGraw-Hill Education (UK), 2013
Alan Chalmers. What is this thing called science? McGraw-Hill Education (UK), 2013
work page 2013
-
[15]
Evoprompting: Language models for code-level neural architecture search
Angelica Chen, David Dohan, and David So. Evoprompting: Language models for code-level neural architecture search. Advances in Neural Information Processing Systems, 36, 2024 a
work page 2024
-
[16]
Evaluating Large Language Models Trained on Code
Mark Chen, Jerry Tworek, Heewoo Jun, Qiming Yuan, Henrique Ponde De Oliveira Pinto, Jared Kaplan, Harri Edwards, Yuri Burda, Nicholas Joseph, Greg Brockman, et al. Evaluating large language models trained on code. arXiv preprint arXiv:2107.03374, 2021
work page internal anchor Pith review Pith/arXiv arXiv 2021
-
[17]
Symbolic discovery of optimization algorithms
Xiangning Chen, Chen Liang, Da Huang, Esteban Real, Kaiyuan Wang, Hieu Pham, Xuanyi Dong, Thang Luong, Cho-Jui Hsieh, Yifeng Lu, et al. Symbolic discovery of optimization algorithms. Advances in Neural Information Processing Systems, 36, 2024 b
work page 2024
-
[18]
Jeff Clune. Ai-gas: Ai-generating algorithms, an alternate paradigm for producing general artificial intelligence. arXiv preprint arXiv:1905.10985, 2019
-
[19]
Marg: Multi-agent review generation for scientific papers.ArXiv, abs/2401.04259,
Mike D'Arcy, Tom Hope, Larry Birnbaum, and Doug Downey. Marg: Multi-agent review generation for scientific papers, 2024. URL https://arxiv.org/abs/2401.04259
-
[20]
J. Dewey. How We Think. D.C. Heath & Company, 1910. ISBN 9781519501868. URL https://books.google.co.uk/books?id=WF0AAAAAMAAJ
work page 1910
-
[21]
Quality diversity through human feedback: Towards open-ended diversity-driven optimization
Li Ding, Jenny Zhang, Jeff Clune, Lee Spector, and Joel Lehman. Quality diversity through human feedback: Towards open-ended diversity-driven optimization. In Forty-first International Conference on Machine Learning, 2024. URL https://openreview.net/forum?id=9zlZuAAb08
work page 2024
-
[22]
Symbolicai: A framework for logic-based approaches combining generative models and solvers, 2024
Marius-Constantin Dinu, Claudiu Leoveanu-Condrei, Markus Holzleitner, Werner Zellinger, and Sepp Hochreiter. Symbolicai: A framework for logic-based approaches combining generative models and solvers, 2024. URL https://arxiv.org/abs/2402.00854
-
[23]
Art and the science of generative ai
Ziv Epstein, Aaron Hertzmann, Investigators of Human Creativity, Memo Akten, Hany Farid, Jessica Fjeld, Morgan R Frank, Matthew Groh, Laura Herman, Neil Leach, et al. Art and the science of generative ai. Science, 380 0 (6650): 0 1110--1111, 2023
work page 2023
-
[24]
Maxence Faldor, Jenny Zhang, Antoine Cully, and Jeff Clune. Omni-epic: Open-endedness via models of human notions of interestingness with environments programmed in code, 2024. URL https://arxiv.org/abs/2405.15568
-
[25]
Integrating quantitative and qualitative discovery: the abacus system
Brian C Falkenhainer and Ryszard S Michalski. Integrating quantitative and qualitative discovery: the abacus system. Machine Learning, 1: 0 367--401, 1986
work page 1986
-
[26]
Discovering faster matrix multiplication algorithms with reinforcement learning
Alhussein Fawzi, Matej Balog, Aja Huang, Thomas Hubert, Bernardino Romera-Paredes, Mohammadamin Barekatain, Alexander Novikov, Francisco J R Ruiz, Julian Schrittwieser, Grzegorz Swirszcz, et al. Discovering faster matrix multiplication algorithms with reinforcement learning. Nature, 610 0 (7930): 0 47--53, 2022
work page 2022
-
[27]
Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity
William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23 0 (120): 0 1--39, 2022. URL http://jmlr.org/papers/v23/21-0998.html
work page 2022
-
[28]
Suzanne Fricke. Semantic scholar. Journal of the Medical Library Association: JMLA, 106 0 (1): 0 145, 2018
work page 2018
- [29]
-
[30]
Probabilistic machine learning and artificial intelligence
Zoubin Ghahramani. Probabilistic machine learning and artificial intelligence. Nature, 521 0 (7553): 0 452--459, 2015
work page 2015
-
[31]
Ideas are dimes a dozen: Large language models for idea generation in innovation
Karan Girotra, Lennart Meincke, Christian Terwiesch, and Karl T Ulrich. Ideas are dimes a dozen: Large language models for idea generation in innovation. Available at SSRN 4526071, 2023
work page 2023
-
[32]
Understanding the difficulty of training deep feedforward neural networks
Xavier Glorot and Yoshua Bengio. Understanding the difficulty of training deep feedforward neural networks. In Proceedings of the thirteenth international conference on artificial intelligence and statistics, pages 249--256. JMLR Workshop and Conference Proceedings, 2010
work page 2010
-
[33]
Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron Courville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes, N. Lawrence, and K.Q. Weinberger, editors, Advances in Neural Information Processing Systems, volume 27. Curran Associates, Inc., 2014. URL https://proceeding...
work page 2014
-
[34]
Gemini: A family of highly capable multimodal models, 2023
Google DeepMind Gemini Team . Gemini: A family of highly capable multimodal models, 2023
work page 2023
-
[35]
Diffit: Diffusion vision transformers for image generation,
Ali Hatamizadeh, Jiaming Song, Guilin Liu, Jan Kautz, and Arash Vahdat. Diffit: Diffusion vision transformers for image generation, 2024. URL https://arxiv.org/abs/2312.02139
-
[36]
Simulating 500 million years of evolution with a language model
Tomas Hayes, Roshan Rao, Halil Akin, Nicholas J Sofroniew, Deniz Oktay, Zeming Lin, Robert Verkuil, Vincent Q Tran, Jonathan Deaton, Marius Wiggert, et al. Simulating 500 million years of evolution with a language model. bioRxiv, pages 2024--07, 2024
work page 2024
-
[37]
Automl: A survey of the state-of-the-art
Xin He, Kaiyong Zhao, and Xiaowen Chu. Automl: A survey of the state-of-the-art. Knowledge-based systems, 212: 0 106622, 2021
work page 2021
-
[38]
Denoising diffusion probabilistic models
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models. In H. Larochelle, M. Ranzato, R. Hadsell, M.F. Balcan, and H. Lin, editors, Advances in Neural Information Processing Systems, volume 33, pages 6840--6851. Curran Associates, Inc., 2020. URL https://proceedings.neurips.cc/paper/2020/file/4c5bcfec8584af0d967f1ab10179ca4b-Paper.pdf
work page 2020
-
[39]
Jia-Bin Huang. Deep paper gestalt. arXiv preprint arXiv:1812.08775, 2018
-
[40]
Mlagentbench: Evaluating language agents on machine learning experimentation
Qian Huang, Jian Vora, Percy Liang, and Jure Leskovec. Mlagentbench: Evaluating language agents on machine learning experimentation. In Forty-first International Conference on Machine Learning, 2024
work page 2024
-
[41]
Automated machine learning: methods, systems, challenges
Frank Hutter, Lars Kotthoff, and Joaquin Vanschoren. Automated machine learning: methods, systems, challenges. Springer Nature, 2019
work page 2019
-
[42]
Marcus Hutter. The hutter prize, 2006. URL http://prize.hutter1.net
work page 2006
-
[43]
Autonomous llm-driven research from data to human-verifiable research papers, 2024
Tal Ifargan, Lukas Hafner, Maor Kern, Ori Alcalay, and Roy Kishony. Autonomous llm-driven research from data to human-verifiable research papers, 2024. URL https://arxiv.org/abs/2404.17605
-
[44]
The principles of science: A treatise on logic and scientific method
William Stanley Jevons. The principles of science: A treatise on logic and scientific method. Macmillan and Company, 1877
-
[45]
Albert Q. Jiang, Alexandre Sablayrolles, Antoine Roux, Arthur Mensch, Blanche Savary, Chris Bamford, Devendra Singh Chaplot, Diego de las Casas, Emma Bou Hanna, Florian Bressand, Gianna Lengyel, Guillaume Bour, Guillaume Lample, Lélio Renard Lavaud, Lucile Saulnier, Marie-Anne Lachaux, Pierre Stock, Sandeep Subramanian, Sophia Yang, Szymon Antoniak, Teven...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[46]
SWE-bench: Can Language Models Resolve Real-World GitHub Issues?
Carlos E. Jimenez, John Yang, Alexander Wettig, Shunyu Yao, Kexin Pei, Ofir Press, and Karthik Narasimhan. Swe-bench: Can language models resolve real-world github issues?, 2024. URL https://arxiv.org/abs/2310.06770
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[47]
Highly accurate protein structure prediction with alphafold
John Jumper, Richard Evans, Alexander Pritzel, Tim Green, Michael Figurnov, Olaf Ronneberger, Kathryn Tunyasuvunakool, Russ Bates, Augustin Z \' dek, Anna Potapenko, et al. Highly accurate protein structure prediction with alphafold. nature, 596 0 (7873): 0 583--589, 2021
work page 2021
-
[48]
The unreasonable effectiveness of recurrent neural networks, 2015
Andrej Karpathy. The unreasonable effectiveness of recurrent neural networks, 2015. URL https://karpathy.github.io/2015/05/21/rnn-effectiveness/
work page 2015
-
[49]
Andrej Karpathy. NanoGPT , 2022. URL https://github.com/karpathy/nanoGPT
work page 2022
-
[50]
A survey of research on cloud robotics and automation
Ben Kehoe, Sachin Patil, Pieter Abbeel, and Ken Goldberg. A survey of research on cloud robotics and automation. IEEE Transactions on automation science and engineering, 12 0 (2): 0 398--409, 2015
work page 2015
-
[51]
Diederik P. Kingma and Max Welling. Auto-Encoding Variational Bayes . In 2nd International Conference on Learning Representations, ICLR 2014, Banff, AB, Canada, April 14-16, 2014, Conference Track Proceedings , 2014
work page 2014
-
[52]
Improving generalization in meta reinforcement learning using learned objectives
Louis Kirsch, Sjoerd van Steenkiste, and J \"u rgen Schmidhuber. Improving generalization in meta reinforcement learning using learned objectives. arXiv preprint arXiv:1910.04098, 2019
-
[53]
Discovering attention-based genetic algorithms via meta-black-box optimization
Robert Lange, Tom Schaul, Yutian Chen, Chris Lu, Tom Zahavy, Valentin Dalibard, and Sebastian Flennerhag. Discovering attention-based genetic algorithms via meta-black-box optimization. In Proceedings of the Genetic and Evolutionary Computation Conference, pages 929--937, 2023 a
work page 2023
-
[54]
Discovering evolution strategies via meta-black-box optimization
Robert Lange, Tom Schaul, Yutian Chen, Tom Zahavy, Valentin Dalibard, Chris Lu, Satinder Singh, and Sebastian Flennerhag. Discovering evolution strategies via meta-black-box optimization. In Proceedings of the Companion Conference on Genetic and Evolutionary Computation, pages 29--30, 2023 b
work page 2023
-
[55]
Large language models as evolution strategies
Robert Tjarko Lange, Yingtao Tian, and Yujin Tang. Large language models as evolution strategies. arXiv preprint arXiv:2402.18381, 2024
-
[56]
Scientific discovery: Computational explorations of the creative processes
Pat Langley. Scientific discovery: Computational explorations of the creative processes. MIT press, 1987
work page 1987
-
[57]
Integrated systems for computational scientific discovery
Pat Langley. Integrated systems for computational scientific discovery. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 22598--22606, 2024
work page 2024
-
[58]
Exploiting open-endedness to solve problems through the search for novelty
Joel Lehman, Kenneth O Stanley, et al. Exploiting open-endedness to solve problems through the search for novelty. In ALIFE, pages 329--336, 2008
work page 2008
-
[59]
Joel Lehman, Jeff Clune, Dusan Misevic, Christoph Adami, Lee Altenberg, Julie Beaulieu, Peter J Bentley, Samuel Bernard, Guillaume Beslon, David M Bryson, et al. The surprising creativity of digital evolution: A collection of anecdotes from the evolutionary computation and artificial life research communities. Artificial life, 26 0 (2): 0 274--306, 2020
work page 2020
- [60]
-
[61]
Evolution through large models
Joel Lehman, Jonathan Gordon, Shawn Jain, Kamal Ndousse, Cathy Yeh, and Kenneth O Stanley. Evolution through large models. In Handbook of Evolutionary Machine Learning, pages 331--366. Springer, 2023
work page 2023
-
[62]
Automated theory formation in mathematics
Douglas B Lenat. Automated theory formation in mathematics. In IJCAI, volume 77, pages 833--842, 1977
work page 1977
-
[63]
Why am and eurisko appear to work
Douglas B Lenat and John Seely Brown. Why am and eurisko appear to work. Artificial intelligence, 23 0 (3): 0 269--294, 1984
work page 1984
-
[64]
Weixin Liang, Yuhui Zhang, Hancheng Cao, Binglu Wang, Daisy Yi Ding, Xinyu Yang, Kailas Vodrahalli, Siyu He, Daniel Scott Smith, Yian Yin, et al. Can large language models provide useful feedback on research papers? a large-scale empirical analysis. NEJM AI, page AIoa2400196, 2024
work page 2024
-
[65]
Large language models as in-context ai generators for quality-diversity
Bryan Lim, Manon Flageat, and Antoine Cully. Large language models as in-context ai generators for quality-diversity. arXiv preprint arXiv:2404.15794, 2024
-
[66]
Llama Team . The llama 3 herd of models, 2024. URL https://arxiv.org/abs/2407.21783
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[67]
Discovered policy optimisation
Chris Lu, Jakub Kuba, Alistair Letcher, Luke Metz, Christian Schroeder de Witt, and Jakob Foerster. Discovered policy optimisation. Advances in Neural Information Processing Systems, 35: 0 16455--16468, 2022 a
work page 2022
-
[68]
Discovering preference optimization algorithms with and for large language models
Chris Lu, Samuel Holt, Claudio Fanconi, Alex J Chan, Jakob Foerster, Mihaela van der Schaar, and Robert Tjarko Lange. Discovering preference optimization algorithms with and for large language models. arXiv preprint arXiv:2406.08414, 2024 a
-
[69]
Cong Lu, Philip Ball, Jack Parker-Holder, Michael Osborne, and Stephen J. Roberts. Revisiting design choices in offline model based reinforcement learning. In International Conference on Learning Representations, 2022 b . URL https://openreview.net/forum?id=zz9hXVhf40
work page 2022
-
[70]
Intelligent go-explore: Standing on the shoulders of giant foundation models, 2024 b
Cong Lu, Shengran Hu, and Jeff Clune. Intelligent go-explore: Standing on the shoulders of giant foundation models, 2024 b . URL https://arxiv.org/abs/2405.15143
-
[71]
Eureka: Human- level reward design via coding large language models,
Yecheng Jason Ma, William Liang, Guanzhi Wang, De-An Huang, Osbert Bastani, Dinesh Jayaraman, Yuke Zhu, Linxi Fan, and Anima Anandkumar. Eureka: Human-level reward design via coding large language models. arXiv preprint arXiv:2310.12931, 2023
-
[72]
Matt Mahoney. About the test data, 2011. URL http://mattmahoney.net/dc/textdata.html
work page 2011
-
[73]
Discoverybench: Towards data-driven discovery with large language models, 2024
Bodhisattwa Prasad Majumder, Harshit Surana, Dhruv Agarwal, Bhavana Dalvi Mishra, Abhijeetsingh Meena, Aryan Prakhar, Tirth Vora, Tushar Khot, Ashish Sabharwal, and Peter Clark. Discoverybench: Towards data-driven discovery with large language models, 2024. URL https://arxiv.org/abs/2407.01725
-
[74]
Daniel May. grokking , 2022. URL https://github.com/danielmamay/grokking
work page 2022
-
[75]
Scaling deep learning for materials discovery
Amil Merchant, Simon Batzner, Samuel S Schoenholz, Muratahan Aykol, Gowoon Cheon, and Ekin Dogus Cubuk. Scaling deep learning for materials discovery. Nature, 624 0 (7990): 0 80--85, 2023
work page 2023
-
[76]
Alex Krizhevsky and Geoffrey Hinton
Luke Metz, James Harrison, C Daniel Freeman, Amil Merchant, Lucas Beyer, James Bradbury, Naman Agrawal, Ben Poole, Igor Mordatch, Adam Roberts, et al. Velo: Training versatile learned optimizers by scaling up. arXiv preprint arXiv:2211.09760, 2022
-
[77]
A robust approach to numeric discovery
Bernd Nordhausen and Pat Langley. A robust approach to numeric discovery. In Machine learning proceedings 1990, pages 411--418. Elsevier, 1990
work page 1990
-
[78]
In-context Learning and Induction Heads
Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022
work page internal anchor Pith review Pith/arXiv arXiv 2022
- [79]
-
[80]
Tanel P\" a rnamaa. tiny-diffusion, 2023. URL https://github.com/tanelp/tiny-diffusion
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.