Beyond AI as Assistants: Toward Autonomous Discovery in Cosmology

Licong Xu; Thomas Borrett

arxiv: 2605.14791 · v2 · pith:Y7FRX3WOnew · submitted 2026-05-14 · 🌌 astro-ph.IM · astro-ph.CO· cs.AI

Beyond AI as Assistants: Toward Autonomous Discovery in Cosmology

Licong Xu , Thomas Borrett This is my paper

Pith reviewed 2026-06-30 20:21 UTC · model grok-4.3

classification 🌌 astro-ph.IM astro-ph.COcs.AI

keywords cosmologyAI agentsautonomous discoveryweak lensingCMB data analysiscode evolutionmulti-agent systemsACT DR6

0 comments

The pith

Cosmology supplies both quantitative benchmarks and open research problems that can advance AI systems toward autonomous discovery.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces two agentic AI systems built for cosmological work. CMBEvolve applies LLM-guided code evolution and tree search to tasks that have explicit numerical targets, such as spotting out-of-distribution features in weak-lensing maps. CosmoEvolve runs a virtual multi-agent laboratory for open-ended analysis, shown on ACT DR6 data where the agents surface non-trivial pair- and scale-dependent signals and generate usable diagnostics. Together the examples position cosmology as a source of both tightly controlled tests and realistic, high-stakes problems for training AI scientists.

Core claim

CMBEvolve and CosmoEvolve demonstrate that LLM-driven code evolution combined with multi-agent collaboration can produce scientifically usable outputs on cosmological data without direct human coding at each step.

What carries the argument

CMBEvolve, which evolves code iteratively via LLM prompts and tree search to optimize explicit benchmark scores, paired with CosmoEvolve, a multi-agent virtual laboratory that manages open-ended research workflows.

If this is right

AI agents can raise performance on controlled out-of-distribution detection tasks in weak-lensing maps through repeated code refinement.
Multi-agent laboratories can extract non-trivial scale- and pair-dependent signals from ACT DR6 data and produce analysis-ready diagnostics.
Cosmology datasets supply both closed-ended quantitative benchmarks and realistic open-ended workflows for testing AI scientist prototypes.
Iterative LLM-guided evolution offers a route to reduce direct human intervention in routine cosmological data processing.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same agent architectures could be applied to other data-rich fields such as particle physics or genomics where quantitative benchmarks coexist with open discovery questions.
If the autonomy assumption holds, these systems could eventually run on live survey data streams and flag anomalies faster than human-led pipelines.
Success would shift research effort from writing analysis code toward defining scientific questions and validating agent-generated results.

Load-bearing premise

The demonstrations assume that the code and analyses generated by iterative evolution and agent collaboration reach publication-grade quality with little or no unstated human guidance or post-editing.

What would settle it

Independent inspection of the final code and diagnostic plots from either system showing that they match published results only after substantial human rewriting or parameter tuning would falsify the autonomy claim.

read the original abstract

Recent advances in artificial intelligence (AI) agents are pushing AI beyond tools toward autonomous scientific discovery. We discuss two complementary agentic systems for cosmology: \texttt{CMBEvolve}, which targets tasks with explicit quantitative objectives through LLM-guided code evolution and tree search, and \texttt{CosmoEvolve}, which targets open-ended scientific workflows through a virtual multi-agent research laboratory. As preliminary demonstrations, we apply \texttt{CMBEvolve} to out-of-distribution detection in weak-lensing maps, where it iteratively improves the benchmark score through code evolution, and \texttt{CosmoEvolve} to autonomous ACT DR6 data analysis, where it identifies non-trivial pair- and scale-dependent behaviour and produces analysis-grade diagnostics. These examples show how cosmology can provide both controlled benchmark tasks and realistic open-ended research problems for the development of AI scientist systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper names two LLM-agent frameworks for cosmology tasks but the autonomy claims rest on high-level descriptions without evidence on human oversight or output validation.

read the letter

The core point is that the work applies existing LLM techniques like code evolution and multi-agent setups to cosmology, but the demonstrations stay at the level of claims without the numbers or protocols needed to judge real autonomy.

What stands out as new is the explicit split between benchmark-style tasks (CMBEvolve on weak-lensing OOD detection) and open-ended analysis (CosmoEvolve on ACT DR6 data). The paper notes that cosmology supplies both controlled scores and realistic research problems, which is a fair observation and gives the framing some structure.

The demonstrations are described as iterative code improvement and identification of pair- and scale-dependent behavior, with outputs called analysis-grade. That framing is reasonable on paper. The limitation is that nothing quantifies how much prompt engineering, code checking, or post-processing was required, and there are no error bars, success rates, or comparison baselines. Without those, the autonomy part cannot be assessed from the text.

The paper does not report new cosmological measurements or derivations, so its value sits in the AI-for-science discussion rather than in the cosmology results themselves. The citation pattern is light because the contribution is mostly architectural naming and high-level application.

This is for readers already working on agentic AI systems who want a cosmology test domain. A serious referee could usefully press on the missing validation details and ask whether the workflows reach usable results with minimal human steering. I would send it to review on that basis rather than desk reject, even though the current version leaves the central autonomy claim unevaluable.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes two agentic AI systems for cosmology: CMBEvolve, which applies LLM-guided code evolution and tree search to tasks with explicit quantitative objectives (demonstrated on out-of-distribution detection in weak-lensing maps), and CosmoEvolve, a virtual multi-agent research laboratory for open-ended workflows (demonstrated on autonomous analysis of ACT DR6 data that identifies non-trivial pair- and scale-dependent behavior). The central claim is that these examples position cosmology as a source of both controlled benchmarks and realistic open-ended problems for developing autonomous AI scientist systems.

Significance. If the demonstrations can be shown to reach analysis-grade outputs with quantified autonomy (i.e., minimal unstated human guidance), the work would be significant for establishing cosmology as a testbed for AI discovery systems, supplying both quantitative benchmarks and open research problems. The complementary framing of controlled versus open-ended tasks is a constructive contribution to the AI-for-science literature.

major comments (2)

[Abstract] Abstract: the claim that CosmoEvolve 'produces analysis-grade diagnostics' is load-bearing for the autonomy argument yet is presented without any reported validation protocols, comparison to standard human-led pipelines, error bars on the identified behaviors, or metrics quantifying human oversight in the multi-agent workflow.
[Abstract] Abstract: the statement that CMBEvolve 'iteratively improves the benchmark score through code evolution' lacks autonomy metrics, baseline comparisons, or details on how objectives, prompts, and post-processing steps are defined, making it impossible to evaluate whether the outputs constitute genuine autonomous discovery rather than guided iteration.

minor comments (2)

The manuscript would benefit from an explicit definition of 'autonomy' and 'analysis-grade' early in the text, together with a table summarizing the human-defined versus system-generated components of each demonstration.
References to prior work on LLM code evolution and multi-agent systems should be expanded to clarify the incremental contribution of the cosmology-specific framing.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which highlight the need to strengthen the abstract's support for claims of autonomy. We respond to each major comment below and indicate planned revisions.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that CosmoEvolve 'produces analysis-grade diagnostics' is load-bearing for the autonomy argument yet is presented without any reported validation protocols, comparison to standard human-led pipelines, error bars on the identified behaviors, or metrics quantifying human oversight in the multi-agent workflow.

Authors: We agree that the abstract phrasing is strong and that the manuscript does not report formal validation protocols, direct comparisons to human-led pipelines, error bars on behaviors, or quantitative metrics of human oversight. The body of the paper (Sections 3-5) describes the multi-agent workflow, agent roles, and the specific diagnostics generated for the ACT DR6 case, which were reviewed for scientific plausibility. We will revise the abstract to qualify the claim (e.g., replacing 'produces analysis-grade diagnostics' with 'generates diagnostics consistent with analysis-grade standards in the demonstrated workflow') and add a sentence noting the preliminary nature of the autonomy assessment. This is a partial revision, as we cannot add new comparative experiments at this stage. revision: partial
Referee: [Abstract] Abstract: the statement that CMBEvolve 'iteratively improves the benchmark score through code evolution' lacks autonomy metrics, baseline comparisons, or details on how objectives, prompts, and post-processing steps are defined, making it impossible to evaluate whether the outputs constitute genuine autonomous discovery rather than guided iteration.

Authors: We acknowledge that the abstract omits these supporting details. Section 2 of the manuscript specifies the quantitative objective (benchmark score), the tree-search mechanism, the LLM prompt templates for code evolution, and post-processing steps; results include comparisons to non-evolutionary baselines. We will revise the abstract to include a brief reference to these elements and to the level of autonomy achieved. This is a partial revision, as full quantitative autonomy metrics (e.g., fraction of steps requiring human intervention) were not systematically recorded in the original experiments. revision: partial

Circularity Check

0 steps flagged

No derivations, equations, or fitted quantities; claims are descriptive demonstrations without circular reduction.

full rationale

The paper contains no equations, derivations, or quantitative modeling steps. It describes two agentic systems (CMBEvolve and CosmoEvolve) and their application to cosmology tasks as preliminary demonstrations. No load-bearing claim reduces by construction to its own inputs, self-citation chains, or fitted parameters renamed as predictions. The absence of any mathematical structure means none of the enumerated circularity patterns (self-definitional, fitted-input-called-prediction, etc.) can be exhibited via direct quotation and reduction. This is the expected honest non-finding for a conceptual/demonstrative manuscript.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 2 invented entities

The claims rest on untested assumptions about LLM effectiveness in guiding scientific code and multi-agent autonomy, with the two named systems introduced as new constructs without external validation.

axioms (2)

domain assumption Large language models can reliably guide code evolution toward improved scientific task performance
Invoked in the description of CMBEvolve and its weak-lensing demonstration
domain assumption Virtual multi-agent laboratories can autonomously handle open-ended data analysis workflows
Invoked in the description of CosmoEvolve and its ACT DR6 demonstration

invented entities (2)

CMBEvolve no independent evidence
purpose: LLM-guided code evolution and tree search for quantitative cosmology tasks
Newly named system presented as targeting explicit objectives
CosmoEvolve no independent evidence
purpose: Multi-agent virtual laboratory for open-ended cosmology workflows
Newly named system presented as targeting realistic research problems

pith-pipeline@v0.9.1-grok · 5671 in / 1497 out tokens · 38266 ms · 2026-06-30T20:21:13.199219+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Workflow Closure Is Not Scientific Closure in Auto-Research Systems
cs.SE 2026-05 unverdicted novelty 5.0

Survey of auto-research systems identifies objective, validation, and acceptance collapses, concluding that workflow closure does not equal scientific closure and advocating non-autonomous epistemic control.

Reference graph

Works this paper leans on

8 extracted references · 1 canonical work pages · cited by 1 Pith paper

[1]

Multi-agent system for cosmolog- ical parameter analysis, 2024

Andrew Laverick, Kristen Surrao, Inigo Zubeldia, et al. Multi-agent system for cosmolog- ical parameter analysis, 2024

2024
[2]

Lonappan, et al

Licong Xu, Milind Sarkar, Anto I. Lonappan, et al. Open source planning & control system with language agents for autonomous scientific discovery, 2025

2025
[3]

The denario project: Deep knowledge ai agents for scientific discovery, 2025

Francisco Villaescusa-Navarro, Boris Bolliet, Pablo Villanueva-Domingo, et al. The denario project: Deep knowledge ai agents for scientific discovery, 2025

2025
[4]

Competing with ai scientists: Agent- driven approach to astrophysics research, 2026

Thomas Borrett, Licong Xu, Andy Nilipour, et al. Competing with ai scientists: Agent- driven approach to astrophysics research, 2026

2026
[5]

Alphaevolve: A coding agent for scientific and algorithmic discovery, 2025

Alexander Novikov et al. Alphaevolve: A coding agent for scientific and algorithmic discovery, 2025

2025
[6]

FAIR Universe Weak Lensing ML Uncertainty Challenge: Handling Uncertainties and Distribution Shifts for Precision Cosmology

Biwei Dai et al. FAIR Universe Weak Lensing ML Uncertainty Challenge: Handling Uncertainties and Distribution Shifts for Precision Cosmology. 4 2026

2026
[7]

Validation of released act dr6 temperature products with beam-aware split-cross pseudo-c ℓ tests

CosmoEvolve Virtual Lab. Validation of released act dr6 temperature products with beam-aware split-cross pseudo-c ℓ tests. 2026

2026
[8]

Cross-frequency temperature coherence of act dr6 maps: Pair- specific diagnostics and scale-cut recommendations for multi-frequency analyses

CosmoEvolve Virtual Lab. Cross-frequency temperature coherence of act dr6 maps: Pair- specific diagnostics and scale-cut recommendations for multi-frequency analyses. 2026. dhttps://lambda.gsfc.nasa.gov/product/act/act_dr6.02/ ehttps://parallel-review-689836870161.us-central1.run.app/forum?id=2604.00012-R1 f https://parallelscience.org ghttps://papers.par...

work page arXiv 2026

[1] [1]

Multi-agent system for cosmolog- ical parameter analysis, 2024

Andrew Laverick, Kristen Surrao, Inigo Zubeldia, et al. Multi-agent system for cosmolog- ical parameter analysis, 2024

2024

[2] [2]

Lonappan, et al

Licong Xu, Milind Sarkar, Anto I. Lonappan, et al. Open source planning & control system with language agents for autonomous scientific discovery, 2025

2025

[3] [3]

The denario project: Deep knowledge ai agents for scientific discovery, 2025

Francisco Villaescusa-Navarro, Boris Bolliet, Pablo Villanueva-Domingo, et al. The denario project: Deep knowledge ai agents for scientific discovery, 2025

2025

[4] [4]

Competing with ai scientists: Agent- driven approach to astrophysics research, 2026

Thomas Borrett, Licong Xu, Andy Nilipour, et al. Competing with ai scientists: Agent- driven approach to astrophysics research, 2026

2026

[5] [5]

Alphaevolve: A coding agent for scientific and algorithmic discovery, 2025

Alexander Novikov et al. Alphaevolve: A coding agent for scientific and algorithmic discovery, 2025

2025

[6] [6]

FAIR Universe Weak Lensing ML Uncertainty Challenge: Handling Uncertainties and Distribution Shifts for Precision Cosmology

Biwei Dai et al. FAIR Universe Weak Lensing ML Uncertainty Challenge: Handling Uncertainties and Distribution Shifts for Precision Cosmology. 4 2026

2026

[7] [7]

Validation of released act dr6 temperature products with beam-aware split-cross pseudo-c ℓ tests

CosmoEvolve Virtual Lab. Validation of released act dr6 temperature products with beam-aware split-cross pseudo-c ℓ tests. 2026

2026

[8] [8]

Cross-frequency temperature coherence of act dr6 maps: Pair- specific diagnostics and scale-cut recommendations for multi-frequency analyses

CosmoEvolve Virtual Lab. Cross-frequency temperature coherence of act dr6 maps: Pair- specific diagnostics and scale-cut recommendations for multi-frequency analyses. 2026. dhttps://lambda.gsfc.nasa.gov/product/act/act_dr6.02/ ehttps://parallel-review-689836870161.us-central1.run.app/forum?id=2604.00012-R1 f https://parallelscience.org ghttps://papers.par...

work page arXiv 2026