arxiv: 2604.15664 · v2 · submitted 2026-04-17 · 💻 cs.LG · cs.AI

Recognition: no theorem link

Stargazer: A Scalable Model-Fitting Benchmark Environment for AI Agents under Astrophysical Constraints

Bernhard Sch\"olkopf, Kristen Menou, Terry Jingchen Zhang, Xinge Liu, Zhijing Jin

Pith reviewed 2026-05-12 03:00 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords AI agentsmodel fittingradial velocityexoplanetsbenchmark environmentphysical constraintsastrophysicsscientific discovery

0 comments

The pith

AI agents achieve statistical fits to radial velocity data but often fail to recover the true physical parameters of planetary systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents Stargazer as a new benchmark with 120 tasks of varying difficulty for testing AI agents on iterative model fitting to astrophysical radial-velocity time series. Evaluation of eight frontier agents shows they frequently reach numerically good fits without identifying the correct physical quantities such as planet masses and orbital elements. The limitation holds even when agents receive basic tools or extra computation steps, and extra tokens often indicate looping rather than progress. A reader would care because the benchmark directly tests whether current AI can perform the constraint-aware reasoning required for real scientific data analysis rather than just curve fitting.

Core claim

Stargazer demonstrates a clear separation between statistical optimization and physical fidelity: agents produce models that match the observed radial-velocity curves adequately in a numerical sense yet deviate from the actual system parameters, with this mismatch appearing across single-planet high-SNR cases through complex low-SNR multi-planet configurations and remaining after the addition of standard agent skills or increased test-time compute.

What carries the argument

The Stargazer environment, a collection of 120 dynamic tasks (including 20 real archival radial-velocity datasets) that supplies iterative feedback on both statistical fit quality and adherence to physical constraints.

If this is right

Increasing test-time compute produces only marginal improvement in recovering physical parameters.
High token consumption frequently corresponds to unproductive recursive failure loops rather than useful exploration.
Stargazer supplies a concrete setting in which to train, evaluate, and scale new agent strategies for physically constrained model fitting.
The same simulation-driven design approach can be applied to model-fitting problems in other scientific fields.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Current agent designs may lack built-in mechanisms for enforcing domain physical priors during iterative search.
Benchmarks that reward only statistical metrics risk overestimating readiness for scientific applications.
Hybrid systems that combine neural reasoning with explicit physical simulators could be tested directly on these tasks.

Load-bearing premise

The selected tasks and the performance of eight current agents capture the essential difficulties that would appear in genuine astrophysical model-fitting work.

What would settle it

A clear falsifier would be an agent that recovers the correct physical parameters (within observational uncertainties) on the majority of the 20 real archival cases while keeping token usage low and avoiding recursive loops.

Figures

Figures reproduced from arXiv: 2604.15664 by Bernhard Sch\"olkopf, Kristen Menou, Terry Jingchen Zhang, Xinge Liu, Zhijing Jin.

**Figure 1.** Figure 1: Overview of Stargazer. Left: 120 RV tasks (100 synthetic, 20 real), with synthetic difficulty controlled by six physical factors. Center: Agents run a periodogram-to-Keplerian workflow and are graded on statistical and physical criteria. Right: Models often achieve strong statistical fits but fail to recover correct orbital parameters. Exoplanet discovery is tied to one of the most critical existential que… view at source ↗

**Figure 2.** Figure 2: STARGAZER framework. Left: Task generation from synthetic physics or extracted from archival RV data. Center: Agent iteration loop of analysis, submission, and percriterion feedback. Right: Evaluator forward-models submissions and grades with ∆BIC, RMS, Match, and Count. scripts that anonymise and reformat archival data into the STARGAZER interface, making it straightforward to incorporate additional real… view at source ↗

**Figure 3.** Figure 3: (a) Match-score distribution across all submitted episodes colored by difficulty tier. The shaded band marks the ±10% sensitivity region around the default threshold (0.80). (b) Pass rate as a function of the match threshold for each model. Rankings are preserved across the entire 0.5–1.0 range. Pass rates are computed as the unweighted fraction across all 100 synthetic tasks; per-tier results in [PITH_FU… view at source ↗

**Figure 4.** Figure 4: Statistical (blue, mean of ∆BIC and RMS) versus physical (red, mean of Match Score and Planet Count) criterion pass rates by difficulty tier. Statistical pass rates stay high while physical recovery drops from Easy to Hard. researcher level. This also provides evidence against data contamination: although these published papers could have appeared in training corpora, no frontier models were able to achiev… view at source ↗

**Figure 5.** Figure 5: Pearson correlation between difficulty factors and per-task success, aggregated [PITH_FULL_IMAGE:figures/full_fig_p023_5.png] view at source ↗

**Figure 6.** Figure 6: shows the RV fits for two representative case studies. 0 20 40 60 80 100 Time (days) 40 20 0 20 40 RV (m/s) (a) Ground truth Agent fit (Pass) Observations 0 50 100 150 200 250 Time (days) 10 0 10 20 30 RV (m/s) (b) Ground truth Agent fit (Fail) Observations [PITH_FULL_IMAGE:figures/full_fig_p028_6.png] view at source ↗

read the original abstract

The rise of autonomous AI agents suggests that dynamic benchmark environments with built-in feedback on scientifically grounded tasks are needed to evaluate the capabilities of these agents in research work. We introduce Stargazer, a scalable environment for evaluating AI agents on dynamic, iterative physics-grounded model-fitting tasks using inference on radial-velocity (RV) time series data. Stargazer comprises 120 tasks across three difficulty tiers, including 20 real archival cases, covering diverse scenarios ranging from high-SNR single-planet systems to complex multi-planetary configurations requiring involved low-SNR analysis. Our evaluation of eight frontier agents reveals a gap between numerical optimization and adherence to physical constraints: although agents often achieve a good statistical fit, they frequently fail to recover correct physical system parameters, a limitation that persists even when agents are equipped with vanilla skills. Furthermore, increasing test-time compute yields only marginal gains, with excessive token usage often reflecting recursive failure loops rather than meaningful exploration. Stargazer presents an opportunity to train, evaluate, scaffold, and scale strategies on a model-fitting problem of practical research relevance today. Our methodology to design a simulation-driven environment for AI agents presumably generalizes to many other model-fitting problems across scientific domains. Source code and the project website are available at https://github.com/AIPS-UofT/Stargazer and https://aips-uoft.github.io/Stargazer/, respectively.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Stargazer gives a usable new benchmark for AI agents on RV model fitting, but the headline claim about agents missing physical parameters despite good fits is probably explained by degeneracies in the data rather than agent shortcomings.

read the letter

Hi, the main thing here is that Stargazer creates a concrete benchmark environment with 120 tasks for testing AI agents on iterative radial-velocity model fitting, mixing synthetic cases and 20 real archival datasets across difficulty levels. That part is new and fills a gap for evaluating agents on physics-constrained scientific tasks. They test eight frontier agents, note that extra compute mostly produces loops instead of progress, and release the code, which makes it easy to extend or use for training scaffolds. The setup does a reasonable job of turning a real astronomy workflow into a dynamic, feedback-driven testbed that others can build on. The soft spot is exactly the one in the stress-test note. RV fitting, especially the low-SNR multi-planet cases, routinely has multiple plausible solutions due to aliases, eccentricity-inclination trade-offs, and correlated noise. If the evaluation only measures distance to the published or injected parameters without checking whether the agent's solution has comparable likelihood or posterior density, then the reported gap between statistical fit and 'correct' physics conflates ill-posedness of the inverse problem with agent limitations. The abstract gives limited detail on the exact metrics, prompts, or statistical controls, so the support for broad statements about AI capabilities stays thin. This paper is aimed at researchers developing AI agents for science or studying constrained optimization in inverse problems. A reader who wants a ready-made suite of tasks for agent evaluation would get practical value from it. I would send it to peer review because the benchmark itself is a reproducible contribution that stands on its own, even if the interpretation of the results needs tightening with likelihood comparisons and clearer task definitions.

Referee Report

2 major / 2 minor

Summary. The paper introduces Stargazer, a scalable benchmark environment consisting of 120 tasks (including 20 real archival RV datasets) for evaluating AI agents on iterative, physics-grounded model-fitting of radial-velocity time series. It evaluates eight frontier agents across difficulty tiers and reports that agents frequently achieve statistically competitive fits yet fail to recover the correct physical parameters, even when equipped with vanilla skills; additional test-time compute yields only marginal gains and often leads to recursive failure loops. The work positions the benchmark as a tool for training and scaling agent strategies on scientifically relevant inverse problems, with methodology claimed to generalize to other domains.

Significance. If the central empirical gap is shown to arise from agent limitations rather than inherent problem degeneracy, Stargazer would offer a valuable, reproducible testbed for measuring progress on scientific reasoning tasks that combine optimization with physical constraint adherence. The open-source release and inclusion of both synthetic and archival data strengthen its utility for the community.

major comments (2)

[Evaluation of agents and task design] The central claim—that agents achieve good statistical fits but fail to recover correct physical parameters—rests on the assumption that each task possesses a unique, identifiable ground-truth parameter vector that any adequate solution must match. In the low-SNR multi-planet RV cases that form a core part of the benchmark, the likelihood surface is frequently multi-modal (period aliases, eccentricity–inclination degeneracies, correlated noise). The evaluation metric appears to use Euclidean distance to the injected or published values without reporting whether the agent’s solution attains a comparable log-likelihood or posterior density; this conflates ill-posedness of the inverse problem with shortcomings in agent reasoning. The 20 real archival tasks and the synthetic task construction therefore require an explicit comparison of fit quality at the agent solution versus the reference.
[Abstract and results] The abstract and results summary state that the observed limitation persists even with vanilla skills and that excessive token usage reflects recursive failure loops. However, the manuscript provides insufficient detail on the precise agent prompts, the definition of “vanilla skills,” the statistical tests used to declare a fit “good,” and the criteria for identifying failure loops. Without these, the support for the generalization about AI capabilities in scientific model-fitting remains limited.

minor comments (2)

[Benchmark construction] Clarify the exact number of tasks per difficulty tier and the precise criteria used to assign archival versus synthetic cases to tiers.
[Discussion] The claim that the methodology “presumably generalizes” to other model-fitting problems would benefit from a short discussion of the minimal requirements (e.g., differentiable forward model, scalar fitness) that would allow porting the environment.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which identify key areas for strengthening the evaluation methodology and transparency of our benchmark. We address each major comment below, indicating planned revisions where appropriate.

read point-by-point responses

Referee: [Evaluation of agents and task design] The central claim—that agents achieve good statistical fits but fail to recover correct physical parameters—rests on the assumption that each task possesses a unique, identifiable ground-truth parameter vector that any adequate solution must match. In the low-SNR multi-planet RV cases that form a core part of the benchmark, the likelihood surface is frequently multi-modal (period aliases, eccentricity–inclination degeneracies, correlated noise). The evaluation metric appears to use Euclidean distance to the injected or published values without reporting whether the agent’s solution attains a comparable log-likelihood or posterior density; this conflates ill-posedness of the inverse problem with shortcomings in agent reasoning. The 20 real archival tasks and the synthetic task construction therefore require an explicit comparison of fit quality at 1.

Authors: We agree that multi-modality in low-SNR RV likelihood surfaces is a valid concern and that parameter recovery metrics alone can be misleading without fit-quality context. The current manuscript already reports both parameter errors (Euclidean distance to reference) and statistical fit quality (reduced chi-squared and log-likelihood at the agent's solution) for all tasks in Section 4 and the supplementary tables. To directly address the referee's point, we will add an explicit side-by-side comparison of log-likelihood values (and, where feasible, approximate posterior densities via MCMC on the agent's final model) versus the reference parameters. This will be included as a new column in the main results tables and discussed in a revised Section 4.3. For the 20 archival datasets we will cross-reference published posterior summaries. These additions will help isolate true agent shortcomings from problem ill-posedness while preserving the central observation that agents frequently under-perform on physically meaningful recovery even when statistical fits are competitive. revision: partial
Referee: [Abstract and results] The abstract and results summary state that the observed limitation persists even with vanilla skills and that excessive token usage reflects recursive failure loops. However, the manuscript provides insufficient detail on the precise agent prompts, the definition of “vanilla skills,” the statistical tests used to declare a fit “good,” and the criteria for identifying failure loops. Without these, the support for the generalization about AI capabilities in scientific model-fitting remains limited.

Authors: We accept that the manuscript would be strengthened by greater methodological detail. In the revision we will expand the 'Agent Evaluation Protocol' subsection (currently Section 3.3) to provide: (i) the exact system and user prompts used for each of the eight agents (with full templates moved to Appendix B), (ii) an explicit definition of 'vanilla skills' as the baseline toolset consisting only of standard code execution, file I/O, and basic plotting without any astrophysics-specific priors or iterative refinement heuristics, (iii) the precise statistical criteria for declaring a fit 'good' (reduced chi-squared < 1.5 together with BIC improvement > 10 relative to the null model), and (iv) the operational definition of recursive failure loops (five or more consecutive actions that produce no improvement in log-likelihood while consuming > 8 000 tokens). We will also release anonymized full interaction traces as supplementary material. These changes will make the claims fully reproducible and better support the generalization to other scientific model-fitting domains. revision: yes

Circularity Check

0 steps flagged

No circularity: benchmark evaluation with externally defined tasks

full rationale

The paper introduces Stargazer as a new benchmark environment consisting of 120 tasks (including 20 real archival RV cases and synthetic ones) and reports empirical results from evaluating eight frontier agents on them. The central observation—that agents achieve good statistical fits but often fail to recover the benchmark-defined physical parameters—is a direct measurement against the task specifications rather than any derived prediction, first-principles result, or fitted quantity presented as independent. No equations, ansatzes, or uniqueness theorems are invoked; there are no self-citations load-bearing on the claims, no renaming of known results, and no reduction of outputs to inputs by construction. The evaluation is self-contained against the explicitly constructed tasks and agent runs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The contribution is the benchmark itself rather than new physical models or parameters; relies on established RV analysis methods.

axioms (1)

domain assumption Radial velocity data can be modeled with Keplerian orbits and noise models under standard astrophysical assumptions.
Basis for task generation in the benchmark.

pith-pipeline@v0.9.0 · 5564 in / 1220 out tokens · 41183 ms · 2026-05-12T03:00:00.043885+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 1 internal anchor

[1]

doi: 10.3847/1538-3881/aa5c87. I. Boisse, C. Moutou, A. Vidal-Madjar, F. Bouchy, F. Pont, G. H´ebrard, X. Bonfils, B. Croll, X. Delfosse, M. Desort, T. Forveille, A.-M. Lagrange, B. Loeillet, C. Lovis, J. M. Matthews, M. Mayor, F. Pepe, C. Perrier, D. Queloz, J. F. Rowe, N. C. Santos, D. S ´egransan, and S. Udry. Stellar activity of planetary host star hd...

work page doi:10.3847/1538-3881/aa5c87 2009
[2]

Humanity's Last Exam

doi: 10.1038/s41586-025-09962-4. URL https://www.nature.com/articles/ s41586-025-09962-4. 11 Preprint 2026 Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, Vishal Dey, Mingyi Xue, Frazier N. Baker, Benjamin Burns, Daniel Adu-Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu Su, and Huan Sun....

work page internal anchor Pith review doi:10.1038/s41586-025-09962-4 2026
[3]

URLhttps://arxiv.org/abs/2407.01725. Chak-Wing Mak, Guanyu Zhu, Boyi Zhang, Hongji Li, Xiaowei Chi, Kevin Zhang, Yichen Wu, Yangfan He, Chun-Kai Fan, Wentao Lu, Kuangzhi Ge, Xinyu Fang, Hongyang He, Kuan Lu, Tianxiang Xu, Li Zhang, Yongxin Ni, Youhua Li, and Shanghang Zhang. Physic- smind: Sim and real mechanics benchmarking for physical reasoning and pre...

work page doi:10.1051/0004-6361/200912172 2026
[4]

Michael Perryman.The Exoplanet Handbook

URLhttps://api.semanticscholar.org/CorpusID:15088852. Michael Perryman.The Exoplanet Handbook. Cambridge University Press, 2nd edition,

work page
[5]

URL https://www.cambridge.org/core/books/ exoplanet-handbook/750759E015FDCF469D141F0046198519

doi: 10.1017/9781108304160. URL https://www.cambridge.org/core/books/ exoplanet-handbook/750759E015FDCF469D141F0046198519. Didier Queloz, Gregory W. Henry, Jean Pierre Sivan, Sallie L. Baliunas, Jean-Luc Beuzit, Robert A. Donahue, Michel Mayor, Dominique Naef, Christian Perrier, and St ´ephane Udry. No planet for HD 166435.Astronomy & Astrophysics, 379:L5...

work page doi:10.1017/9781108304160 2001
[6]

doi: 10.1088/0004-637X/719/1/890. Yujiong Shen, Yajie Yang, Zhiheng Xi, Binze Hu, Huayu Sha, Jiazheng Zhang, Qiyuan Peng, Junlin Shang, Jixuan Huang, Yutao Fan, Jingqi Tong, Shihan Dou, Ming Zhang, Lei Bai, Zhenfei Yin, Tao Gui, Xingjun Ma, Qi Zhang, Xuanjing Huang, and Yu-Gang Jiang. Sciagentgym: Benchmarking multi-step scientific tool-use in llm agents,...

work page doi:10.1088/0004-637x/719/1/890 2026
[7]

Failure to detrend causes spurious long-period peaks

Always detrend first.Remove linear or polynomial RV drift before running any peri- odogram. Failure to detrend causes spurious long-period peaks

work page
[8]

Reject immediately

Reject periods > baseline.If any periodogram peak has period > (max time − min time), it is almost certainly a trend alias. Reject immediately

work page
[9]

If a secondary peak lies at one of these, pick the one with higher power and shorter period

Identify the 1-day alias family.For candidate P, compute: 1/(1/P− 1), 1/(1/P+ 1), P/2, 2P. If a secondary peak lies at one of these, pick the one with higher power and shorter period

work page
[10]

Random scatter = wrong period; smooth curve = correct period

Validate with phase-folding.Phase-fold at top-3 candidates. Random scatter = wrong period; smooth curve = correct period. 5.Narrow refinement.Refine with fine grid search around±5% of best candidate. Skill 2: Robust Keplerian Orbit Fitting Description.Fit a full Keplerian orbit model to RV data. Ensures eccentricity is properly optimized, avoids local min...

work page
[11]

Always follow up with a full Keplerian fit with 6 parameters:P,K,e,ω,M 0,γ

Never use sine-fit as final answer.Sine fitting assumes e= 0. Always follow up with a full Keplerian fit with 6 parameters:P,K,e,ω,M 0,γ

work page
[12]

Use differential evolution with bounds: P± 2% of periodogram peak, K∈[ 0.1, 3Ksine], e∈[0, 0.8],ω,M 0 ∈[0, 2π]

Use global optimization.Local optimizers get stuck in local minima. Use differential evolution with bounds: P± 2% of periodogram peak, K∈[ 0.1, 3Ksine], e∈[0, 0.8],ω,M 0 ∈[0, 2π]

work page
[13]

Polish with local optimizer.After global search, refine with Nelder-Mead (maxiter=10000)

work page
[14]

If RMS with e= 0 is within 5% of best-fit RMS: submit circular orbit

Validate eccentricity.If e> 0.8: suspect artifact, re-fit with e≤ 0.7. If RMS with e= 0 is within 5% of best-fit RMS: submit circular orbit. 5.Per-instrument offsets.If multiple instruments, fit independentγ i for each. 6.Report RMS.If RMS≫σ median, try different period or starting eccentricity. Common Mistakes.Do not submit a sine-fit directly. Do not fi...

work page 2026
[15]

If RMS > 2 × σmedian, a second planet likely exists

Check residuals.Compute residual RMS after subtracting planet 1. If RMS > 2 × σmedian, a second planet likely exists. If RMS≤1.5×σ median, likely single-planet

work page
[16]

Periodogram on residuals.Run Lomb-Scargle on residuals, apply alias checks from Skill 1

work page
[17]

∆BIC> 10: strong evidence;>6: moderate evidence;<6: no strong evidence, stop

BIC comparison.Compute BIC for N-planet vs (N+ 1)-planet model. ∆BIC> 10: strong evidence;>6: moderate evidence;<6: no strong evidence, stop

work page
[18]

Joint re-optimization.After finding approximate P2 from residuals, fit both planets simultaneously

work page
[19]

Stop when residual RMS ≈ noise floor or∆BIC<6

Repeat.Check 2-planet residuals for a 3rd planet. Stop when residual RMS ≈ noise floor or∆BIC<6. Decision Rule:Residual RMS > 3× noise → add planet; 2–3× noise → check ∆BIC; < 2× noise →stop. Skill 5: Submission Strategy & Timing Description.Decide when to call submit action and avoid analysis paralysis (running out of budget without submitting). No-submi...

work page 2026
[20]

2) Plan/Next: 1--3 short bullets

Findings: concise hypothesis plus key numbers. 2) Plan/Next: 1--3 short bullets

work page
[21]

4) Results: outputs interpreted; if ready, submit

Code: one fenced code block (only if calling PythonREPL). 4) Results: outputs interpreted; if ready, submit. ### Available Tools 1.PythonREPL: Execute Python code. Pre-loaded: times days, rvs ms, sigmas ms, np, baselines, history, star mass sun, t ref days, Stargazer planet from fit, Stargazer SUBMISSION GUIDE 2.submit action: Submit planet hypotheses (ma...

work page
[22]

Submitting early with poor fit (high RMS)

Jumping to Kepler before LS + linear-sine gating 2. Submitting early with poor fit (high RMS)

work page
[23]

Wrong phase convention for l rad

Forgetting omega rad in submission (defaults to 0!) 4. Wrong phase convention for l rad

work page
[24]

Not doing multi-start optimisation for eccentric orbits 27 Preprint 2026 E Case Study Trajectories Figure 6 shows the RV fits for two representative case studies. 0 20 40 60 80 100 Time (days) 40 20 0 20 40 RV (m/s) (a) Ground truth Agent fit (Pass) Observations 0 50 100 150 200 250 Time (days) 10 0 10 20 30 RV (m/s) (b) Ground truth Agent fit (Fail) Obse...

work page 2026