Recognition: no theorem link
Stargazer: A Scalable Model-Fitting Benchmark Environment for AI Agents under Astrophysical Constraints
Pith reviewed 2026-05-12 03:00 UTC · model grok-4.3
The pith
AI agents achieve statistical fits to radial velocity data but often fail to recover the true physical parameters of planetary systems.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Stargazer demonstrates a clear separation between statistical optimization and physical fidelity: agents produce models that match the observed radial-velocity curves adequately in a numerical sense yet deviate from the actual system parameters, with this mismatch appearing across single-planet high-SNR cases through complex low-SNR multi-planet configurations and remaining after the addition of standard agent skills or increased test-time compute.
What carries the argument
The Stargazer environment, a collection of 120 dynamic tasks (including 20 real archival radial-velocity datasets) that supplies iterative feedback on both statistical fit quality and adherence to physical constraints.
If this is right
- Increasing test-time compute produces only marginal improvement in recovering physical parameters.
- High token consumption frequently corresponds to unproductive recursive failure loops rather than useful exploration.
- Stargazer supplies a concrete setting in which to train, evaluate, and scale new agent strategies for physically constrained model fitting.
- The same simulation-driven design approach can be applied to model-fitting problems in other scientific fields.
Where Pith is reading between the lines
- Current agent designs may lack built-in mechanisms for enforcing domain physical priors during iterative search.
- Benchmarks that reward only statistical metrics risk overestimating readiness for scientific applications.
- Hybrid systems that combine neural reasoning with explicit physical simulators could be tested directly on these tasks.
Load-bearing premise
The selected tasks and the performance of eight current agents capture the essential difficulties that would appear in genuine astrophysical model-fitting work.
What would settle it
A clear falsifier would be an agent that recovers the correct physical parameters (within observational uncertainties) on the majority of the 20 real archival cases while keeping token usage low and avoiding recursive loops.
Figures
read the original abstract
The rise of autonomous AI agents suggests that dynamic benchmark environments with built-in feedback on scientifically grounded tasks are needed to evaluate the capabilities of these agents in research work. We introduce Stargazer, a scalable environment for evaluating AI agents on dynamic, iterative physics-grounded model-fitting tasks using inference on radial-velocity (RV) time series data. Stargazer comprises 120 tasks across three difficulty tiers, including 20 real archival cases, covering diverse scenarios ranging from high-SNR single-planet systems to complex multi-planetary configurations requiring involved low-SNR analysis. Our evaluation of eight frontier agents reveals a gap between numerical optimization and adherence to physical constraints: although agents often achieve a good statistical fit, they frequently fail to recover correct physical system parameters, a limitation that persists even when agents are equipped with vanilla skills. Furthermore, increasing test-time compute yields only marginal gains, with excessive token usage often reflecting recursive failure loops rather than meaningful exploration. Stargazer presents an opportunity to train, evaluate, scaffold, and scale strategies on a model-fitting problem of practical research relevance today. Our methodology to design a simulation-driven environment for AI agents presumably generalizes to many other model-fitting problems across scientific domains. Source code and the project website are available at https://github.com/AIPS-UofT/Stargazer and https://aips-uoft.github.io/Stargazer/, respectively.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Stargazer, a scalable benchmark environment consisting of 120 tasks (including 20 real archival RV datasets) for evaluating AI agents on iterative, physics-grounded model-fitting of radial-velocity time series. It evaluates eight frontier agents across difficulty tiers and reports that agents frequently achieve statistically competitive fits yet fail to recover the correct physical parameters, even when equipped with vanilla skills; additional test-time compute yields only marginal gains and often leads to recursive failure loops. The work positions the benchmark as a tool for training and scaling agent strategies on scientifically relevant inverse problems, with methodology claimed to generalize to other domains.
Significance. If the central empirical gap is shown to arise from agent limitations rather than inherent problem degeneracy, Stargazer would offer a valuable, reproducible testbed for measuring progress on scientific reasoning tasks that combine optimization with physical constraint adherence. The open-source release and inclusion of both synthetic and archival data strengthen its utility for the community.
major comments (2)
- [Evaluation of agents and task design] The central claim—that agents achieve good statistical fits but fail to recover correct physical parameters—rests on the assumption that each task possesses a unique, identifiable ground-truth parameter vector that any adequate solution must match. In the low-SNR multi-planet RV cases that form a core part of the benchmark, the likelihood surface is frequently multi-modal (period aliases, eccentricity–inclination degeneracies, correlated noise). The evaluation metric appears to use Euclidean distance to the injected or published values without reporting whether the agent’s solution attains a comparable log-likelihood or posterior density; this conflates ill-posedness of the inverse problem with shortcomings in agent reasoning. The 20 real archival tasks and the synthetic task construction therefore require an explicit comparison of fit quality at the agent solution versus the reference.
- [Abstract and results] The abstract and results summary state that the observed limitation persists even with vanilla skills and that excessive token usage reflects recursive failure loops. However, the manuscript provides insufficient detail on the precise agent prompts, the definition of “vanilla skills,” the statistical tests used to declare a fit “good,” and the criteria for identifying failure loops. Without these, the support for the generalization about AI capabilities in scientific model-fitting remains limited.
minor comments (2)
- [Benchmark construction] Clarify the exact number of tasks per difficulty tier and the precise criteria used to assign archival versus synthetic cases to tiers.
- [Discussion] The claim that the methodology “presumably generalizes” to other model-fitting problems would benefit from a short discussion of the minimal requirements (e.g., differentiable forward model, scalar fitness) that would allow porting the environment.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed comments, which identify key areas for strengthening the evaluation methodology and transparency of our benchmark. We address each major comment below, indicating planned revisions where appropriate.
read point-by-point responses
-
Referee: [Evaluation of agents and task design] The central claim—that agents achieve good statistical fits but fail to recover correct physical parameters—rests on the assumption that each task possesses a unique, identifiable ground-truth parameter vector that any adequate solution must match. In the low-SNR multi-planet RV cases that form a core part of the benchmark, the likelihood surface is frequently multi-modal (period aliases, eccentricity–inclination degeneracies, correlated noise). The evaluation metric appears to use Euclidean distance to the injected or published values without reporting whether the agent’s solution attains a comparable log-likelihood or posterior density; this conflates ill-posedness of the inverse problem with shortcomings in agent reasoning. The 20 real archival tasks and the synthetic task construction therefore require an explicit comparison of fit quality at 1.
Authors: We agree that multi-modality in low-SNR RV likelihood surfaces is a valid concern and that parameter recovery metrics alone can be misleading without fit-quality context. The current manuscript already reports both parameter errors (Euclidean distance to reference) and statistical fit quality (reduced chi-squared and log-likelihood at the agent's solution) for all tasks in Section 4 and the supplementary tables. To directly address the referee's point, we will add an explicit side-by-side comparison of log-likelihood values (and, where feasible, approximate posterior densities via MCMC on the agent's final model) versus the reference parameters. This will be included as a new column in the main results tables and discussed in a revised Section 4.3. For the 20 archival datasets we will cross-reference published posterior summaries. These additions will help isolate true agent shortcomings from problem ill-posedness while preserving the central observation that agents frequently under-perform on physically meaningful recovery even when statistical fits are competitive. revision: partial
-
Referee: [Abstract and results] The abstract and results summary state that the observed limitation persists even with vanilla skills and that excessive token usage reflects recursive failure loops. However, the manuscript provides insufficient detail on the precise agent prompts, the definition of “vanilla skills,” the statistical tests used to declare a fit “good,” and the criteria for identifying failure loops. Without these, the support for the generalization about AI capabilities in scientific model-fitting remains limited.
Authors: We accept that the manuscript would be strengthened by greater methodological detail. In the revision we will expand the 'Agent Evaluation Protocol' subsection (currently Section 3.3) to provide: (i) the exact system and user prompts used for each of the eight agents (with full templates moved to Appendix B), (ii) an explicit definition of 'vanilla skills' as the baseline toolset consisting only of standard code execution, file I/O, and basic plotting without any astrophysics-specific priors or iterative refinement heuristics, (iii) the precise statistical criteria for declaring a fit 'good' (reduced chi-squared < 1.5 together with BIC improvement > 10 relative to the null model), and (iv) the operational definition of recursive failure loops (five or more consecutive actions that produce no improvement in log-likelihood while consuming > 8 000 tokens). We will also release anonymized full interaction traces as supplementary material. These changes will make the claims fully reproducible and better support the generalization to other scientific model-fitting domains. revision: yes
Circularity Check
No circularity: benchmark evaluation with externally defined tasks
full rationale
The paper introduces Stargazer as a new benchmark environment consisting of 120 tasks (including 20 real archival RV cases and synthetic ones) and reports empirical results from evaluating eight frontier agents on them. The central observation—that agents achieve good statistical fits but often fail to recover the benchmark-defined physical parameters—is a direct measurement against the task specifications rather than any derived prediction, first-principles result, or fitted quantity presented as independent. No equations, ansatzes, or uniqueness theorems are invoked; there are no self-citations load-bearing on the claims, no renaming of known results, and no reduction of outputs to inputs by construction. The evaluation is self-contained against the explicitly constructed tasks and agent runs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Radial velocity data can be modeled with Keplerian orbits and noise models under standard astrophysical assumptions.
Reference graph
Works this paper leans on
-
[1]
doi: 10.3847/1538-3881/aa5c87. I. Boisse, C. Moutou, A. Vidal-Madjar, F. Bouchy, F. Pont, G. H´ebrard, X. Bonfils, B. Croll, X. Delfosse, M. Desort, T. Forveille, A.-M. Lagrange, B. Loeillet, C. Lovis, J. M. Matthews, M. Mayor, F. Pepe, C. Perrier, D. Queloz, J. F. Rowe, N. C. Santos, D. S ´egransan, and S. Udry. Stellar activity of planetary host star hd...
-
[2]
doi: 10.1038/s41586-025-09962-4. URL https://www.nature.com/articles/ s41586-025-09962-4. 11 Preprint 2026 Ziru Chen, Shijie Chen, Yuting Ning, Qianheng Zhang, Boshi Wang, Botao Yu, Yifei Li, Zeyi Liao, Chen Wei, Zitong Lu, Vishal Dey, Mingyi Xue, Frazier N. Baker, Benjamin Burns, Daniel Adu-Ampratwum, Xuhui Huang, Xia Ning, Song Gao, Yu Su, and Huan Sun....
work page internal anchor Pith review doi:10.1038/s41586-025-09962-4 2026
-
[3]
URLhttps://arxiv.org/abs/2407.01725. Chak-Wing Mak, Guanyu Zhu, Boyi Zhang, Hongji Li, Xiaowei Chi, Kevin Zhang, Yichen Wu, Yangfan He, Chun-Kai Fan, Wentao Lu, Kuangzhi Ge, Xinyu Fang, Hongyang He, Kuan Lu, Tianxiang Xu, Li Zhang, Yongxin Ni, Youhua Li, and Shanghang Zhang. Physic- smind: Sim and real mechanics benchmarking for physical reasoning and pre...
-
[4]
Michael Perryman.The Exoplanet Handbook
URLhttps://api.semanticscholar.org/CorpusID:15088852. Michael Perryman.The Exoplanet Handbook. Cambridge University Press, 2nd edition,
-
[5]
URL https://www.cambridge.org/core/books/ exoplanet-handbook/750759E015FDCF469D141F0046198519
doi: 10.1017/9781108304160. URL https://www.cambridge.org/core/books/ exoplanet-handbook/750759E015FDCF469D141F0046198519. Didier Queloz, Gregory W. Henry, Jean Pierre Sivan, Sallie L. Baliunas, Jean-Luc Beuzit, Robert A. Donahue, Michel Mayor, Dominique Naef, Christian Perrier, and St ´ephane Udry. No planet for HD 166435.Astronomy & Astrophysics, 379:L5...
-
[6]
doi: 10.1088/0004-637X/719/1/890. Yujiong Shen, Yajie Yang, Zhiheng Xi, Binze Hu, Huayu Sha, Jiazheng Zhang, Qiyuan Peng, Junlin Shang, Jixuan Huang, Yutao Fan, Jingqi Tong, Shihan Dou, Ming Zhang, Lei Bai, Zhenfei Yin, Tao Gui, Xingjun Ma, Qi Zhang, Xuanjing Huang, and Yu-Gang Jiang. Sciagentgym: Benchmarking multi-step scientific tool-use in llm agents,...
-
[7]
Failure to detrend causes spurious long-period peaks
Always detrend first.Remove linear or polynomial RV drift before running any peri- odogram. Failure to detrend causes spurious long-period peaks
-
[8]
Reject periods > baseline.If any periodogram peak has period > (max time − min time), it is almost certainly a trend alias. Reject immediately
-
[9]
If a secondary peak lies at one of these, pick the one with higher power and shorter period
Identify the 1-day alias family.For candidate P, compute: 1/(1/P− 1), 1/(1/P+ 1), P/2, 2P. If a secondary peak lies at one of these, pick the one with higher power and shorter period
-
[10]
Random scatter = wrong period; smooth curve = correct period
Validate with phase-folding.Phase-fold at top-3 candidates. Random scatter = wrong period; smooth curve = correct period. 5.Narrow refinement.Refine with fine grid search around±5% of best candidate. Skill 2: Robust Keplerian Orbit Fitting Description.Fit a full Keplerian orbit model to RV data. Ensures eccentricity is properly optimized, avoids local min...
-
[11]
Always follow up with a full Keplerian fit with 6 parameters:P,K,e,ω,M 0,γ
Never use sine-fit as final answer.Sine fitting assumes e= 0. Always follow up with a full Keplerian fit with 6 parameters:P,K,e,ω,M 0,γ
-
[12]
Use global optimization.Local optimizers get stuck in local minima. Use differential evolution with bounds: P± 2% of periodogram peak, K∈[ 0.1, 3Ksine], e∈[0, 0.8],ω,M 0 ∈[0, 2π]
-
[13]
Polish with local optimizer.After global search, refine with Nelder-Mead (maxiter=10000)
-
[14]
If RMS with e= 0 is within 5% of best-fit RMS: submit circular orbit
Validate eccentricity.If e> 0.8: suspect artifact, re-fit with e≤ 0.7. If RMS with e= 0 is within 5% of best-fit RMS: submit circular orbit. 5.Per-instrument offsets.If multiple instruments, fit independentγ i for each. 6.Report RMS.If RMS≫σ median, try different period or starting eccentricity. Common Mistakes.Do not submit a sine-fit directly. Do not fi...
work page 2026
-
[15]
If RMS > 2 × σmedian, a second planet likely exists
Check residuals.Compute residual RMS after subtracting planet 1. If RMS > 2 × σmedian, a second planet likely exists. If RMS≤1.5×σ median, likely single-planet
-
[16]
Periodogram on residuals.Run Lomb-Scargle on residuals, apply alias checks from Skill 1
-
[17]
∆BIC> 10: strong evidence;>6: moderate evidence;<6: no strong evidence, stop
BIC comparison.Compute BIC for N-planet vs (N+ 1)-planet model. ∆BIC> 10: strong evidence;>6: moderate evidence;<6: no strong evidence, stop
-
[18]
Joint re-optimization.After finding approximate P2 from residuals, fit both planets simultaneously
-
[19]
Stop when residual RMS ≈ noise floor or∆BIC<6
Repeat.Check 2-planet residuals for a 3rd planet. Stop when residual RMS ≈ noise floor or∆BIC<6. Decision Rule:Residual RMS > 3× noise → add planet; 2–3× noise → check ∆BIC; < 2× noise →stop. Skill 5: Submission Strategy & Timing Description.Decide when to call submit action and avoid analysis paralysis (running out of budget without submitting). No-submi...
work page 2026
-
[20]
2) Plan/Next: 1--3 short bullets
Findings: concise hypothesis plus key numbers. 2) Plan/Next: 1--3 short bullets
-
[21]
4) Results: outputs interpreted; if ready, submit
Code: one fenced code block (only if calling PythonREPL). 4) Results: outputs interpreted; if ready, submit. ### Available Tools 1.PythonREPL: Execute Python code. Pre-loaded: times days, rvs ms, sigmas ms, np, baselines, history, star mass sun, t ref days, Stargazer planet from fit, Stargazer SUBMISSION GUIDE 2.submit action: Submit planet hypotheses (ma...
-
[22]
Submitting early with poor fit (high RMS)
Jumping to Kepler before LS + linear-sine gating 2. Submitting early with poor fit (high RMS)
-
[23]
Wrong phase convention for l rad
Forgetting omega rad in submission (defaults to 0!) 4. Wrong phase convention for l rad
-
[24]
Not doing multi-start optimisation for eccentric orbits 27 Preprint 2026 E Case Study Trajectories Figure 6 shows the RV fits for two representative case studies. 0 20 40 60 80 100 Time (days) 40 20 0 20 40 RV (m/s) (a) Ground truth Agent fit (Pass) Observations 0 50 100 150 200 250 Time (days) 10 0 10 20 30 RV (m/s) (b) Ground truth Agent fit (Fail) Obse...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.