arxiv: 2605.05679 · v1 · submitted 2026-05-07 · 🌌 astro-ph.CO · physics.data-an

Recognition: unknown

Bayesian leave-one-out cross-validation for astrophysical model comparison using gravitational-wave background data

Shreyas Tiruvaskar , Chris Gordon

Authors on Pith no claims yet

Pith reviewed 2026-05-08 05:49 UTC · model grok-4.3

classification 🌌 astro-ph.CO physics.data-an

keywords pulsar timing arraysgravitational wave backgroundultralight dark matterBayesian model comparisonleave-one-out cross-validationsupermassive black hole binaries

0 comments

The pith

Current PTA data do not decisively prefer any model of supermassive black hole binary evolution.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper applies Bayesian leave-one-out cross-validation to four models of the low-frequency gravitational-wave background seen in pulsar timing arrays: simplified and realistic ultralight-dark-matter implementations, a phenomenological environmental-hardening model, and a pure gravitational-wave model. It evaluates predictive performance on the five lowest frequency bins and finds the phenomenological model has the highest expected log predictive density, yet its lead is smaller than the estimated uncertainties. Within the ultralight-dark-matter class the simplified version beats the realistic implementation in every bin examined. The data remain compatible with ultralight-dark-matter suppression of low-frequency power but cannot yet distinguish it from generic environmental descriptions.

Core claim

The current pulsar-timing-array data therefore do not decisively prefer one model overall. The clearest pairwise result is within the ultralight-dark-matter framework: the simplified model outperforms the realistic implementation in all five frequency bins.

What carries the argument

Bayesian leave-one-out cross-validation that computes expected log predictive density on the five lowest pulsar-timing-array frequency bins.

If this is right

The phenomenological environmental-hardening model records the largest expected log predictive density.
Its advantage over the other three models lies within the estimated standard errors.
The simplified ultralight-dark-matter model consistently outperforms the realistic version across all five bins.
The data remain compatible with ultralight-dark-matter-induced low-frequency suppression.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Additional frequency bins or future higher-sensitivity data could shrink the error bars enough to produce a decisive model ranking.
The same cross-validation procedure can be applied directly to other pulsar-timing-array observables or to different dark-matter mass ranges.
If bin-to-bin correlations prove non-negligible, the leave-one-out estimates would need to be replaced by a blocked or joint predictive-density calculation.

Load-bearing premise

The five lowest frequency bins can be treated as sufficiently independent for leave-one-out cross-validation to give reliable estimates of predictive performance without unaccounted correlations or systematics.

What would settle it

A new analysis that finds substantial correlations between those frequency bins or unmodeled systematics large enough to change which model has the highest expected log predictive density.

Figures

Figures reproduced from arXiv: 2605.05679 by Chris Gordon, Shreyas Tiruvaskar.

**Figure 1.** Figure 1: FIG. 1: Pareto- view at source ↗

**Figure 2.** Figure 2: FIG. 2: Pointwise predictive contributions for all view at source ↗

**Figure 3.** Figure 3: FIG. 3: Strain spectra for all sampled MCMC parameter combinations. We plot the median and 95% posterior view at source ↗

read the original abstract

Previous work showed that ultralight-dark-matter solitons can provide dynamical friction for supermassive black-hole binaries, suppressing low-frequency power in the pulsar-timing-array gravitational-wave background and constraining the particle mass and effective ultralight-dark-matter fraction. Here we extend that analysis by comparing the predictive performance of four models: simplified and realistic ultralight-dark-matter implementations, a phenomenological environmental-hardening model, and a gravitational-wave-only model. We use Bayesian leave-one-out cross-validation on the five lowest pulsar-timing-array frequency bins. The phenomenological model gives the largest expected log predictive density, but its advantage over the other models is not large compared with the estimated standard errors. The current data therefore do not decisively prefer one model overall. The clearest pairwise result is within the ultralight-dark-matter framework: the simplified model outperforms the realistic implementation in all five frequency bins. Current pulsar-timing-array data are therefore compatible with ultralight-dark-matter-induced low-frequency suppression, but do not yet distinguish ultralight-dark-matter significantly from more generic environmental descriptions of supermassive-black-hole-binary evolution.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper applies Bayesian LOO-CV to rank ULDM and other models on PTA data but the results rest on treating five frequency bins as independent.

read the letter

The main thing to know is that the authors use Bayesian leave-one-out cross-validation on the five lowest PTA frequency bins to compare a simplified ULDM model, a more realistic ULDM implementation, a phenomenological environmental model, and a pure gravitational-wave model. They report no decisive overall preference but find the simplified ULDM version outperforms the realistic one across all bins. This is a direct extension of earlier ULDM work on dynamical friction in supermassive black hole binaries. What the paper does well is to shift the comparison toward out-of-sample predictive density rather than in-sample likelihoods alone, and to state plainly that current data cannot strongly distinguish the models. That cautious tone matches the limited sensitivity of existing PTA observations. The approach is new in this specific astrophysical setting even if LOO-CV itself is established. The clearest soft spot is the independence assumption for the bins. PTA data are shaped by a common red-noise process whose power-law spectrum couples neighboring frequencies, so residual covariance not captured in the per-bin likelihood could bias the expected log predictive densities upward for models that fit the shared component. With only five bins the effective sample is small, and any unaccounted dependence could alter or erase the pairwise ranking between the two ULDM versions. The abstract gives no indication that the authors tested for bin correlations or ran sensitivity checks, which leaves the central claim vulnerable. This work is for people already working on PTA gravitational-wave background analyses or ultralight dark matter constraints who want a concrete example of predictive model selection on real data. A reader focused on method application in noisy astrophysical datasets will get the most out of it. It deserves serious referee time because the method is applied honestly and the conclusions stay within what the data support, even though the bin-independence issue needs explicit checking in review. I would send it out for peer review with a request that reviewers examine whether the likelihood accounts for cross-bin covariance.

Referee Report

3 major / 2 minor

Summary. The manuscript applies Bayesian leave-one-out cross-validation (LOO-CV) to compare four models of the pulsar-timing-array gravitational-wave background: simplified and realistic ultralight-dark-matter (ULDM) implementations that suppress low-frequency power via dynamical friction on supermassive black-hole binaries, a phenomenological environmental-hardening model, and a pure gravitational-wave model. Using the five lowest frequency bins, it reports that the phenomenological model achieves the highest expected log predictive density (ELPD) but with differences not exceeding the estimated standard errors, yielding no decisive overall preference. Within the ULDM class the simplified implementation outperforms the realistic one in every bin. The central conclusion is that current data remain compatible with ULDM-induced suppression yet cannot yet distinguish it from generic environmental descriptions of binary evolution.

Significance. If the LOO-CV results prove robust to the independence assumption, the work supplies a principled, out-of-sample metric for ranking astrophysical models of the nanohertz gravitational-wave background. It demonstrates that ULDM solitons remain viable but are not yet statistically preferred over simpler phenomenological alternatives, thereby informing the interpretation of upcoming PTA data releases and the design of future constraints on ultralight-dark-matter particle mass and fraction. The explicit use of held-out predictive densities rather than in-sample likelihoods is a methodological strength.

major comments (3)

[LOO-CV application and data section] The LOO-CV procedure (described in the methods and applied to the five lowest frequency bins) treats the bins as conditionally independent given the model parameters. PTA data, however, arise from a common red-noise process whose power-law spectrum induces positive covariance between neighboring bins. No diagnostic for residual bin-to-bin correlations, no block cross-validation test, and no sensitivity analysis to this assumption are presented; under unaccounted dependence the reported ELPD rankings (including the consistent superiority of the simplified ULDM model) can become optimistically biased.
[Model definitions] Comparative ELPD results are given without explicit statements of the model parameterizations, the priors placed on ULDM particle mass and effective fraction, or the precise differences between the simplified and realistic ULDM implementations (e.g., how soliton density profiles and dynamical-friction prescriptions are implemented). These details are required to determine whether the observed performance gap reflects physical modeling choices or differences in effective degrees of freedom.
[Results and statistical reporting] The claim that the phenomenological model’s ELPD advantage is “not large compared with the estimated standard errors” (abstract and results) relies on the accuracy of those standard-error estimates. The paper does not specify whether the errors are obtained from the PSIS-LOO approximation, from posterior predictive checks, or from another method, nor whether multiple-comparison corrections across four models and five bins have been applied.

minor comments (2)

Define all acronyms (ELPD, ULDM, PTA, SMBHB) at first use in the main text and ensure consistent notation for the expected log predictive density throughout.
[Results section] A compact table listing ELPD values, standard errors, and pairwise differences for each model and frequency bin would improve readability of the comparative results.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their careful reading of the manuscript and for the constructive comments. We address each major point below and indicate the revisions we will incorporate.

read point-by-point responses

Referee: [LOO-CV application and data section] The LOO-CV procedure (described in the methods and applied to the five lowest frequency bins) treats the bins as conditionally independent given the model parameters. PTA data, however, arise from a common red-noise process whose power-law spectrum induces positive covariance between neighboring bins. No diagnostic for residual bin-to-bin correlations, no block cross-validation test, and no sensitivity analysis to this assumption are presented; under unaccounted dependence the reported ELPD rankings (including the consistent superiority of the simplified ULDM model) can become optimistically biased.

Authors: We agree that the conditional independence assumption between frequency bins is an approximation. The underlying red-noise process does induce positive correlations that are not explicitly modeled in the LOO-CV. In the revised manuscript we will add an explicit discussion of this assumption in the methods, include a diagnostic of posterior predictive residuals across bins to check for unaccounted correlations, and perform a sensitivity analysis using block cross-validation (holding out adjacent bins together). These additions will allow readers to assess the robustness of the reported ELPD rankings. revision: yes
Referee: [Model definitions] Comparative ELPD results are given without explicit statements of the model parameterizations, the priors placed on ULDM particle mass and effective fraction, or the precise differences between the simplified and realistic ULDM implementations (e.g., how soliton density profiles and dynamical-friction prescriptions are implemented). These details are required to determine whether the observed performance gap reflects physical modeling choices or differences in effective degrees of freedom.

Authors: We thank the referee for this observation. The current manuscript refers to prior work for model details but does not restate them. We will expand the methods section to provide the explicit parameterizations, including the priors on ULDM particle mass (log-uniform) and effective fraction (uniform), and to delineate the differences: the simplified implementation uses an approximate uniform-density soliton core and a basic dynamical-friction formula, while the realistic version adopts the full soliton density profile from N-body simulations together with an orbit-averaged friction prescription that accounts for binary eccentricity evolution. This will clarify that the performance difference arises from the physical modeling choices. revision: yes
Referee: [Results and statistical reporting] The claim that the phenomenological model’s ELPD advantage is “not large compared with the estimated standard errors” (abstract and results) relies on the accuracy of those standard-error estimates. The paper does not specify whether the errors are obtained from the PSIS-LOO approximation, from posterior predictive checks, or from another method, nor whether multiple-comparison corrections across four models and five bins have been applied.

Authors: The standard errors on the ELPD differences are computed via the PSIS-LOO approximation implemented in the loo package. We will state this explicitly in the revised results section. On multiple-comparison corrections, our conclusions emphasize that differences are not large relative to the SEs rather than formal significance testing; the comparisons are pre-specified. We will add a clarifying sentence on the method and note that no Bonferroni-style correction was applied. If desired, we can include a supplementary table with raw and adjusted values. revision: partial

Circularity Check

0 steps flagged

LOO-CV provides independent predictive assessment with no definitional reduction

full rationale

The paper computes expected log predictive densities for four models (simplified ULD, realistic ULD, phenomenological, GW-only) by applying Bayesian LOO-CV directly to the five lowest PTA frequency bins. This estimator is constructed to evaluate out-of-sample performance on held-out bins and therefore cannot reduce by definition to any in-sample fit or parameter that was optimized on the full data. No load-bearing step invokes a self-citation for a uniqueness theorem, renames a known result, or smuggles an ansatz; the pairwise rankings and overall conclusion emerge from the standard LOO formula applied to the observed data. The independence assumption among bins is a modeling premise whose validity can be checked externally and does not create circularity in the reported ELPD values.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The analysis applies standard Bayesian cross-validation to existing astrophysical models without introducing new free parameters or entities; it relies on the assumption that the selected frequency bins support the cross-validation procedure.

axioms (1)

domain assumption The five lowest PTA frequency bins provide sufficiently independent information for LOO-CV to produce reliable predictive density estimates.
This assumption underpins the application of leave-one-out cross-validation to these specific data points.

pith-pipeline@v0.9.0 · 5491 in / 1418 out tokens · 36656 ms · 2026-05-08T05:49:07.841829+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

17 extracted references · 4 canonical work pages

[1]

Agazie, A

G. Agazie, A. Anumarlapudi,et al., The NANOGrav 15 yr Data Set: Evidence for a Gravitational-wave Back- ground, Astrophys. J. Lett.951, L8 (2023)

2023
[2]

D. J. Reardon, A. Zic,et al., Search for an Isotropic Gravitational-wave Background with the Parkes Pulsar Timing Array, Astrophys. J. Lett.951, L6 (2023)

2023
[3]

Antoniadis, P

J. Antoniadis, P. Arumugam,et al., The second data release from the European Pulsar Timing Array: III. Search for gravitational wave signals, Astron. Astrophys. 678, A50 (2023)

2023
[4]

H. Xu, S. Chen,et al., Searching for the Nano-Hertz Stochastic Gravitational Wave Background with the Chi- nese Pulsar Timing Array Data Release I, Res. Astron. Astrophys.23, 075024 (2023)

2023
[5]

M. T. Mileset al., The MeerKAT Pulsar Timing Array: the first search for gravitational waves with the MeerKAT radio telescope, Monthly Notices of the Royal Astronom- ical Society536, 1489–1500 (2024)

2024
[6]

M. C. Begelman, R. D. Blandford, and M. J. Rees, Mas- sive black hole binaries in active galactic nuclei, Nature 287, 307 (1980)

1980
[7]

Agazieet al., The NANOGrav 15 yr Data Set: Con- straints on Supermassive Black Hole Binaries from the Gravitational-wave Background, Astrophys

G. Agazieet al., The NANOGrav 15 yr Data Set: Con- straints on Supermassive Black Hole Binaries from the Gravitational-wave Background, Astrophys. J. Lett.952, L37 (2023)

2023
[8]

Milosavljevic and D

M. Milosavljevic and D. Merritt, The Final parsec prob- lem, AIP Conf. Proc.686, 201 (2003), arXiv:astro- ph/0212270

work page arXiv 2003
[9]

Milosavljevic and D

M. Milosavljevic and D. Merritt, Long-Term Evolution of Massive Black Hole Binaries, The Astrophysical Journal 596, 860–878 (2003)

2003
[10]

Tiruvaskar, R

S. Tiruvaskar, R. Boey, R. Easther, and C. Gordon, Ul- tralight dark matter constraints from nano-Hertz gravi- tational waves, Phys. Rev. D113, 063541 (2026)

2026
[11]

Gelmanet al.,Bayesian Data Analysis, 3rd ed., Chap- man & Hall/CRC Texts in Statistical Science Series (CRC, 2013)

A. Gelmanet al.,Bayesian Data Analysis, 3rd ed., Chap- man & Hall/CRC Texts in Statistical Science Series (CRC, 2013)

2013
[12]

Vehtari, A

A. Vehtari, A. Gelman, and J. Gabry, Practical Bayesian model evaluation using leave-one-out cross-validation and WAIC, Statistics and Computing27, 1413–1432 (2016)

2016
[13]

A. E. Gelfand, D. K. Dey, and H. Chang, Model deter- mination using predictive distributions with implementa- tion via sampling-based methods, inBayesian Statistics 4, edited by J. M. Bernardo, J. O. Berger, A. P. Dawid, and A. F. M. Smith (Oxford University Press, Oxford, UK, 1992) pp. 147–167,https://apps.dtic.mil/sti/ html/tr/ADA258777/

1992
[14]

R. Boey, E. Kendall, Y. Wang, and R. Easther, Su- permassive Binaries in Ultralight Dark Matter Solitons (2025), arXiv:2504.16348 [astro-ph.CO]

work page arXiv 2025
[15]

Gordon and R

C. Gordon and R. Trotta, Bayesian Calibrated Signifi- cance Levels Applied to the Spectral Tilt and Hemispher- ical Asymmetry, Mon. Not. Roy. Astron. Soc.382, 1859 (2007), arXiv:0706.3014 [astro-ph]

work page arXiv 2007
[16]

Tiruvaskar and C

S. Tiruvaskar and C. Gordon, Self-interacting dark-matter spikes and the final-parsec problem: Bayesian constraints from the NANOGrav 15-year gravitational-wave background, Physical Review D113, 10.1103/2hqm-qv99 (2026)

work page doi:10.1103/2hqm-qv99 2026
[17]

O. A. Martinet al., ArviZ: a modular and flexible library for exploratory analysis of Bayesian models, Journal of Open Source Software11, 9889 (2026)

2026