Denoising weak lensing mass maps with diffusion model: systematic comparison with generative adversarial network

Ken Osato; Masato Shirasaki; Shohei D. Aoyama

arxiv: 2505.00345 · v4 · submitted 2025-05-01 · 🌌 astro-ph.CO

Denoising weak lensing mass maps with diffusion model: systematic comparison with generative adversarial network

Shohei D. Aoyama , Ken Osato , Masato Shirasaki This is my paper

Pith reviewed 2026-05-22 18:04 UTC · model grok-4.3

classification 🌌 astro-ph.CO

keywords weak lensingdenoisingdiffusion modelsgenerative adversarial networksmass mapscosmological statisticsshape noiseconvergence maps

0 comments

The pith

Diffusion models denoise weak lensing maps more accurately than GANs and recover statistics to smaller scales.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that diffusion models trained on large suites of mock weak lensing observations can remove shape noise from convergence maps more effectively than generative adversarial networks. Both approaches reconstruct the true maps well on large scales, but the diffusion model maintains fidelity in measured statistics such as the power spectrum, bispectrum, peak counts, and scattering transform coefficients down to scales where noise normally dominates. This enables access to small-scale cosmological information that would otherwise be inaccessible. The authors also report that the diffusion model offers stable training and the ability to produce multiple denoised realizations from a single trained network, while stress tests with maps of differing source redshifts confirm recovery of statistics at large scales even when small-scale performance declines.

Core claim

Diffusion models outperform generative adversarial networks when denoising noisy weak lensing convergence maps, recovering the correct cosmological statistics down to small scales. For example, the angular power spectrum is recovered up to multipoles ℓ ≲ 6000 with the diffusion model, whereas the noise power spectrum dominates from ℓ ≃ 2000 onward.

What carries the argument

A diffusion model trained on 39,000 mock weak lensing mass maps to remove shape noise, compared directly against a GAN trained on the same data.

If this is right

Analyses of weak lensing surveys can incorporate information from smaller angular scales without being limited by shape noise.
Cosmological parameter constraints derived from weak lensing can tighten because more modes remain usable after denoising.
Multiple independent denoised realizations can be generated efficiently once the diffusion model is trained, aiding covariance estimation.
Stress-test results indicate that the model remains useful for large-scale statistics even when applied to maps with source redshifts different from the training set.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be combined with existing reconstruction techniques that use external data such as galaxy clustering to further reduce residual biases at intermediate scales.
Training on mocks that include realistic systematics beyond shape noise, such as multiplicative bias or intrinsic alignments, would test robustness for next-generation surveys.
The computational cost of training the diffusion model may be offset by its ability to produce an ensemble of outputs, potentially reducing the need for many separate noise realizations in simulation pipelines.

Load-bearing premise

The statistical properties of the mock observations used for training match those of real observational noise and the underlying cosmology closely enough that performance on mocks carries over to actual survey data.

What would settle it

Apply the trained diffusion model to real weak lensing maps from an ongoing survey and check whether the recovered power spectrum, bispectrum, and peak counts match independent measurements or simulations that include the same noise properties.

read the original abstract

Removing the shape noise from the observed weak lensing field, i.e., denoising, enhances the potential of WL by accessing information at small scales where the shape noise dominates without denoising. We utilise two machine learning (ML) models for denosing: generative adversarial network (GAN) and diffusion model (DM). We evaluate the performance of denosing with GAN and DM utilising the large suite of mock WL observations, which serve as the training and test data sets. We apply denoising to 1,000 noisy mass maps with GAN and DM models trained with 39,000 mock observations. Both models can fairly well reproduce the true convergence map on large scales. Then, we measure cosmological statistics: power spectrum, bispectrum, one-point probability distribution function, peak and minima counts, and scattering transform coefficients. We find that DM outperforms GAN in almost all considered statistics and recovers the correct statistics down to small scales. For example, the angular power spectrum can be recovered with DM up to multipoles $\ell \lesssim 6000$ while the noise power spectrum dominates from $\ell \simeq 2000$. We also conduct stress tests on the trained model; denoising the maps with different characteristics, e.g., different source redshifts, from the training data. The performance degrades at small scales, but the statistics can still be recovered at large scales. Though the training of DM is more computationally demanding compared with GAN, there are several advantages: numerically stable training, higher performance in the reconstruction of cosmological statistics, and sampling multiple realisations once the model is trained. It has been known that DM can generate higher-quality images in real-world problems than GAN, the superiority has been confirmed as well in the WL denoising problem.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Diffusion models beat GANs at recovering small-scale stats from noisy weak-lensing mocks, but the mock-to-real transfer is still unproven.

read the letter

The headline result is that diffusion models recover the input convergence statistics better than GANs across power spectrum, bispectrum, peaks, minima, PDF, and scattering coefficients when both are trained and tested on the same 39k + 1k mock suite. The power spectrum stays usable to ℓ ≲ 6000 with DM while raw noise dominates from ℓ ≃ 2000. That side-by-side test on multiple higher-order statistics is the concrete new piece here; prior work had applied diffusion models to image denoising but not with this systematic WL comparison and this set of summary statistics.

Referee Report

3 major / 2 minor

Summary. The manuscript compares diffusion models (DM) and generative adversarial networks (GAN) for denoising weak lensing convergence maps. Using 39,000 mock maps for training and 1,000 held-out maps for testing, it reports that DM outperforms GAN across multiple cosmological statistics including the angular power spectrum (recovered to ℓ ≲ 6000), bispectrum, one-point PDF, peak and minima counts, and scattering transform coefficients. Stress tests with altered source redshifts show degradation at small scales but retained performance at large scales. The paper notes DM advantages in training stability and the ability to generate multiple realizations post-training.

Significance. If the mock-based results hold, the work provides a systematic empirical demonstration that diffusion models can recover small-scale weak lensing information more effectively than GANs, with potential benefits for cosmological analyses from future surveys. Strengths include the large training set size, evaluation on a broad suite of statistics, and explicit stress testing; these elements support the central performance comparison.

major comments (3)

[Results section] Results section: the claim that DM recovers the 'correct statistics' down to small scales (e.g., power spectrum to ℓ ≲ 6000) lacks quantitative support such as fractional residuals, reduced chi-squared values, or error budgets on the recovered spectra; without these, it is difficult to judge whether the improvement over GAN is statistically significant or within the mock noise level.
[Stress-test subsection] Stress-test subsection: performance degradation is reported for maps with source redshifts differing from the training distribution, but no direct head-to-head comparison of DM versus GAN is provided under these conditions; this comparison is load-bearing for assessing whether the reported superiority persists under realistic distribution shifts.
[Methods section] Methods section: the manuscript does not detail the precise error budget or how the mock suite incorporates variations in baryonic feedback, photometric redshift scatter, or survey masks; these omissions limit evaluation of whether the DM advantage is robust to unmodeled observational effects.

minor comments (2)

[Figures] Figure captions and text should explicitly state the number of realizations used for each statistic and whether error bars represent sample variance or bootstrap estimates.
[Abstract] The abstract states DM training is 'more computationally demanding' but provides no wall-time or GPU-hour comparison; adding this would help readers weigh the performance gains against cost.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their positive assessment and constructive feedback on our manuscript. We have addressed all major comments in the revised version and provide detailed responses below.

read point-by-point responses

Referee: [Results section] Results section: the claim that DM recovers the 'correct statistics' down to small scales (e.g., power spectrum to ℓ ≲ 6000) lacks quantitative support such as fractional residuals, reduced chi-squared values, or error budgets on the recovered spectra; without these, it is difficult to judge whether the improvement over GAN is statistically significant or within the mock noise level.

Authors: We agree with the referee that adding quantitative metrics would improve the clarity of our results. In the revised manuscript, we have included plots of fractional residuals for the power spectrum and other statistics, along with reduced chi-squared values comparing the denoised maps to the true ones for both DM and GAN. These additions demonstrate that the DM-recovered power spectrum agrees with the true spectrum within the mock variance up to ℓ ≈ 6000, and the difference from GAN is significant. We have updated the Results section accordingly. revision: yes
Referee: [Stress-test subsection] Stress-test subsection: performance degradation is reported for maps with source redshifts differing from the training distribution, but no direct head-to-head comparison of DM versus GAN is provided under these conditions; this comparison is load-bearing for assessing whether the reported superiority persists under realistic distribution shifts.

Authors: We thank the referee for pointing this out. We have performed the additional analysis and added a direct comparison of DM and GAN under the stress-test conditions with altered source redshifts. The new results, now included in the revised Stress-test subsection, show that DM continues to outperform GAN even when the input maps have source redshifts outside the training distribution, although both methods show degradation at small scales as noted. This supports the robustness of the DM advantage. revision: yes
Referee: [Methods section] Methods section: the manuscript does not detail the precise error budget or how the mock suite incorporates variations in baryonic feedback, photometric redshift scatter, or survey masks; these omissions limit evaluation of whether the DM advantage is robust to unmodeled observational effects.

Authors: We acknowledge that more details on the mock construction would be helpful. The mocks used in this work are generated from N-body simulations with ray-tracing, and we have now expanded the Methods section to describe the error budget, including the shape noise model, and to clarify that baryonic feedback effects are not varied in this particular suite (we use dark-matter only simulations), while photometric redshift scatter is incorporated via the source redshift distribution. Survey masks are not applied in the current mock suite as we focus on idealized full-sky maps; we have added a discussion of this limitation and its implications for the DM advantage in realistic settings. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical ML comparison on held-out mocks

full rationale

The paper trains GAN and DM models on 39,000 mock weak-lensing maps and evaluates denoising performance on a separate set of 1,000 held-out mocks by measuring recovery of power spectrum, bispectrum, PDF, peaks, minima, and scattering coefficients. No derivation, ansatz, uniqueness theorem, or first-principles prediction is presented that reduces by construction to a fitted parameter or self-citation inside the paper. All reported superiority of DM is an empirical outcome on the test mocks, with explicit stress tests for distribution shift (different source redshifts) noted as degrading small-scale performance. This is standard supervised learning evaluation and contains no load-bearing self-referential step.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on the assumption that the mock weak-lensing simulations faithfully represent real observational noise and cosmology; no new physical entities are introduced.

axioms (1)

domain assumption Mock weak lensing maps generated from a fiducial cosmology plus shape noise accurately represent the statistical properties of real survey data.
Training and test performance are evaluated exclusively on these mocks; transfer to real data is asserted but not demonstrated in the abstract.

pith-pipeline@v0.9.0 · 5853 in / 1225 out tokens · 54237 ms · 2026-05-22T18:04:37.036343+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Increasing the Precision of Surrogate Models for Weak Lensing Mass Maps with Flow Matching
astro-ph.CO 2026-05 unverdicted novelty 7.0

A flow matching generative model produces weak lensing mass maps with fidelity improved to below 1% and 5% on basic and higher-order statistics relative to GAN benchmarks.
Machine-learning applications for weak-lensing cosmology
astro-ph.CO 2026-05 unverdicted novelty 2.0

Machine learning techniques can mitigate limitations in traditional weak-lensing analyses and enhance extraction of cosmological information from galaxy imaging surveys.