arxiv: 2604.24862 · v1 · submitted 2026-04-27 · 🌌 astro-ph.IM · astro-ph.GA

Recognition: unknown

The effects of image augmentations when training machine learning models in astronomy

Ashley Spindler, Leon H. Butterworth

Pith reviewed 2026-05-07 17:31 UTC · model grok-4.3

classification 🌌 astro-ph.IM astro-ph.GA

keywords image augmentationgalaxy morphology classificationdeep neural networksmachine learningastronomical imagestraining dataset sizedata augmentation effects

0 comments

The pith

Image augmentations generally improve neural network performance on galaxy morphology classification, but benefits decrease sharply with larger training datasets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests the common practice of adding image augmentations when training deep neural networks to classify galaxy shapes from survey images. It retrains the same model on subsets of 230,000 DECaLS galaxies while changing which augmentations are applied and how many training examples are used. Results show that augmentations raise accuracy on average, yet the size of that lift shrinks markedly once the training set becomes large, and different reasonable augmentations produce comparable gains. This matters because astronomers routinely apply augmentations by default, even though the extra computation may yield little return when data volumes are already high.

Core claim

We find that generally, the addition of image augmentations does improve a deep neural network's performance, however, this improvement is significantly diminished as the training dataset size increases. The choice of specific augmentations (provided they are sensible) does not seem to be as important as simply having augmentations as different augmentations result in similar increases in performances. We find that for a model of a given size, there exists a saturation point (when the model's capacity has been filled with data) that cannot be surpassed with data augmentations. We find that more complex augmentations result in longer training times and might not lead to improved performance.

What carries the argument

Retraining the Zoobot deep neural network on Galaxy Zoo DECaLS images under controlled variations in image augmentation schemes and training set sizes up to 230,000 examples, then measuring changes in classification accuracy.

If this is right

Simpler augmentations often deliver nearly the same accuracy gains as complex ones while requiring less training time.
Astronomers can skip or limit augmentations once their training set reaches the size that saturates the chosen model's capacity.
Default inclusion of augmentations without checking dataset size risks wasting compute with no added accuracy.
For a fixed model size, adding more real training images eventually outperforms any augmentation strategy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same saturation pattern may appear in other astronomical image tasks such as source detection or photometric redshift estimation.
Researchers could test whether increasing model size in step with data volume restores the value of augmentations at larger scales.
Survey teams might allocate effort toward acquiring more real observations instead of engineering additional augmentations once datasets exceed certain thresholds.

Load-bearing premise

The observed accuracy differences result primarily from the chosen image augmentations rather than from differences in training hyperparameters, random seeds, or limits of the particular model architecture.

What would settle it

Re-running the full set of training experiments with the same augmentation choices but with multiple random seeds and a grid of hyperparameter values to check whether the reported performance gaps stay consistent or shrink to noise levels.

Figures

Figures reproduced from arXiv: 2604.24862 by Ashley Spindler, Leon H. Butterworth.

**Figure 1.** Figure 1: Decision tree used by GZD-5 (Walmsley et al. 2022a). This shows the order in which questions are shown to volunteers based on their answers to previous questions. our models (similarly to Walmsley et al. 2022a; Walmsley et al. 2024). GZD-5 has 11 possible questions volunteers could be asked. Zoobot only learns from 10 of the 11 questions, ignoring the last question regarding rare features as that is a mult… view at source ↗

**Figure 2.** Figure 2: Selection of images from the DECaLS survey, all images were included in GZD-5. The images are created from the grz bands. The pixel scale of each image is interpolated from DECaLS’ 0.262 arcsec per pixel scale. agreed upon, so the model should be able to perform especially well on those galaxies. Because with the whole test sample, there will be galaxies that have very ambiguous answers where morphological… view at source ↗

**Figure 3.** Figure 3: Visual representation of the transformation process for the six model configurations tested. Each transformation step is shown as an example image and the transformation process moves from left to right (denoted by the arrows between images). Each model configuration occupies its own row. The transformation processes contain two sections, "Pre-processing" which refers to transformations that are applied eq… view at source ↗

**Figure 4.** Figure 4: Accuracy for all galaxies in the test sample, the x-axis shows the size of the training sample used to train the model (as a percentage of the total training size). The question being tested in written above each plot. MNRAS 000, 1–13 (2026) view at source ↗

**Figure 5.** Figure 5: Accuracy for confident galaxies in the test sample, the x-axis shows the size of the training sample used to train the model (as a percentage of the total training size). The question being tested in written above each plot. MNRAS 000, 1–13 (2026) view at source ↗

**Figure 6.** Figure 6: Number of epochs needed to train each model configuration. time view at source ↗

read the original abstract

We measure the influence of image augmentations and training dataset size when training a deep neural network to classify galaxy morphology. Data augmentation is an integral step when training machine learning models and often astronomers add augmentations assuming they will always improve the performance of their models. We train multiple versions of the same pre-existing Zoobot model using different image augmentations and different dataset sizes from 230,000 galaxy images from Galaxy Zoo DECaLS to determine whether this assumption is necessarily true. We find that generally, the addition of image augmentations does improve a deep neural network's performance, however, this improvement is significantly diminished as the training dataset size increases. The choice of specific augmentations (provided they are sensible) does not seem to be as important as simply having augmentations as different augmentations result in similar increases in performances. We find that for a model of a given size, there exists a saturation point (when the model's capacity has been filled with data) that cannot be surpassed with data augmentations. We find that more complex augmentations result in longer training times and might not lead to improved performance. If augmentations are added to the training process (which is recommended), simpler augmentations might be sufficient, depending on the size of the dataset and model. We therefore encourage astronomers to carefully consider their use of image augmentations in an effort to reduce wasted time and computational resources.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Augmentations give a real but shrinking boost to Zoobot galaxy classifiers as dataset size grows, with the main caveat that single-run results make the exact size of those gains hard to trust.

read the letter

The main thing to know is that this paper measures how much image augmentations actually move the needle on galaxy morphology classification and finds the benefit fades fast once training sets reach tens or hundreds of thousands of images. They also note that simple augmentations often do as well as complex ones while costing less training time, and that there's a saturation point set by model capacity rather than data tricks. That matches what a lot of people training on survey data have suspected but rarely quantified across dataset sizes. They earn credit for running the same Zoobot setup on DECaLS subsets from small to 230k images and for testing multiple augmentation choices in a systematic way. The results give concrete guidance on when to skip fancy augmentations to save compute, which is useful for LSST-scale work. The soft spot is the absence of multiple random seeds or error bars on the performance numbers. The stress-test note is right that without those controls the reported deltas could partly reflect training stochasticity instead of the augmentations themselves, especially at larger dataset sizes where variance is lower but not zero. That makes the claim of significantly diminished returns less solid than it could be. The work is aimed at astronomers who actually train these models and need practical rules of thumb rather than new theory. A reader in that group would get value from the empirical patterns even if the numbers need tighter error estimates. It deserves peer review because the question is relevant, the design is straightforward, and the community can use the directional findings while the authors add the missing statistical checks.

Referee Report

3 major / 2 minor

Summary. The manuscript presents an empirical study examining the impact of various image augmentations and training dataset sizes on the performance of the Zoobot deep neural network for galaxy morphology classification. Using subsets drawn from 230,000 DECaLS galaxy images, the authors train multiple versions of the model under different augmentation regimes and conclude that augmentations generally improve performance, but the gains diminish substantially with larger training sets; that the specific choice among sensible augmentations matters less than including any; that a model-capacity saturation point exists beyond which augmentations cannot help; and that more complex augmentations increase training time without commensurate benefits.

Significance. If the central empirical trends hold after proper statistical controls, the work supplies actionable guidance for the astronomy ML community on efficient use of augmentations, potentially reducing wasted compute when datasets are already large. The controlled design across multiple dataset sizes and augmentation types is a clear strength and directly addresses a common practical question in the field.

major comments (3)

[Methods] Methods section: the experimental protocol does not describe repeated training runs with different random seeds or report standard deviations on any performance metric. Because the central claim is that augmentation-induced gains shrink with dataset size, the absence of these controls leaves open the possibility that the reported trends are dominated by single-run stochasticity rather than the augmentations themselves.
[Results] Results section (and associated figures/tables): no error bars, confidence intervals, or statistical tests are provided for the accuracy or loss differences across augmentation choices and dataset sizes. Without these, it is impossible to assess whether the observed diminishing returns exceed run-to-run variance, especially at the largest dataset sizes where variance is typically smaller.
[Discussion] Discussion: the assertion of a hard 'saturation point' determined by model capacity is stated qualitatively but is not supported by any quantitative analysis of learning curves, effective capacity metrics, or ablation of model size; this claim is load-bearing for the recommendation that augmentations become irrelevant beyond a certain data volume.

minor comments (2)

[Introduction] The abstract and introduction would benefit from explicit citation of prior work on augmentation strategies in astronomical image classification to better situate the novelty of the controlled dataset-size sweep.
Figure captions and axis labels should clearly indicate which specific augmentations correspond to each curve and whether the plotted metric is accuracy, F1, or another quantity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed comments, which have helped us identify areas where the manuscript can be strengthened with additional statistical context and clarifications. We address each major comment point by point below, outlining the revisions we plan to make.

read point-by-point responses

Referee: [Methods] Methods section: the experimental protocol does not describe repeated training runs with different random seeds or report standard deviations on any performance metric. Because the central claim is that augmentation-induced gains shrink with dataset size, the absence of these controls leaves open the possibility that the reported trends are dominated by single-run stochasticity rather than the augmentations themselves.

Authors: We agree that repeated runs with varied random seeds would provide stronger evidence against stochastic effects. The original experiments used single training runs per configuration due to the high computational cost of retraining Zoobot across the full grid of dataset sizes and augmentations. However, the reported trends are consistent and monotonic across independent dataset sizes, which would be unlikely if dominated by random seed effects. In the revised manuscript, we will expand the Methods section to explicitly state that single runs were performed, discuss the implications for interpreting variance, and note that the diminishing-returns pattern aligns with prior scaling studies in the literature. revision: partial
Referee: [Results] Results section (and associated figures/tables): no error bars, confidence intervals, or statistical tests are provided for the accuracy or loss differences across augmentation choices and dataset sizes. Without these, it is impossible to assess whether the observed diminishing returns exceed run-to-run variance, especially at the largest dataset sizes where variance is typically smaller.

Authors: We acknowledge that the absence of error bars and formal statistical comparisons limits the ability to quantify whether differences exceed typical run-to-run variance. Because the study relied on single runs, we cannot retroactively compute standard deviations. We will revise the Results section and figure captions to include a clear statement of this limitation and to emphasize that the central trend (diminishing augmentation gains with increasing data volume) is observed consistently across multiple independent dataset sizes. Where feasible, we will also add a small number of repeated runs for the largest dataset sizes to provide indicative variability estimates in the revised version. revision: partial
Referee: [Discussion] Discussion: the assertion of a hard 'saturation point' determined by model capacity is stated qualitatively but is not supported by any quantitative analysis of learning curves, effective capacity metrics, or ablation of model size; this claim is load-bearing for the recommendation that augmentations become irrelevant beyond a certain data volume.

Authors: The saturation point is presented as an empirical observation: performance gains from augmentations plateau once dataset size is large enough that the fixed-capacity Zoobot model is effectively data-saturated. This is visible in the flattening of the accuracy curves in our figures. We will strengthen the Discussion by adding references to scaling-law literature (e.g., on model capacity and data requirements), by providing a more quantitative description of the observed learning-curve plateaus, and by clarifying that the claim is specific to the Zoobot architecture and dataset rather than a universal theoretical limit. We will also soften the language from 'hard saturation point' to 'empirically observed saturation regime' to better reflect the evidence. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical study with direct experimental results

full rationale

This is an empirical machine-learning paper that trains variants of the Zoobot model on DECaLS galaxy images and reports observed accuracy changes under different augmentation policies and training-set sizes. No equations, fitted parameters, or derivations are present that could reduce to their own inputs by construction. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. All performance claims rest on the training runs themselves rather than on any algebraic identity or prior self-referential result, satisfying the criteria for a self-contained empirical finding.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper's claims are based on empirical results from model training rather than any axioms, free parameters, or invented entities.

pith-pipeline@v0.9.0 · 5546 in / 1185 out tokens · 101031 ms · 2026-05-07T17:31:20.924854+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

2 extracted references · 1 canonical work pages

[1]

Aihara H., et al., 2011, The Astrophysical Journal Supplement, 193, 29 Almeida A., et al., 2023, The Astrophysical Journal Supplement Series, 267, 44 Astropy Collaboration et al., 2013, A&A, 558, A33 Astropy Collaboration et al., 2018, AJ, 156, 123 Astropy Collaboration et al., 2022, ApJ, 935, 167 Bait O., Barway S., Wadadekar Y., 2017, Monthly Notices of...

work page doi:10.1201/9781315139470 2011
[2]

Smooth Or Featured

The differencebetweenthetwosetsoffiguresisthatalloftheindividual subplots in figures A1 and A2 share the sameyaxis range. allowing for easier comparison between questions. Figure A1 shows that the accuracies between questions can vary by large amounts, for instance, the ‘Has Spiral Arms’ question has accuracies over 90% while the ‘Spiral Arm Count’ questi...

2026