pith. sign in

arxiv: 2605.20405 · v1 · pith:DGS6LKN3new · submitted 2026-05-19 · 📡 eess.IV · cs.AI· cs.CV· physics.med-ph

Disentangling Sampling from Training Budget in Class-Imbalanced CT Body Composition Segmentation

Pith reviewed 2026-05-21 07:17 UTC · model grok-4.3

classification 📡 eess.IV cs.AIcs.CVphysics.med-ph
keywords samplingtrainingepisodicunderbatchrandomsegmentationweighted
0
0 comments X

The pith

Episodic sampling for class-balanced batches in low-data CT segmentation delays overfitting compared to random or weighted sampling, revealing training iteration budget as a key evaluation confound.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Medical CT scans often show class imbalance where common tissues dominate training and rare ones like specific fat deposits are poorly learned. Standard fixes either reweight the loss or pick images randomly or by class frequency. This work adapts episodic sampling from few-shot learning to ensure each training batch contains examples from all classes more evenly. Experiments use nine muscle and fat tissues from 210 public SAROS CT scans. With full data all three sampling methods reach similar Dice scores near 0.88. With limited data episodic sampling reaches 0.787 while others stay around 0.76, but this occurs with roughly twelve times more training iterations. When the number of iterations is matched across methods, random and weighted sampling overfit earlier while episodic sampling continues to improve for about three times longer before plateauing. The authors interpret the remaining edge as possible implicit regularization from balanced batches and flag training budget as an overlooked factor in such comparisons.

Core claim

Under matched training budgets, random and weighted sampling overfit earlier while episodic sampling improved for approximately three times more iterations before plateauing, identifying the training iteration budget as an under-recognized confound.

Load-bearing premise

That differences in performance are driven primarily by the class-balancing property of episodic sampling rather than by unstated differences in hyperparameter schedules, batch construction details, or optimization dynamics.

Figures

Figures reproduced from arXiv: 2605.20405 by Dimitrios Karkalousos, Iason Skylitsis, Ivana I\v{s}gum.

Figure 1
Figure 1. Figure 1: Slice-wise prevalence of the nine tissue classes. The y-axis indicates the percentage of slices in which each [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Example axial slices at three vertebral levels (L4–first, L1–second, and T9–third row), showing the reference [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Per-class batch frequency per epoch for episodic (blue), random (orange), and weighted (green) sampling [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Representative segmentations at the L3 vertebral level for a test case under standard epoch-based full-data (a) [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Representative segmentations at the L3 vertebral level for a test case under low-data fixed-iteration (a) and [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Full-data 100% regime. (a; top–left) training Loss, (b; top–right) validation Loss, (c; bottom) per-class [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Low-data 10% regime. (a; top–left) training Loss, (b; top–right) validation Loss, (c; bottom) per-class [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Low-data 10% fixed 3,000 iterations with constant learning rate regime. (a; top–left) training Loss, (b; [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Low-data 10% iteration-calibrated regime. (a; top–left) training Loss, (b; top–right) validation Loss, (c; [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
read the original abstract

Class imbalance is a fundamental challenge in medical image segmentation, where frequent classes typically dominate training at the expense of rare classes. Loss-based approaches mitigate imbalance by reweighting the per-pixel loss within the batch, while sampling strategies control which images enter the batch. Yet neither explicitly controls which classes appear within the batch, leaving rare-class exposure only partially rebalanced. In this work, we adopt episodic sampling from few-shot learning to promote class-balanced batch construction in a fully supervised setting. We decouple episodic sampling from its conventional metric-learning context and evaluate it in body composition segmentation in CT. We compare episodic sampling against random and weighted sampling on nine muscle and adipose tissues, derived from 210 scans of the public SAROS dataset. Training is performed under full- and low-data regimes, with additional comparisons under matched training iteration budgets. Under full-data training, all three strategies performed comparably (mean Dice 0.882 for episodic, 0.878 for random and weighted). Under low-data training, episodic sampling outperformed random and weighted (0.787 vs. 0.758 and 0.762), driven by a 12-fold difference in training iterations. Under matched training budgets, random and weighted overfit earlier, while episodic improved for approximately three times more iterations before plateauing. Our findings identify the training iteration budget as under-recognized confound in sampling strategies, motivating iteration-aware evaluation protocols for small datasets. Furthermore, the residual advantage of episodic sampling is consistent with an implicit regularization effect of class-balanced batches, offering a low-cost, model-agnostic strategy for class-imbalanced medical image segmentation. Code is available at https://github.com/iasonsky/episodic-sampling.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper claims that episodic sampling (decoupled from metric learning) outperforms random and weighted sampling for class-imbalanced CT body composition segmentation on the public SAROS dataset (210 scans, 9 tissues). Under full-data training all three strategies yield comparable mean Dice scores (~0.88). Under low-data regimes episodic sampling reaches higher Dice (0.787 vs. 0.758/0.762) because it trains for ~12× more iterations before plateauing. When iteration budgets are explicitly matched, random and weighted sampling overfit earlier while episodic sampling continues improving for ~3× more iterations; the authors attribute the residual advantage to an implicit regularization effect of class-balanced batches and identify training budget as an under-recognized confound, motivating iteration-aware evaluation protocols. Code is released.

Significance. If the matched-budget results hold under fully controlled conditions, the work is significant for medical image segmentation: it provides concrete evidence that sampling-strategy comparisons are confounded by iteration count, supplies a simple model-agnostic remedy via episodic batch construction, and demonstrates the effect on a public dataset with explicit full-data / low-data / matched-budget contrasts. The released code supports reproducibility.

major comments (1)
  1. [Matched-budget experiments] The central attribution—that the ~3× longer plateau under matched budgets arises from the class-balancing property of episodic sampling rather than from unstated differences in optimizer, learning-rate schedule, weight decay, batch size, or episode-construction details—is load-bearing. The manuscript must explicitly state (and ideally tabulate) that every hyper-parameter and optimization setting was identical across the three sampling strategies; without this, the longer training trajectory could be explained by optimization dynamics instead of class balance. (See the matched-budget paragraph in the abstract and the corresponding experimental subsection.)
minor comments (2)
  1. [Abstract / Results] The abstract and results sections report point estimates for Dice scores but omit per-fold or per-run standard deviations and any statistical significance tests for the reported differences (e.g., 0.787 vs. 0.758/0.762).
  2. [Methods] Implementation details on how episodic sampling is realized in the fully-supervised regime (exact episode size, how rare-class co-occurrence is enforced, batch-size equivalence with random/weighted baselines) should be expanded for reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The major comment raises a valid point about ensuring full transparency in the matched-budget experiments. We have revised the manuscript to explicitly confirm and tabulate identical hyperparameters across sampling strategies, which directly addresses the concern and strengthens the attribution to class balancing.

read point-by-point responses
  1. Referee: The central attribution—that the ~3× longer plateau under matched budgets arises from the class-balancing property of episodic sampling rather than from unstated differences in optimizer, learning-rate schedule, weight decay, batch size, or episode-construction details—is load-bearing. The manuscript must explicitly state (and ideally tabulate) that every hyper-parameter and optimization setting was identical across the three sampling strategies; without this, the longer training trajectory could be explained by optimization dynamics instead of class balance. (See the matched-budget paragraph in the abstract and the corresponding experimental subsection.)

    Authors: We agree that explicit confirmation is necessary to rule out optimization confounds. In the revised manuscript we have added a new paragraph in Section 3.2 (Experimental Setup) stating that all three sampling strategies used identical settings: Adam optimizer with the same initial learning rate and cosine decay schedule, identical weight decay, fixed batch size of 4, and the same episode-construction hyperparameters for episodic sampling. We have also inserted Table 2, which tabulates every hyperparameter value side-by-side for random, weighted, and episodic sampling. These additions make clear that the observed differences in plateau length arise from the sampling strategy itself rather than from any unstated optimization differences. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison with no derivations or self-referential predictions

full rationale

The paper reports direct empirical results from training segmentation models under different sampling strategies (random, weighted, episodic) on the SAROS dataset, measuring Dice scores in full-data and low-data regimes with matched iteration budgets. No equations, fitted parameters renamed as predictions, or derivation chains appear; performance differences are presented as observed outcomes rather than constructed from internal definitions. The adoption of episodic sampling is described as a methodological choice decoupled from metric learning, with no load-bearing self-citations or uniqueness theorems invoked to force the conclusions. The central claim that training iteration budget is a confound rests on the reported iteration counts and plateau behaviors, which are external to any internal fitting or redefinition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper is empirical and introduces no new mathematical entities or free parameters; it relies on standard deep-learning training assumptions and a public dataset.

axioms (1)
  • domain assumption Dice coefficient is a suitable primary metric for comparing segmentation performance across sampling strategies.
    All reported results and conclusions rest on Dice scores without alternative metrics or sensitivity analysis.

pith-pipeline@v0.9.0 · 5861 in / 1186 out tokens · 51532 ms · 2026-05-21T07:17:34.720488+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

34 extracted references · 34 canonical work pages · 1 internal anchor

  1. [1]

    2024 , keywords =

    Scientific Data , author =. 2024 , keywords =. doi:10.1038/s41597-024-03337-6 , abstract =

  2. [2]

    International Conference on Learning Representations (ICLR) , year=

    Decoupling Representation and Classifier for Long-Tailed Recognition , author=. International Conference on Learning Representations (ICLR) , year=

  3. [3]

    2018 , eprint=

    Focal Loss for Dense Object Detection , author=. 2018 , eprint=

  4. [4]

    Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=

    BBN: Bilateral-Branch Network with Cumulative Learning for Long-Tailed Visual Recognition , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) , pages=. 2020 , url=

  5. [5]

    Proceedings of the 31st International Conference on Machine Learning (ICML) , pages=

    Accelerating Minibatch Stochastic Gradient Descent using Stratified Sampling , author=. Proceedings of the 31st International Conference on Machine Learning (ICML) , pages=. 2014 , url=

  6. [6]

    Generalised Dice overlap as a deep learning loss function for highly unbalanced segmentations

    Carole H. Sudre and Wenqi Li and Tom Vercauteren and S. Generalised Dice overlap as a deep learning loss function for highly unbalanced segmentations , journal =. 2017 , url =. 1707.03237 , timestamp =

  7. [7]

    2023 , eprint=

    Simplifying Neural Network Training Under Class Imbalance , author=. 2023 , eprint=

  8. [8]

    2019 , eprint=

    What is the Effect of Importance Weighting in Deep Learning? , author=. 2019 , eprint=

  9. [9]

    2017 , eprint=

    Prototypical Networks for Few-shot Learning , author=. 2017 , eprint=

  10. [10]

    and Kohl, Simon A

    Isensee, Fabian and Jaeger, Paul F. and Kohl, Simon A. A. and Petersen, Jens and Maier-Hein, Klaus H. , year =. nnU-Net: a self-configuring method for deep learning-based biomedical image segmentation , volume =. Nature Methods , publisher =. doi:10.1038/s41592-020-01008-z , number =

  11. [11]

    2021 , eprint=

    How Important is Importance Sampling for Deep Budgeted Training? , author=. 2021 , eprint=

  12. [12]

    2020 , eprint=

    Budgeted Training: Rethinking Deep Neural Network Training Under Resource Constraints , author=. 2020 , eprint=

  13. [13]

    CoRR , volume =

    Agrim Gupta and Piotr Doll. CoRR , volume =. 2019 , url =. 1908.03195 , timestamp =

  14. [14]

    Not All Samples Are Created Equal: Deep Learning with Importance Sampling , journal =

    Angelos Katharopoulos and Fran. Not All Samples Are Created Equal: Deep Learning with Importance Sampling , journal =. 2018 , url =. 1803.00942 , timestamp =

  15. [15]

    2023 , eprint=

    Rethinking Semi-Supervised Medical Image Segmentation: A Variance-Reduction Perspective , author=. 2023 , eprint=

  16. [16]

    Chawla, N. V. and Bowyer, K. W. and Hall, L. O. and Kegelmeyer, W. P. , year=. SMOTE: Synthetic Minority Over-sampling Technique , volume=. doi:10.1613/jair.953 , journal=

  17. [17]

    2023 , eprint=

    Instance-Aware Repeat Factor Sampling for Long-Tailed Object Detection , author=. 2023 , eprint=

  18. [18]

    and Simpson, Joanna P

    Kamnitsas, Konstantinos and Ledig, Christian and Newcombe, Virginia F.J. and Simpson, Joanna P. and Kane, Andrew D. and Menon, David K. and Rueckert, Daniel and Glocker, Ben , year =. Efficient multi-scale 3D CNN with fully connected CRF for accurate brain lesion segmentation , volume =. doi:10.1016/j.media.2016.10.004 , journal =

  19. [20]

    , year =

    Haibo He and Garcia, E.A. , year =. Learning from Imbalanced Data , volume =. IEEE Transactions on Knowledge and Data Engineering , publisher =. doi:10.1109/tkde.2008.239 , number =

  20. [21]

    2020 , eprint=

    PANet: Few-Shot Image Semantic Segmentation with Prototype Alignment , author=. 2020 , eprint=

  21. [22]

    Self-supervision with Superpixels: Training Few-Shot Medical Image Segmentation Without Annotation , ISBN =

    Ouyang, Cheng and Biffi, Carlo and Chen, Chen and Kart, Turkay and Qiu, Huaqi and Rueckert, Daniel , year =. Self-supervision with Superpixels: Training Few-Shot Medical Image Segmentation Without Annotation , ISBN =. doi:10.1007/978-3-030-58526-6_45 , booktitle =

  22. [23]

    , year =

    Ma, Jun and Chen, Jianan and Ng, Matthew and Huang, Rui and Li, Yu and Li, Chen and Yang, Xiaoping and Martel, Anne L. , year =. Loss odyssey in medical image segmentation , volume =. doi:10.1016/j.media.2021.102035 , journal =

  23. [24]

    2019 , eprint=

    Decoupled Weight Decay Regularization , author=. 2019 , eprint=

  24. [25]

    Ecology , volume=

    Measures of the amount of ecologic association between species , author=. Ecology , volume=. 1945 , publisher=

  25. [26]

    , author Hanbury, A

    Taha, Abdel Aziz and Hanbury, Allan , year =. Metrics for evaluating 3D medical image segmentation: analysis, selection, and tool , volume =. BMC Medical Imaging , publisher =. doi:10.1186/s12880-015-0068-x , number =

  26. [27]

    Investigative Radiology , volume=

    BOA: a CT-based body and organ analysis for radiologists at the point of care , author=. Investigative Radiology , volume=. 2024 , publisher=

  27. [28]

    2023 , doi =

    Blankemeier, Louis and Desai, Arjun and Chaves, Juan Manuel Zambrano and Wentland, Andrew and Yao, Sally and Reis, Eduardo and others , title =. 2023 , doi =

  28. [29]

    and Heiliger, Christian and Tschaidse, Tengis and Jarmusch, Stefanie and Auhage, Liv A

    Hofmann, Felix O. and Heiliger, Christian and Tschaidse, Tengis and Jarmusch, Stefanie and Auhage, Liv A. and Aghamaliyev, Ughur and others , title =. Sci. Rep. , volume =. 2025 , doi =

  29. [30]

    European Journal of Radiology , author =

    Artificial intelligence for body composition and sarcopenia evaluation on computed tomography:. European Journal of Radiology , author =. 2022 , note =. doi:10.1016/j.ejrad.2022.110218 , abstract =

  30. [31]

    RadioGraphics , author =

    Opportunistic. RadioGraphics , author =. 2021 , keywords =. doi:10.1148/rg.2021200056 , language =

  31. [32]

    2021 , eprint=

    Accounting for Variance in Machine Learning Benchmarks , author=. 2021 , eprint=

  32. [33]

    IEEE Transactions on Medical Imaging , author =

    Imbalanced. IEEE Transactions on Medical Imaging , author =. 2025 , keywords =. doi:10.1109/TMI.2024.3524253 , abstract =

  33. [34]

    Computers in Biology and Medicine , author =

    A deep ensemble medical image segmentation with novel sampling method and loss function , volume =. Computers in Biology and Medicine , author =. 2024 , pages =. doi:10.1016/j.compbiomed.2024.108305 , abstract =

  34. [35]

    IEEE Journal of Biomedical and Health Informatics , author =

    An. IEEE Journal of Biomedical and Health Informatics , author =. 2024 , keywords =. doi:10.1109/JBHI.2023.3330667 , abstract =