arxiv: 2605.12753 · v1 · submitted 2026-05-12 · 📡 eess.IV · cs.CV· cs.LG

Recognition: no theorem link

Optimization in Sparse 2D to Dense 3D Weakly Supervised Learning: Application to Multi-Label Segmentation of Large ex vivo MRI Data

Brandon Bujak, Charidimos Tsagkas, Daniel Reich, Govind Nair, Irene Cortese, Julien Cohen-Adad, Kuan Yi Wang, Paul Hoareau, Roy Sun

Authors on Pith no claims yet

Pith reviewed 2026-05-14 19:32 UTC · model grok-4.3

classification 📡 eess.IV cs.CVcs.LG

keywords weakly supervised segmentationsparse to dense learning3D MRI segmentationex vivo spinal cordmulti-label lesion segmentationpseudo-labelingregularization strategiesdata scarcity

0 comments

The pith

2D and 3D segmentation models require distinct regularization when trained from sparse 2D MRI annotations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper investigates the challenges of training 3D models for segmenting lesions in high-resolution ex vivo spinal cord MRI using only a small number of annotated 2D slices. A 2D teacher model generates pseudo-labels to supervise a 3D student model. The study reveals that methods like aggressive spatial augmentation and soft labeling, which improve the 2D model's accuracy on white matter lesions by more than 11 Dice points, actually lower performance when used with the 3D model. Human-oriented preprocessing steps such as CLAHE also cause large accuracy losses, reducing gray matter lesion Dice scores by around 25 points.

Core claim

The central finding is a divergence in optimal training strategies: while 2D teachers benefit from strong spatial augmentation and soft-label regularization to handle data scarcity, propagating these to 3D students trained on dense pseudo-labels degrades results. Human-centric preprocessing disrupts global statistical cues essential for machine learning, and 3D models need more conservative regularization due to their different optimization landscapes.

What carries the argument

The sparse-to-dense weakly supervised pipeline using a 2D teacher model to generate pseudo-labels for training a 3D student model on multi-label segmentation of MS lesions in spinal cord MRI.

If this is right

3D student models exhibit different optimization needs and require conservative regularization unlike their 2D counterparts.
Human-centric preprocessing like CLAHE should be avoided as it harms model performance by disrupting statistical cues.
Soft-labeling and strong augmentation improve 2D performance on sparse data but must not be directly transferred to 3D.
Multi-label segmentation of white and gray matter lesions in large ex vivo MRI datasets is achievable with sparse 2D annotations under appropriate conditions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The performance differences may stem from 3D models being more vulnerable to errors in the pseudo-labels generated by 2D teachers.
These results suggest that dimensionality-specific tuning is necessary in other volumetric imaging tasks beyond spinal cord MRI.
Future experiments could test whether using ground truth dense labels eliminates the need for conservative regularization in 3D.

Load-bearing premise

The pseudo-labels produced by the 2D teacher are accurate enough to train the 3D student without systematic errors that account for the performance differences.

What would settle it

An experiment that trains the 3D model directly on dense ground-truth labels and tests whether applying strong augmentation and soft-labeling still degrades performance compared to conservative settings.

Figures

Figures reproduced from arXiv: 2605.12753 by Brandon Bujak, Charidimos Tsagkas, Daniel Reich, Govind Nair, Irene Cortese, Julien Cohen-Adad, Kuan Yi Wang, Paul Hoareau, Roy Sun.

**Figure 2.** Figure 2: Illustration of the effect of our preprocessing. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Illustration of the soft edges on the magnitude. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗

**Figure 4.** Figure 4: Illustration of the jitter noise on a sagittal plane [PITH_FULL_IMAGE:figures/full_fig_p010_4.png] view at source ↗

**Figure 5.** Figure 5: Segmentation results on the test set. Each panel [PITH_FULL_IMAGE:figures/full_fig_p010_5.png] view at source ↗

**Figure 6.** Figure 6: Sagittal view of a rough Pseudo Ground Truth and [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: Illustration of the Otsu masking. The original [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

read the original abstract

INTRODUCTION | Fully supervised 3D segmentation of high-resolution ex vivo MRI is limited by the prohibitive cost of volumetric annotation, forcing reliance on sparse 2D slices. Weakly supervised Sparse-to-Dense frameworks bridge this gap, but guidelines remain ambiguous regarding human-centric visual enhancements and transferring optimization strategies across dimensions. We analyze divergent regularization needs for multi-class segmentation of high-resolution ex vivo spinal cord MRI. METHODS | We used 9.4T MRI of multiple sclerosis spinal cords (>104,000 slices) with sparse annotations (428 slices). A 2D Teacher trained on sparse slices generated dense pseudo-labels to train a 3D Student. We systematically evaluated the impact of human-centric preprocessing, spatial augmentation, and soft-label regularization on both architectures. RESULTS | We identified a critical divergence in training dynamics. The 2D Teacher required strong spatial augmentation and soft-labeling to overcome data scarcity, improving White Matter Lesion Dice scores by >11 points. However, propagating these techniques to the 3D Student degraded its performance. Furthermore, human-centric preprocessing (e.g., CLAHE) disrupted global statistical cues, dropping Gray Matter Lesion Dice scores by ~25 points. DISCUSSION | Our study highlights a perception divergence (human-centric contrast enhancement harms machine models) and a regularization conflict across dimensions. 3D architectures trained on dense pseudo-labels exhibit fundamentally different optimization landscapes than 2D counterparts and require distinct, conservative regularization. Code and models: https://github.com/ivadomed/model_seg_sc-gm-lesion_human_ms_exvivo_t2star.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows that regularization choices optimal for a 2D teacher degrade the 3D student in this sparse-to-dense setup, and that human-centric preprocessing like CLAHE can sharply hurt 3D lesion scores.

read the letter

The main finding is that strong spatial augmentation and soft labeling, which lifted the 2D teacher's white matter lesion Dice by over 11 points, reduced performance when passed to the 3D student. CLAHE and similar contrast steps also dropped gray matter lesion Dice by around 25 points by interfering with global statistics. The work tests these effects on a large ex vivo spinal cord dataset with over 104,000 slices but only 428 annotated ones, using a 2D teacher to create dense pseudo-labels for the 3D student. It gives concrete numbers on where the techniques diverge and shares code and models, which makes the results easier to check or extend. The systematic comparison of preprocessing, augmentation, and label softening across dimensions is the useful part. The soft spot is the lack of any reported check on pseudo-label accuracy. No slice-wise agreement with held-out labels or 3D volume consistency metrics appear in the abstract, so the claim of fundamentally different optimization landscapes could partly reflect noise in the targets rather than a pure 2D-versus-3D effect. More ablation tables and error bars would also help. This is for groups doing weakly supervised 3D medical segmentation with sparse annotations, especially in high-resolution MRI where full labeling is impractical. Readers working on teacher-student pipelines will get practical pointers on what not to copy across dimensions. It deserves peer review because the experiments are direct and the code is public, even if the pseudo-label validation needs tightening.

Referee Report

2 major / 2 minor

Summary. The paper proposes a sparse 2D-to-dense 3D weakly supervised framework for multi-label segmentation of high-resolution ex vivo spinal cord MRI (>104k slices, 428 sparse annotations). A 2D teacher trained on sparse slices generates dense pseudo-labels for a 3D student; systematic ablations show that strong spatial augmentation plus soft labeling raises 2D White Matter Lesion Dice by >11 points but degrades the 3D student, while human-centric preprocessing (CLAHE) drops Gray Matter Lesion Dice by ~25 points. The authors conclude that 2D and 3D models occupy distinct optimization landscapes and require dimension-specific regularization.

Significance. If the reported divergence is robust, the work supplies concrete, actionable guidelines for transferring regularization and preprocessing choices across dimensions in weakly supervised medical segmentation, where full 3D annotation is prohibitive. It also supplies a large-scale ex vivo MS dataset and open code, which are valuable for reproducibility.

major comments (2)

[Results] Results section: the headline claim that the 3D student exhibits a fundamentally different optimization landscape rests on the untested assumption that the 2D teacher's dense pseudo-labels are sufficiently accurate. No slice-wise or volume-wise fidelity metrics (Dice against held-out manual labels, label consistency across reconstructed 3D volumes) are reported; without them the observed 3D degradation could be explained by label noise rather than dimensionality.
[Methods] Methods and Results: the reported Dice gains (>11 pt WM lesion, ~25 pt GM lesion drop) are presented without error bars, statistical tests, or complete ablation tables that isolate each factor (augmentation strength, soft-label temperature, CLAHE). This prevents assessment of whether the dimensional divergence is statistically reliable or sensitive to hyper-parameter choices.

minor comments (2)

[Abstract] Abstract and Results: numerical claims should be accompanied by the corresponding baseline Dice values and the exact number of test volumes or slices used.
[Discussion] Discussion: the term 'perception divergence' is introduced without a precise definition or supporting quantitative comparison between human and model sensitivity to contrast changes.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their valuable comments on our manuscript. We address each major comment point-by-point below and indicate the revisions we will make to strengthen the paper.

read point-by-point responses

Referee: [Results] Results section: the headline claim that the 3D student exhibits a fundamentally different optimization landscape rests on the untested assumption that the 2D teacher's dense pseudo-labels are sufficiently accurate. No slice-wise or volume-wise fidelity metrics (Dice against held-out manual labels, label consistency across reconstructed 3D volumes) are reported; without them the observed 3D degradation could be explained by label noise rather than dimensionality.

Authors: We appreciate this observation. The 2D teacher was trained and evaluated on the sparse annotations with a held-out validation set, providing indirect support for pseudo-label quality. Importantly, our ablation studies fix the pseudo-label generation process and vary only the 3D training regularization, demonstrating that the performance differences arise from the interaction between the 3D model and the regularization choices. To directly address the concern, we will report slice-wise Dice scores of the pseudo-labels against additional held-out manual segmentations and assess label consistency in the reconstructed volumes in the revised manuscript. revision: yes
Referee: [Methods] Methods and Results: the reported Dice gains (>11 pt WM lesion, ~25 pt GM lesion drop) are presented without error bars, statistical tests, or complete ablation tables that isolate each factor (augmentation strength, soft-label temperature, CLAHE). This prevents assessment of whether the dimensional divergence is statistically reliable or sensitive to hyper-parameter choices.

Authors: We agree that the presentation of results can be improved for statistical robustness. In the revised manuscript, we will include error bars representing standard deviation across multiple independent training runs, conduct appropriate statistical tests to confirm the significance of the reported differences, and provide expanded ablation tables that systematically isolate the effects of each factor (augmentation strength, soft-label temperature, and CLAHE) separately for the 2D teacher and 3D student models. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical results from held-out evaluation

full rationale

The manuscript reports measured Dice-score differences from training a 2D teacher on sparse slices and a 3D student on its dense pseudo-labels. No equations, first-principles derivations, or fitted parameters are presented whose outputs reduce to the inputs by construction. Performance gaps (e.g., +11 pt WM-lesion Dice for 2D with augmentation/soft labels, degradation for 3D, -25 pt GM-lesion Dice with CLAHE) are direct experimental observations on held-out data, not tautological re-statements. Any self-citations are incidental and not load-bearing for the central empirical claims. The analysis is therefore self-contained and externally falsifiable via the reported metrics.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that 2D-generated pseudo-labels provide a valid training signal for 3D models and that observed differences stem from dimensionality rather than dataset-specific artifacts.

free parameters (2)

spatial augmentation strength
Chosen to optimize 2D teacher but shown to harm 3D student; value is tuned rather than derived.
soft-label temperature
Hyperparameter controlling label softness whose optimal setting differs by architecture.

axioms (1)

domain assumption Pseudo-labels from 2D teacher are sufficiently accurate for 3D training
Core premise of the sparse-to-dense pipeline invoked in the methods description.

pith-pipeline@v0.9.0 · 5639 in / 1299 out tokens · 52581 ms · 2026-05-14T19:32:59.085289+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

24 extracted references · 3 canonical work pages

[1]

Image Augmentation Techniques for Mammogram Analysis

Oza, Parita and Sharma, Paawan and Patel, Samir and Adedoyin, Festus and Bruno, Alessandro. Image Augmentation Techniques for Mammogram Analysis. J Imaging
[2]

Yoshimi, Yuki and Mine, Yuichi and Ito, Shota and Takeda, Saori and Okazaki, Shota and Nakamoto, Takashi and Nagasaki, Toshikazu and Kakimoto, Naoya and Murayama, Takeshi and Tanimoto, Kotaro. Image preprocessing with contrast-limited adaptive histogram equalization improves the segmentation performance of deep learning for the articular disk of the tempo...
[3]

ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness

Geirhos, Robert and Rubisch, Patricia and Michaelis, Claudio and Bethge, Matthias and Wichmann, Felix A and Brendel, Wieland. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. 1811.12231

work page arXiv
[4]

Contrastive learning of global and local features for medical image segmentation with limited annotations

Chaitanya, Krishna and Erdil, Ertunc and Karani, Neerav and Konukoglu, Ender. Contrastive learning of global and local features for medical image segmentation with limited annotations
[5]

Grey matter pathology in multiple sclerosis

Geurts, Jeroen J G and Barkhof, Frederik. Grey matter pathology in multiple sclerosis. Lancet Neurol
[6]

Gray matter imaging in multiple sclerosis: what have we learned?

Hulst, Hanneke E and Geurts, Jeroen J G. Gray matter imaging in multiple sclerosis: what have we learned?. BMC Neurol
[7]

Accuracy of Marginal and Internal Adaptation of Advanced Lithium Disilicate Crowns Using Different Margin Designs (In Vitro Study)

Mohamed, Hossam A and Azer, Amir and AboElHassan, Rewaa G. Accuracy of Marginal and Internal Adaptation of Advanced Lithium Disilicate Crowns Using Different Margin Designs (In Vitro Study). Int J Dent
[8]

When does label smoothing help?

M \"u ller, Rafael and Kornblith, Simon and Hinton, Geoffrey. When does label smoothing help?. 1906.02629

work page arXiv 1906
[9]

One network to segment them all: A general, lightweight system for accurate 3D medical image segmentation

Perslev, Mathias and Dam, Erik Bj rnager and Pai, Akshay and Igel, Christian. One network to segment them all: A general, lightweight system for accurate 3D medical image segmentation. Lecture Notes in Computer Science
[10]

SoftSeg : Advantages of soft versus binary training for image segmentation

Gros, Charley and Lemay, Andreanne and Cohen-Adad, Julien. SoftSeg : Advantages of soft versus binary training for image segmentation. Med. Image Anal
[11]

nnU-Net : a self-configuring method for deep learning-based biomedical image segmentation

Isensee, Fabian and Jaeger, Paul F and Kohl, Simon A A and Petersen, Jens and Maier-Hein, Klaus H. nnU-Net : a self-configuring method for deep learning-based biomedical image segmentation. Nat Methods
[12]

Spinal cord MRI in multiple sclerosis--diagnostic, prognostic and clinical value

Kearney, Hugh and Miller, David H and Ciccarelli, Olga. Spinal cord MRI in multiple sclerosis--diagnostic, prognostic and clinical value. Nat Rev Neurol
[13]

Embracing imperfect datasets: A review of deep learning solutions for medical image segmentation

Tajbakhsh, Nima and Jeyaseelan, Laura and Li, Qian and Chiang, Jeffrey N and Wu, Zhihao and Ding, Xiaowei. Embracing imperfect datasets: A review of deep learning solutions for medical image segmentation. Med Image Anal
[14]

Incorporating Boundary Uncertainty into loss functions for biomedical image segmentation

Yeung, Michael and Yang, Guang and Sala, Evis and Sch \"o nlieb, Carola-Bibiane and Rundo, Leonardo. Incorporating Boundary Uncertainty into loss functions for biomedical image segmentation. 2111.00533

work page arXiv
[15]

Shortcut learning in deep neural networks

Geirhos, Robert and Jacobsen, J \"o rn-Henrik and Michaelis, Claudio and Zemel, Richard and Brendel, Wieland and Bethge, Matthias and Wichmann, Felix A. Shortcut learning in deep neural networks. Nat. Mach. Intell
[16]

High-field MRI of brain cortical substructure based on signal phase

Duyn, Jeff H and van Gelderen, Peter and Li, Tie-Qiang and de Zwart, Jacco A and Koretsky, Alan P and Fukunaga, Masaki. High-field MRI of brain cortical substructure based on signal phase. Proc Natl Acad Sci U S A
[17]

Objective Evaluation of Multiple Sclerosis Lesion Segmentation using a Data Management and Processing Infrastructure

Commowick, Olivier and Istace, Audrey and Kain, Micha \"e l and Laurent, Baptiste and Leray, Florent and Simon, Mathieu and Pop, Sorina Camarasu and Girard, Pascal and Am \'e li, Roxana and Ferr \'e , Jean-Christophe and Kerbrat, Anne and Tourdias, Thomas and Cervenansky, Fr \'e d \'e ric and Glatard, Tristan and Beaumont, J \'e r \'e my and Doyle, Senan ...
[18]

3D medical image segmentation with sparse annotation via cross-teaching between 3D and 2D networks

Cai, Heng and Qi, Lei and Yu, Qian and Shi, Yinghuan and Gao, Yang. 3D medical image segmentation with sparse annotation via cross-teaching between 3D and 2D networks. Lecture Notes in Computer Science
[19]

Deep Learning Techniques for Medical Image Segmentation: Achievements and Challenges

Hesamian, Mohammad Hesam and Jia, Wenjing and He, Xiangjian and Kennedy, Paul. Deep Learning Techniques for Medical Image Segmentation: Achievements and Challenges. J Digit Imaging
[20]

SCT : Spinal Cord Toolbox, an open-source software for processing spinal cord MRI data

De Leener, Benjamin and L \'e vy, Simon and Dupont, Sara M and Fonov, Vladimir S and Stikov, Nikola and Louis Collins, D and Callot, Virginie and Cohen-Adad, Julien. SCT : Spinal Cord Toolbox, an open-source software for processing spinal cord MRI data. Neuroimage
[21]

Contrast limited adaptive histogram equalization

Zuiderveld, Karel. Contrast limited adaptive histogram equalization. Graphics Gems
[22]

Journal of the Neurological Sciences , volume=

Hallmarks of spinal cord pathology in multiple sclerosis , author=. Journal of the Neurological Sciences , volume=. 2024 , publisher=

2024
[23]

Acta Neuropathologica , volume=

The prevalence and topography of spinal cord demyelination in multiple sclerosis: a retrospective study , author=. Acta Neuropathologica , volume=. 2024 , publisher=

2024
[24]

Imaging Neuroscience , volume=

Automatic segmentation of spinal cord lesions in MS: A robust tool for axial T2-weighted MRI scans , author=. Imaging Neuroscience , volume=. 2025 , publisher=

2025