arxiv: 2605.14566 · v1 · submitted 2026-05-14 · 💻 cs.CV

Recognition: 1 theorem link

· Lean Theorem

SpectraFlow: Unifying Structural Pretraining and Frequency Adaptation for Medical Image Segmentation

Zhiquan Chen , Haitao Wang , Guowei Zou , Hejun Wu

Authors on Pith no claims yet

Pith reviewed 2026-05-15 01:53 UTC · model grok-4.3

classification 💻 cs.CV

keywords medical image segmentationself-supervised pretraininglatent transport regressionfrequency adaptationboundary refinementlow-data regimesstructural representations

0 comments

The pith

Aligning images and binary masks in a shared latent space through latent transport regression produces transferable structural representations that improve medical image segmentation accuracy and boundary precision in low-data regimes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to overcome texture bias in self-supervised pretraining for medical image segmentation by developing a two-stage framework that prioritizes geometric structure and boundary detail. In the first stage, Mixed-Domain MeanFlow Pretraining aligns images with binary masks inside a common latent space using latent transport regression, where the masks supply conditional structural guidance rather than serving as direct prediction targets, and a dispersive loss guards against collapse under scarce labels. The second stage attaches a lightweight decoder that applies attentional cross-scale fusion and frequency-directional dynamic convolutions to refine high-frequency boundaries under varying appearance. This matters for clinical tasks because accurate delineation of lesions and organs often fails when annotations are limited, leading to poor generalization and missing fine structures. The resulting representations are intended to be task-agnostic and to transfer more reliably than texture-focused alternatives.

Core claim

The central claim is that coupling structure-aware encoder pretraining with boundary-oriented decoding, achieved by aligning images and binary masks in a shared latent space through latent transport regression where masks act as conditional structural guidance rather than prediction targets, yields task-agnostic representations that transfer effectively to downstream segmentation, and that further fine-tuning with Direct Attentional Fusion for adaptive cross-scale gating plus Frequency-Directional Dynamic Convolution for high-frequency boundary refinement produces consistent gains in accuracy, robustness under low data, and sharper boundaries on datasets such as ISIC-2016, Kvasir-SEG, and G1

What carries the argument

Mixed-Domain MeanFlow Pretraining, which aligns images and binary masks in a shared latent space through latent transport regression with masks serving as conditional structural guidance instead of prediction targets.

Load-bearing premise

That aligning images and binary masks through latent transport regression in a shared latent space produces task-agnostic structural representations that transfer effectively to downstream segmentation without bias from the mask generation or pretraining process.

What would settle it

An experiment showing that a randomly initialized encoder achieves equal or higher segmentation accuracy and boundary precision than the pretrained encoder on the same low-data splits of ISIC-2016, Kvasir-SEG, or GlaS would falsify the value of the structural pretraining stage.

Figures

Figures reproduced from arXiv: 2605.14566 by Guowei Zou, Haitao Wang, Hejun Wu, Zhiquan Chen.

**Figure 2.** Figure 2: Overview of the proposed two-stage framework. (a) Stage-1 performs geometryguided latent representation pretraining with MeanFlow and Dispersive Loss; (b) Stage2 adopts a lightweight decoder with DAF-based fusion and FDConv refinement; (c) illustrates the structure of the DAF module. 2 Related Work Medical image segmentation. Encoder–decoder architectures remain the dominant paradigm for medical image se… view at source ↗

**Figure 3.** Figure 3: t-SNE visualization of latent representations with and without Dispersive Loss. Effect of Dispersive Loss. To qualitatively analyze the representation stabilization introduced by Dispersive Loss, we visualize the latent feature distributions learned during Stage-1 pretraining. As shown in [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗

**Figure 4.** Figure 4: Effect of different proportions of training data on ISIC-2016. To evaluate data efficiency, we conduct experiments using different proportions of labeled training data on ISIC-2016. Specifically, we randomly sample 10%, 20%, 50%, and 100% of the training set, while keeping the validation protocol unchanged. For each setting, Stage-1 pretraining uses only the masks from the same sampled subset as condi… view at source ↗

**Figure 5.** Figure 5: Robustness analysis against appearance corruptions on Kvasir-SEG [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative comparison of segmentation results on Kvasir-SEG, ISIC-2016, and GlaS. models often yield over-smoothed boundaries. In contrast, our method consistently preserves fine structures and produces more coherent object shapes across different modalities, particularly in challenging cases with small lesions or complex boundaries. These visual results are consistent with the quantitative improvement… view at source ↗

**Figure 7.** Figure 7: Qualitative visualization of DAF gating behavior. Effect of DAF Gated Fusion. To analyze the contribution of Direct Attentional Fusion (DAF) in Stage-2, we visualize the learned gate map M produced by the sigmoid activation. As shown in [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

**Figure 8.** Figure 8: Sensitivity Analysis of the Dispersive Loss Weight A Extra Experiments A.1 Sensitivity Analysis of the Dispersive Loss Weight To investigate the influence of the Dispersive Loss weight, we conduct a sensitivity analysis on Kvasir-SEG dataset by varying the hyperparameter λ from 0.2 to 1.0. The quantitative results in terms of mIoU and Dice score are visualized in [PITH_FULL_IMAGE:figures/full_fig_p017_8.png] view at source ↗

read the original abstract

Medical image segmentation remains challenging in low-data regimes, where scarce annotations often yield poor generalization and ambiguous boundaries with missing fine structures. Recent self-supervised pretraining has improved transferability, but it often exhibits a texture bias. In contrast, accurate segmentation is inherently geometry-aware and depends on both topological consistency and precise boundary preservation. To address this problem, we propose a two-stage framework that couples structure-aware encoder pretraining with boundary-oriented decoding. In Stage-1, we aim to learn structure-aware representations for downstream segmentation in low-data regimes. To this end, we propose Mixed-Domain MeanFlow Pretraining, which aligns images and binary masks in a shared latent space through latent transport regression, where masks act as conditional structural guidance rather than prediction targets, making the pretraining task-agnostic. To further improve training stability under scarce supervision, we incorporate a lightweight Dispersive Loss to prevent representation collapse. In Stage-2, we fine-tune the pretrained encoder with a lightweight decoder that combines Direct Attentional Fusion for adaptive cross-scale gating and Frequency-Directional Dynamic Convolution for high-frequency boundary refinement under appearance variation. Experiments on ISIC-2016, Kvasir-SEG, and GlaS demonstrate consistent gains over state-of-the-art methods, with improved robustness in low-data settings and sharper boundary delineation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SpectraFlow's two-stage setup with latent transport pretraining on image-mask pairs and frequency-directional dynamic convolution looks like a reasonable incremental tweak for low-data medical segmentation, but the abstract gives no numbers so the gains stay unproven.

read the letter

SpectraFlow pretrains an encoder by aligning image latents to binary-mask latents through transport regression, using the masks only as structural guidance, then adds a decoder with direct attentional fusion and frequency-directional dynamic convolution to sharpen boundaries. That combination is the main new element, and it fits the stated goal of moving away from texture bias toward geometry in scarce-label settings. The framing of the problem is clear and the decoder choices are lightweight enough to be practical. The abstract says the method delivers consistent gains on ISIC-2016, Kvasir-SEG, and GlaS plus better low-data robustness, yet it supplies no metrics, no baseline list, no error bars, and no protocol, so those claims cannot be checked from what is written. The pretraining step also feeds ground-truth masks from the same datasets used for fine-tuning, which raises the possibility that dataset-specific shape statistics leak into the representations even with the dispersive loss; the paper would need ablations to show this does not drive the reported boundary improvements. This is aimed at medical-image researchers who already work on self-supervised pretraining and dynamic convolutions and need something that runs on limited annotations. It is coherent enough on its own terms to deserve a serious referee, though the evidence section will need substantial strengthening before it can be taken as a solid result.

Referee Report

2 major / 2 minor

Summary. The paper proposes SpectraFlow, a two-stage framework for medical image segmentation in low-data regimes. In Stage-1, Mixed-Domain MeanFlow Pretraining aligns images and binary masks in a shared latent space via latent transport regression (masks serve as conditional structural guidance rather than targets) augmented by a Dispersive Loss to avoid collapse. Stage-2 fine-tunes the pretrained encoder using a lightweight decoder that integrates Direct Attentional Fusion for cross-scale gating and Frequency-Directional Dynamic Convolution for high-frequency boundary refinement. Experiments on ISIC-2016, Kvasir-SEG, and GlaS are reported to yield consistent gains over state-of-the-art methods, with improved low-data robustness and sharper boundaries.

Significance. If the quantitative results hold, the work offers a plausible route to reduce texture bias in self-supervised pretraining for geometry-sensitive tasks such as medical segmentation. The explicit separation of structure-aware pretraining from frequency-adaptive decoding, together with the task-agnostic framing of mask guidance, could improve transfer in annotation-scarce settings; the combination of latent transport regression and dispersive regularization is a concrete technical contribution worth evaluating.

major comments (2)

[Abstract / Stage-1 Pretraining] Abstract and Stage-1 description: the claim that pretraining is task-agnostic because masks act only as 'conditional structural guidance' is not supported by the experimental design. Binary masks are generated from the identical ground-truth annotations later used for fine-tuning on ISIC-2016, Kvasir-SEG, and GlaS; latent transport regression can therefore embed dataset-specific topological priors (lesion shape distributions, boundary curvature statistics) rather than generic geometry. The Dispersive Loss prevents collapse but does not remove this source of leakage.
[Abstract] Abstract: no quantitative metrics, baseline tables, statistical tests, error bars, or low-data protocol details are supplied. The central claim of 'consistent gains' and 'improved robustness in low-data settings' cannot be assessed for effect size or reproducibility from the provided information.

minor comments (2)

[Stage-1] Clarify the precise form of the latent transport regression objective and the weighting schedule between the transport loss and Dispersive Loss; current notation leaves the optimization target ambiguous.
[Stage-2] The terms 'Direct Attentional Fusion' and 'Frequency-Directional Dynamic Convolution' are introduced without equations or architectural diagrams; a single figure or pseudocode block would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and insightful comments. We address each major point below and have revised the manuscript to improve clarity and support for our claims.

read point-by-point responses

Referee: [Abstract / Stage-1 Pretraining] Abstract and Stage-1 description: the claim that pretraining is task-agnostic because masks act only as 'conditional structural guidance' is not supported by the experimental design. Binary masks are generated from the identical ground-truth annotations later used for fine-tuning on ISIC-2016, Kvasir-SEG, and GlaS; latent transport regression can therefore embed dataset-specific topological priors (lesion shape distributions, boundary curvature statistics) rather than generic geometry. The Dispersive Loss prevents collapse but does not remove this source of leakage.

Authors: We appreciate the referee's observation on potential leakage. While the masks originate from the same annotations, they are used exclusively as conditional structural guidance within the latent transport regression objective, not as direct segmentation targets. This formulation encourages the encoder to capture transferable geometric and topological features rather than dataset-specific appearance patterns. The Dispersive Loss promotes representation diversity but does not fully eliminate domain-specific priors. To address this, we have revised the manuscript to qualify the 'task-agnostic' phrasing as 'structure-aware pretraining with in-domain mask guidance' and added an ablation using synthetic masks from external shape priors to quantify the contribution of dataset-specific topology. revision: partial
Referee: [Abstract] Abstract: no quantitative metrics, baseline tables, statistical tests, error bars, or low-data protocol details are supplied. The central claim of 'consistent gains' and 'improved robustness in low-data settings' cannot be assessed for effect size or reproducibility from the provided information.

Authors: We agree that the abstract should include concrete quantitative support. In the revised manuscript, we have updated the abstract to report key Dice and HD95 improvements on ISIC-2016, Kvasir-SEG, and GlaS under the 10% and 20% annotation regimes, along with a brief description of the low-data protocol and mention of statistical significance testing. Full tables with baselines, error bars, and detailed protocols remain in the experimental section. revision: yes

Circularity Check

0 steps flagged

Pretraining uses mask guidance by design without reducing gains to fitted inputs by construction

full rationale

The paper defines Mixed-Domain MeanFlow Pretraining as aligning image and binary-mask latents with masks as conditional guidance (not targets), explicitly labeling the setup task-agnostic. No equations or self-citations in the provided text equate the downstream segmentation gains or boundary improvements to quantities defined solely by the pretraining fit on the same annotations. The experimental claims on ISIC-2016, Kvasir-SEG, and GlaS remain independent empirical results rather than tautological renamings or self-referential predictions. This yields only a minor score for the inherent use of ground-truth masks in pretraining, which does not collapse the derivation chain.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 4 invented entities

The central claim rests on several newly introduced components whose benefits are asserted via experiments on three datasets; no independent external validation or formal proofs are referenced.

free parameters (2)

Dispersive Loss weighting factor
Balance hyperparameter controlling the contribution of the loss that prevents representation collapse during pretraining.
Network architecture hyperparameters
Standard deep learning choices such as learning rates, layer dimensions, and convolution kernel sizes that are not specified in the abstract.

axioms (1)

domain assumption Binary masks serve as effective conditional structural guidance for learning task-agnostic representations via latent transport regression
Invoked in Stage-1 to make pretraining independent of the downstream segmentation task.

invented entities (4)

Mixed-Domain MeanFlow Pretraining no independent evidence
purpose: Align images and binary masks in shared latent space for structure-aware representations
Newly proposed pretraining procedure.
Dispersive Loss no independent evidence
purpose: Prevent representation collapse under scarce supervision
Lightweight auxiliary loss introduced for training stability.
Direct Attentional Fusion no independent evidence
purpose: Adaptive cross-scale gating in the decoder
New decoder module for combining multi-scale features.
Frequency-Directional Dynamic Convolution no independent evidence
purpose: High-frequency boundary refinement under appearance variation
New convolution operation proposed for the decoder.

pith-pipeline@v0.9.0 · 5540 in / 1628 out tokens · 78612 ms · 2026-05-15T01:53:36.558700+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Mixed-Domain MeanFlow Pretraining, which aligns images and binary masks in a shared latent space through latent transport regression, where masks act as conditional structural guidance rather than prediction targets

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages

[1]

In: ICLR (2022) 2, 5

Bao, H., Dong, L., Wei, F., Yuan, G., Zhang, H., Lin, L., Zhang, B.: Beit: Bert pre-training of image transformers. In: ICLR (2022) 2, 5

work page 2022
[2]

Cao, H., Wang, Y., Chen, J., Jiang, D., Zhang, X., Tian, Q., Wang, M.: Swin-unet: Unet-like pure transformer for medical image segmentation (2021) 2

work page 2021
[3]

In: ICCV

Caron, M., Touvron, H., Misra, I., J´ egou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: ICCV. pp. 9650– 9660 (2021) 3, 5

work page 2021
[4]

In: MICCAI

Chen, J., Lu, Y., Yu, Q., Luo, X., Adeli, E., Wang, Y., Lu, L., Yuille, A.L., Zhou, Y.: Transunet: Transformers make strong encoders for medical image segmentation. In: MICCAI. pp. 127–136 (2021) 1

work page 2021
[5]

In: CVPR

Chen, L., Gu, L., Li, L., Yan, C., Fu, Y.: Frequency dynamic convolution for dense image prediction. In: CVPR. pp. 30178–30188 (2025) 3, 8

work page 2025
[6]

In: ICML

Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for con- trastive learning of visual representations. In: ICML. pp. 1597–1607 (2020) 5

work page 2020
[7]

In: CVPR

Chen, Y., Dai, X., Chen, D., Liu, M., Yuan, L., Liu, Z.: Dynamic convolution: Attention over convolution kernels. In: CVPR. pp. 11030–11039 (2020) 3, 5

work page 2020
[8]

In: CVPR

Fujieda, S., Takayama, K., Hachisuka, T.: Wavelet convolutional neural networks for texture classification. In: CVPR. pp. 548–557 (2018) 5

work page 2018
[9]

In: International Conference on Learning Representations (2019), https://openreview.net/forum?id=Bygh9j09KX4, 5

Geirhos, R., Rubisch, P., Michaelis, C., Bethge, M., Wichmann, F.A.: Imagenet- trained cnns are biased towards texture; increasing shape bias improves accuracy and robustness. In: International Conference on Learning Representations (2019), https://openreview.net/forum?id=Bygh9j09KX4, 5

work page 2019
[10]

Geng, Z., Deng, M., Bai, X., Kolter, J.Z., He, K.: Mean flows for one-step generative modeling (2025) 5, 6

work page 2025
[11]

IEEE Transactions on Biomedical Engineering (2022),https://www.ncbi.nlm.nih.gov/ pmc/articles/PMC9011180/4

Guan, H., et al.: Domain adaptation for medical image analysis: A survey. IEEE Transactions on Biomedical Engineering (2022),https://www.ncbi.nlm.nih.gov/ pmc/articles/PMC9011180/4

work page 2022
[12]

In: ISBI

Gutman, D., et al.: Skin lesion analysis toward melanoma detection: A challenge at the isic 2016. In: ISBI. pp. 1–10 (2016) 9 16 Zhiquan Chen, Haitao Wang, Guowei Zou, and Hejun Wu

work page 2016
[13]

In: CVPR

Hatamizadeh, A., Nath, V., Tang, Y., Yang, H., Myronenko, A., Roth, H., Xu, D.: Unetr: Transformers for 3d medical image segmentation. In: CVPR. pp. 14324– 14334 (2022) 1, 4

work page 2022
[14]

In: CVPR

He, K., Chen, X., Xie, S., Li, Y., Doll´ ar, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: CVPR. pp. 16000–16009 (2022) 2, 5

work page 2022
[15]

In: CVPR

He, K., Fan, H., Wu, Y., Xie, S., Girshick, R.: Momentum contrast for unsupervised visual representation learning. In: CVPR. pp. 9729–9738 (2020) 5

work page 2020
[16]

arXiv preprint arXiv:1911.09071 (2019), https://arxiv.org/abs/1911.090714, 5

Hermann, K.L., Chen, T., Kornblith, S.: The origins and prevalence of texture bias in convolutional neural networks. arXiv preprint arXiv:1911.09071 (2019), https://arxiv.org/abs/1911.090714, 5

work page arXiv 1911
[17]

Hu, Z., Lai, C.H., Wu, G., Mitsufuji, Y., Ermon, S.: Meanflow transformers with representation autoencoders (2025) 5

work page 2025
[18]

Isensee, F., Jaeger, P.F., Kohl, S.A.A., Petersen, J., Maier-Hein, K.H.: nnu-net: A self-configuring method for deep learning-based biomedical image segmentation. Nat. Methods18(2), 203–211 (2021) 1, 4

work page 2021
[19]

Jha, D., Smedsrud, P.H., Riegler, M.A., Halvorsen, P., de Lange, T., Johansen, D., Johansen, H.D.: Kvasir-seg: A segmented polyp dataset. In: Int. Conf. Multimedia Modeling. pp. 451–462 (2020) 9

work page 2020
[20]

In: ICML

Lee-Thorp, J., Ainslie, J., Eckstein, I., Onta˜ n´ on, S.: Fnet: Mixing tokens with fourier transforms. In: ICML. pp. 12421–12436 (2022) 5

work page 2022
[21]

Oktay, O., Schlemper, J., Le Folgoc, L., Lee, M., Heinrich, M., Misawa, K., Mori, K., McDonagh, S., Hammerla, N., Kainz, B., Glocker, B., Rueckert, D.: Attention u-net: Learning where to look for the pancreas (2018) 2

work page 2018
[22]

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W., Howes, R., Huang, P.Y., Li, S.W., Misra, I., Rabbat, M., Sharma, V., Synnaeve, G., Xu, H., J´ egou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: Dinov2: Learning robust visual feat...

work page 2023
[23]

In: MICCAI

Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomed- ical image segmentation. In: MICCAI. pp. 234–241 (2015) 2, 4

work page 2015
[24]

Sirinukunwattana, K., et al.: Gland segmentation in colon histology images: The glas challenge contest and dataset. Sci. Rep.7, 12821 (2017) 9

work page 2017
[25]

arXiv preprint arXiv:2109.13230 (2021),https://arxiv.org/abs/2109.132304

Ugurlu, D., Puyol-Anton, E., et al.: The impact of domain shift on left and right ventricle segmentation in short axis cardiac mr images. arXiv preprint arXiv:2109.13230 (2021),https://arxiv.org/abs/2109.132304

work page arXiv 2021
[26]

Wang, R., He, K.: Diffuse and disperse: Image generation with representation reg- ularization (2025) 3, 7

work page 2025
[27]

In: ECCV

Woo, S., Park, J., Lee, J.Y., Kweon, I.S.: Cbam: Convolutional block attention module. In: ECCV. pp. 3–19 (2018) 3, 5, 11

work page 2018
[28]

IEEE TPAMI (2023) 5

Zhao, W., Chen, J., Xu, Y., He, Y., Huang, Z., He, X.: Learning via flow dynamics for self-supervised representation. IEEE TPAMI (2023) 5

work page 2023
[29]

In: MICCAI

Zhou, Z., Siddiquee, M.M.R., Tajbakhsh, N., Liang, J.: Unet++: A nested u-net architecture for medical image segmentation. In: MICCAI. pp. 3–11 (2019) 4

work page 2019
[30]

Zhou, Z., Sodha, V., Rahman Siddiquee, M.M., Feng, R., Tajbakhsh, N., Gotway, M.B., Liang, J.: Models genesis: Generic autodidactic models for 3d medical image analysis. Med. Image Anal.67, 101840 (2021) 2 Title Suppressed Due to Excessive Length 17 0.2 0.4 0.6 0.8 1.0 Hyperparameter 0.0 0.2 0.4 0.6 0.8 1.0 Metric Score 0.851 0.859 0.848 0.845 0.841 0.912...

work page 2021