Recognition: unknown
Noise2Map: End-to-End Diffusion Model for Semantic Segmentation and Change Detection
Pith reviewed 2026-05-07 05:44 UTC · model grok-4.3
The pith
A diffusion model can directly predict semantic segmentation and change maps from satellite images by conditioning its denoising on task-specific noise schedules.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Noise2Map is a unified diffusion-based framework that repurposes the denoising process for fast, end-to-end discriminative learning by directly predicting semantic or change maps using task-specific noise schedules and timestep conditioning. The model is pretrained via self-supervised denoising and fine-tuned with supervision on a shared backbone that supports both semantic segmentation and change detection through separate schedulers. Unlike prior diffusion work limited to generation or feature extraction, this avoids costly iterative sampling. On the SpaceNet7, WHU, and xView2 datasets it achieves the top average rank among seven models for segmentation and the top rank for change detec
What carries the argument
The task-specific noise scheduler combined with timestep conditioning inside the diffusion denoising network, which converts the generative process into direct prediction of segmentation or change maps from noisy inputs.
If this is right
- The shared backbone allows a single trained model to handle both semantic segmentation and change detection without separate architectures.
- Self-supervised denoising pretraining followed by fine-tuning improves robustness on remote sensing data that often has limited labels.
- Avoiding full diffusion sampling at inference reduces computation, making the approach suitable for processing large volumes of satellite imagery.
- Ablation results indicate the method remains stable across different choices of noise schedulers and timestep controls during training.
Where Pith is reading between the lines
- The same conditioning trick on noise schedules might transfer to other dense prediction tasks such as depth estimation or instance segmentation where generative priors could help.
- In practice this could support faster disaster response mapping because the model produces outputs in one forward pass rather than iterative refinement.
- Interpretability might arise from inspecting intermediate denoising steps to see how the model resolves ambiguous regions in satellite scenes.
Load-bearing premise
Repurposing the denoising process with task-specific noise schedules and timestep conditioning enables effective end-to-end discriminative learning for semantic segmentation and change detection without needing costly sampling procedures.
What would settle it
If the model is evaluated on the xView2 wildfire damage dataset and its average F1 score for change detection falls below the top baseline instead of ranking first, the performance claim would be falsified.
Figures
read the original abstract
Semantic segmentation and change detection are two fundamental challenges in remote sensing, requiring models to capture either spatial semantics or temporal differences from satellite imagery. Existing deep learning models often struggle with temporal inconsistencies or in capturing fine-grained spatial structures, require extensive pretraining, and offer limited interpretability - especially in real-world remote sensing scenarios. Recent advances in diffusion models show that Gaussian noise can be systematically leveraged to learn expressive data representations through denoising. Motivated by this, we investigate whether the noise process in diffusion models can be effectively utilized for discriminative tasks. We propose Noise2Map, a unified diffusion-based framework that repurposes the denoising process for fast, end-to-end discriminative learning. Unlike prior work that uses diffusion only for generation or feature extraction, Noise2Map directly predicts semantic or change maps using task-specific noise schedules and timestep conditioning, avoiding the costly sampling procedures of traditional diffusion models. The model is pretrained via self-supervised denoising and fine-tuned with supervision, enabling both interpretability and robustness. Our architecture supports both tasks (SS and CD) through a shared backbone and task-specific noise schedulers. Extensive evaluations on the SpaceNet7, WHU, and xView2 buildings damaged by wildfires datasets demonstrate that Noise2Map ranks on average 1st among seven models on semantic segmentation and 1st on change detection by a cross-dataset rank metric (average F1 primary, IoU tie-break). Ablation studies highlight the robustness of our model against different training noise schedulers and timestep control in the diffusion process, as well as the ability of the model to perform multi-task learning.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents Noise2Map, a unified diffusion-based framework for semantic segmentation and change detection in remote sensing imagery. It repurposes the denoising process via task-specific noise schedules and timestep conditioning to enable direct, end-to-end prediction of semantic or change maps from a shared backbone, with self-supervised denoising pretraining followed by supervised fine-tuning. The model performs single-pass inference at a chosen timestep. Extensive evaluations on SpaceNet7, WHU, and xView2 datasets claim that Noise2Map achieves the highest average rank (by F1 primary, IoU tie-break) among seven models for both tasks, with ablations demonstrating robustness to scheduler variations.
Significance. If the central claims hold after addressing the isolation of components, the work could demonstrate a practical way to adapt diffusion models for efficient discriminative tasks in remote sensing, potentially improving robustness and interpretability while avoiding iterative sampling. The use of multiple public datasets, a cross-dataset ranking metric, and multi-task support via shared backbone are positive aspects. However, the significance is currently limited by uncertainty over whether the reported gains derive specifically from the diffusion repurposing rather than backbone capacity or pretraining.
major comments (2)
- [Ablation studies] Ablation studies section: The reported ablations examine robustness across different noise schedulers and timestep controls, yet omit a control experiment that trains the identical backbone architecture end-to-end on clean images using standard supervised fine-tuning without timestep conditioning or noise addition. This control is load-bearing for the headline ranking claim, as it is needed to determine whether the F1/IoU advantages on SpaceNet7, WHU, and xView2 arise from the task-specific noise schedules and timestep conditioning or from other factors such as model capacity and self-supervised pretraining.
- [Experimental results] Experimental results and comparisons: The cross-dataset rank metric establishing Noise2Map as 1st among seven models on both tasks does not include error bars, statistical significance testing, or explicit confirmation that the baseline models received equivalent pretraining and data augmentation. Without these, the ranking's robustness cannot be fully assessed, particularly given the claim of superior performance on wildfire-damaged building datasets.
minor comments (2)
- [Abstract and Method] The abstract states that the model 'avoids the costly sampling procedures' via single-pass inference, but the method section should more explicitly describe the chosen inference timestep and how it is selected during fine-tuning versus testing.
- [Results tables] Tables reporting F1 and IoU scores should include the exact names and references for the seven compared models, along with any notes on whether they were re-implemented or used off-the-shelf.
Simulated Author's Rebuttal
We are grateful for the referee's insightful comments, which have helped us identify areas to strengthen our manuscript. We provide point-by-point responses to the major comments and commit to revisions that address the concerns regarding ablation controls and experimental rigor.
read point-by-point responses
-
Referee: Ablation studies section: The reported ablations examine robustness across different noise schedulers and timestep controls, yet omit a control experiment that trains the identical backbone architecture end-to-end on clean images using standard supervised fine-tuning without timestep conditioning or noise addition. This control is load-bearing for the headline ranking claim, as it is needed to determine whether the F1/IoU advantages on SpaceNet7, WHU, and xView2 arise from the task-specific noise schedules and timestep conditioning or from other factors such as model capacity and self-supervised pretraining.
Authors: We thank the referee for highlighting this important control. Our ablations demonstrate the model's robustness to scheduler variations within the diffusion paradigm, supporting the utility of task-specific noise schedules. However, to more rigorously isolate the contribution of the diffusion-based components (noise addition and timestep conditioning) from the backbone capacity and pretraining strategy, we will include the suggested control experiment in the revised version. Specifically, we will train the identical backbone architecture end-to-end using standard supervised learning on clean images without noise or timestep conditioning, and compare its performance to Noise2Map on the same datasets. This addition will strengthen the evidence that the performance advantages derive from our proposed repurposing of the diffusion process. revision: yes
-
Referee: Experimental results and comparisons: The cross-dataset rank metric establishing Noise2Map as 1st among seven models on both tasks does not include error bars, statistical significance testing, or explicit confirmation that the baseline models received equivalent pretraining and data augmentation. Without these, the ranking's robustness cannot be fully assessed, particularly given the claim of superior performance on wildfire-damaged building datasets.
Authors: We agree that incorporating error bars, statistical significance testing, and explicit details on baseline training protocols would improve the assessment of our results. In the revised manuscript, we will report standard deviations from multiple training runs for the key metrics, include statistical tests (e.g., Wilcoxon signed-rank test) to evaluate the significance of differences in rankings, and provide a detailed table or section clarifying the pretraining and data augmentation strategies used for each of the seven baseline models. This will confirm that comparisons were fair and address concerns regarding the wildfire-damaged building datasets in xView2. revision: yes
Circularity Check
No circularity in derivation or claims
full rationale
The paper proposes Noise2Map as a repurposing of standard diffusion denoising (with task-specific schedules and timestep conditioning) into an end-to-end discriminative model for semantic segmentation and change detection. It describes a self-supervised pretraining stage followed by supervised fine-tuning on a shared backbone, then reports empirical rankings (average 1st on F1/IoU across SpaceNet7, WHU, xView2) from direct evaluation on public datasets. No equation, prediction, or central claim reduces by construction to a fitted parameter, self-definition, or self-citation chain; the architecture and results remain independent of the inputs they are evaluated against. This is a conventional empirical ML contribution whose validity rests on external benchmarks rather than tautological reduction.
Axiom & Free-Parameter Ledger
free parameters (1)
- task-specific noise schedule parameters
axioms (1)
- domain assumption The noise process in diffusion models can be leveraged for discriminative tasks through appropriate conditioning and schedules.
Reference graph
Works this paper leans on
-
[1]
Cmlformer: Cnn and multi-scale local-context transformer network for remote sensing images semantic segmentation,
H. Wu, M. Zhang, P. Huang, and W. Tang, “Cmlformer: Cnn and multi-scale local-context transformer network for remote sensing images semantic segmentation,”IEEE JSTARS, 2024
2024
-
[2]
Transformers for remote sensing: A systematic review and analysis,
R. Wanget al., “Transformers for remote sensing: A systematic review and analysis,”Sensors, vol. 24, no. 11, p. 3495, 2024
2024
-
[3]
Diffusion models beat gans on image synthesis,
P. Dhariwal and A. Nichol, “Diffusion models beat gans on image synthesis,”NeurIPS, vol. 34, pp. 8780–8794, 2021
2021
-
[4]
High- resolution image synthesis with latent diffusion models,
R. Rombach, A. Blattmann, D. Lorenz, P. Esser, and B. Ommer, “High- resolution image synthesis with latent diffusion models,” inCVPR, 2022, pp. 10 684–10 695
2022
-
[5]
Diffusionsat: A generative foundation model for satellite imagery,
S. Khannaet al., “Diffusionsat: A generative foundation model for satellite imagery,” inICLR, 2024
2024
-
[6]
Crs-diff: Controllable remote sensing image generation with diffusion model,
D. Tang, X. Cao, X. Hou, Z. Jiang, J. Liu, and D. Meng, “Crs-diff: Controllable remote sensing image generation with diffusion model,” IEEE TGRS, 2024
2024
-
[7]
Swimdiff: Scene- wide matching contrastive learning with diffusion constraint for remote sensing image,
J. Tian, J. Lei, J. Zhang, W. Xie, and Y . Li, “Swimdiff: Scene- wide matching contrastive learning with diffusion constraint for remote sensing image,”IEEE TGRS, 2024
2024
-
[8]
Ddpm-cd: Denoising diffusion probabilistic models as feature extractors for remote sensing change detection,
W. G. C. Bandara, N. G. Nair, and V . Patel, “Ddpm-cd: Denoising diffusion probabilistic models as feature extractors for remote sensing change detection,” inWACV, 2025, pp. 5250–5262
2025
-
[9]
Rs-dseg: semantic segmentation of high-resolution remote sensing images based on a diffusion model component with unsupervised pretraining,
Z. Luoet al., “Rs-dseg: semantic segmentation of high-resolution remote sensing images based on a diffusion model component with unsupervised pretraining,”Scientific Reports, vol. 14, no. 1, p. 18609, 2024
2024
-
[10]
U-net: Convolutional networks for biomedical image segmentation,
O. Ronneberger, P. Fischer, and T. Brox, “U-net: Convolutional networks for biomedical image segmentation,” inMICCAI, 2015, pp. 234–241
2015
-
[11]
Segnet: A deep con- volutional encoder-decoder architecture for image segmentation,
V . Badrinarayanan, A. Kendall, and R. Cipolla, “Segnet: A deep con- volutional encoder-decoder architecture for image segmentation,”IEEE PAMI, vol. 39, no. 12, pp. 2481–2495, 2017
2017
-
[12]
Rethinking Atrous Convolution for Semantic Image Segmentation
L.-C. Chen, G. Papandreou, F. Schroff, and H. Adam, “Rethinking atrous convolution for semantic image segmentation,”arXiv preprint arXiv:1706.05587, 2017
work page internal anchor Pith review arXiv 2017
-
[13]
Pyramid scene parsing network,
H. Zhao, J. Shi, X. Qi, X. Wang, and J. Jia, “Pyramid scene parsing network,” inCVPR, 2017, pp. 2881–2890
2017
-
[14]
Segformer: Simple and efficient design for semantic segmentation with transformers,
E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “Segformer: Simple and efficient design for semantic segmentation with transformers,”NeurIPS, vol. 34, pp. 12 077–12 090, 2021
2021
-
[15]
Unified perceptual parsing for scene understanding,
T. Xiao, Y . Liu, B. Zhou, Y . Jiang, and J. Sun, “Unified perceptual parsing for scene understanding,” inECCV, 2018
2018
-
[16]
Masked-attention mask transformer for universal image segmentation,
B. Cheng, I. Misra, A. G. Schwing, A. Kirillov, and R. Girdhar, “Masked-attention mask transformer for universal image segmentation,” inCVPR, 2022, pp. 1290–1299
2022
-
[17]
A multilevel multimodal fusion transformer for remote sensing semantic segmentation,
X. Ma, X. Zhang, M.-O. Pun, and M. Liu, “A multilevel multimodal fusion transformer for remote sensing semantic segmentation,”IEEE TGRS, 2024
2024
-
[18]
Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data,
O. Manas, A. Lacoste, X. Gir ´o-i Nieto, D. Vazquez, and P. Rodriguez, “Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data,” inICCV, 2021, pp. 9414–9423
2021
-
[19]
Ssl4eo-s12: A large-scale multimodal, multitemporal dataset for self-supervised learning in earth observation,
Y . Wang, N. A. A. Braham, Z. Xiong, C. Liu, C. M. Albrecht, and X. X. Zhu, “Ssl4eo-s12: A large-scale multimodal, multitemporal dataset for self-supervised learning in earth observation,”IEEE GRSM, vol. 11, no. 3, pp. 98–106, 2023
2023
-
[20]
Satmae: Pre-training transformers for temporal and multi-spectral satellite imagery,
Y . Conget al., “Satmae: Pre-training transformers for temporal and multi-spectral satellite imagery,”NeurIPS, vol. 35, pp. 197–211, 2022
2022
-
[21]
A review of remote sensing image segmentation by deep learning methods,
J. Li, Y . Cai, Q. Li, M. Kou, and T. Zhang, “A review of remote sensing image segmentation by deep learning methods,”International Journal of Digital Earth, vol. 17, no. 1, p. 2328827, 2024
2024
-
[22]
Deep learning methods for semantic segmentation in remote sensing with small data: A survey,
A. Yuet al., “Deep learning methods for semantic segmentation in remote sensing with small data: A survey,”Remote sensing, vol. 15, no. 20, p. 4987, 2023
2023
-
[23]
Change detection techniques: A review,
Y . Ban and O. Yousif, “Change detection techniques: A review,”Multi- temporal remote sensing: methods and applications, pp. 19–43, 2016
2016
-
[24]
Remote sensing image change detection with transformers,
H. Chen, Z. Qi, and Z. Shi, “Remote sensing image change detection with transformers,”IEEE TGRS, vol. 60, p. 1–14, 2022
2022
-
[25]
Change guiding network: Incorporating change prior to guide change detection in remote sensing imagery,
C. Han, C. Wu, H. Guo, M. Hu, J. Li, and H. Chen, “Change guiding network: Incorporating change prior to guide change detection in remote sensing imagery,”IEEE JSTARS, vol. 16, pp. 8395–8407, 2023
2023
-
[26]
A transformer-based siamese network for change detection,
W. G. C. Bandara and V . M. Patel, “A transformer-based siamese network for change detection,” inIGARSS, 2022, pp. 207–210
2022
-
[27]
Changemamba: Re- mote sensing change detection with spatio-temporal state space model,
H. Chen, J. Song, C. Han, J. Xia, and N. Yokoya, “Changemamba: Re- mote sensing change detection with spatio-temporal state space model,” IEEE TGRS, 2024
2024
-
[28]
Continuous urban change detection from satellite image time series with temporal feature refinement and multi-task integration,
S. Hafner, H. Fang, H. Azizpour, and Y . Ban, “Continuous urban change detection from satellite image time series with temporal feature refinement and multi-task integration,”IEEE TGRS, 2025
2025
-
[29]
Not all diffu- sion model activations have been evaluated as discriminative features,
B. Meng, Q. Xu, Z. Wang, X. Cao, and Q. Huang, “Not all diffu- sion model activations have been evaluated as discriminative features,” NeurIPS, vol. 37, pp. 55 141–55 177, 2025
2025
-
[30]
Diffusion-tta: Test-time adaptation of discriminative models via gen- erative feedback,
M. Prabhudesai, T.-W. Ke, A. Li, D. Pathak, and K. Fragkiadaki, “Diffusion-tta: Test-time adaptation of discriminative models via gen- erative feedback,”NeurIPS, vol. 36, pp. 17 567–17 583, 2023
2023
-
[31]
Text-to-image diffusion models are zero shot classifiers,
K. Clark and P. Jaini, “Text-to-image diffusion models are zero shot classifiers,”NeurIPS, vol. 36, pp. 58 921–58 937, 2023
2023
-
[32]
Unsupervised semantic correspondence using stable diffusion,
E. Hedlinet al., “Unsupervised semantic correspondence using stable diffusion,”NeurIPS, vol. 36, 2024
2024
-
[33]
Diffusion model is secretly a training-free open vocab- ulary semantic segmenter,
J. Wanget al., “Diffusion model is secretly a training-free open vocab- ulary semantic segmenter,”arXiv preprint arXiv:2309.02773, 2023
-
[34]
Diffusiondet: Diffusion model for object detection,
S. Chen, P. Sun, Y . Song, and P. Luo, “Diffusiondet: Diffusion model for object detection,” inICCV, 2023, pp. 19 830–19 843
2023
-
[35]
Enhance image classification via inter-class image mixup with diffusion model,
Z. Wanget al., “Enhance image classification via inter-class image mixup with diffusion model,” inCVPR, 2024, pp. 17 223–17 233
2024
-
[36]
Diffusion models: A comprehensive survey of methods and applications,
L. Yanget al., “Diffusion models: A comprehensive survey of methods and applications,”ACM Computing Surveys, vol. 56, pp. 1 – 39, 2022
2022
-
[37]
Diffusion models meet remote sensing: Principles, methods, and perspectives,
Y . Liu, J. Yue, S. Xia, P. Ghamisi, W. Xie, and L. Fang, “Diffusion models meet remote sensing: Principles, methods, and perspectives,” IEEE TGRS, vol. 62, pp. 1–22, 2024
2024
-
[38]
Denoising diffusion probabilistic models,
J. Ho, A. Jain, and P. Abbeel, “Denoising diffusion probabilistic models,” NeurIPS, vol. 33, pp. 6840–6851, 2020
2020
-
[39]
Leveraging diffusion modeling for remote sensing change detection in built-up urban areas,
R. Wan, J. Zhang, Y . Huang, Y . Li, B. Hu, and B. Wang, “Leveraging diffusion modeling for remote sensing change detection in built-up urban areas,”IEEE Access, vol. 12, pp. 7028–7039, 2024
2024
-
[40]
Siamese meets diffusion network: Smdnet for enhanced change detection in high-resolution rs imagery,
J. Jia, G. Lee, Z. Wang, Z. Lyu, and Y . He, “Siamese meets diffusion network: Smdnet for enhanced change detection in high-resolution rs imagery,”IEEE JSTARS, vol. 17, pp. 8189–8202, 2024
2024
-
[41]
Ediffsr: An efficient diffusion probabilistic model for remote sensing image super- resolution,
Y . Xiao, Q. Yuan, K. Jiang, J. He, X. Jin, and L. Zhang, “Ediffsr: An efficient diffusion probabilistic model for remote sensing image super- resolution,”IEEE TGRS, vol. 62, pp. 1–14, 2023
2023
-
[42]
Enhancing remote sensing image super-resolution with efficient hybrid conditional diffusion model,
L. Hanet al., “Enhancing remote sensing image super-resolution with efficient hybrid conditional diffusion model,”Remote Sensing, vol. 15, no. 13, p. 3452, 2023
2023
-
[43]
The multi-temporal urban development spacenet dataset,
A. Van Etten, D. Hogan, J. M. Manso, J. Shermeyer, N. Weir, and R. Lewis, “The multi-temporal urban development spacenet dataset,” in CVPR, 2021, pp. 6398–6407
2021
-
[44]
Semi-supervised urban change detection using multi-modal sentinel-1 sar and sentinel-2 msi data,
S. Hafner, Y . Ban, and A. Nascetti, “Semi-supervised urban change detection using multi-modal sentinel-1 sar and sentinel-2 msi data,” Remote Sensing, vol. 15, no. 21, p. 5135, 2023
2023
-
[45]
xview: Objects in context in overhead imagery,
D. Lamet al., “xview: Objects in context in overhead imagery,”arXiv preprint arXiv:1802.07856, 2018
-
[46]
Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set,
S. Ji, S. Wei, and M. Lu, “Fully convolutional networks for multisource building extraction from an open aerial and satellite imagery data set,” IEEE TGRS, vol. 57, no. 1, pp. 574–586, 2018
2018
-
[47]
Aid: A benchmark data set for performance evaluation of aerial scene classification,
G.-S. Xiaet al., “Aid: A benchmark data set for performance evaluation of aerial scene classification,”IEEE TGRS, vol. 55, no. 7, pp. 3965– 3981, 2017
2017
-
[48]
Unet++: A nested u-net architecture for medical image segmentation,
Z. Zhou, M. M. Rahman Siddiquee, N. Tajbakhsh, and J. Liang, “Unet++: A nested u-net architecture for medical image segmentation,” inMICCAI Workshops, 2018, pp. 3–11
2018
-
[49]
Vision transformers for dense prediction,
R. Ranftl, A. Bochkovskiy, and V . Koltun, “Vision transformers for dense prediction,” inICCV, 2021, pp. 12 179–12 188
2021
-
[50]
Rs 3 mamba: Visual state space model for remote sensing image semantic segmentation,
X. Ma, X. Zhang, and M.-O. Pun, “Rs 3 mamba: Visual state space model for remote sensing image semantic segmentation,”IEEE GRSL, vol. 21, pp. 1–5, 2024
2024
-
[51]
Fully convolutional siamese networks for change detection,
R. C. Daudt, B. Le Saux, and A. Boulch, “Fully convolutional siamese networks for change detection,” inIEEE ICIP, 2018, pp. 4063–4067
2018
-
[52]
Elgc-net: Efficient local–global context aggregation for remote sensing change detection,
M. Noman, M. Fiaz, H. Cholakkal, S. Khan, and F. S. Khan, “Elgc-net: Efficient local–global context aggregation for remote sensing change detection,”IEEE TGRS, vol. 62, pp. 1–11, 2024
2024
-
[53]
Major tom: Expandable datasets for earth observation,
A. Francis and M. Czerkawski, “Major tom: Expandable datasets for earth observation,” inIGARSS, 2024, pp. 2935–2940
2024
-
[54]
Denoising Diffusion Implicit Models
J. Song, C. Meng, and S. Ermon, “Denoising diffusion implicit models,” arXiv preprint arXiv:2010.02502, 2020
work page internal anchor Pith review arXiv 2010
-
[55]
Elucidating the design space of diffusion-based generative models,
T. Karras, M. Aittala, T. Aila, and S. Laine, “Elucidating the design space of diffusion-based generative models,”NeurIPS, vol. 35, pp. 26 565–26 577, 2022
2022
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.