arxiv: 2605.00896 · v1 · submitted 2026-04-28 · 💻 cs.CV · cs.AI· cs.LG

Recognition: unknown

When Less Is More: Simplicity Beats Complexity for Physics-Constrained InSAR Phase Unwrapping

Authors on Pith no claims yet

Pith reviewed 2026-05-09 20:17 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords InSAR phase unwrappingU-Netattention mechanismsdeep learninggeophysical regressionsmoothness constraintscomplexity penaltyLiCSAR

0 comments

The pith

A simple vanilla U-Net outperforms complex attention models for InSAR phase unwrapping by respecting physical smoothness.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a basic convolutional network called vanilla U-Net works better than attention-heavy models when unwrapping phases from InSAR data for volcano and earthquake monitoring. On a large global set of patches it reaches higher accuracy with far fewer parameters and runs fast enough for real-time use. The authors trace the gap to the fact that attention adds high-frequency noise that breaks the smooth deformation expected from elastic Earth processes. This finding questions the habit of importing the latest complex vision architectures into geophysical problems without checking their physical fit.

Core claim

On 39,724 patches from 20 LiCSAR frames the vanilla U-Net with 7.76 million parameters reaches R² = 0.834 and RMSE = 1.01 cm. It beats 11.37-million-parameter attention models by 34 percent in R² and 51 percent in RMSE. Power spectral density analysis shows attention injects unphysical content above 0.3 cycles per pixel that violates smoothness constraints of elastic surface deformation, while the simpler model stays consistent and runs at 2.92 ms per patch to meet operational latency limits.

What carries the argument

Power spectral density analysis that flags unphysical high-frequency artifacts introduced by attention in smooth geophysical fields.

Load-bearing premise

The LiCSAR patches and 0.3 cycles per pixel threshold fully capture smoothness constraints across all InSAR conditions and sensors.

What would settle it

An attention model achieving lower RMSE than the vanilla U-Net on a new InSAR dataset from a different sensor or deformation regime would disprove the reported superiority.

Figures

Figures reproduced from arXiv: 2605.00896 by Manmeet Singh, Prabhjot Singh.

**Figure 2.** Figure 2: Representative predictions across test regimes. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: (a) Cumulative error distribution. (b) Power spectral density analysis Root Causes of Failure. We identify three mechanisms driving this divergence: (1) Inductive bias mismatch: Attention mechanisms excel at detecting discrete boundaries in natural images (Dosovitskiy et al., 2021; Vaswani et al., 2023); however, InSAR displacement is characterized by high spatial autocorrelation, making the global flexib… view at source ↗

**Figure 4.** Figure 4: Detailed architectural variants evaluated in this study [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Visual comparison of phase unwrapping results. Each grid presents: (a) Wrapped Phase, [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

read the original abstract

Operational phase unwrapping is the primary computational bottleneck in InSAR-based volcanic and seismic monitoring. We challenge the industry trend of adopting high-complexity computer vision architectures, such as attention mechanisms, without validating their suitability for physics-constrained geophysical regression. We present the first large-scale architectural ablation study on a global LiCSAR benchmark (20 frames, 39,724 patches, 651M pixels). Our results reveal a significant "complexity penalty": a vanilla U-Net (7.76M parameters) achieves $R^2=0.834$ and RMSE $= 1.01$ cm, outperforming 11.37M-parameter attention-based models by 34% in $R^2$ and 51% in RMSE. Power Spectral Density (PSD) analysis provides the physical justification: while attention excels at capturing sharp semantic edges in natural images, it injects unphysical high-frequency artifacts ($>0.3$ cycles/pixel) into geophysical fields, violating the fundamental smoothness constraints of elastic surface deformation. With a 2.92ms inference latency (a $2.5\times$ speedup), the vanilla U-Net is the only candidate to comfortably meet the sub-100ms requirement for operational early-warning systems. This work bridges the "publication-to-practice" gap by proving that convolutional locality outperforms modern complexity for smooth-field regression, advocating for physics-informed simplicity in ML4RS. Code available at https://github.com/prabhjotschugh/When-Less-is-More-InSAR-Phase-Unwrapping

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper conducts a large-scale architectural ablation study on InSAR phase unwrapping using a global LiCSAR benchmark (20 frames, 39,724 patches, 651M pixels). It claims that a vanilla U-Net (7.76M parameters) outperforms 11.37M-parameter attention-based models, achieving R²=0.834 and RMSE=1.01 cm (34% and 51% better, respectively), because attention mechanisms inject unphysical high-frequency artifacts above 0.3 cycles/pixel that violate elastic deformation smoothness constraints, as diagnosed via PSD analysis. The U-Net also provides 2.92 ms inference latency suitable for operational early-warning systems, advocating physics-informed simplicity over complexity in ML4RS.

Significance. If the empirical results and PSD-based mechanistic insight hold, the work demonstrates that convolutional locality can be preferable to attention for smooth geophysical regression tasks, offering both performance and latency advantages for operational InSAR monitoring. The scale of the benchmark and direct metrics provide concrete support for the complexity-penalty observation, with potential to influence architectural choices in remote-sensing applications.

major comments (2)

[Abstract] Abstract and benchmark description: The headline performance gains and physical justification rest on the LiCSAR patches and 0.3 cycles/pixel PSD threshold fully capturing elastic surface deformation constraints. If the dataset is dominated by low-deformation Sentinel-1 scenes, the observed high-frequency penalties to attention models may be partly by construction, limiting generalization to cases with admissible high-frequency content, discontinuities, or other sensors.
[PSD analysis] PSD analysis section: The 0.3 cycles/pixel cutoff used to label unphysical artifacts is presented as a diagnostic threshold but appears chosen post-hoc from this specific benchmark rather than derived from first-principles deformation spectra; additional validation or sensitivity analysis across deformation regimes would be needed to support the claim that attention inherently violates smoothness constraints.

minor comments (2)

[Abstract] Abstract: Details on train/test splits, cross-validation strategy, and the precise attention architectures (e.g., specific transformer blocks or attention variants) tested are missing, which would aid reproducibility of the 34%/51% gains.
[Results] Methods and results: Clarify whether statistical significance tests (e.g., paired t-tests or bootstrap confidence intervals) were performed on the R² and RMSE differences across the 39,724 patches.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and recommendation for minor revision. We address each major comment below and will incorporate revisions to clarify the benchmark characteristics and strengthen the PSD analysis.

read point-by-point responses

Referee: [Abstract] Abstract and benchmark description: The headline performance gains and physical justification rest on the LiCSAR patches and 0.3 cycles/pixel PSD threshold fully capturing elastic surface deformation constraints. If the dataset is dominated by low-deformation Sentinel-1 scenes, the observed high-frequency penalties to attention models may be partly by construction, limiting generalization to cases with admissible high-frequency content, discontinuities, or other sensors.

Authors: We appreciate this observation on potential dataset bias. The LiCSAR benchmark consists of 20 globally distributed frames chosen to capture a range of volcanic, seismic, and tectonic scenarios, but we acknowledge that many Sentinel-1 scenes exhibit moderate rather than extreme deformation. In the revised manuscript, we will add a new paragraph in the benchmark description section with quantitative statistics on deformation magnitudes (mean, max, and distribution of line-of-sight displacements across frames) and a limitations paragraph discussing applicability to high-frequency content, discontinuities, or non-Sentinel sensors. This will make the scope of the claims more precise while preserving the core empirical findings on the tested benchmark. revision: partial
Referee: [PSD analysis] PSD analysis section: The 0.3 cycles/pixel cutoff used to label unphysical artifacts is presented as a diagnostic threshold but appears chosen post-hoc from this specific benchmark rather than derived from first-principles deformation spectra; additional validation or sensitivity analysis across deformation regimes would be needed to support the claim that attention inherently violates smoothness constraints.

Authors: The 0.3 cycles/pixel value was identified by inspecting the PSD decay of the ground-truth deformation fields, where spectral energy drops sharply above this frequency, consistent with the band-limited nature of elastic deformation. To address the post-hoc concern, the revised version will include (i) explicit reference to geophysical literature on expected deformation spectra and (ii) a sensitivity study showing that the relative performance gap between U-Net and attention models remains stable when the cutoff is varied between 0.2–0.5 cycles/pixel. These additions will better anchor the threshold in both data and theory without overclaiming universality. revision: yes

Circularity Check

0 steps flagged

No circularity: central claims are direct empirical measurements on held-out benchmark data

full rationale

The paper's primary results consist of measured performance metrics (R²=0.834, RMSE=1.01 cm) from training and evaluating multiple architectures, including a vanilla U-Net and attention-based models, on a fixed held-out portion of the LiCSAR benchmark (39,724 patches). These are observational outcomes from standard supervised regression, not quantities derived from equations that reduce to the inputs by construction. The PSD analysis and 0.3 cycles/pixel threshold serve only as post-hoc physical interpretation of observed high-frequency artifacts; they do not enter any derivation that predicts or forces the reported superiority numbers. No self-citations, uniqueness theorems, or ansatzes are invoked to justify the architectural conclusions. The derivation chain is therefore self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The claim rests on the representativeness of the LiCSAR benchmark for operational conditions and on the interpretation of PSD thresholds as strict physical constraints on elastic deformation.

axioms (2)

domain assumption The LiCSAR dataset patches represent the distribution of operational InSAR phase data for volcanic and seismic monitoring
All experiments and conclusions are drawn from this single benchmark
domain assumption Frequency components above 0.3 cycles/pixel are unphysical for elastic surface deformation fields
Used to interpret why attention models underperform

pith-pipeline@v0.9.0 · 5589 in / 1357 out tokens · 48178 ms · 2026-05-09T20:17:24.937164+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

12 extracted references · 11 canonical work pages · 3 internal anchors

[1]

URLhttps://opg.optica.org/josaa/abstract

doi: 10.1364/JOSAA.18.000338. URLhttps://opg.optica.org/josaa/abstract. cfm?URI=josaa-18-2-338. Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder- decoder with atrous separable convolution for semantic image segmentation. In Vittorio Ferrari, Martial Hebert, Cristian Sminchisescu, and Yair Weiss (eds.),Computer Vis...

work page doi:10.1364/josaa.18.000338 2018
[2]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

URLhttps://arxiv.org/abs/2010.11929. Jie Hu, Li Shen, Samuel Albanie, Gang Sun, and Enhua Wu. Squeeze-and-excitation networks,

work page internal anchor Pith review Pith/arXiv arXiv 2010
[3]

arXiv:1709.01507 (2017)

URLhttps://arxiv.org/abs/1709.01507. Milan Lazeck ´y, Karsten Spaans, Pablo J. Gonz ´alez, Yasser Maghsoudi, Yu Morishita, Fabien Al- bino, John Elliott, Nicholas Greenall, Emma Hatton, Andrew Hooper, Daniel Juncu, Alistair McDougall, Richard J. Walters, C. Scott Watson, Jonathan R. Weiss, and Tim J. Wright. Lic- sar: An automatic insar tool for measuring...

work page arXiv
[4]

doi: 10.3390/rs12152430

ISSN 2072-4292. doi: 10.3390/rs12152430. URLhttps: //www.mdpi.com/2072-4292/12/15/2430. Markus Reichstein, Gustau Camps-Valls, Bjorn Stevens, Martin Jung, Joachim Denzler, Nuno Car- valhais, and Prabhat. Deep learning and process understanding for data-driven earth system sci- ence.Nature, 566(7743):195–204, feb

work page doi:10.3390/rs12152430 2072
[5]

Camps-Valls, B

ISSN 1476-4687. doi: 10.1038/s41586-019-0912-1. URLhttps://doi.org/10.1038/s41586-019-0912-1. Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedi- cal image segmentation,

work page doi:10.1038/s41586-019-0912-1
[6]

U-Net: Convolutional Networks for Biomedical Image Segmentation

URLhttps://arxiv.org/abs/1505.04597. Jo Schlemper, Ozan Oktay, Michiel Schaap, Mattias Heinrich, Bernhard Kainz, Ben Glocker, and Daniel Rueckert. Attention gated networks: Learning to leverage salient regions in med- ical images.Medical Image Analysis, 53:197–207,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Cocchi, L

ISSN 1361-8415. doi: 10.1016/j. media.2019.01.012. URLhttps://www.sciencedirect.com/science/article/ pii/S1361841518306133. G. E. Spoorthi, Subrahmanyam Gorthi, and Rama Krishna Sai Subrahmanyam Gorthi. Phasenet: A deep convolutional neural network for two-dimensional phase unwrapping.IEEE Signal Process- ing Letters, 26(1):54–58,

work page doi:10.1016/j 2019
[8]

doi: 10.1109/LSP.2018.2879184. W. R. Tobler. A computer movie simulating urban growth in the detroit region.Economic Geogra- phy, 46(sup1):234–240,

work page doi:10.1109/lsp.2018.2879184 2018
[9]

Economic Geography , author =

doi: 10.2307/143141. URLhttps://www.tandfonline. com/doi/abs/10.2307/143141. Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need,

work page doi:10.2307/143141
[10]

Attention Is All You Need

URLhttps://arxiv. org/abs/1706.03762. Lifan Zhou, Hanwen Yu, Yang Lan, and Mengdao xing. Artificial intelligence in interferometric synthetic aperture radar phase unwrapping: A review.IEEE Geoscience and Remote Sensing Magazine, 9(2):10–28,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

doi: 10.1109/MGRS.2021.3065811. 5 ICLR 2026 Machine Learning for Remote Sensing (ML4RS) Workshop A ARCHITECTURALSPECIFICATIONS ANDDESIGNRATIONALES To ensure full reproducibility and provide a technical basis for the ”Complexity Penalty” observed in our experiments, we detail the implementation of all four architectural variants (Fig. 4). A.1 VARIANTSPECIF...

work page doi:10.1109/mgrs.2021.3065811 2021
[12]

•Decoder:Up-convolutions via ConvTranspose2d followed by concatenation-based skip connections from the corresponding encoder stage

•Channel Progression:[32,64,128,256,512]. •Decoder:Up-convolutions via ConvTranspose2d followed by concatenation-based skip connections from the corresponding encoder stage. •Output Head:1×1Convolution mapping to a single-channel displacement map. Enhanced U-Net (8.29M parameters):This variant investigates whether channel-wise recalibra- tion can improve ...

2026