arxiv: 1505.04597 · v1 · submitted 2015-05-18 · 💻 cs.CV

Recognition: 3 theorem links

· Lean Theorem

U-Net: Convolutional Networks for Biomedical Image Segmentation

Olaf Ronneberger, Philipp Fischer, Thomas Brox

Pith reviewed 2026-05-09 01:26 UTC · model claude-opus-4-7

classification 💻 cs.CV

keywords semantic segmentationfully convolutional networksbiomedical imagingencoder-decoderskip connectionsdata augmentationelectron microscopycell tracking

0 comments

The pith

A symmetric encoder-decoder network with skip connections, trained on a few dozen images with elastic-deformation augmentation, sets a new bar for biomedical image segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that pixel-level segmentation of biomedical images does not require the thousands of annotated examples conventional wisdom assumed, provided the network and the training pipeline are built around that scarcity. The proposed architecture pairs a downsampling contracting path with a symmetric upsampling expanding path; cropped feature maps from the contracting side are concatenated into the expanding side so that fine spatial detail is not lost when context is gathered. The training recipe leans on heavy elastic deformation of the few available images and a loss that explicitly rewards getting the thin separations between touching cells right. With this combination, the same network — trained on 30 electron microscopy slices or on 20-35 light microscopy frames — beats a strong sliding-window baseline on ISBI EM and wins the 2015 ISBI cell tracking challenge in two categories, while segmenting a 512x512 image in under a second.

Core claim

The paper introduces a symmetric encoder-decoder convolutional network — a contracting path that captures context and an expanding path that recovers spatial precision, joined by skip connections that copy high-resolution features across the U — and shows it can be trained end-to-end from only 20-35 annotated microscopy images. Two ingredients make this work: aggressive elastic-deformation augmentation that teaches the network the kind of variability biological tissue actually exhibits, and a per-pixel weighted cross-entropy loss that puts extra mass on the thin background ridges separating touching cells. The result outperforms a sliding-window CNN on ISBI EM neuron segmentation and wins tw

What carries the argument

The U-net: a 23-layer fully convolutional network whose contracting half repeatedly downsamples while doubling channels, and whose expanding half upsamples while halving channels, with each decoder stage concatenating the cropped feature map from the matching encoder stage. Two auxiliary mechanisms carry the training: random elastic deformations on a coarse displacement grid as the dominant data-augmentation prior, and a precomputed per-pixel weight map in the cross-entropy loss that boosts the narrow background ridges between touching cells.

If this is right

Biomedical segmentation no longer requires thousands of pixel-labeled images; tens of carefully annotated frames plus elastic augmentation suffice to reach competition-winning accuracy.
Replacing sliding-window patch classifiers with a single fully-convolutional pass cuts inference of a 512x512 image to under a second, making per-pixel labeling practical at acquisition speed.
Skip connections that fuse high-resolution encoder features with upsampled decoder features become a reusable design pattern for any task where both context and precise localization matter.
A pixel-weighted loss that emphasizes thin separating boundaries lets a semantic-segmentation network resolve instance boundaries between touching objects of the same class, partly closing the gap to instance segmentation.
The same architecture transfers across imaging modalities (electron microscopy, phase contrast, DIC) without modality-specific engineering, suggesting a single template for a wide class of microscopy problems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The skip-connection trick generalizes far beyond microscopy: any dense-prediction task where output must be both context-aware and pixel-precise (depth, optical flow, medical CT/MRI) inherits the same design pressure, which is roughly what later years confirmed.
The boundary-weighted loss is a lightweight stand-in for instance segmentation; it works here because cells are convex blobs separated by thin gaps, and would degrade on objects with complex topology or genuine occlusion.
Elastic deformation works as a data prior precisely because biological tissue is locally diffeomorphic; in modalities dominated by photometric or sensor nuisances rather than geometric ones, the same augmentation budget would buy less.
The single-image batch with momentum 0.99 is an early instance of trading batch statistics for input resolution under tight GPU memory — a tradeoff that resurfaces whenever models outgrow available memory.

Load-bearing premise

That randomly stretching and warping a handful of training images produces a distribution close enough to real tissue variation for the trained network to generalize — true for the microscopy modalities shown, but an empirical bet rather than a guarantee for modalities where geometric deformation is not the dominant source of variation.

What would settle it

Retrain the published architecture on the same 30 EM images and 20-35 light-microscopy images and check whether the warping error on ISBI EM falls to ~0.000353 and IOU on PhC-U373 and DIC-HeLa reaches ~0.92 and ~0.78. If those numbers reproduce, the central claim — that this U-shaped network plus elastic augmentation plus boundary-weighted loss yields state-of-the-art biomedical segmentation from tens of images — stands.

read the original abstract

There is large consent that successful training of deep networks requires many thousand annotated training samples. In this paper, we present a network and training strategy that relies on the strong use of data augmentation to use the available annotated samples more efficiently. The architecture consists of a contracting path to capture context and a symmetric expanding path that enables precise localization. We show that such a network can be trained end-to-end from very few images and outperforms the prior best method (a sliding-window convolutional network) on the ISBI challenge for segmentation of neuronal structures in electron microscopic stacks. Using the same network trained on transmitted light microscopy images (phase contrast and DIC) we won the ISBI cell tracking challenge 2015 in these categories by a large margin. Moreover, the network is fast. Segmentation of a 512x512 image takes less than a second on a recent GPU. The full implementation (based on Caffe) and the trained networks are available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Short, clean methods paper that introduces the encoder-decoder-with-skips design now used everywhere; wins the cited benchmarks; ablation gap is real but doesn't threaten the contribution.

read the letter

This is the U-Net paper. If you haven't read it carefully, it's worth half an hour. Eight pages, one figure of the architecture, three benchmark results, code and trained weights released. The ingredient list is small: a symmetric contracting/expanding FCN with concatenative skips at every resolution, elastic-deformation augmentation on a coarse grid, and a boundary-weighted cross-entropy that puts heavy weight on the thin background strip between touching cells (Eq. 2, w0=10, σ≈5 px). They win the ISBI EM challenge (warping error 0.000353 vs. Ciresan's 0.000420) and take the 2015 cell-tracking categories by very large margins (IOU 0.92 vs. 0.83 on PhC-U373; 0.78 vs. 0.46 on DIC-HeLa).

What's actually new relative to Long et al.'s FCN: feature channels are kept wide on the way up so context propagates to high resolution, the path is symmetric, and skips are concatenative at every scale rather than a couple of late fusions. The touching-cell weighted loss is also a genuinely useful trick for instance separation in dense biology, and it is specified precisely enough to reimplement.

The stress-test concern is correct on its own terms: there is no ablation isolating architecture from augmentation from weighted loss, and the EM number is averaged over 7 rotations of the input, with the un-averaged score not reported. So strictly speaking the paper does not prove the U-shape is the load-bearing piece. I'd weight this less than the stress note does, though. The DIC-HeLa gap (0.78 vs. 0.46) is too large to be explained by elastic deformation alone on top of a plain FCN-8s, and the architecture has since been ablated to death by the community with the skips surviving as the essential element. For a 2015 conference paper with released code, the missing ablation is a soft spot, not a hole.

Other minor things: no seed variance, "second-best 2015" baselines on the cell-tracking table are not described, and the claim that augmentation suffices for tissue variability is a domain bet rather than a derived guarantee. None of that touches the central empirical claim.

Citation pattern is fine — Ciresan, Long et al., Hariharan et al. are all properly placed. Math is elementary and correct. Reproducibility is unusually high for the era.

Recommendation: accept, send to review only as a formality. Bring it to reading group if anyone there hasn't already internalized it; otherwise it's foundational reading rather than discussion material.

Referee Report

4 major / 7 minor

Summary. The paper proposes U-Net, a fully convolutional encoder–decoder architecture with concatenative skip connections at every scale, trained end-to-end on small biomedical image datasets using heavy elastic-deformation augmentation and a pixel-wise weighted cross-entropy loss whose weights (Eq. 2) emphasize narrow separating borders between touching cells. The authors report state-of-the-art performance on the ISBI 2012 EM neuronal-structure segmentation challenge (warping error 0.000353, surpassing Ciresan et al.'s 0.000420) and large-margin wins on the ISBI 2015 cell tracking challenge for the PhC-U373 (IOU 0.9203 vs. 0.83) and DIC-HeLa (0.7756 vs. 0.46) datasets, all using only 20–35 annotated training images. They release the Caffe implementation and trained networks.

Significance. If the central claim holds, U-Net demonstrates that a single architecture, trained from scratch on tens of images via aggressive elastic-deformation augmentation and a boundary-aware loss, can decisively beat strong sliding-window baselines on multiple biomedical segmentation benchmarks while running in under a second per 512×512 image. The architectural idea — a symmetric expansive path with as many feature channels as the contracting path, joined by concatenative skips — is a clean and reproducible refinement of fully convolutional networks (Long et al. [9]). The boundary-weighted loss (Eq. 2) is a concrete and easily reusable trick for instance separation in dense segmentation. The paper ships an open-source Caffe implementation and pretrained weights, which materially aids reproducibility, and the results are tied to public, organizer-evaluated benchmarks (EM challenge with sequestered ground truth; ISBI cell tracking 2015) rather than self-reported splits. The contribution is methodological and empirical rather than theoretical, and is likely to be of broad practical utility in biomedical imaging.

major comments (4)

[§4 Experiments / Tables 1–2] No ablation isolates the contributions of (i) the U-shaped architecture with skip connections, (ii) elastic-deformation augmentation, and (iii) the boundary-weighted loss of Eq. (2). Because the EM-challenge baseline of Ciresan et al. [1] used neither elastic deformation nor a touching-cell weighting, the warping-error gap (0.000420 → 0.000353) is consistent with the training recipe — not the skip architecture — doing most of the work. At minimum, the paper should report (a) U-Net trained without elastic deformation, (b) U-Net trained with uniform w(x), and ideally (c) a plain FCN-style baseline (e.g. [9]) under the identical augmentation+loss pipeline on at least one of the three datasets. Without this, the architectural novelty claim is under-supported as the cause of the wins.
[§4, EM result] The reported warping error of 0.000353 is described as 'averaged over 7 rotated versions of the input data.' Test-time augmentation by 7-fold rotation averaging is itself a non-trivial accuracy booster and should be disentangled from the model's intrinsic performance: please report the single-pass (un-averaged) warping/Rand/pixel errors alongside the averaged numbers, so that the comparison to entries in Table 1 (which are not described as using such averaging) is on equal footing.
[§3, Eq. (2) and weight-map hyperparameters] The weight map relies on w0=10 and σ≈5 px, and the elastic deformation on a 3×3 grid with 10-px standard deviation. These are presented without sensitivity analysis. Given that the central empirical claim depends on this training recipe, a brief sweep over (w0, σ) and the deformation magnitude — even on one dataset — would substantially strengthen the case that the wins are not narrowly tuned to the specific challenge data.
[§4 / Table 2] The cell-tracking comparison reports only a single 'second-best 2015' IOU per dataset with no identification of the competing methods or their training regimes (in particular, whether they used comparable augmentation). Since the reported margin (0.92 vs. 0.83; 0.78 vs. 0.46) is the basis for the 'large margin' claim in the abstract, please name the second-best entries and, where possible, briefly characterize their pipelines so readers can judge whether the gap reflects architecture, training data usage, or both.

minor comments (7)

[§2 / Fig. 1] The constraint that input tile size must yield an even x/y resolution at every 2×2 max-pool is stated but not given as an explicit formula. A short equation or worked example for arbitrary depth would aid practitioners constructing tiles for their own data.
[§3] The use of batch size 1 with momentum 0.99 is unusual and load-bearing for the optimization story; a single sentence on how this was chosen (vs., e.g., batch size of a few tiles) would help reproducibility.
[§3.1] 'Drop-out layers at the end of the contracting path perform further implicit data augmentation' — please specify the dropout rate and exact layer placement; this is currently ambiguous from text and Fig. 1.
[§3] Eq. (1): the symbol ℓ(x) is used both as the true label function and in the subscript of p; consider rewriting as p_{ℓ(x)}(x) consistently and defining ℓ before it appears in the loss.
[Fig. 1] The figure caption could benefit from explicitly noting that convolutions are unpadded ('valid'), since this is what produces the cropping arrows and the input/output size mismatch (572 → 388).
[§4] The training-set sizes (30 EM images, 35 PhC-U373, 20 DIC-HeLa) and any train/val split used for early stopping or hyperparameter selection should be stated; currently only the totals are given.
[References] Reference [10] is truncated as 'Maska, M., (...), de Solorzano, C.O.'; please give the full author list as published in Bioinformatics.

Simulated Author's Rebuttal

4 responses · 1 unresolved

We thank the referee for a careful and constructive report. The four major comments all concern attribution and contextualisation of the empirical claims rather than the validity of the results themselves, and we agree with the substance of each. In the revision we will (1) add an ablation on the EM dataset isolating the elastic-deformation augmentation and the boundary-weighted loss; (2) report single-pass EM errors alongside the 7-rotation-averaged numbers in Table 1; (3) add a sensitivity sweep over (w0, σ) and the elastic-deformation magnitude on DIC-HeLa; and (4) name the second-best ISBI 2015 cell-tracking entries in Table 2 and briefly characterise their pipelines, while tightening the abstract's 'large margin' phrasing to a quantitative statement. One element of comment (1) — a fully retrained FCN-[9] baseline under our identical pipeline — we address only partially, because FCN as published depends on ImageNet-pretrained VGG initialisation and a like-for-like comparison from scratch on ~30 images requires design choices we believe lie beyond a minor revision; we will discuss this limitation explicitly rather than paper over it.

read point-by-point responses

Referee: No ablation isolates the contributions of (i) the U-shaped architecture with skip connections, (ii) elastic-deformation augmentation, and (iii) the boundary-weighted loss. Without ablation, the EM warping-error gap may be due to the training recipe rather than the architecture; please add (a) U-Net without elastic deformation, (b) U-Net with uniform w(x), (c) ideally a plain FCN baseline under identical pipeline.

Authors: We agree that a controlled ablation would sharpen the attribution of credit, and we will add one in the revision. Specifically, we will retrain on the EM dataset with (a) elastic deformation disabled (keeping only shift/rotation/gray-value augmentation), and (b) uniform w(x)=w_c(x), i.e. removing the exp(-(d1+d2)^2/2σ^2) boundary term, and report warping/Rand/pixel error for each. We will also clarify in §3 the role each component is intended to play: the boundary-weighted loss specifically targets instance separation between touching cells (most relevant for DIC-HeLa, Fig. 3), while elastic deformation targets generalisation from very few images. We are less able to commit, within a minor revision, to a fully retrained FCN-[9] baseline under the identical pipeline on the challenge data, because the FCN of Long et al. relies on ImageNet-pretrained VGG initialisation, which is a different regime from training from scratch on ~30 images; a fair head-to-head therefore requires non-trivial design choices that we feel exceed a minor revision. We will however discuss this caveat explicitly so readers do not over-attribute the gains to the skip architecture alone. revision: partial
Referee: The EM warping error 0.000353 is averaged over 7 rotated versions of the input. Test-time rotation averaging is itself a non-trivial booster and should be disentangled from intrinsic performance; please report single-pass (un-averaged) warping/Rand/pixel errors alongside the averaged numbers.

Authors: Agreed. The 7-fold rotation/flip averaging exploits the approximate dihedral symmetry of EM sections and is a standard trick, but the referee is right that Table 1 entries are not annotated as using it, so the comparison should be made on equal footing. In the revision we will report the single-pass warping, Rand and pixel errors of a single U-Net (no test-time averaging) alongside the 7-rotation-averaged numbers, and we will add a sentence noting that the gain attributable purely to test-time averaging can thus be read off directly. We will keep the averaged number as our official challenge submission, since the challenge does not forbid test-time augmentation, but the table will make both numbers visible. revision: yes
Referee: The weight map (w0=10, σ≈5 px) and elastic deformation (3×3 grid, 10-px std) hyperparameters are presented without sensitivity analysis. A brief sweep on one dataset would strengthen the case that the wins are not narrowly tuned.

Authors: We will add a small sensitivity study on the DIC-HeLa dataset, which is the setting where both knobs matter most (touching cells; only 20 training images). Concretely we will sweep w0 ∈ {0, 5, 10, 20} with σ fixed, σ ∈ {3, 5, 10} px with w0 fixed, and the elastic-deformation displacement std ∈ {0, 5, 10, 20} px on the 3×3 control grid, reporting IOU for each. We chose the original values by visual inspection of the resulting weight maps and deformation fields rather than by tuning on a held-out split, and we will state this explicitly. We do not expect the ranking against competitors to flip within a reasonable neighbourhood of these settings, but the referee is correct that this should be demonstrated rather than asserted. revision: yes
Referee: The cell-tracking comparison reports only a single 'second-best 2015' IOU per dataset, without identifying the competing methods or their training regimes. Please name the second-best entries and briefly characterise their pipelines so readers can judge whether the gap reflects architecture, training data, or both.

Authors: We will identify the competing entries by team name in the revised Table 2, using the official ISBI 2015 cell tracking challenge leaderboard, and we will add a one-sentence characterisation of each second-best pipeline (feature type, classifier/segmenter family, and whether augmentation was reported) to the extent that this is documented by the organisers or the participating teams. Where the entry's training regime is not publicly described in sufficient detail, we will say so rather than speculate. We will also soften the abstract phrasing from 'by a large margin' to a more specific quantitative statement (e.g. '+9 IOU points on PhC-U373, +31 on DIC-HeLa over the next-best 2015 entry') so the claim is anchored to the numbers in the table rather than to a qualitative descriptor. revision: yes

standing simulated objections not resolved

A fully matched FCN-[9] baseline trained from scratch on the EM dataset under our identical augmentation and loss pipeline is not provided. FCN as published relies on ImageNet-pretrained VGG features, so a fair from-scratch reimplementation involves design choices (initialisation, depth, channel counts) that we cannot resolve uncontroversially within a minor revision; we will instead flag this as a limitation of the architectural-attribution claim.

Circularity Check

0 steps flagged

No circularity: results are evaluated against external, held-out benchmarks with secret ground truth.

full rationale

The paper's central empirical claims (warping error 0.000353 on the ISBI EM segmentation challenge; IOU 0.9203 / 0.7756 on the ISBI 2015 cell tracking challenge) are evaluated by third-party challenge organizers on test data whose ground truth is withheld from the authors ("The test set is publicly available, but its segmentation maps are kept secret. An evaluation can be obtained by sending the predicted membrane probability map to the organizers."). The numerical comparisons in Tables 1 and 2 are against other teams' submissions to the same external leaderboards. None of the load-bearing claims are fitted on the same data they are then evaluated on, none rely on a self-citation as a load-bearing uniqueness/ansatz import, and no quantity is renamed and represented as a derivation. The skeptic's concern raised in the reader's take — that no ablation isolates architecture from augmentation/weighted-loss contributions — is a legitimate attribution/causal-identification concern, but per the rubric this falls under correctness/attribution risk, not circularity: the paper does not define the architectural contribution in terms of the benchmark numbers, nor fit parameters to the test set. Self-citations in the references are to standard tools (Caffe, He et al. initialization, FCN, etc.) and are not load-bearing for any "uniqueness" or "forced" claim. Score: 0.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper's load-bearing additions beyond standard CNN machinery are: a specific architectural design, two augmentation/loss tricks, and a handful of hyperparameters chosen for the experiments. There are no invented physical entities, no unverifiable postulates. The free parameters are conventional ML hyperparameters and are honestly reported as choices. Standard-math/standard-DL background (SGD, ReLU, cross-entropy, He initialization) is cited from prior literature.

pith-pipeline@v0.9.0 · 9528 in / 5310 out tokens · 84875 ms · 2026-05-09T01:26:43.686382+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Cost/JcostCore (canonical cost J(x) = ½(x + x⁻¹) − 1) Cost.washburn_uniqueness_aczel unclear
E = Σ w(x) log(p_ℓ(x)(x)) ... w(x) = w_c(x) + w_0 · exp(−(d_1(x) + d_2(x))² / 2σ²) ... we set w_0 = 10 and σ ≈ 5 pixels.

Forward citations

Cited by 44 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

REALISTA: Realistic Latent Adversarial Attacks that Elicit LLM Hallucinations
cs.CL 2026-05 unverdicted novelty 8.0

REALISTA optimizes continuous combinations of valid editing directions in latent space to produce realistic adversarial prompts that elicit hallucinations more effectively than prior methods, including on large reason...
LatentHDR: Decoupling Exposure from Diffusion via Conditional Latent-to-Latent Mapping for Text/Image-to-Panoramic HDR
cs.CV 2026-05 unverdicted novelty 7.0

LatentHDR generates structurally consistent panoramic HDR images by producing one scene latent with a diffusion backbone then deterministically mapping it to multiple exposure latents via a lightweight conditional head.
EchoXFlow: A Beamspace Echocardiography Dataset for Cardiac Motion, Flow, and Function
cs.CV 2026-05 unverdicted novelty 7.0

EchoXFlow is a new dataset of 37,125 beamspace echocardiography recordings with separable modalities, Doppler data, ECG, and clinical annotations that enables acquisition-aware learning not possible with standard scan...
Generative diffusion models for spatiotemporal influenza forecasting
cs.LG 2026-04 unverdicted novelty 7.0

Influpaint uses generative diffusion models on image-encoded influenza data to produce realistic and diverse epidemic trajectories that match leading ensemble methods in accuracy.
VitaminP: cross-modal learning enables whole-cell segmentation from routine histology
cs.CV 2026-04 unverdicted novelty 7.0

VitaminP uses paired H&E-mIF data to train a model that transfers molecular boundary information, enabling accurate whole-cell segmentation directly from routine H&E histology across 34 cancer types.
Physics-informed, Generative Adversarial Design of Funicular Shells
cs.CE 2026-04 unverdicted novelty 7.0

A modified DCGAN with an auxiliary discriminator using the membrane factor generates stable, previously unseen funicular shells optimized for pure compression in three dimensions.
Machine Learning Phase Field Reconstruction in a Bose-Einstein Condensate
cond-mat.quant-gas 2026-04 unverdicted novelty 7.0

A U-Net-based ML pipeline reconstructs the complete phase field and quantized vortex charges in 2D Bose-Einstein condensates from density snapshots alone, using synthetic training data from projected Gross-Pitaevskii ...
Dual Triangle Attention: Effective Bidirectional Attention Without Positional Embeddings
q-bio.QM 2026-04 unverdicted novelty 7.0

Dual Triangle Attention achieves effective bidirectional attention with built-in positional inductive bias via dual triangular masks, outperforming standard bidirectional attention on position-sensitive tasks and show...
Diffusion Processes on Implicit Manifolds
cs.LG 2026-04 unverdicted novelty 7.0

Implicit Manifold-valued Diffusions (IMDs) are data-driven SDEs built from proximity graphs that converge in law to smooth manifold diffusions as sample count increases.
A General B\'ezier Tree Encoding Counterfactual Framework for Retinal-Vessel-Mediated Disease Analysis
eess.IV 2026-05 unverdicted novelty 6.0

BTECF encodes retinal vessels as Bézier trees to enable targeted, parameter-level counterfactual interventions on vessel geometry for causal analysis of vascular diseases.
EDGER: EDge-Guided with HEatmap Refinement for Generalizable Image Forgery Localization
cs.CV 2026-05 unverdicted novelty 6.0

A dual-branch system using frequency edge cues and CLIP-based synthetic patch detection for accurate, resolution-independent image forgery localization.
Geometry-aware Prototype Learning for Cross-domain Few-shot Medical Image Segmentation
cs.CV 2026-05 unverdicted novelty 6.0

GeoProto enriches appearance prototypes with geometric offsets from an ordinal shape branch to improve cross-domain few-shot medical image segmentation.
Don't Fix the Basis -- Learn It: Spectral Representation with Adaptive Basis Learning for PDEs
cs.LG 2026-05 unverdicted novelty 6.0

ABLE learns a spatially adaptive Parseval frame from data via an ancillary density to replace fixed bases in spectral neural operators for PDEs.
StereoPolicy: Improving Robotic Manipulation Policies via Stereo Perception
cs.RO 2026-05 unverdicted novelty 6.0

StereoPolicy fuses stereo image pairs via a Stereo Transformer on pretrained 2D encoders to boost robotic manipulation policies, showing gains over monocular, RGB-D, point cloud, and multi-view methods in simulations ...
Diffusion model for SU(N) gauge theories
hep-lat 2026-05 unverdicted novelty 6.0

Implicit score matching trains diffusion models that successfully sample SU(3) Wilson gauge configurations on lattices, with a Hamiltonian-dynamics corrector needed for strong coupling.
Leveraging Image Generators to Address Training Data Scarcity: The Gen4Regen Dataset for Forest Regeneration Mapping
cs.CV 2026-05 conditional novelty 6.0

Mixing real UAV imagery with 2101 AI-generated image-mask pairs improves semantic segmentation F1 scores for fine-grained forest species by over 15 percentage points overall and up to 30 points for rare classes.
A CNN--Transformer Denoiser for low-$S/N$ Galaxy Spectra: Stellar Population Recovery in Synthetic Tests
astro-ph.GA 2026-05 unverdicted novelty 6.0

A hybrid CNN-Transformer denoiser trained on synthetic spectra substantially reduces noise and improves stellar population recovery for low-S/N galaxy observations in controlled tests.
Approaching human parity in the quality of automated organoid image segmentation
cs.CV 2026-05 conditional novelty 6.0

A composite SAM-based method segments organoid images with accuracy matching or approaching inter-observer variability among human annotators.
When Less Is More: Simplicity Beats Complexity for Physics-Constrained InSAR Phase Unwrapping
cs.CV 2026-04 accept novelty 6.0

A vanilla U-Net with 7.76M parameters achieves R²=0.834 and RMSE=1.01 cm on a global InSAR benchmark, beating larger attention models by 34% in R² and 51% in RMSE while running 2.5× faster.
MG-NECOLA: A Field-Level Emulator for $f(R)$ Gravity and Massive Neutrino Cosmologies
astro-ph.CO 2026-04 conditional novelty 6.0

A field-level CNN emulator converts MG-PICOLA runs into near N-body accuracy for f(R) gravity and neutrino cosmologies, achieving sub-percent errors on power spectra and bispectra while generalizing beyond its training set.
From Boundaries to Semantics: Prompt-Guided Multi-Task Learning for Petrographic Thin-section Segmentation
cs.CV 2026-04 unverdicted novelty 6.0

Petro-SAM adapts SAM via a Merge Block for polarized views plus multi-scale fusion and color-entropy priors to jointly achieve grain-edge and lithology segmentation in petrographic images.
Self-supervised Pretraining of Cell Segmentation Models
cs.CV 2026-04 unverdicted novelty 6.0

DINOCell achieves a SEG score of 0.784 on LIVECell by self-supervised domain adaptation of DINOv2, improving 10.42% over SAM-based models and showing strong zero-shot transfer.
GIF: A Conditional Multimodal Generative Framework for IR Drop Imaging in Chip Layouts
cs.CV 2026-04 unverdicted novelty 6.0

GIF fuses geometrical image features and logical graph topology in a conditional diffusion model to generate high-quality IR drop images for chip layouts, outperforming prior ML methods on CircuitNet-N28 with SSIM 0.7...
ELT: Elastic Looped Transformers for Visual Generation
cs.CV 2026-04 unverdicted novelty 6.0

Elastic Looped Transformers share weights across recurrent blocks and apply intra-loop self-distillation to deliver 4x parameter reduction while matching competitive FID and FVD scores on ImageNet and UCF-101.
MRI-to-CT synthesis using drifting models
eess.IV 2026-03 unverdicted novelty 6.0

Drifting models outperform diffusion, CNN, VAE, and GAN baselines in MRI-to-CT synthesis on two pelvis datasets with higher SSIM/PSNR, lower RMSE, and millisecond one-step inference.
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
cs.CV 2023-11 conditional novelty 6.0

Stable Video Diffusion scales latent video diffusion models via text-to-image pretraining, video pretraining on curated data, and high-quality finetuning to produce competitive text-to-video and image-to-video results...
SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis
cs.CV 2023-07 conditional novelty 6.0

SDXL improves upon prior Stable Diffusion versions through a larger UNet backbone, dual text encoders, novel conditioning, and a refinement model, producing higher-fidelity images competitive with black-box state-of-t...
TRAS: An Interactive Software for Tracing Tree Ring Cross Sections
cs.CV 2026-05 unverdicted novelty 5.0

TRAS integrates CS-TRD, DeepCS-TRD, and INBD detection methods with an interactive GUI, achieving 81% F-score on 18 Pinus taeda images while reducing manual correction to ~20% of boundaries and matching CooRecorder ri...
Scalable Active Metamaterials for Shape-Morphing
cs.CE 2026-05 unverdicted novelty 5.0

A hierarchical SAM framework decouples macroscale mesh optimization from microscale inverse design to enable fast scalable creation of aperiodic shape-morphing metamaterials.
Full-chip CMP modelling based on Fully Convolutional Network leveraging White Light Interferometry
cs.LG 2026-05 unverdicted novelty 5.0

A fully convolutional network trained separately on WLI and AFM data predicts full-chip post-CMP nanotopography at nanometer accuracy.
Flow matching for Sentinel-2 super-resolution: implementation, application, and implications
cs.CV 2026-05 unverdicted novelty 5.0

Flow matching achieves single-step pixel accuracy and 20-step perceptual quality for Sentinel-2 super-resolution, outperforming diffusion and Real-ESRGAN while enabling large-scale 2.5 m land-cover products.
End-to-end Automated Deep Neural Network Optimization for PPG-based Blood Pressure Estimation on Wearables
cs.LG 2026-04 unverdicted novelty 5.0

An end-to-end hardware-aware optimization pipeline produces DNNs for PPG-based blood pressure estimation with up to 7.99% lower error and 83x fewer parameters that fit on ultra-low-power SoCs like GAP8.
World Action Models: The Next Frontier in Embodied AI
cs.RO 2026-05 unverdicted novelty 4.0

The paper introduces World Action Models as a new paradigm unifying predictive world modeling with action generation in embodied foundation models and provides a taxonomy of existing approaches.
Deep Learning-Based Segmentation of Peritoneal Cancer Index Regions from CT Imaging
cs.CV 2026-04 unverdicted novelty 4.0

nnU-Net segments rPCI regions on 62 CT scans with mean Dice 0.82, nearing inter-observer agreement of 0.88 and beating Swin UNETR at 0.76.
KAYRA: A Microservice Architecture for AI-Assisted Karyotyping with Cloud and On-Premise Deployment
cs.LG 2026-04 unverdicted novelty 4.0

KAYRA packages a cascade of EfficientNet-B5 + U-Net, Mask R-CNN, and ResNet-18 models into a microservice architecture that supports both cloud and on-premise deployment and reaches 98.91% segmentation accuracy in a p...
A Deep U-Net Framework for Flood Hazard Mapping Using Hydraulic Simulations of the Wupper Catchment
cs.LG 2026-04 unverdicted novelty 4.0

A U-Net surrogate model trained on hydraulic simulations predicts maximum water levels for flood hazard mapping in the Wupper catchment with results comparable to the original simulations.
A Wasserstein GAN-based climate scenario generator for risk management and insurance: the case of soil subsidence
cs.LG 2026-04 unverdicted novelty 4.0

A conditional Wasserstein GAN generates plausible future SWI drought trajectories for French insurance risk management under climate change.
Learning to count small and clustered objects with application to bacterial colonies
cs.CV 2026-04 unverdicted novelty 4.0

ACFamNet Pro reaches 9.64% mean normalized absolute error on bacterial colony images under 5-fold cross-validation, beating FamNet by 12.71%.
AI Approach for MRI-only Full-Spine Vertebral Segmentation and 3D Reconstruction in Paediatric Scoliosis
cs.CV 2026-04 unverdicted novelty 4.0

An AI pipeline using GAN-generated MRI-like images and U-Net segmentation produces automated 3D thoracolumbar spine reconstructions from MRI with 88% Dice score and reduces processing time from 1 hour to under 1 minut...
DigiForest: Digital Analytics and Robotics for Sustainable Forestry
cs.RO 2026-04 unverdicted novelty 4.0

DigiForest integrates heterogeneous autonomous robots for data collection, automated tree trait extraction, a decision support system for growth forecasting, and autonomous harvesters for selective logging, with real-...
AMO-ENE: Attention-based Multi-Omics Fusion Model for Outcome Prediction in Extra Nodal Extension and HPV-associated Oropharyngeal Cancer
eess.IV 2026-04 unverdicted novelty 4.0

An attention-based fusion model combining semi-supervised CT segmentation, radiomics, and clinical features predicts metastatic recurrence, overall survival, and disease-free survival in HPV+ oropharyngeal cancer with...
Uncertainty Estimation for Deep Reconstruction in Actuatic Disaster Scenarios with Autonomous Vehicles
cs.RO 2026-04 unverdicted novelty 4.0

Evidential Deep Learning outperforms other methods in accuracy, calibration, and speed for uncertainty-aware scalar field reconstruction in aquatic environments using autonomous vehicles.
SAGE-GAN: Towards Realistic and Robust Segmentation of Spatially Ordered Nanoparticles via Attention-Guided GANs
cs.CV 2026-04 unverdicted novelty 4.0

SAGE-GAN integrates a self-attention U-Net into a CycleGAN framework to generate realistic synthetic electron microscopy image-mask pairs that augment training data for nanoparticle segmentation without human labeling.
Machine Learning as a Transformative Tool for (Exo-)Planetary Science
astro-ph.EP 2026-04 unverdicted novelty 2.0

The paper reviews ML applications for sequence modeling, pattern recognition, and generative Bayesian analysis to tackle heterogeneous data challenges in (exo)planetary science.

Reference graph

Works this paper leans on

14 extracted references · 14 canonical work pages · cited by 44 Pith papers

[1]

In: NIPS

Ciresan, D.C., Gambardella, L.M., Giusti, A., Schmidhuber, J.: Deep neural net- works segment neuronal membranes in electron microscopy images. In: NIPS. pp. 2852–2860 (2012)

work page 2012
[2]

In: NIPS (2014)

Dosovitskiy, A., Springenberg, J.T., Riedmiller, M., Brox, T.: Discriminative un- supervised feature learning with convolutional neural networks. In: NIPS (2014)

work page 2014
[3]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014)

Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for ac- curate object detection and semantic segmentation. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2014)

work page 2014
[4]

Hariharan, B., Arbelez, P., Girshick, R., Malik, J.: Hypercolumns for object seg- mentation and ﬁne-grained localization (2014), arXiv:1411.5752 [cs.CV]

work page arXiv 2014
[5]

He, K., Zhang, X., Ren, S., Sun, J.: Delving deep into rectiﬁers: Surpassing human- level performance on imagenet classiﬁcation (2015), arXiv:1502.01852 [cs.CV]

work page arXiv 2015
[6]

Jia, Y., Shelhamer, E., Donahue, J., Karayev, S., Long, J., Girshick, R., Guadar- rama, S., Darrell, T.: Caﬀe: Convolutional architecture for fast feature embedding (2014), arXiv:1408.5093 [cs.CV]

work page arXiv 2014
[7]

In: NIPS

Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classiﬁcation with deep con- volutional neural networks. In: NIPS. pp. 1106–1114 (2012)

work page 2012
[8]

Neural Computation 1(4), 541–551 (1989)

LeCun, Y., Boser, B., Denker, J.S., Henderson, D., Howard, R.E., Hubbard, W., Jackel, L.D.: Backpropagation applied to handwritten zip code recognition. Neural Computation 1(4), 541–551 (1989)

work page 1989
[9]

Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation (2014), arXiv:1411.4038 [cs.CV]

work page arXiv 2014
[10]

Bioinformatics 30, 1609–1617 (2014)

Maska, M., (...), de Solorzano, C.O.: A benchmark for comparison of cell tracking algorithms. Bioinformatics 30, 1609–1617 (2014)

work page 2014
[11]

In: Computer Vision (ICCV), 2013 IEEE International Conference on

Seyedhosseini, M., Sajjadi, M., Tasdizen, T.: Image segmentation with cascaded hierarchical models and logistic disjunctive normal networks. In: Computer Vision (ICCV), 2013 IEEE International Conference on. pp. 2168–2175 (2013)

work page 2013
[12]

Simonyan, K., Zisserman, A.: Very deep convolutional networks for large-scale image recognition (2014), arXiv:1409.1556 [cs.CV]

work page Pith review arXiv 2014
[13]

WWW: Web page of the cell tracking challenge, http://www.codesolorzano.com/ celltrackingchallenge/Cell_Tracking_Challenge/Welcome.html

work page
[14]

WWW: Web page of the em segmentation challenge, http://brainiac2.mit.edu/ isbi_challenge/ 4 U-net implementation, trained networks and supplementary material available at http://lmb.informatik.uni-freiburg.de/people/ronneber/u-net

work page