pith. sign in

arxiv: 2604.12752 · v2 · submitted 2026-04-14 · 💻 cs.CV

Scaling In-Context Segmentation with Hierarchical Supervision

Pith reviewed 2026-05-10 15:11 UTC · model grok-4.3

classification 💻 cs.CV
keywords in-context learningmedical image segmentationhierarchical supervisionpatch selectionefficient attentionout-of-domain generalizationCT segmentation
0
0 comments X

The pith

PatchICL uses selective patching and hierarchical supervision to reduce compute in in-context medical image segmentation by 44 percent while matching global attention accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PatchICL as a way to make in-context learning practical for medical segmentation by replacing dense global cross-attention with learned selection of informative image patches. Hierarchical supervision trains the model to focus computation only on regions that matter for the current example. This matters for clinical use because it lowers the cost of adapting a model to new anatomical structures from just a few labeled scans, which could expand the range of cases that can be handled without new full annotations. On CT data the method matches the accuracy of a strong global-attention baseline at half the compute for 512 by 512 images, and it improves results on several out-of-domain modalities that contain focal rather than diffuse features.

Core claim

PatchICL is a hierarchical framework that combines selective image patching with multi-level supervision so the model learns to actively identify and attend only to the most informative anatomical regions, achieving competitive in-domain CT segmentation accuracy while reducing compute by 44 percent at 512 by 512 resolution and outperforming a global-attention baseline on 6 of 13 out-of-domain modality categories.

What carries the argument

Hierarchical supervision that trains a patch-selection module to restrict attention to the most relevant anatomical regions.

If this is right

  • PatchICL matches UniverSeg accuracy on in-domain CT segmentation at 512 by 512 resolution but uses 44 percent less compute.
  • The method outperforms the global-attention baseline on 6 of 13 modality categories across 35 out-of-domain datasets.
  • Gains are largest on modalities dominated by localized pathology such as OCT and dermoscopy.
  • Explicit supervision on the selection process reduces redundant computation in non-informative image regions.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same selection-plus-hierarchy pattern could be tested on other dense-prediction tasks that currently rely on full cross-attention.
  • If patch selection proves reliable, clinical systems could process higher-resolution scans in real time without proportional increases in hardware cost.
  • When pathology is diffuse rather than focal, the current hierarchical signal might need an additional global-context term to avoid under-segmentation.

Load-bearing premise

The learned patch selection will always capture every region needed for accurate segmentation even when the target anatomy or pathology is not localized.

What would settle it

On a new out-of-domain dataset with diffuse pathology spread across the entire image, PatchICL would need to produce clearly lower Dice scores than the global baseline while still using far fewer patches.

Figures

Figures reproduced from arXiv: 2604.12752 by Marco Reisert, Robin T. Schirrmeister, T. Camaret Ndir.

Figure 3
Figure 3. Figure 3: Patch-level attention mechanism. Target and context [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 1
Figure 1. Figure 1: Coarse-to-fine patch selection across resolution levels on [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the PatchICL architecture. Image patches [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 4
Figure 4. Figure 4: Dice score and computational cost (GFLOPs) as a func [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparison on TotalSegmentator CT held [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Additional coarse-to-fine patch selection examples on TotalSegmentator CT. Columns show the target image with patch boxes, [PITH_FULL_IMAGE:figures/full_fig_p006_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Cross-modality comparison on TotalSegmentator MRI [PITH_FULL_IMAGE:figures/full_fig_p006_7.png] view at source ↗
read the original abstract

In-context learning (ICL) enables medical image segmentation models to adapt to new anatomical structures from limited examples, reducing the clinical annotation burden. However, standard ICL methods typically rely on dense, global cross-attention, which scales poorly with image resolution. While recent approaches have introduced localized attention mechanisms, they often lack explicit supervision on the selection process, leading to redundant computation in non-informative regions. We propose PatchICL, a hierarchical framework that combines selective image patching with multi-level supervision. Our approach learns to actively identify and attend only to the most informative anatomical regions. Compared to UniverSeg, a strong global-attention baseline, PatchICL achieves competitive in-domain CT segmentation accuracy while reducing compute by 44\% at $512\times512$ resolution. On 35 out-of-domain datasets spanning diverse imaging modalities, PatchICL outperforms the baseline on 6 of 13 modality categories, with particular strength on modalities dominated by localized pathology such as OCT and dermoscopy. Training and evaluation code are available at https://github.com/tidiane-camaret/ic_segmentation

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces PatchICL, a hierarchical in-context learning framework for medical image segmentation that combines selective image patching with multi-level supervision to focus computation on informative regions. It claims competitive in-domain CT segmentation accuracy versus the global-attention baseline UniverSeg while achieving a 44% compute reduction at 512×512 resolution, plus outperformance on 6 of 13 out-of-domain modality categories (with noted strength on localized pathologies such as OCT and dermoscopy) across 35 datasets.

Significance. If the results prove robust, the work could meaningfully advance scalable in-context segmentation for high-resolution medical images by reducing compute without sacrificing accuracy, directly addressing annotation burden in clinical settings. The open-sourced code at the provided GitHub link is a clear strength for reproducibility.

major comments (2)
  1. [§4.2] §4.2 (OOD evaluation): The headline claim of outperformance on 6 of 13 modality categories rests on the assumption that the learned patch selector under hierarchical supervision surfaces all clinically relevant regions; however, no per-modality breakdown, failure-case analysis, or comparison of localized vs. diffuse pathology is provided, leaving open whether gains are an artifact of the test distribution rather than a general property.
  2. [§4.1] §4.1 and Table 2: The reported 44% compute reduction and competitive in-domain accuracy lack error bars, statistical significance tests, or details on out-of-domain split construction and baseline tuning; without these, it is impossible to assess whether post-hoc choices affect the comparison to UniverSeg.
minor comments (2)
  1. [Abstract and §3] The abstract and §3 could clarify the exact multi-level supervision losses and how patch selection thresholds are set during inference to improve reproducibility.
  2. [Figures] Figure captions for patch selection visualizations should include examples from diffuse-pathology cases to directly address potential context loss.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We sincerely thank the referee for their detailed and constructive feedback on our manuscript. We address each major comment below and commit to revisions that enhance the transparency and robustness of our evaluations.

read point-by-point responses
  1. Referee: [§4.2] §4.2 (OOD evaluation): The headline claim of outperformance on 6 of 13 modality categories rests on the assumption that the learned patch selector under hierarchical supervision surfaces all clinically relevant regions; however, no per-modality breakdown, failure-case analysis, or comparison of localized vs. diffuse pathology is provided, leaving open whether gains are an artifact of the test distribution rather than a general property.

    Authors: We acknowledge the referee's concern that additional granularity is needed to substantiate the OOD claims. While the manuscript notes particular strength on localized pathologies such as OCT and dermoscopy, we agree that a per-modality breakdown and failure-case analysis would better demonstrate generality. In the revised version, we will add a supplementary table with per-modality Dice scores and compute a localized-vs-diffuse pathology comparison using the existing 35-dataset suite. We will also include representative success and failure visualizations to show that the patch selector focuses on clinically relevant regions across modalities. revision: yes

  2. Referee: [§4.1] §4.1 and Table 2: The reported 44% compute reduction and competitive in-domain accuracy lack error bars, statistical significance tests, or details on out-of-domain split construction and baseline tuning; without these, it is impossible to assess whether post-hoc choices affect the comparison to UniverSeg.

    Authors: We agree that statistical rigor and experimental transparency are essential. In the revision we will augment Table 2 with error bars (standard deviation over three independent runs), report p-values from paired statistical tests comparing PatchICL to UniverSeg, and expand §4.1 with explicit details on OOD split construction (random patient-level partitioning per modality) and the hyperparameter search performed for the UniverSeg baseline to ensure the comparison is fair and reproducible. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical results rest on external baseline comparisons

full rationale

The paper is an empirical ML contribution that proposes PatchICL and reports direct measurements of accuracy and compute against the external UniverSeg baseline on in-domain CT and 35 out-of-domain datasets. No derivation chain, equations, or first-principles results are present in the provided text that could reduce to fitted inputs or self-citations by construction. The 44% compute reduction is stated as a measured quantity at fixed resolution, and OOD performance is evaluated on held-out modalities rather than predicted from internal parameters. This is the normal case of a self-contained empirical paper whose central claims are falsifiable against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on standard supervised deep-learning assumptions plus the empirical observation that selective patching preserves accuracy; no new physical or mathematical axioms are introduced.

pith-pipeline@v0.9.0 · 5491 in / 1144 out tokens · 52186 ms · 2026-05-10T15:11:44.812736+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

10 extracted references · 10 canonical work pages

  1. [1]

    Sabuncu, John Guttag, and Adrian V

    Victor Ion Butoi, Jose Javier Gonzalez Ortiz, Tianyu Ma, Mert R. Sabuncu, John Guttag, and Adrian V . Dalca. Uni- verseg: Universal medical image segmentation.arXiv preprint arXiv:2304.06131, 2023. 1, 2

  2. [2]

    D’Antonoli et al

    Tommaso A. D’Antonoli et al. TotalSegmentator MRI: Ro- bust segmentation of anatomic structures in MRI.arXiv preprint, 2025. 3, 4

  3. [3]

    Show and segment: Universal medical image segmentation via in-context learning.arXiv preprint,

    Yunhe Gao et al. Show and segment: Universal medical image segmentation via in-context learning.arXiv preprint,

  4. [4]

    Medverse: A universal model for medical image segmentation.arXiv preprint, 2025

    others Hu. Medverse: A universal model for medical image segmentation.arXiv preprint, 2025. 1, 2

  5. [5]

    Stochastic beams and where to find them: The gumbel-top-k trick for sampling sequences without replacement, 2019

    Wouter Kool, Herke van Hoof, and Max Welling. Stochastic beams and where to find them: The gumbel-top-k trick for sampling sequences without replacement, 2019. 2

  6. [6]

    MedSegBench: A comprehensive benchmark for medical image segmentation.arXiv preprint, 2024

    Zeki Kus et al. MedSegBench: A comprehensive benchmark for medical image segmentation.arXiv preprint, 2024. 3, 5

  7. [7]

    Tyche: Stochastic in-context learning for medical image segmentation.arXiv preprint, 2025

    Marianne Rakic et al. Tyche: Stochastic in-context learning for medical image segmentation.arXiv preprint, 2025. 1

  8. [8]

    Deep neural patchworks: Coping with large segmentation tasks

    Marco Reisert et al. Deep neural patchworks: Coping with large segmentation tasks. InMedical Image Computing and Computer Assisted Intervention, 2022. 2

  9. [9]

    TotalSegmentator: Robust segmen- tation of 104 anatomic structures in CT images.Radiology: Artificial Intelligence, 2023

    Jakob Wasserthal et al. TotalSegmentator: Robust segmen- tation of 104 anatomic structures in CT images.Radiology: Artificial Intelligence, 2023. 3

  10. [10]

    EICSeg: Universal medical image segmentation via efficient in-context learning.arXiv preprint, 2026

    others Xie. EICSeg: Universal medical image segmentation via efficient in-context learning.arXiv preprint, 2026. 1, 2