arxiv: 2605.11265 · v1 · submitted 2026-05-11 · 💻 cs.CV · cs.AI· cs.LG

Recognition: no theorem link

DenseTRF: Texture-Aware Unsupervised Representation Adaptation for Surgical Scene Dense Prediction

Guiqiu Liao , Matja\v{z} Jogan , Daniel A. Hashimoto

Authors on Pith no claims yet

Pith reviewed 2026-05-13 06:38 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords texture-aware adaptationslot attentionunsupervised domain adaptationsurgical scene segmentationdense predictiondomain shift robustnessself-supervised representation learning

0 comments

The pith

DenseTRF adapts surgical dense prediction models to new distributions by learning texture-aware representations through slot attention without any target supervision.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Surgical computer vision models for segmentation and zone prediction often fail when the visual appearance shifts due to different equipment, lighting, or procedures. DenseTRF addresses this by using slot attention to extract texture-centric features that stay consistent across those shifts. It then adapts the representations to the new data in a self-supervised way and conditions the dense prediction head on the resulting slots. Model merging strategies further stabilize the output. Experiments on multiple surgical procedures show better generalization than standard supervised models or other test-time adaptation techniques.

Core claim

DenseTRF is a self-supervised framework that applies slot attention to texture-centric features to capture invariant visual structures, then adapts these representations to the target distribution and conditions dense prediction outputs on the adapted slots together with model merging, yielding improved cross-distribution performance on surgical dense prediction tasks.

What carries the argument

Slot attention applied to texture-centric features, used to condition dense prediction heads and combined with model merging for unsupervised target adaptation.

Load-bearing premise

Slot attention on texture features will reliably extract visual structures that remain stable when surgical scenes change in appearance or equipment.

What would settle it

Running the method on a held-out surgical dataset with clear domain shift and observing no accuracy gain or a drop relative to the unadapted baseline would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.11265 by Daniel A. Hashimoto, Guiqiu Liao, Matja\v{z} Jogan.

**Figure 1.** Figure 1: Architecture of the proposed network, where the dense head is conditioned on object-centric representations learned via Slot Attention and optimized using both reconstruction and supervised dense prediction objectives. 2 Method 2.1 Network architecture Our dense prediction network is built upon object-centric representations learned via Slot Attention (SA). Given an input image, we extract features using a… view at source ↗

**Figure 2.** Figure 2: (a) Slot Attention (SA) adaptation via periodic model merging, which anchors the adapted model to the base representation during specialization. (b) Example dense prediction tasks across Thoracic, POEM, and Cholec surgical datasets. 2.2 Unsupervised test-distribution adaptation When SA is trained on a mixture of diverse visual domains, the learned slots must explain substantial appearance variability, whic… view at source ↗

**Figure 3.** Figure 3: Comparison with state-of-the-art methods under varying training data ratios using IoU (%) and Hausdorff Distance (HD, pixels) metrics. across three surgical datasets (Thoracic, Per Oral Endoscopic Myotomy [POEM], and Cholec80) under increasing data-usage ratios from 1% to 5%, evaluating both IoU (%) and Hausdorff Distance (HD, pixels) and DICE score. As shown in [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

read the original abstract

Dense prediction tasks in surgical computer vision, such as segmentation and surgical zone prediction, can provide valuable guidance for laparoscopic and robotic surgery. However, these models often suffer from distribution shifts, as training datasets rarely cover the variability encountered during deployment, leading to poor generalization. We propose DenseTRF, a self-supervised representation adaptation framework based on texture-centric attention. Our method leverages slot attention to learn texture-aware representations that capture invariant visual structures. By adapting these representations to the target distribution without supervision, DenseTRF significantly improves robustness to domain shifts. The framework is implemented through conditioning dense prediction on slot attention and model merging strategies. Experiments across multiple surgical procedures demonstrate improved cross-distribution generalization in comparison to state-of-the-art segmentation models and test-distribution adaptation methods for dense prediction tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DenseTRF applies texture-focused slot attention plus model merging for unsupervised adaptation in surgical dense prediction and reports cross-procedure gains, but offers no direct test that the slots capture stable invariants rather than task-correlated cues.

read the letter

The paper's main contribution is a self-supervised framework that runs slot attention on texture-centric features from surgical images, then conditions dense prediction heads on those slots while using model merging to shift toward target distributions. Experiments across several procedures show better segmentation and zone prediction numbers than standard models and existing adaptation baselines.

Referee Report

2 major / 2 minor

Summary. The paper proposes DenseTRF, a self-supervised framework for adapting texture-centric representations via slot attention for dense prediction tasks (segmentation, zone prediction) in surgical scenes. It conditions predictions on learned slots plus model merging to improve robustness to distribution shifts across procedures, lighting, and cameras without target supervision, claiming superior cross-distribution generalization over SOTA segmentation and test-time adaptation baselines.

Significance. If the invariance of the slot embeddings and the reported gains hold under proper controls, the work would offer a practical route to unsupervised domain adaptation in surgical computer vision, where labeled target data is expensive and domain shifts are routine. The texture-aware slot mechanism and merging strategy could generalize to other dense prediction settings with limited supervision.

major comments (2)

[Abstract, §3] Abstract and §3 (method): the central claim that slot attention on texture features produces representations whose statistics remain stable across domain shifts is unsupported; no domain-gap statistic (MMD, CORAL, or feature-space distance) on the slot embeddings before versus after adaptation is reported, so downstream task gains alone cannot confirm invariance rather than re-weighting of source-specific cues.
[§4] §4 (experiments): the cross-procedure generalization results lack an ablation isolating the contribution of the texture-centric slot attention versus the model-merging component; without this, it is unclear whether the reported improvements are load-bearing on the proposed invariance mechanism.

minor comments (2)

[§3] Notation for the slot attention module and conditioning step should be introduced with explicit equations rather than prose descriptions only.
[Figures 2-4] Figure captions should state the exact datasets, number of runs, and statistical test used for the reported improvements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below and have made revisions to the manuscript to strengthen the presentation of our claims.

read point-by-point responses

Referee: [Abstract, §3] Abstract and §3 (method): the central claim that slot attention on texture features produces representations whose statistics remain stable across domain shifts is unsupported; no domain-gap statistic (MMD, CORAL, or feature-space distance) on the slot embeddings before versus after adaptation is reported, so downstream task gains alone cannot confirm invariance rather than re-weighting of source-specific cues.

Authors: We agree that explicit quantification of distribution shift on the slot embeddings would provide stronger support for the invariance claim. In the revised manuscript we have added MMD and CORAL distances computed between source and target slot embeddings both before and after adaptation. These statistics show a consistent reduction in domain gap for the adapted slots, which is not observed to the same degree in the baseline features. This evidence indicates that the performance gains arise from stabilized slot statistics rather than simple re-weighting of source cues. revision: yes
Referee: [§4] §4 (experiments): the cross-procedure generalization results lack an ablation isolating the contribution of the texture-centric slot attention versus the model-merging component; without this, it is unclear whether the reported improvements are load-bearing on the proposed invariance mechanism.

Authors: We acknowledge the value of isolating the two components. The revised §4 now includes an ablation study that disables the texture-centric slot attention while retaining model merging (and vice versa). The results demonstrate that removing slot attention leads to a larger drop in cross-procedure performance than removing merging alone, confirming that the invariance mechanism contributes substantially to the reported gains. revision: yes

Circularity Check

0 steps flagged

No circularity detected; derivation self-contained at abstract level

full rationale

The provided abstract and context describe a self-supervised slot-attention framework for texture-aware representation adaptation, claiming improved robustness to domain shifts via unsupervised adaptation to target distributions. No equations, parameter-fitting steps, self-citations, or uniqueness theorems appear in the text. The central claim rests on empirical cross-procedure experiments rather than any derivation that reduces by construction to fitted inputs or self-defined quantities. Without explicit mathematical reductions or load-bearing self-references, the chain is self-contained and does not exhibit circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

Only abstract available so ledger is minimal; central claim rests on the unverified assumption that texture structures learned via slot attention are domain-invariant.

axioms (1)

domain assumption Texture-centric features learned by slot attention capture visual structures invariant to surgical domain shifts
Stated in abstract as the basis for unsupervised adaptation without target labels

invented entities (1)

DenseTRF framework no independent evidence
purpose: Self-supervised representation adaptation for surgical dense prediction
Introduced as the proposed method combining slot attention and model merging

pith-pipeline@v0.9.0 · 5437 in / 1196 out tokens · 49630 ms · 2026-05-13T06:38:44.151296+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 1 internal anchor

[1]

In: 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021

Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: 2021 IEEE/CVF International Conference on Computer Vision, ICCV 2021, Montreal, QC, Canada, October 10-17, 2021. pp. 9630–9640 (2021)

work page 2021
[2]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022

Cheng, B., Misra, I., Schwing, A.G., Kirillov, A., Girdhar, R.: Masked-attention mask transformer for universal image segmentation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. pp. 1280–1289 (2022)

work page 2022
[3]

In: The Thir- teenth International Conference on Learning Representations (2025)

Didolkar, A., Zadaianchuk, A., Goyal, A., Mozer, M., Bengio, Y., Martius, G., Seitzer, M.: On the transfer of object-centric representation learning. In: The Thir- teenth International Conference on Learning Representations (2025)

work page 2025
[4]

In: CVPR (2024)

Fu, Y., Lou, M., Yu, Y.: Segman: Omni-scale contextual modeling for semantic segmentation. In: CVPR (2024)

work page 2024
[5]

Annals of surgery268(1), 70–76 (2018)

Hashimoto, D.A., Rosman, G., Rus, D., Meireles, O.R.: Artificial intelligence in surgery: promises and perils. Annals of surgery268(1), 70–76 (2018)

work page 2018
[6]

In: MICCAI (2022)

Hatamizadeh, A., et al.: Swin unetr: Swin transformers for semantic segmentation of brain tumors in mri images. In: MICCAI (2022)

work page 2022
[7]

In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016

He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition. In: 2016 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016. pp. 770–778 (2016)

work page 2016
[8]

Cholecseg8k: a semantic segmentation dataset for laparoscopic cholecystectomy based on cholec80.arXiv preprint arXiv:2012.12453, 2020

Hong, W.Y., Kao, C.L., Kuo, Y.H., Wang, J.R., Chang, W.L., Shih, C.S.: Cholec- seg8k: a semantic segmentation dataset for laparoscopic cholecystectomy based on cholec80. ArXiv preprintabs/2012.12453(2020)

work page arXiv 2012
[9]

Journal of pathology informatics 7(1), 29 (2016)

Janowczyk,A.,Madabhushi,A.:Deeplearningfordigitalpathologyimageanalysis: A comprehensive tutorial with selected use cases. Journal of pathology informatics 7(1), 29 (2016)

work page 2016
[10]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024

Kakogeorgiou, I., Gidaris, S., Karantzalos, K., Komodakis, N.: SPOT: self-training with patch-order permutation for object-centric learning with autoregressive trans- formers. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024. pp. 22776–22786 (2024)

work page 2024
[11]

In: IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023

Kirillov, A., Mintun, E., Ravi, N., Mao, H., Rolland, C., Gustafson, L., Xiao, T., Whitehead, S., Berg, A.C., Lo, W., Dollár, P., Girshick, R.B.: Segment anything. In: IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023. pp. 3992–4003 (2023) 10 G. Liao et al

work page 2023
[12]

In: The Thirty-ninth Annual Confer- ence on Neural Information Processing Systems (2025)

Liao, G., Jogan, M., Eaton, E., Hashimoto, D.A.: Forla: Federated object-centric representation learning with slot attention. In: The Thirty-ninth Annual Confer- ence on Neural Information Processing Systems (2025)

work page 2025
[13]

In: Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual (2020)

Locatello, F., Weissenborn, D., Unterthiner, T., Mahendran, A., Heigold, G., Uszkoreit, J., Dosovitskiy, A., Kipf, T.: Object-centric learning with slot attention. In: Advances in Neural Information Processing Systems 33: Annual Conference on Neural Information Processing Systems 2020, NeurIPS 2020, December 6-12, 2020, virtual (2020)

work page 2020
[14]

Annals of surgery276(2), 363–369 (2022)

Madani, A., Namazi, B., Altieri, M.S., Hashimoto, D.A., Rivera, A.M., Pucher, P.H., Navarrete-Welton, A., Sankaranarayanan, G., Brunt, L.M., Okrainec, A., et al.: Artificial intelligence for intraoperative guidance: using semantic segmenta- tion to identify surgical anatomy during laparoscopic cholecystectomy. Annals of surgery276(2), 363–369 (2022)

work page 2022
[15]

Nature methods21(2), 195–212 (2024)

Maier-Hein, L., Reinke, A., Godau, P., Tizabi, M.D., Buettner, F., Christodoulou, E., Glocker, B., Isensee, F., Kleesiek, J., Kozubek, M., et al.: Metrics reloaded: recommendations for image analysis validation. Nature methods21(2), 195–212 (2024)

work page 2024
[16]

In: European Conference on Computer Vision

Nguyen, M., Wang, A.Q., Kim, H., Sabuncu, M.R.: Adapting to shifting corre- lations with unlabeled data calibration. In: European Conference on Computer Vision. pp. 230–246. Springer (2024)

work page 2024
[17]

In: International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA

Prabhudesai,M.,Goyal,A.,Paul,S.,vanSteenkiste,S.,Sajjadi,M.S.M.,Aggarwal, G., Kipf, T., Pathak, D., Fragkiadaki, K.: Test-time adaptation with slot-centric models. In: International Conference on Machine Learning, ICML 2023, 23-29 July 2023, Honolulu, Hawaii, USA. vol. 202, pp. 28151–28166 (2023)

work page 2023
[18]

In: Proceedings of the 38th In- ternational Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., Krueger, G., Sutskever, I.: Learning transferable visual models from natural language supervision. In: Proceedings of the 38th In- ternational Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event. vol. 139, pp. 8748–8...

work page 2021
[19]

In: MICCAI (2015)

Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomed- ical image segmentation. In: MICCAI (2015)

work page 2015
[20]

In: The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 (2023)

Seitzer, M., Horn, M., Zadaianchuk, A., Zietlow, D., Xiao, T., Simon-Gabriel, C., He, T., Zhang, Z., Schölkopf, B., Brox, T., Locatello, F.: Bridging the gap to real-world object-centric learning. In: The Eleventh International Conference on Learning Representations, ICLR 2023, Kigali, Rwanda, May 1-5, 2023 (2023)

work page 2023
[21]

DINOv3

Siméoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khali- dov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., et al.: Dinov3. ArXiv preprint abs/2508.10104(2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

IEEE transactions on medical imaging36(1), 86–97 (2016)

Twinanda, A.P., Shehata, S., Mutter, D., Marescaux, J., De Mathelin, M., Padoy, N.: Endonet: a deep architecture for recognition tasks on laparoscopic videos. IEEE transactions on medical imaging36(1), 86–97 (2016)

work page 2016
[23]

In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 (2021)

Wang,D.,Shelhamer,E.,Liu,S.,Olshausen,B.A.,Darrell,T.:Tent:Fullytest-time adaptation by entropy minimization. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021 (2021)

work page 2021
[24]

In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022

Wang, Q., Fink, O., Gool, L.V., Dai, D.: Continual test-time domain adaptation. In: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2022, New Orleans, LA, USA, June 18-24, 2022. pp. 7191–7201 (2022)

work page 2022
[25]

Xie, E., Wang, W., Yu, Z., Anandkumar, A., Alvarez, J.M., Luo, P.: Segformer: Simple and efficient design for semantic segmentation with transformers. In: Ad- vances in Neural Information Processing Systems 34: Annual Conference on Neural DenseTRF: Texture-Aware Unsupervised Representation Adaptation 11 Information Processing Systems 2021, NeurIPS 2021, D...

work page 2021
[26]

IEEE Transactions on Medical Imaging43(1), 4–14 (2024)

Xie, Q., Li, Y., He, N., Ning, M., Ma, K., Wang, G., Lian, Y., Zheng, Y.: Unsu- pervised domain adaptation for medical image segmentation by disentanglement learning and self-training. IEEE Transactions on Medical Imaging43(1), 4–14 (2024)

work page 2024
[27]

In:TheThirty-ninthAnnualConferenceonNeuralInformationProcessingSystems (2025)

Yang, E., Tang, A., Shen, L., Guo, G., Wang, X., Cao, X., Zhang, J.: Continual modelmergingwithoutdata:Dualprojectionsforbalancingstabilityandplasticity. In:TheThirty-ninthAnnualConferenceonNeuralInformationProcessingSystems (2025)

work page 2025