pith. sign in

arxiv: 2606.11032 · v1 · pith:DZUIZONFnew · submitted 2026-06-09 · 💻 cs.CV

U-TTT: Towards Generalizable PET Image Denoising via Test-Time Training

Pith reviewed 2026-06-27 13:13 UTC · model grok-4.3

classification 💻 cs.CV
keywords PET image denoisingtest-time traininggeneralizationdistribution shiftU-shaped networkself-supervisionmedical imaging
0
0 comments X

The pith

U-TTT adapts a U-shaped network at test time via self-supervision to denoise PET scans from unseen doses and scanners.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Deep learning PET denoisers typically degrade when test scans come from different radiation doses or scanner hardware than the training data. U-TTT embeds test-time training layers inside a U-shaped architecture so that the model can update its own parameters during inference using a self-supervised signal extracted from the current test scan. Separate spatial and frequency-domain training layers let it correct structural degradations and suppress noise spectra while restoring fine details. The approach is intended to remove the need for retraining when clinical conditions change.

Core claim

U-TTT is a U-shaped model that integrates Spatial Test-Time Training (S-TTT) and Frequency Test-Time Training (F-TTT) layers; these layers dynamically adjust network parameters at inference time through self-supervision on each individual test instance, allowing the model to adapt to variations in dose levels and scanner types and thereby improve denoising under distribution shift.

What carries the argument

Dual-domain test-time training mechanism consisting of an S-TTT layer that corrects spatial structural degradations and an F-TTT layer that suppresses global noise spectra while restoring high-frequency details, both performing per-scan self-supervised parameter updates.

If this is right

  • U-TTT reaches state-of-the-art denoising accuracy on standard PET benchmarks.
  • Performance remains high on test scans acquired at dose levels absent from the training distribution.
  • Performance remains high on test scans acquired on scanner types absent from the training distribution.
  • The combination of spatial and frequency adaptation handles the three-dimensional degradations present in PET volumes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same per-scan adaptation strategy could be tested on other modalities that suffer from scanner-to-scanner variation, such as CT or MRI.
  • If the self-supervision proves stable, hospitals could deploy a single trained model across multiple scanner models without site-specific retraining.
  • The frequency-domain layer may offer a route to preserve small lesions in low-dose protocols where spatial-only adaptation tends to blur detail.

Load-bearing premise

A self-supervised signal derived from each individual test scan is sufficient to drive meaningful parameter updates that improve denoising without introducing artifacts or overfitting to noise in that scan.

What would settle it

On a test set of PET scans acquired on a scanner absent from training data, if the error metrics produced by U-TTT are no better than those of an otherwise identical fixed-parameter baseline model, the generalization claim is falsified.

Figures

Figures reproduced from arXiv: 2606.11032 by Bingzheng Wei, Hao Lu, Hui Zhang, Jiayin Li, Yan Xu, Zhiwen Yang, Zihua Wang.

Figure 1
Figure 1. Figure 1: Overview of the proposed U-TTT. and Frequency Test-Time Training (F-TTT) blocks for feature extraction. These blocks enable the model to dynamically update its parameters at test time and adapt to the test data, thereby improving generalizability. Finally, a 3 × 3 × 3 output-projection convolutional layer transforms deep feature Id into a residual image Ir ∈ R D×H×W×1 , which is added to the original low-d… view at source ↗
Figure 2
Figure 2. Figure 2: Visualization comparison on the in-distribution base dataset at DRF=12. 3 Experiments and Results 3.1 Dataset To demonstrate the effectiveness and generalizability of the proposed U-TTT, we establish four distinct whole-body PET datasets (D1–D4) with diverse char￾acteristics, as summarized in Tab. 1. For each dataset, we first collect full-dose PET data from patients in list-mode. Corresponding low-dose PE… view at source ↗
read the original abstract

Existing deep learning models for Positron Emission Tomography (PET) image denoising often suffer from severe performance degradation under distribution shifts, fundamentally restricting their robust clinical deployment. This lack of generalization stems from the conventional paradigm of fixed-parameter models that cannot adapt to variations in test data (e.g., dose levels or scanner types) after training. To overcome this limitation and achieve robust generalization, we introduce U-TTT, a novel U-shaped model that integrates Test-Time Training (TTT) layers to dynamically adjust model parameters during inference through self-supervision, thereby adapting to the specific characteristics of each test instance. Furthermore, to comprehensively capture the complex degradations of 3D PET data, U-TTT features a dual-domain adaptation mechanism comprising a Spatial Test-Time Training (S-TTT) layer and a Frequency Test-Time Training (F-TTT) layer. The S-TTT layer captures and corrects spatial structural degradations, while the F-TTT layer suppresses global noise spectra and restores delicate high-frequency details. Extensive experiments demonstrate that U-TTT achieves state-of-the-art PET denoising performance and exhibits superior generalization under challenging distribution shifts, including both unseen dose levels and unseen scanners. Our code will be available at https://github.com/Yaziwel/U-TTT.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces U-TTT, a U-shaped network for PET denoising that augments a base model with Spatial Test-Time Training (S-TTT) and Frequency Test-Time Training (F-TTT) layers. These layers are adapted at inference time on each test scan via a self-supervised objective, with the goal of correcting spatial structural degradations and frequency-domain noise spectra. The central claim is that this per-instance adaptation yields state-of-the-art denoising performance and superior generalization to unseen dose levels and scanner types relative to fixed-parameter baselines.

Significance. If the self-supervised updates reliably target distribution shift rather than scan-specific noise, the approach would meaningfully advance robust clinical deployment of DL denoisers. The dual-domain design is a plausible way to address both local structure and global spectral characteristics of PET degradations, but its value hinges on empirical demonstration that the adaptation step improves rather than degrades output quality under the claimed shifts.

major comments (2)
  1. [Method (TTT layer description)] The self-supervised loss driving the S-TTT and F-TTT parameter updates is not described with sufficient detail (no equation or pseudocode is visible) to assess whether it supplies a gradient aligned with distribution shift or simply fits the noise realization in the single test volume. Without an explicit loss definition, regularization term, or early-stopping rule, the load-bearing claim that adaptation improves generalization cannot be evaluated.
  2. [Experiments] The experiments section asserts SOTA performance and superior generalization under dose and scanner shifts, yet no quantitative tables, error bars, ablation studies isolating S-TTT/F-TTT, or statistical tests are provided in the manuscript text. This absence prevents verification that the per-scan updates produce the claimed gains rather than artifacts, directly undermining the generalization result.
minor comments (2)
  1. [Abstract] The abstract states that code will be released but provides no link or repository details at submission time; this should be updated with a concrete URL or placeholder.
  2. [Method] Notation for the dual-domain layers (S-TTT vs. F-TTT) is introduced without a clear diagram or pseudocode showing how they are inserted into the U-Net backbone; a figure would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important areas for clarification in our manuscript on U-TTT. We agree that additional details are needed on both the self-supervised loss and the experimental results to fully support the claims. We will revise the manuscript to incorporate these elements and respond point-by-point below.

read point-by-point responses
  1. Referee: [Method (TTT layer description)] The self-supervised loss driving the S-TTT and F-TTT parameter updates is not described with sufficient detail (no equation or pseudocode is visible) to assess whether it supplies a gradient aligned with distribution shift or simply fits the noise realization in the single test volume. Without an explicit loss definition, regularization term, or early-stopping rule, the load-bearing claim that adaptation improves generalization cannot be evaluated.

    Authors: We agree that the self-supervised loss was not described with sufficient detail. In the revised manuscript, we will add the explicit loss equation for both S-TTT and F-TTT (a consistency-based self-supervised objective on augmented views of the test volume), the full pseudocode for the test-time adaptation procedure, the regularization terms applied to the layer updates, and the early-stopping rule based on validation loss on a held-out portion of the test scan. These additions will clarify how the gradients target structural and spectral distribution shifts rather than scan-specific noise. revision: yes

  2. Referee: [Experiments] The experiments section asserts SOTA performance and superior generalization under dose and scanner shifts, yet no quantitative tables, error bars, ablation studies isolating S-TTT/F-TTT, or statistical tests are provided in the manuscript text. This absence prevents verification that the per-scan updates produce the claimed gains rather than artifacts, directly undermining the generalization result.

    Authors: We acknowledge that the manuscript text currently lacks the requested quantitative elements. In the revision, we will include full result tables with PSNR/SSIM metrics and error bars across multiple subjects, ablation studies isolating S-TTT and F-TTT contributions, and statistical tests (e.g., paired t-tests with p-values) comparing U-TTT against baselines under the dose and scanner shifts. These will be added to the experiments section to substantiate the generalization claims. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with no load-bearing derivations or self-referential fits

full rationale

The paper describes an empirical architecture (U-shaped network with inserted S-TTT and F-TTT layers) whose performance claims rest on experimental results under distribution shifts rather than any closed-form derivation. No equations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or method description; the self-supervised update on each test scan is presented as an independent mechanism whose validity is asserted via measured denoising metrics, not by construction from the same inputs. This is the normal case of a self-contained empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the method implicitly assumes that test-time self-supervision can be formulated without additional labeled data or external supervision signals.

pith-pipeline@v0.9.1-grok · 5770 in / 1148 out tokens · 11674 ms · 2026-06-27T13:13:14.534327+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

21 extracted references · 1 linked inside Pith

  1. [1]

    In: 2018 IEEE Nuclear Science Symposium and Medical Imaging Conference Proceedings (NSS/MIC)

    Chan, C., Zhou, J., Yang, L., Qi, W., Kolthammer, J., Asma, E.: Noise adap- tive deep convolutional neural network for whole-body pet denoising. In: 2018 IEEE Nuclear Science Symposium and Medical Imaging Conference Proceedings (NSS/MIC). pp. 1–4. IEEE (2018)

  2. [2]

    In: International Conference on Medical Image Computing and Computer-Assisted Intervention

    Chan, S.C., Shi, L., Huang, B., Wong, T.T.: Directional adaptive shuffle-based vi- sual state-space models for medical image restoration. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 160–170. Springer (2025)

  3. [3]

    IEEE Transactions on Medical Imaging (2026)

    Chen, H., Yang, Z., Zhou, Y., Zhang, X., Zhang, H., Zhao, D., Wei, B., Zhou, G., Xu,Y.:Vqpet:Leveragingvector-quantizedcodebookpriorforpetimagesynthesis. IEEE Transactions on Medical Imaging (2026)

  4. [4]

    Advances in Neural Information Processing Systems35, 29374–29385 (2022)

    Gandelsman, Y., Sun, Y., Chen, X., Efros, A.: Test-time training with masked au- toencoders. Advances in Neural Information Processing Systems35, 29374–29385 (2022)

  5. [5]

    Communications of the ACM63(11), 139–144 (2020)

    Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., Bengio, Y.: Generative adversarial networks. Communications of the ACM63(11), 139–144 (2020)

  6. [6]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Han, D., Li, Y., Li, T., Cao, Z., Wang, Z., Song, J., Cheng, Y., Zheng, B., Huang, G.: Vit3: Unlocking test-time training in vision. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 51–61 (2026) 10 Yang et al

  7. [7]

    Medical Image Analysis99, 103334 (2025)

    Huang, J., Yang, L., Wang, F., Wu, Y., Nan, Y., Wu, W., Wang, C., Shi, K., Aviles- Rivero, A.I., Schoenlieb, C.B., et al.: Enhancing global sensitivity and uncertainty quantification in medical image reconstruction with monte carlo arbitrary-masked mamba. Medical Image Analysis99, 103334 (2025)

  8. [8]

    IEEE transactions on medical imaging13(4), 601–609 (1994)

    Hudson, H.M., Larkin, R.S.: Accelerated image reconstruction using ordered sub- sets of projection data. IEEE transactions on medical imaging13(4), 601–609 (1994)

  9. [9]

    IEEE transactions on medical imaging (2023)

    Jang, S.I., Pan, T., Li, Y., Heidari, P., Chen, J., Li, Q., Gong, K.: Spach trans- former: spatial and channel-wise transformer based on local and global self- attentions for pet image denoising. IEEE transactions on medical imaging (2023)

  10. [10]

    Medical Image Analysis77, 102335 (2022)

    Luo, Y., Zhou, L., Zhan, B., Fei, Y., Zhou, J., Wang, Y., Shen, D.: Adaptive rectification based adversarial network with spectrum constraint for high-quality pet image synthesis. Medical Image Analysis77, 102335 (2022)

  11. [11]

    arXiv preprint arXiv:2407.04620 (2024)

    Sun, Y., Li, X., Dalal, K., Xu, J., Vikram, A., Zhang, G., Dubois, Y., Chen, X., Wang, X., Koyejo, S., et al.: Learning to (learn at test time): Rnns with expressive hidden states. arXiv preprint arXiv:2407.04620 (2024)

  12. [12]

    In: International conference on machine learning

    Sun, Y., Wang, X., Liu, Z., Miller, J., Efros, A., Hardt, M.: Test-time training with self-supervision for generalization under distribution shifts. In: International conference on machine learning. pp. 9229–9248. PMLR (2020)

  13. [13]

    Neuroimage174, 550–562 (2018)

    Wang, Y., Yu, B., Wang, L., Zu, C., Lalush, D.S., Lin, W., Wu, X., Zhou, J., Shen, D., Zhou, L.: 3d conditional generative adversarial networks for high-quality pet image estimation at low dose. Neuroimage174, 550–562 (2018)

  14. [14]

    Neurocomputing267, 406–416 (2017)

    Xiang, L., Qiao, Y., Nie, D., An, L., Lin, W., Wang, Q., Shen, D.: Deep auto- context convolutional neural networks for standard-dose pet image estimation from low-dose pet/mri. Neurocomputing267, 406–416 (2017)

  15. [15]

    arXiv preprint arXiv:2407.11087 (2024)

    Yang, Z., Li, J., Zhang, H., Zhao, D., Wei, B., Xu, Y.: Restore-rwkv: Efficient and effective medical image restoration with rwkv. arXiv preprint arXiv:2407.11087 (2024)

  16. [16]

    Medical Image Analysis p

    Yang, Z., Zhou, Y., Chen, H., Zhang, H., Zhao, D., Wei, B., Xu, Y.: Unipet: A universal network for high-quality pet image denoising across varied dose reduction factors. Medical Image Analysis p. 104059 (2026)

  17. [17]

    In: International Conference on Medical Image Computing and Computer-Assisted Intervention

    Yang, Z., Zhou, Y., Zhang, H., Wei, B., Fan, Y., Xu, Y.: Drmc: A generalist model with dynamic routing for multi-center pet image synthesis. In: International Conference on Medical Image Computing and Computer-Assisted Intervention. pp. 36–46. Springer (2023)

  18. [18]

    European Journal of Nuclear Medicine and Molecular Imaging52(7), 2549–2562 (2025)

    Yu, B., Ozdemir, S., Dong, Y., Shao, W., Pan, T., Shi, K., Gong, K.: Robust whole-body pet image denoising using 3d diffusion models: evaluation across vari- ous scanners, tracers, and dose levels. European Journal of Nuclear Medicine and Molecular Imaging52(7), 2549–2562 (2025)

  19. [19]

    In: International conference on medical image computing and computer-assisted intervention

    Zeng, P., Zhou, L., Zu, C., Zeng, X., Jiao, Z., Wu, X., Zhou, J., Shen, D., Wang, Y.: 3d cvt-gan: A 3d convolutional vision transformer-gan for pet reconstruction. In: International conference on medical image computing and computer-assisted intervention. pp. 516–526. Springer (2022)

  20. [20]

    arXiv preprint arXiv:2409.11299 (2024)

    Zhou, R., Yuan, Z., Yan, Z., Sun, W., Zhang, K., Li, Y., Ye, Y., Li, X., He, L., Sun, L.: Ttt-unet: Enhancing u-net with test-time training layers for biomedical image segmentation. arXiv preprint arXiv:2409.11299 (2024)

  21. [21]

    IEEE Transactions on Medical Imaging41(8), 2092–2104 (2022)

    Zhou, Y., Yang, Z., Zhang, H., Eric, I., Chang, C., Fan, Y., Xu, Y.: 3d segmen- tation guided style-based generative adversarial networks for pet synthesis. IEEE Transactions on Medical Imaging41(8), 2092–2104 (2022)