pith. machine review for the scientific record. sign in

arxiv: 2604.04552 · v3 · submitted 2026-04-06 · 💻 cs.CV · cs.AI

Recognition: no theorem link

StableTTA: Improving Vision Model Performance by Training-free Test-Time Adaptation Methods

Authors on Pith no claims yet

Pith reviewed 2026-05-10 19:36 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords test-time adaptationensemble methodslogit aggregationvision modelstraining-freecoherent-batch inferenceImageNetprediction stability
0
0 comments X

The pith

A training-free method stabilizes logit aggregation to improve vision model accuracy at test time without retraining.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper identifies an instability in ensemble prediction aggregation caused by nonlinear projections and voting. It introduces StableTTA with two variants to address this while keeping methods training-free and low-cost. StableTTA-I applies variance-aware logit aggregation when inputs arrive in coherent batches where nearby samples tend to share a class, as in video or robotics. StableTTA-II adds feature-level cropping so aggregation needs only one forward pass on a single model. If these hold, they would let existing vision models gain accuracy in practical settings with far less memory and compute than full ensembles.

Core claim

StableTTA addresses both efficiency challenges and aggregation inconsistency by providing two training-free test-time adaptation variants. StableTTA-I targets coherent-batch inference settings and improves prediction consistency and accuracy through variance-aware logit aggregation. StableTTA-II establishes feature-level cropping that enables efficient logit aggregation with a single forward pass on a single model backbone. Experiments on ImageNet-1K across 71 models show StableTTA-I consistently improves accuracy under coherent-batch inference while StableTTA-II delivers lightweight architecture-agnostic gains with minimal overhead.

What carries the argument

Variance-aware logit aggregation combined with feature-level cropping, which together stabilize ensemble outputs and reduce the number of forward passes needed during inference.

If this is right

  • StableTTA-I substantially improves prediction consistency and accuracy under coherent-batch inference such as video streams, burst photography, robotics perception, and industrial inspection.
  • StableTTA-II enables efficient logit aggregation with a single forward pass and minimal computational overhead while remaining architecture-agnostic.
  • Both variants operate without any model training or parameter updates and apply across a wide range of vision models.
  • Inference-time semantic coherence and aggregation stability offer practical perspectives for improving test-time adaptation systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If coherence between nearby inputs is common in many real deployments, these techniques could serve as a default lightweight post-processing step for deployed vision systems.
  • The same stability principle might extend to other multi-input aggregation tasks such as multi-view 3D reconstruction or temporal sensor fusion.
  • Combining StableTTA variants with existing domain-adaptation methods could be tested to measure additive gains when both coherence and distribution shift are present.

Load-bearing premise

Temporally or semantically adjacent observations are likely to belong to the same class in the target deployment settings.

What would settle it

Evaluating the method on a test set of randomly ordered images where adjacent samples come from unrelated classes would show no accuracy gain or a drop for StableTTA-I.

Figures

Figures reproduced from arXiv: 2604.04552 by Huanying Helen Gu, Jerry Cheng, Zheng Li.

Figure 1
Figure 1. Figure 1: Top: Milestone comparison. We show that StableTTA+MobileNetV3 significantly outperforms the base ViT in terms of performance (+11.75% accuracy), memory usage (-97.1% parameters), and computational cost (-89.1% GFLOPs). Bottom: General comparison. StableTTA improves baseline models by 11%-33% in accuracy, with 34 models achieving more than 95% accuracy. Our method yields consistent and significant improveme… view at source ↗
Figure 2
Figure 2. Figure 2: (a) Conflict: Given branch logits {z (i) | i = 1, 2, 3}, probabilities p (i) = softmax(z (i) ), and predictions yˆ (i) = argmax p (i) , different aggregation strategies may yield inconsistent results. Here, logit averaging predicts yˆlogit = 1, soft voting predicts yˆsoft = 2, and hard voting predicts yˆhard = 3. (a) Explanation: When logits (z (1) , z (2) , . . .) are sparsely distributed, the conflict yˆ… view at source ↗
Figure 3
Figure 3. Figure 3: Superior Efficiency and Accuracy of StableTTA. Comparison of the baseline (blue) and StableTTA (red) across the number of model parameters (left), peak GFLOPs in sequential aggregation mode (middle), and total GFLOPs in parallel aggregation mode (right). 4 8 16 32 70 80 90 100 Number of Experts (N) Top-1 Validation Accuracy (%) AlexNet 4 8 16 32 88 92 96 100 Number of Experts (N) ResNet50† 4 8 16 32 88 92 … view at source ↗
Figure 4
Figure 4. Figure 4: (a) TTA (with our augmentation) vs. StableTTA. (b) StableTTA is robust to [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: illustrates three common inference strategies for model predictions. (a) shows the baseline approach, where a single model processes the input once to produce an output. (b) presents a multi￾model ensemble, in which predictions from several independently trained models are combined to improve accuracy, at the cost of increased the total model size and the computational overhead. (c) demonstrates TTA, where… view at source ↗
Figure 6
Figure 6. Figure 6: Empirical cumulative distribution functions (ECDFs) of [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Monte Carlo simulation. The conflict probability increases as Var(z) grows. In this simulation, we consider distributions: {z ∼ N (µ, σI) | µ ∈ {(1, 0.9),(1, 0.7),(1, 0.5)}, σ ∈ [0.05, 0.25]}. The solid curves show Monte Carlo estimates of the relationship between σ and P(ˆylogit ̸= ˆyhard), while the dashed curves correspond to the theoretical (asymptotic) predictions. The empirical and theoretical result… view at source ↗
Figure 8
Figure 8. Figure 8: provides an intuitive visualization of how different data augmentation strategies influence the variance of logits under the Hölder continuity assumption. According to Eq. (1), the distance between logit vectors is bounded by the distance between augmented inputs. As illustrated in the left example, translation preserves the semantic structure of the image but introduces large pixel-wise differences betwee… view at source ↗
read the original abstract

Ensemble methods improve predictive performance but often incur high memory and computational costs. We identify an aggregation instability induced by nonlinear projection and voting operations. To address both efficiency challenges and this inconsistency, we propose StableTTA, a training-free test-time adaptation method with two variants. StableTTA-I targets coherent-batch inference settings, where temporally or semantically adjacent observations are likely to belong to the same class. Examples include burst photography, video streams, robotics perception, and industrial inspection. Under coherent-batch inference, StableTTA-I substantially improves prediction consistency and accuracy through variance-aware logit aggregation. StableTTA-II establishes feature-level cropping, enabling efficient logit aggregation with a single forward pass on a single model backbone. Experiments on ImageNet-1K across 71 models demonstrate that StableTTA-I consistently improves prediction accuracy under coherent-batch inference, while StableTTA-II provides lightweight and architecture-agnostic accuracy improvements with minimal computational overhead. These results suggest that inference-time semantic coherence and aggregation stability provide useful perspectives for improving practical test-time adaptation systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper introduces StableTTA, a training-free test-time adaptation method with two variants for vision models. StableTTA-I applies variance-aware logit aggregation to improve prediction consistency and accuracy under coherent-batch inference (where temporally or semantically adjacent observations are assumed likely to share the same class, as in video streams or robotics). StableTTA-II uses feature-level cropping to enable efficient logit aggregation via a single forward pass on one backbone. Experiments on ImageNet-1K across 71 models are reported to show consistent accuracy gains for StableTTA-I under coherent-batch settings and lightweight improvements for StableTTA-II with minimal overhead.

Significance. If the empirical gains hold under realistic conditions, the work offers a practical, low-cost perspective on stabilizing ensemble-style aggregation at inference time without retraining or architecture changes. The training-free design and focus on semantic coherence in batches could be useful for deployment scenarios like video or burst photography, provided the coherence assumption transfers beyond idealized test conditions.

major comments (2)
  1. [Experiments] The abstract and experimental description do not specify how coherent batches are constructed on ImageNet-1K (e.g., whether they are formed by perfectly class-homogeneous blocks or by temporally adjacent samples with possible transitions). If the former, the variance reduction in StableTTA-I is maximized artificially and the reported gains may not generalize to the target settings (video streams, robotics) that exhibit gradual class changes and label noise.
  2. [Experiments] No details are provided on statistical significance testing, error bars, variance across runs, or exact baseline implementations (including how standard logit averaging or voting is performed). Without these, the claim of 'consistent' improvements across 71 models cannot be assessed for robustness.
minor comments (1)
  1. [Abstract] The abstract refers to 'nonlinear projection and voting operations' inducing instability but does not define these operations or the precise aggregation formula used in StableTTA-I.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address the two major comments on the experimental setup below and will incorporate clarifications and additional details in the revised version to improve transparency and robustness assessment.

read point-by-point responses
  1. Referee: [Experiments] The abstract and experimental description do not specify how coherent batches are constructed on ImageNet-1K (e.g., whether they are formed by perfectly class-homogeneous blocks or by temporally adjacent samples with possible transitions). If the former, the variance reduction in StableTTA-I is maximized artificially and the reported gains may not generalize to the target settings (video streams, robotics) that exhibit gradual class changes and label noise.

    Authors: We agree that the batch construction procedure requires explicit description. Coherent batches on ImageNet-1K were formed by sorting the validation set by ground-truth class labels and extracting contiguous blocks of same-class samples to simulate semantic adjacency under the coherence assumption stated in the paper. This design isolates the benefit of variance-aware logit aggregation without introducing label noise. We acknowledge that perfectly homogeneous blocks represent an idealized case and may overestimate gains relative to video streams with gradual transitions. In the revision we will add a dedicated subsection detailing the exact batch construction algorithm, discuss its relation to target applications, and include new experiments on partially coherent batches that incorporate controlled class transitions and label noise to better evaluate generalization. revision: yes

  2. Referee: [Experiments] No details are provided on statistical significance testing, error bars, variance across runs, or exact baseline implementations (including how standard logit averaging or voting is performed). Without these, the claim of 'consistent' improvements across 71 models cannot be assessed for robustness.

    Authors: We accept that additional statistical and implementation details are necessary for full assessment. The baselines were implemented as follows: standard logit averaging computes the mean logit vector over the batch before applying softmax; voting aggregates the argmax predictions via majority vote. All 71 models showed accuracy gains under StableTTA-I, but we did not report run-to-run variance or significance tests. In the revised manuscript we will (i) provide pseudocode for every baseline and our methods, (ii) report mean accuracy with standard deviation across three random seeds for the subset of models where stochasticity exists, (iii) include error bars on all bar plots, and (iv) add paired t-test p-values comparing StableTTA-I against the strongest baseline to quantify consistency. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method with no self-referential reductions

full rationale

The paper introduces StableTTA as a training-free test-time adaptation approach based on variance-aware logit aggregation for coherent-batch settings and feature-level cropping for efficiency. No equations, derivations, or parameter-fitting steps are described that reduce by construction to the method's own inputs or outputs. The claims rest on empirical experiments across 71 models on ImageNet-1K rather than any self-citation chain, uniqueness theorem, or ansatz imported from prior author work. The derivation chain is self-contained as a set of heuristic aggregation rules validated externally through accuracy measurements.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claims rest on the existence of coherent-batch settings in practice and on the empirical observation of aggregation instability; no explicit free parameters, axioms, or invented entities are stated in the abstract.

pith-pipeline@v0.9.0 · 5480 in / 1129 out tokens · 35630 ms · 2026-05-10T19:36:51.035650+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

43 extracted references · 6 canonical work pages · 4 internal anchors

  1. [1]

    Bagging predictors.Machine learning, 24(2):123–140, 1996

    Leo Breiman. Bagging predictors.Machine learning, 24(2):123–140, 1996

  2. [2]

    Autoaugment: Learning augmentation strategies from data

    Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment: Learning augmentation strategies from data. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 113–123, 2019

  3. [3]

    Imagenet: A large- scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

  4. [4]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

  5. [5]

    Learning both weights and connections for efficient neural network.Advances in neural information processing systems, 28, 2015

    Song Han, Jeff Pool, John Tran, and William Dally. Learning both weights and connections for efficient neural network.Advances in neural information processing systems, 28, 2015

  6. [6]

    The elements of statistical learning: data mining, inference, and prediction, 2009

    Trevor Hastie. The elements of statistical learning: data mining, inference, and prediction, 2009

  7. [7]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016

  8. [8]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015

  9. [9]

    Searching for mobilenetv3

    Andrew Howard, Mark Sandler, Grace Chu, Liang-Chieh Chen, Bo Chen, Mingxing Tan, Weijun Wang, Yukun Zhu, Ruoming Pang, Vijay Vasudevan, et al. Searching for mobilenetv3. InProceedings of the IEEE/CVF international conference on computer vision, pages 1314–1324, 2019

  10. [10]

    Densely connected convolutional networks

    Gao Huang, Zhuang Liu, Laurens Van Der Maaten, and Kilian Q Weinberger. Densely connected convolutional networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4700–4708, 2017

  11. [11]

    SqueezeNet: AlexNet-level accuracy with 50x fewer parameters and <0.5MB model size.arXiv2016, arXiv:1602.07360

    Forrest N Iandola, Song Han, Matthew W Moskewicz, Khalid Ashraf, William J Dally, and Kurt Keutzer. Squeezenet: Alexnet-level accuracy with 50x fewer parameters and< 0.5 mb model size.arXiv preprint arXiv:1602.07360, 2016

  12. [12]

    Efficient tests for normality, homoscedasticity and serial independence of regression residuals.Economics letters, 6(3):255–259, 1980

    Carlos M Jarque and Anil K Bera. Efficient tests for normality, homoscedasticity and serial independence of regression residuals.Economics letters, 6(3):255–259, 1980

  13. [13]

    Learning loss for test-time augmentation

    Ildoo Kim, Younghoon Kim, and Sungwoong Kim. Learning loss for test-time augmentation. Advances in neural information processing systems, 33:4163–4174, 2020

  14. [14]

    Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012

    Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convolutional neural networks.Advances in neural information processing systems, 25, 2012

  15. [15]

    Losstransform: Reformulating the loss function for contrastive learning.Information, 16(12):1068, 2025

    Zheng Li, Jerry Cheng, and Huanying Helen Gu. Losstransform: Reformulating the loss function for contrastive learning.Information, 16(12):1068, 2025

  16. [16]

    Swin transformer: Hierarchical vision transformer using shifted windows

    Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. InProceedings of the IEEE/CVF international conference on computer vision, pages 10012–10022, 2021

  17. [17]

    Swin transformer v2: Scaling up capacity and resolution

    Ze Liu, Han Hu, Yutong Lin, Zhuliang Yao, Zhenda Xie, Yixuan Wei, Jia Ning, Yue Cao, Zheng Zhang, Li Dong, et al. Swin transformer v2: Scaling up capacity and resolution. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12009–12019, 2022. 10

  18. [18]

    A convnet for the 2020s

    Zhuang Liu, Hanzi Mao, Chao-Yuan Wu, Christoph Feichtenhofer, Trevor Darrell, and Saining Xie. A convnet for the 2020s. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 11976–11986, 2022

  19. [19]

    Fully convolutional networks for se- mantic segmentation

    Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for se- mantic segmentation. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3431–3440, 2015

  20. [20]

    Greedy policy search: A simple baseline for learnable test-time augmentation

    Alexander Lyzhov, Yuliya Molchanova, Arsenii Ashukha, Dmitry Molchanov, and Dmitry Vetrov. Greedy policy search: A simple baseline for learnable test-time augmentation. In Conference on uncertainty in artificial intelligence, pages 1308–1317. PMLR, 2020

  21. [21]

    Shufflenet v2: Practical guidelines for efficient cnn architecture design

    Ningning Ma, Xiangyu Zhang, Hai-Tao Zheng, and Jian Sun. Shufflenet v2: Practical guidelines for efficient cnn architecture design. InProceedings of the European conference on computer vision (ECCV), pages 116–131, 2018

  22. [22]

    Exploring the limits of weakly supervised pretraining

    Dhruv Mahajan, Ross Girshick, Vignesh Ramanathan, Kaiming He, Manohar Paluri, Yixuan Li, Ashwin Bharambe, and Laurens Van Der Maaten. Exploring the limits of weakly supervised pretraining. InProceedings of the European conference on computer vision (ECCV), pages 181–196, 2018

  23. [23]

    Scikit- learn: Machine learning in python.the Journal of machine Learning research, 12:2825–2830, 2011

    Fabian Pedregosa, Gaël Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel, Mathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, et al. Scikit- learn: Machine learning in python.the Journal of machine Learning research, 12:2825–2830, 2011

  24. [24]

    How to train state-of-the-art models us- ing torchvision’s latest primitives

    PyTorch Team. How to train state-of-the-art models us- ing torchvision’s latest primitives. https://pytorch.org/blog/ how-to-train-state-of-the-art-models-using-torchvision-latest-primitives/ ,

  25. [25]

    Accessed: 2026-03-25

  26. [26]

    Faster r-cnn: Towards real-time object detection with region proposal networks.Advances in neural information processing systems, 28, 2015

    Shaoqing Ren, Kaiming He, Ross Girshick, and Jian Sun. Faster r-cnn: Towards real-time object detection with region proposal networks.Advances in neural information processing systems, 28, 2015

  27. [27]

    arXiv preprint arXiv:2104.10972 , year=

    Tal Ridnik, Emanuel Ben-Baruch, Asaf Noy, and Lihi Zelnik-Manor. Imagenet-21k pretraining for the masses.arXiv preprint arXiv:2104.10972, 2021

  28. [28]

    Mobilenetv2: Inverted residuals and linear bottlenecks

    Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: Inverted residuals and linear bottlenecks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4510–4520, 2018

  29. [29]

    Better aggregation in test-time augmentation

    Divya Shanmugam, Davis Blalock, Guha Balakrishnan, and John Guttag. Better aggregation in test-time augmentation. InProceedings of the IEEE/CVF international conference on computer vision, pages 1214–1223, 2021

  30. [30]

    Test-time augmen- tation improves efficiency in conformal prediction

    Divya Shanmugam, Helen Lu, Swami Sankaranarayanan, and John Guttag. Test-time augmen- tation improves efficiency in conformal prediction. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 20622–20631, 2025

  31. [31]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    Karen Simonyan and Andrew Zisserman. Very deep convolutional networks for large-scale image recognition.arXiv preprint arXiv:1409.1556, 2014

  32. [32]

    Going deeper with convolutions

    Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru Erhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1–9, 2015

  33. [33]

    Re- thinking the inception architecture for computer vision

    Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Re- thinking the inception architecture for computer vision. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 2818–2826, 2016

  34. [34]

    Efficientnet: Rethinking model scaling for convolutional neural networks

    Mingxing Tan and Quoc Le. Efficientnet: Rethinking model scaling for convolutional neural networks. InInternational conference on machine learning, pages 6105–6114. PMLR, 2019. 11

  35. [35]

    Efficientnetv2: Smaller models and faster training

    Mingxing Tan and Quoc Le. Efficientnetv2: Smaller models and faster training. InInternational conference on machine learning, pages 10096–10106. PMLR, 2021

  36. [36]

    Mnasnet: Platform-aware neural architecture search for mobile

    Mingxing Tan, Bo Chen, Ruoming Pang, Vijay Vasudevan, Mark Sandler, Andrew Howard, and Quoc V Le. Mnasnet: Platform-aware neural architecture search for mobile. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2820–2828, 2019

  37. [37]

    Torchvision: Pytorch’s computer vision library, 2016

    TorchVision Contributors. Torchvision: Pytorch’s computer vision library, 2016. https: //github.com/pytorch/vision

  38. [38]

    Maxvit: Multi-axis vision transformer

    Zhengzhong Tu, Hossein Talebi, Han Zhang, Feng Yang, Peyman Milanfar, Alan Bovik, and Yinxiao Li. Maxvit: Multi-axis vision transformer. InEuropean conference on computer vision, pages 459–479. Springer, 2022

  39. [39]

    Lipschitz regularity of deep neural networks: analysis and efficient estimation.Advances in neural information processing systems, 31, 2018

    Aladin Virmaux and Kevin Scaman. Lipschitz regularity of deep neural networks: analysis and efficient estimation.Advances in neural information processing systems, 31, 2018

  40. [40]

    Aggregated residual transformations for deep neural networks

    Saining Xie, Ross Girshick, Piotr Dollár, Zhuowen Tu, and Kaiming He. Aggregated residual transformations for deep neural networks. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1492–1500, 2017

  41. [41]

    Cutmix: Regularization strategy to train strong classifiers with localizable features

    Sangdoo Yun, Dongyoon Han, Seong Joon Oh, Sanghyuk Chun, Junsuk Choe, and Youngjoon Yoo. Cutmix: Regularization strategy to train strong classifiers with localizable features. In Proceedings of the IEEE/CVF international conference on computer vision, pages 6023–6032, 2019

  42. [42]

    Wide Residual Networks

    Sergey Zagoruyko and Nikos Komodakis. Wide residual networks.arXiv preprint arXiv:1605.07146, 2016

  43. [43]

    mixup: Beyond empirical risk minimization

    Hongyi Zhang, Moustapha Cisse, Yann N Dauphin, and David Lopez-Paz. mixup: Beyond empirical risk minimization. InInternational Conference on Learning Representations, 2018. A Preliminary Concepts and Reproduction of Prior Work In this section, we review key concepts and reproduce prior TTA results to highlight their limitations and motivate our method. Fi...