pith. sign in

arxiv: 2606.10905 · v1 · pith:XXHJRUEInew · submitted 2026-06-09 · 💻 cs.CV

Beyond Model Size: Probing the Gaps in Visual in-Context Learning by Training a Tiny Model

Pith reviewed 2026-06-27 13:29 UTC · model grok-4.3

classification 💻 cs.CV
keywords visual in-context learningmodel scalingbenchmarkingtiny modelsadaptive visiontask encodingevaluation metricsdistribution shift
0
0 comments X

The pith

A 1-million-parameter visual in-context model performs on par with models 7000 times larger on several adaptive tasks, showing that current benchmarks fail to isolate true adaptability.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper trains a 1-million-parameter model on only 70,000 images and pits it against far larger visual in-context learning systems. It evaluates both under small distribution shifts, unseen task encodings, and entirely new tasks that the field intends to solve. The results indicate that measured performance gaps arise in part from how tasks are presented to the model, which tasks appeared during pre-training, and which metrics are reported. A reader should care because the work questions whether scaling model size is necessary or sufficient for adaptive vision capabilities. It points instead to weaknesses in how those capabilities are currently tested.

Core claim

By training a severely capacity-capped 1M-parameter visual in-context learning model on a modest dataset and comparing it directly to 7000-times-larger counterparts, the authors establish that existing evaluation protocols do not adequately capture adaptive capabilities with respect to task encoding, pre-training task selection, and metric choice.

What carries the argument

The 1-million-parameter visual in-context learning model trained on 70,000 images, deployed as an extreme low-capacity counterexample to test whether large scale is required for adaptability.

If this is right

  • VICL progress reported on current benchmarks may overstate actual adaptability gains.
  • Standardized task encodings and metric definitions become necessary before scaling claims can be trusted.
  • Pre-training task choice must be reported and controlled when comparing adaptive performance.
  • Small models can serve as useful probes for isolating benchmarking artifacts in adaptive vision.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Improved evaluation protocols could allow researchers to test adaptability without requiring massive compute.
  • The same tiny-model probe could be applied to other modalities to check whether similar benchmarking gaps exist.
  • Future work might prioritize data curation and encoding design over raw parameter count for in-context adaptation.
  • The gap between reported and actual adaptability may slow progress until benchmarks are revised.

Load-bearing premise

Observed performance differences between the tiny model and much larger models can be attributed primarily to shortcomings in benchmarking rather than to the extreme difference in model capacity.

What would settle it

A re-evaluation in which the tiny model is given identical task encodings, pre-training tasks, and metrics as the large models and still shows large consistent deficits would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.10905 by Markus Ulrich, Simon Rei{\ss}, Steven Landgraf, Sunil Khatri.

Figure 1
Figure 1. Figure 1: Examples images of the task data used in multi-task pre-training, first row shows the inputs, second row shows the outputs for the different tasks in the columns [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Qualitative results for TinyVICL with different losses. While results are imper￾fect, the low capacity model with 1M parameters learns to address the seven tasks. Quantitative results First, we look into quantitative results of the 1M pa￾rameter Unet variant in [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Qualitative results for all VICL models for out-of-domain prompting. Visual in-Context Learners can bridge small domain gaps When a model is trained on a task, and addresses the task well on an in-domain data, the resulting model can address the same task on new data, if the domain shift is small as seen in setting ○1 . Yet, as soon as the data distribution, or the task encoding shifts too much, models are… view at source ↗
read the original abstract

Visual in-Context Learning (VICL) aims at making progress towards adaptive vision models, that can -- based on a few examples -- adapt to a new task at test-time. With the history of in-context learning in natural language processing research, where large, parameter-heavy models are in use, one pathway that current VICL methods take is model- and data-scaling as key ingredients. Yet, it is not clear, whether these ingredients are the key for in-context learning to take shape in vision models. To stress-test such large models, we challenge them with an extreme counterexample: we train a tiny visual in-context model with merely $1$ million parameters and a modest amount of $70,000$ images. We compare the results of this severely capacity capped tiny model to $7,000\times$ larger VICL models in different adaptive settings, (1) on image data with small distribution shifts, (2) on unseen task encodings and (3) on a completely new task, i.e., the setting VICL envisions. With the chasm of training resources between the tiny- and large models, our experiments showcase a lack in how adaptive capabilities are measured, with respect to how tasks are encoded, which tasks were used in pre-training and the choice of metrics. These gaps in current VICL benchmarking underscore a need for innovation in evaluation of adaptive capabilities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper trains a 1M-parameter visual in-context learning model on 70k images and compares its performance to 7000× larger VICL models across three adaptive settings (small distribution shifts, unseen task encodings, and completely new tasks). It concludes that the results expose deficiencies in current VICL benchmarking with respect to task encoding, pre-training task selection, and metrics.

Significance. If the tiny model's results can be shown to isolate benchmarking deficiencies from capacity limitations, the work would usefully redirect attention from model scaling toward improved evaluation protocols for adaptive vision capabilities.

major comments (2)
  1. [Abstract] Abstract: the central claim that the experiments 'showcase a lack in how adaptive capabilities are measured' rests on an empirical comparison, yet the abstract supplies no quantitative results, error analysis, or construction details for the three adaptive settings, preventing verification that performance differences can be attributed to benchmarking gaps rather than the 7000× capacity disparity.
  2. [Experimental results] Experimental comparison (new-task setting): the attribution of gaps to measurement practices rather than insufficient capacity for in-context adaptation requires evidence that the 1M model reaches non-trivial performance on the completely new task under controlled encodings; without such data the load-bearing assumption remains untested.
minor comments (1)
  1. [Abstract] Abstract: clarify whether the 70,000 images constitute the total training set or are allocated across tasks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major point below and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the central claim that the experiments 'showcase a lack in how adaptive capabilities are measured' rests on an empirical comparison, yet the abstract supplies no quantitative results, error analysis, or construction details for the three adaptive settings, preventing verification that performance differences can be attributed to benchmarking gaps rather than the 7000× capacity disparity.

    Authors: We agree that the abstract would benefit from additional quantitative details to support the central claim and facilitate verification. In the revised version, we will incorporate key performance metrics from the three adaptive settings, along with brief descriptions of the experimental constructions, while maintaining conciseness. revision: yes

  2. Referee: [Experimental results] Experimental comparison (new-task setting): the attribution of gaps to measurement practices rather than insufficient capacity for in-context adaptation requires evidence that the 1M model reaches non-trivial performance on the completely new task under controlled encodings; without such data the load-bearing assumption remains untested.

    Authors: The manuscript reports that the 1M model achieves non-trivial performance on the new task (exceeding random baselines under controlled encodings) in Section 4.3. This forms the basis for attributing gaps to benchmarking practices. To make this evidence more prominent, we will add explicit statements highlighting the non-trivial results relative to baselines. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical comparison is self-contained

full rationale

The paper reports training a 1M-parameter model on 70k images and comparing its performance to 7000x larger VICL models across three settings. No equations, parameter fits, or derivations are present. The central claim—that observed gaps indicate deficiencies in task encoding, pre-training tasks, and metrics rather than capacity—is an interpretation of experimental outcomes, not a reduction to self-definition or self-citation. No load-bearing self-citations or ansatzes are invoked; the work is a direct empirical stress-test.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the tiny model constitutes a fair stress test whose results can be interpreted as evidence of benchmarking deficiencies; no free parameters or invented entities are described.

axioms (1)
  • domain assumption Performance of a severely capacity-capped model can be used to diagnose deficiencies in how adaptive capabilities are measured in much larger models.
    This premise is invoked when the abstract states that the chasm in training resources between tiny and large models reveals gaps in measurement.

pith-pipeline@v0.9.1-grok · 5792 in / 1242 out tokens · 20924 ms · 2026-06-27T13:29:59.483407+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

60 extracted references · 11 canonical work pages · 9 internal anchors

  1. [1]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Bai, Y., Geng, X., Mangalam, K., Bar, A., Yuille, A.L., Darrell, T., Malik, J., Efros, A.A.: Sequential modeling enables scalable learning for large vision models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22861–22872 (2024)

  2. [2]

    Advances in neural information processing systems35, 25005–25017 (2022)

    Bar, A., Gandelsman, Y., Darrell, T., Globerson, A., Efros, A.: Visual prompt- ing via image inpainting. Advances in neural information processing systems35, 25005–25017 (2022)

  3. [3]

    In: Proceedings of the fourteenth international conference on artificial intelligence and statistics

    Bengio, Y., Bastien, F., Bergeron, A., Boulanger-Lewandowski, N., Breuel, T., Chherawala, Y., Cisse, M., Cˆ ot´ e, M., Erhan, D., Eustache, J., et al.: Deep learners benefit more from out-of-distribution examples. In: Proceedings of the fourteenth international conference on artificial intelligence and statistics. pp. 164–172. JMLR Workshop and Conference...

  4. [4]

    International Journal of Computer Vision129(4), 1038–1059 (2021)

    Bergmann, P., Batzner, K., Fauser, M., Sattlegger, D., Steger, C.: The mvtec anomaly detection dataset: a comprehensive real-world dataset for unsupervised anomaly detection. International Journal of Computer Vision129(4), 1038–1059 (2021)

  5. [5]

    On the Opportunities and Risks of Foundation Models

    Bommasani, R., Hudson, D.A., Adeli, E., Altman, R., Arora, S., von Arx, S., Bernstein, M.S., Bohg, J., Bosselut, A., Brunskill, E., et al.: On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258 (2021)

  6. [6]

    In: DAGM German Conference on Pattern Recognition

    Bratuli´ c, J., Mittal, S., Hoffmann, D.T., B¨ ohm, S., Schirrmeister, R.T., Ball, T., Rupprecht, C., Brox, T.: Unlocking in-context learning for natural datasets beyond language modelling. In: DAGM German Conference on Pattern Recognition. pp. 303–319. Springer (2025)

  7. [7]

    Advances in neural information processing systems33, 1877–1901 (2020) 14 S

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems33, 1877–1901 (2020) 14 S. Khatri et al

  8. [8]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Butoi, V.I., Ortiz, J.J.G., Ma, T., Sabuncu, M.R., Guttag, J., Dalca, A.V.: Uni- verseg: Universal medical image segmentation. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 21438–21451 (2023)

  9. [9]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Caron, M., Touvron, H., Misra, I., J´ egou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9650–9660 (2021)

  10. [10]

    Advances in neural information processing systems7(1994)

    Caruana, R.: Learning many related tasks at the same time with backpropagation. Advances in neural information processing systems7(1994)

  11. [11]

    Advances in neural information processing systems35, 18878– 18891 (2022)

    Chan, S., Santoro, A., Lampinen, A., Wang, J., Singh, A., Richemond, P., McClel- land, J., Hill, F.: Data distributional properties drive emergent in-context learning in transformers. Advances in neural information processing systems35, 18878– 18891 (2022)

  12. [12]

    In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition

    Cherti, M., Beaumont, R., Wightman, R., Wortsman, M., Ilharco, G., Gordon, C., Schuhmann, C., Schmidt, L., Jitsev, J.: Reproducible scaling laws for contrastive language-image learning. In: Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition. pp. 2818–2829 (2023)

  13. [13]

    In: Proc

    Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: Proc. of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2016)

  14. [14]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Czolbe, S., Dalca, A.V.: Neuralizer: General neuroimage analysis without re- training. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6217–6230 (2023)

  15. [15]

    In: Proceedings of the 6th ACM multimedia systems conference

    Dang-Nguyen, D.T., Pasquini, C., Conotter, V., Boato, G.: Raise: A raw images dataset for digital image forensics. In: Proceedings of the 6th ACM multimedia systems conference. pp. 219–224 (2015)

  16. [16]

    In: International conference on machine learning

    Dehghani, M., Djolonga, J., Mustafa, B., Padlewski, P., Heek, J., Gilmer, J., Steiner, A.P., Caron, M., Geirhos, R., Alabdulmohsin, I., et al.: Scaling vision transformers to 22 billion parameters. In: International conference on machine learning. pp. 7480–7512. PMLR (2023)

  17. [17]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., et al.: An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929 (2020)

  18. [18]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12873–12883 (2021)

  19. [19]

    In: International conference on machine learning

    Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: International conference on machine learning. pp. 1126–1135. PMLR (2017)

  20. [20]

    arXiv preprint arXiv:2402.04841 (2024)

    Guo, J., Hao, Z., Wang, C., Tang, Y., Wu, H., Hu, H., Han, K., Xu, C.: Data- efficient large vision models through sequential autoregression. arXiv preprint arXiv:2402.04841 (2024)

  21. [21]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    He, K., Chen, X., Xie, S., Li, Y., Doll´ ar, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16000–16009 (2022)

  22. [22]

    Deep Learning Scaling is Predictable, Empirically

    Hestness, J., Narang, S., Ardalani, N., Diamos, G., Jun, H., Kianinejad, H., Pat- wary, M.M.A., Yang, Y., Zhou, Y.: Deep learning scaling is predictable, empirically. arXiv preprint arXiv:1712.00409 (2017)

  23. [23]

    DISCUSSION AND CONCLUSION 15

  24. [24]

    Training Compute-Optimal Large Language Models

    Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., Casas, D., Hendricks, L.A., Welbl, J., Clark, A., et al.: Training compute-optimal large language models. arXiv preprint arXiv:2203.1555610(2022)

  25. [25]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Jiang, K., Wang, Z., Yi, P., Chen, C., Huang, B., Luo, Y., Ma, J., Jiang, J.: Multi- scale progressive fusion network for single image deraining. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 8346–8355 (2020)

  26. [26]

    Scaling Laws for Neural Language Models

    Kaplan, J., McCandlish, S., Henighan, T., Brown, T.B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., Amodei, D.: Scaling laws for neural language models. arXiv preprint arXiv:2001.08361 (2020)

  27. [27]

    Adam: A Method for Stochastic Optimization

    Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014)

  28. [28]

    In: European conference on computer vision

    Larsson, G., Maire, M., Shakhnarovich, G.: Learning representations for automatic colorization. In: European conference on computer vision. pp. 577–593. Springer (2016)

  29. [29]

    In: European conference on computer vision

    Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Doll´ ar, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European conference on computer vision. pp. 740–755. Springer (2014)

  30. [30]

    Vision research120, 93–107 (2016)

    M´ ely, D.A., Kim, J., McGill, M., Guo, Y., Serre, T.: A systematic comparison between visual cues for boundary detection. Vision research120, 93–107 (2016)

  31. [31]

    In: International Workshop on Efficient Medical Artificial Intelligence

    Negrini, A., Reiß, S.: Conquering the retina: Bringing visual in-context learning to oct. In: International Workshop on Efficient Medical Artificial Intelligence. pp. 21–30. Springer (2025)

  32. [32]

    In: Proceedings of the IEEE/CVF winter conference on applications of computer vision

    Poma, X.S., Riba, E., Sappa, A.: Dense extreme inception network: Towards a robust cnn model for edge detection. In: Proceedings of the IEEE/CVF winter conference on applications of computer vision. pp. 1923–1932 (2020)

  33. [33]

    Radford, A., Narasimhan, K., Salimans, T., Sutskever, I., et al.: Improving lan- guage understanding by generative pre-training (2018)

  34. [34]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Rakic, M., Wong, H.E., Ortiz, J.J.G., Cimini, B.A., Guttag, J.V., Dalca, A.V.: Ty- che: Stochastic in-context learning for medical image segmentation. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 11159–11173 (2024)

  35. [35]

    SAM 2: Segment Anything in Images and Videos

    Ravi, N., Gabeur, V., Hu, Y.T., Hu, R., Ryali, C., Ma, T., Khedr, H., R¨ adle, R., Rolland, C., Gustafson, L., Mintun, E., Pan, J., Alwala, K.V., Carion, N., Wu, C.Y., Girshick, R., Doll´ ar, P., Feichtenhofer, C.: Sam 2: Segment anything in images and videos. arXiv preprint arXiv:2408.00714 (2024), https://arxiv.org/ abs/2408.00714

  36. [36]

    In: Inter- national conference on learning representations (2017)

    Ravi, S., Larochelle, H.: Optimization as a model for few-shot learning. In: Inter- national conference on learning representations (2017)

  37. [37]

    Reiß, S., Marinov, Z., Jaus, A., Seibold, C., Sarfraz, M.S., Rodner, E., Stiefelhagen, R.: Is visual in-context learning for compositional medical tasks within reach? In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 2642–2652 (2025)

  38. [38]

    In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition

    Reiß, S., Seibold, C., Freytag, A., Rodner, E., Stiefelhagen, R.: Every annotation counts: Multi-label deep supervision for medical image segmentation. In: Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 9532–9542 (2021)

  39. [39]

    In: European Conference on Computer Vision

    Reiß, S., Seibold, C., Freytag, A., Rodner, E., Stiefelhagen, R.: Graph-constrained contrastive regularization for semi-weakly volumetric segmentation. In: European Conference on Computer Vision. pp. 401–419. Springer (2022) 16 S. Khatri et al

  40. [40]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Reiß, S., Seibold, C., Freytag, A., Rodner, E., Stiefelhagen, R.: Decoupled semantic prototypes enable learning from diverse annotation types for semi-weakly segmen- tation in expert-driven domains. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 15495–15506 (2023)

  41. [41]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)

  42. [42]

    In: International Conference on Medical image computing and computer-assisted intervention

    Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedi- cal image segmentation. In: International Conference on Medical image computing and computer-assisted intervention. pp. 234–241. Springer (2015)

  43. [43]

    Advances in neural information processing systems35, 25278–25294 (2022)

    Schuhmann, C., Beaumont, R., Vencu, R., Gordon, C., Wightman, R., Cherti, M., Coombes, T., Katta, A., Mullis, C., Wortsman, M., et al.: Laion-5b: An open large- scale dataset for training next generation image-text models. Advances in neural information processing systems35, 25278–25294 (2022)

  44. [44]

    In: Proceedings of the AAAI conference on artificial intelligence

    Seibold, C.M., Reiß, S., Kleesiek, J., Stiefelhagen, R.: Reference-guided pseudo- label generation for medical semantic segmentation. In: Proceedings of the AAAI conference on artificial intelligence. vol. 36, pp. 2171–2179 (2022)

  45. [45]

    DINOv3

    Sim´ eoni, O., Vo, H.V., Seitzer, M., Baldassarre, F., Oquab, M., Jose, C., Khali- dov, V., Szafraniec, M., Yi, S., Ramamonjisoa, M., et al.: Dinov3. arXiv preprint arXiv:2508.10104 (2025)

  46. [46]

    Advances in neural information processing systems33, 596–608 (2020)

    Sohn, K., Berthelot, D., Carlini, N., Zhang, Z., Zhang, H., Raffel, C.A., Cubuk, E.D., Kurakin, A., Li, C.L.: Fixmatch: Simplifying semi-supervised learning with consistency and confidence. Advances in neural information processing systems33, 596–608 (2020)

  47. [47]

    In: International Workshop on Deep Learning in Medical Image Analysis

    Sudre, C.H., Li, W., Vercauteren, T., Ourselin, S., Jorge Cardoso, M.: Generalised dice overlap as a deep learning loss function for highly unbalanced segmentations. In: International Workshop on Deep Learning in Medical Image Analysis. pp. 240–

  48. [48]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Sung, F., Yang, Y., Zhang, L., Xiang, T., Torr, P.H., Hospedales, T.M.: Learning to compare: Relation network for few-shot learning. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 1199–1208 (2018)

  49. [49]

    Advances in neural information processing systems30(2017)

    Tarvainen, A., Valpola, H.: Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. Advances in neural information processing systems30(2017)

  50. [50]

    LLaMA: Open and Efficient Foundation Language Models

    Touvron, H., Lavril, T., Izacard, G., Martinet, X., Lachaux, M.A., Lacroix, T., Rozi` ere, B., Goyal, N., Hambro, E., Azhar, F., et al.: Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971 (2023)

  51. [51]

    Advances in neural information processing systems30(2017)

    Van Den Oord, A., Vinyals, O., et al.: Neural discrete representation learning. Advances in neural information processing systems30(2017)

  52. [52]

    Advances in neural information pro- cessing systems30(2017)

    Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.N., Kaiser, L., Polosukhin, I.: Attention is all you need. Advances in neural information pro- cessing systems30(2017)

  53. [53]

    arXiv preprint arXiv:2305.01115 (2023), https://arxiv.org/abs/2305.01115

    Wang, Z., Jiang, Y., Lu, Y., Shen, Y., He, P., Chen, W., Wang, Z., Zhou, M.: In- context learning unlocked for diffusion models. arXiv preprint arXiv:2305.01115 (2023), https://arxiv.org/abs/2305.01115

  54. [54]

    IEEE transactions on image processing 13(4), 600–612 (2004)

    Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. IEEE transactions on image processing 13(4), 600–612 (2004)

  55. [55]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Xian, K., Zhang, J., Wang, O., Mai, L., Lin, Z., Cao, Z.: Structure-guided ranking loss for single image depth prediction. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 611–620 (2020)

  56. [56]

    DISCUSSION AND CONCLUSION 17

  57. [57]

    Yosinski, J., Clune, J., Bengio, Y., Lipson, H.: How transferable are features in deep neural networks? Advances in neural information processing systems27(2014)

  58. [58]

    In: International confer- ence on medical image computing and computer-assisted intervention

    Yu, L., Wang, S., Li, X., Fu, C.W., Heng, P.A.: Uncertainty-aware self-ensembling model for semi-supervised 3d left atrium segmentation. In: International confer- ence on medical image computing and computer-assisted intervention. pp. 605–613. Springer (2019)

  59. [59]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition

    Zhai, X., Kolesnikov, A., Houlsby, N., Beyer, L.: Scaling vision transformers. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 12104–12113 (2022)

  60. [60]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 633–641 (2017)