pith. sign in

arxiv: 2606.10967 · v1 · pith:ZV3P6ULFnew · submitted 2026-06-09 · 💻 cs.CV

Quo Vadis, Visual In-Context Learning? A Unified Benchmark Across Domains and Tasks

Pith reviewed 2026-06-27 13:23 UTC · model grok-4.3

classification 💻 cs.CV
keywords visual in-context learningbenchmarkone-shot evaluationdomain shifttask generalizationmodel adaptationfailure modescomputer vision
0
0 comments X

The pith

A broad benchmark shows visual in-context models have limited ability to adapt to new image domains and tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs a benchmark called VIBE to test how visual in-context learning models perform when given new image distributions and task types that differ from their pre-training data. It runs six models through 14 datasets and 12 tasks, covering 106 combinations, all under a single one-shot evaluation protocol. This reveals consistent limitations and failure patterns that narrower tests had missed. A sympathetic reader would care because the core promise of in-context learning is test-time adaptation without further training, and the results question how far current models reach that goal.

Core claim

The paper establishes that visual in-context learning models exhibit limited adaptation capabilities when evaluated on diverse new imaging domains and a wide range of tasks under a unified one-shot protocol, uncovering systematic failure modes not visible in prior narrow setups that reused pre-training distributions.

What carries the argument

The VIBE benchmark, which organizes 106 dataset-task combinations under a reproducible one-shot evaluation protocol to measure adaptation to new distributions.

If this is right

  • Current visual in-context models require targeted improvements to handle shifts in both image appearance and task structure.
  • Evaluation of these models must use out-of-distribution datasets to avoid overestimating adaptation ability.
  • Observed failure modes point to concrete areas where model design can be refined for better generalization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Wider use of cross-domain benchmarks like this one could guide development of more robust test-time adaptation methods.
  • The same stress-testing approach could be applied to in-context methods in other data modalities to check for parallel limits.
  • Experiments that vary the number of context examples beyond one shot could test whether the identified failures are fixed by additional demonstrations.

Load-bearing premise

The chosen datasets and tasks are new enough relative to the models' pre-training data that performance differences reflect genuine adaptation rather than recall.

What would settle it

If the six models achieved similar accuracy on the VIBE tasks as they do on tasks drawn from their original pre-training distributions, the claim of limited adaptation to new distributions would be weakened.

Figures

Figures reproduced from arXiv: 2606.10967 by Alexander Jaus, Jiale Wei, Pradnya Halady, Simon Rei{\ss}, Zdravko Marinov.

Figure 1
Figure 1. Figure 1: Overview of our Visual In-Context BEnchmark (VIBE) with diverse tasks and domains. Abstract Visual in-context learning has been proposed as a pathway towards dynamic models that can generate predictions based on a provided context and thereby can adapt to new vision tasks at test-time. Yet, the evaluation of the adaptation capabilities of these models has been limited to narrow setups that mainly mirror ta… view at source ↗
Figure 2
Figure 2. Figure 2: Visualization of task encodings in the VIBE toolkit. Perception & Localization, Restora￾tion & Enhancement, and Manipulation & Transformation tasks tinted in their respective colors. 4 [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Quantitative results for semantic segmentation in color-aware mIoU (for DUKE foreground [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Quantitative results for image restoration tasks. We report PSNR for inpainting, denoising [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Quantitative results for modality transfer in PSNR, and rotation and flipping in SSIM. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative results for subset of 14 task-dataset pairs in [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative examples for Painter on NYU-Depth V2, prompted with the colorization task. [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Qualitative samples for tasks segmentation and depth estimation. [PITH_FULL_IMAGE:figures/full_fig_p036_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Qualitative samples for edge detection, object detection and keypoint detection. [PITH_FULL_IMAGE:figures/full_fig_p037_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Qualitative samples for the colorization task covering different domains. [PITH_FULL_IMAGE:figures/full_fig_p038_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Qualitative samples for the denoising task covering different domains. [PITH_FULL_IMAGE:figures/full_fig_p039_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Qualitative samples for the inpainting task covering different domains. [PITH_FULL_IMAGE:figures/full_fig_p040_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Qualitative samples for different artifact removal modes in CDD [36]. [PITH_FULL_IMAGE:figures/full_fig_p041_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Qualitative samples for modality transfer, rotation, flip and object removal. [PITH_FULL_IMAGE:figures/full_fig_p042_14.png] view at source ↗
read the original abstract

Visual in-context learning has been proposed as a pathway towards dynamic models that can generate predictions based on a provided context and thereby can adapt to new vision tasks at test-time. Yet, the evaluation of the adaptation capabilities of these models has been limited to narrow setups that mainly mirror tasks or image domains from pre-training for which real adaptation is not required. We address this gap by constructing a broad Visual In-Context BEnchmark (VIBE) with a focus on diverse imaging domains and a wide range of tasks. With this, we are able to get a much clearer picture of the adaptive capabilities of visual in-context models when faced with new image- and task distributions. We stress test six models on $14$ datasets and $12$ tasks (in total, we explore $106$ dataset-task combinations) and compare them under a unified, reproducible evaluation protocol, in an one-shot setting. Our evaluation uncovers key insights on the state of visual in-context learning, including limitations, systematic failure modes and promising directions. To foster broader evaluation, we will openly release our VIBE toolkit.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces the Visual In-Context BEnchmark (VIBE) to address limitations in prior evaluations of visual in-context learning, which often overlap with pre-training distributions. It constructs a benchmark spanning 14 datasets and 12 tasks (106 dataset-task combinations), evaluates six models in a unified one-shot protocol, and reports insights on limitations, systematic failure modes, and promising directions, with plans to release the VIBE toolkit.

Significance. If the datasets satisfy the OOD condition, the work provides a more rigorous assessment of genuine adaptation in visual in-context models than prior narrow setups. The unified reproducible protocol, broad coverage across domains and tasks, and commitment to open-sourcing the toolkit are clear strengths that would support future research.

major comments (2)
  1. [§3 and §4] §3 (Dataset Curation) and §4 (Evaluation Protocol): The central claim that measured one-shot performance reflects genuine adaptation (rather than recall) requires the 14 datasets to lie outside the pre-training distributions of the six models. The manuscript motivates this via diverse domains but supplies no overlap statistics, pre-training corpus comparisons, or explicit OOD verification for the chosen datasets and tasks; this verification is load-bearing for interpreting the 106 combinations as testing adaptation.
  2. [§5] §5 (Results and Analysis): The reported insights on systematic failure modes and promising directions are presented without quantitative breakdowns (e.g., per-model error patterns or statistical significance tests across the 106 combinations) that would allow readers to assess whether the observed limitations are robust or driven by a subset of tasks.
minor comments (2)
  1. [Abstract and §4] The abstract and introduction use 'one-shot setting' inconsistently with the later description of the protocol; clarify whether support examples are strictly one or few-shot.
  2. [Table 1] Table 1 (or equivalent dataset summary) should include explicit columns for domain diversity metrics or pre-training overlap indicators to make the selection criteria transparent.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps strengthen the claims regarding genuine adaptation and the robustness of the reported insights. We address each major comment below.

read point-by-point responses
  1. Referee: [§3 and §4] §3 (Dataset Curation) and §4 (Evaluation Protocol): The central claim that measured one-shot performance reflects genuine adaptation (rather than recall) requires the 14 datasets to lie outside the pre-training distributions of the six models. The manuscript motivates this via diverse domains but supplies no overlap statistics, pre-training corpus comparisons, or explicit OOD verification for the chosen datasets and tasks; this verification is load-bearing for interpreting the 106 combinations as testing adaptation.

    Authors: We agree that explicit OOD verification is important for interpreting the results as genuine adaptation. The 14 datasets were deliberately chosen from domains (medical imaging, remote sensing, microscopy, and other specialized scientific imaging) that are known to have minimal overlap with common pre-training corpora such as ImageNet, LAION, or COCO. However, the manuscript does not include quantitative overlap statistics or direct corpus comparisons. In revision, we will add an appendix providing dataset provenance details, references to known pre-training data distributions, and any available overlap analyses or citations that support the OOD status of these tasks and domains. revision: yes

  2. Referee: [§5] §5 (Results and Analysis): The reported insights on systematic failure modes and promising directions are presented without quantitative breakdowns (e.g., per-model error patterns or statistical significance tests across the 106 combinations) that would allow readers to assess whether the observed limitations are robust or driven by a subset of tasks.

    Authors: The current manuscript already reports full performance tables and figures covering all 106 dataset-task combinations, which form the basis for the discussed patterns and failure modes. To increase rigor, we will augment Section 5 with additional quantitative elements: per-model aggregated statistics across task categories, breakdowns of error patterns (e.g., by failure type), and statistical significance tests (such as paired comparisons or non-parametric tests) across the combinations to confirm that the observed limitations are not driven by outliers. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with no derivations or self-referential reductions

full rationale

The paper constructs and applies the VIBE benchmark to evaluate six models across 106 dataset-task combinations in a one-shot setting. No equations, fitted parameters, or derivations are present. The central claims rest on direct experimental comparisons under a unified protocol rather than any quantity that reduces to its own inputs by construction. Dataset selection is motivated by the goal of testing adaptation outside pre-training distributions, but this is an empirical assumption (not a derived result) and does not create circularity. No self-citation chains or ansatzes are invoked to justify core results. The work is self-contained as a benchmark study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces no new mathematical entities or fitted parameters; the central claim rests on empirical evaluation under standard machine-learning assumptions about one-shot protocols and dataset diversity.

axioms (1)
  • domain assumption One-shot evaluation is a valid protocol for measuring adaptation to new distributions.
    The entire evaluation is performed in the one-shot setting as stated in the abstract.

pith-pipeline@v0.9.1-grok · 5738 in / 1165 out tokens · 26281 ms · 2026-06-27T13:23:18.528914+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

111 extracted references · 6 canonical work pages

  1. [1]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Akkaynak, D., Treibitz, T.: Sea-thru: A method for removing water from underwater images. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 1682–1691 (2019)

  2. [2]

    In: ACM SIGGRAPH 2024 conference papers

    Alaluf, Y ., Garibi, D., Patashnik, O., Averbuch-Elor, H., Cohen-Or, D.: Cross-image attention for zero-shot appearance transfer. In: ACM SIGGRAPH 2024 conference papers. pp. 1–12 (2024)

  3. [3]

    arXiv preprint arXiv:2503.09837 (2025)

    Anis, A.M., Ali, H., Sarfraz, S.: On the limitations of vision-language models in understanding image transforms. arXiv preprint arXiv:2503.09837 (2025)

  4. [4]

    In: 2020 International joint conference on neural networks (IJCNN)

    Arazo, E., Ortego, D., Albert, P., O’Connor, N.E., McGuinness, K.: Pseudo-labeling and confirmation bias in deep semi-supervised learning. In: 2020 International joint conference on neural networks (IJCNN). pp. 1–8. IEEE (2020)

  5. [5]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Bai, Y ., Geng, X., Mangalam, K., Bar, A., Yuille, A.L., Darrell, T., Malik, J., Efros, A.A.: Sequential modeling enables scalable learning for large vision models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22861–22872 (2024)

  6. [6]

    arXiv preprint arXiv:2106.08254 (2021)

    Bao, H., Dong, L., Piao, S., Wei, F.: Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254 (2021)

  7. [7]

    Advances in neural information processing systems35, 25005–25017 (2022)

    Bar, A., Gandelsman, Y ., Darrell, T., Globerson, A., Efros, A.: Visual prompting via image inpainting. Advances in neural information processing systems35, 25005–25017 (2022)

  8. [8]

    In: International conference on machine learning

    Barrett, D., Hill, F., Santoro, A., Morcos, A., Lillicrap, T.: Measuring abstract reasoning in neural networks. In: International conference on machine learning. pp. 511–520. PMLR (2018)

  9. [9]

    Journal of machine learning research3(Feb), 1137–1155 (2003)

    Bengio, Y ., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic language model. Journal of machine learning research3(Feb), 1137–1155 (2003)

  10. [10]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Bitton, Y ., Yosef, R., Strugo, E., Shahaf, D., Schwartz, R., Stanovsky, G.: Vasr: Visual analogies of situation recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 37, pp. 241–249 (2023)

  11. [11]

    Advances in Neural Information Processing Systems36, 73299–73311 (2023)

    Blumenstiel, B., Jakubik, J., Kühne, H., Vössing, M.: What a mess: Multi-domain evaluation of zero-shot semantic segmentation. Advances in Neural Information Processing Systems36, 73299–73311 (2023)

  12. [12]

    arXiv preprint arXiv:2604.13883 (2026)

    Born, F., Neuhäuser, T., Muttenthaler, L., Roads, B.D., Spitzer, B., Lampinen, A.K., Jones, M., Müller, K.R., Mozer, M.C.: Context sensitivity improves human-machine visual alignment. arXiv preprint arXiv:2604.13883 (2026)

  13. [13]

    Bradski, G.: The OpenCV Library. Dr. Dobb’s Journal of Software Tools (2000) 10

  14. [14]

    In: DAGM German Conference on Pattern Recognition

    Bratuli´c, J., Mittal, S., Hoffmann, D.T., Böhm, S., Schirrmeister, R.T., Ball, T., Rupprecht, C., Brox, T.: Unlocking in-context learning for natural datasets beyond language modelling. In: DAGM German Conference on Pattern Recognition. pp. 303–319. Springer (2025)

  15. [15]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: Learning to follow image editing instructions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18392–18402 (2023)

  16. [16]

    Advances in neural information processing systems33, 1877–1901 (2020)

    Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems33, 1877–1901 (2020)

  17. [17]

    Machine learning28(1), 41–75 (1997)

    Caruana, R.: Multitask learning. Machine learning28(1), 41–75 (1997)

  18. [18]

    Advances in neural information processing systems35, 18878–18891 (2022)

    Chan, S., Santoro, A., Lampinen, A., Wang, J., Singh, A., Richemond, P., McClelland, J., Hill, F.: Data distributional properties drive emergent in-context learning in transformers. Advances in neural information processing systems35, 18878–18891 (2022)

  19. [19]

    In: International conference on machine learning

    Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International conference on machine learning. pp. 1597–1607. PmLR (2020)

  20. [20]

    In: Computer graphics forum

    Cohen, N., Newman, Y ., Shamir, A.: Semantic segmentation in art paintings. In: Computer graphics forum. vol. 41, pp. 261–275. Wiley Online Library (2022)

  21. [21]

    O’Reilly (2013)

    Collette, A.: Python and HDF5. O’Reilly (2013)

  22. [22]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3213–3223 (2016)

  23. [23]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Czolbe, S., Dalca, A.V .: Neuralizer: General neuroimage analysis without re-training. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6217–6230 (2023)

  24. [24]

    In: 2009 IEEE conference on computer vision and pattern recognition

    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255. Ieee (2009)

  25. [25]

    In: International conference on machine learning

    Donahue, J., Jia, Y ., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.: Decaf: A deep convolutional activation feature for generic visual recognition. In: International conference on machine learning. pp. 647–655. PMLR (2014)

  26. [26]

    In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., De- hghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021....

  27. [27]

    Environmental Modelling & Software149, 105333 (2022)

    Erfani, S.M.H., Wu, Z., Wu, X., Wang, S., Goharian, E.: Atlantis: A benchmark for semantic segmentation of waterbody images. Environmental Modelling & Software149, 105333 (2022)

  28. [28]

    In: Forty-first international conference on machine learning (2024)

    Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y ., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: Forty-first international conference on machine learning (2024)

  29. [29]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12873–12883 (2021)

  30. [30]

    International journal of computer vision88(2), 303–338 (2010)

    Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. International journal of computer vision88(2), 303–338 (2010)

  31. [31]

    In: International conference on machine learning

    Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: International conference on machine learning. pp. 1126–1135. PMLR (2017) 11

  32. [32]

    https://www.iwf.org.uk/about-us/why-we-exist/our-research/ how-ai-is-being-abused-to-create-child-sexual-abuse-imagery/ (2026), ac- cessed: 2026-05-05

    Foundation, I.W.: Ai-generated child sexual abuse: 2026 report on trends, data & human impact. https://www.iwf.org.uk/about-us/why-we-exist/our-research/ how-ai-is-being-abused-to-create-child-sexual-abuse-imagery/ (2026), ac- cessed: 2026-05-05

  33. [33]

    arXiv preprint arXiv:2306.09344 (2023)

    Fu, S., Tamir, N., Sundaram, S., Chai, L., Zhang, R., Dekel, T., Isola, P.: Dreamsim: Learning new dimensions of human visual similarity using synthetic data. arXiv preprint arXiv:2306.09344 (2023)

  34. [34]

    In: Proceedings of the IEEE international conference on computer vision

    Girshick, R.: Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision. pp. 1440–1448 (2015)

  35. [35]

    ACM Transactions on Graphics (TOG)43(4), 1–15 (2024)

    Gu, Z., Yang, S., Liao, J., Huo, J., Gao, Y .: Analogist: Out-of-the-box visual in-context learning with image diffusion model. ACM Transactions on Graphics (TOG)43(4), 1–15 (2024)

  36. [36]

    In: European conference on computer vision

    Guo, Y ., Gao, Y ., Lu, Y ., Zhu, H., Liu, R.W., He, S.: Onerestore: A universal restoration framework for composite degradation. In: European conference on computer vision. pp. 255–272. Springer (2024)

  37. [37]

    In: European conference on computer vision

    Guo, Y ., Codella, N.C., Karlinsky, L., Codella, J.V ., Smith, J.R., Saenko, K., Rosing, T., Feris, R.: A broader study of cross-domain few-shot learning. In: European conference on computer vision. pp. 124–141. Springer (2020)

  38. [38]

    In: Salakhutdinov, R., Kolter, Z., Heller, K.A., Weller, A., Oliver, N., Scarlett, J., Berkenkamp, F

    Hao, Z., Guo, J., Wang, C., Tang, Y ., Wu, H., Hu, H., Han, K., Xu, C.: Data-efficient large vision models through sequential autoregression. In: Salakhutdinov, R., Kolter, Z., Heller, K.A., Weller, A., Oliver, N., Scarlett, J., Berkenkamp, F. (eds.) Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. Pr...

  39. [39]

    Harris and K

    Harris, C.R., Millman, K.J., van der Walt, S.J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N.J., Kern, R., Picus, M., Hoyer, S., van Kerkwijk, M.H., Brett, M., Haldane, A., del Río, J.F., Wiebe, M., Peterson, P., Gérard-Marchant, P., Sheppard, K., Reddy, T., Weckesser, W., Abbasi, H., Gohlke, C., Oliphant, T.E.: A...

  40. [40]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    He, K., Chen, X., Xie, S., Li, Y ., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16000–16009 (2022)

  41. [41]

    In: Pro- ceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques

    Hertzmann, A., Jacobs, C.E., Oliver, N., Curless, B., Salesin, D.H.: Image analogies. In: Pro- ceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques. p. 327–340. SIGGRAPH ’01, Association for Computing Machinery, New York, NY , USA (2001). https://doi.org/10.1145/383259.383295,https://doi.org/10.1145/383259.383295

  42. [42]

    arXiv preprint arXiv:1902.00120 (2019)

    Hill, F., Santoro, A., Barrett, D.G., Morcos, A.S., Lillicrap, T.: Learning to make analogies by contrasting abstract relational structure. arXiv preprint arXiv:1902.00120 (2019)

  43. [43]

    CoRR (2015),http://arxiv.org/abs/1503.02531

    Hinton, G.E., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. CoRR (2015),http://arxiv.org/abs/1503.02531

  44. [44]

    Advances in neural information processing systems33, 6840–6851 (2020)

    Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020)

  45. [45]

    The analogical mind: Perspectives from cognitive science pp

    Hofstadter, D.R., et al.: Analogy as the core of cognition. The analogical mind: Perspectives from cognitive science pp. 499–538 (2001)

  46. [46]

    In: International Conference on Learning Representations (2022),https://openreview.net/forum?id=nZeVKeeFYf9 12

    Hu, E.J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., Chen, W.: LoRA: Low-rank adaptation of large language models. In: International Conference on Learning Representations (2022),https://openreview.net/forum?id=nZeVKeeFYf9 12

  47. [47]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Hu, S., Ma, Y ., Liu, X., Wei, Y ., Bai, S.: Stratified rule-aware network for abstract visual reasoning. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 35, pp. 1567–1574 (2021)

  48. [48]

    In: Proceedings of the IEEE international conference on computer vision

    Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive instance normaliza- tion. In: Proceedings of the IEEE international conference on computer vision. pp. 1501–1510 (2017)

  49. [49]

    Computing in Science and Engineering , keywords =

    Hunter, J.D.: Matplotlib: A 2d graphics environment. Computing in Science & Engineering 9(3), 90–95 (2007). https://doi.org/10.1109/MCSE.2007.55

  50. [50]

    In: 2020 IEEE/RSJ international conference on intelligent robots and systems (IROS)

    Islam, M.J., Edge, C., Xiao, Y ., Luo, P., Mehtaz, M., Morse, C., Enan, S.S., Sattar, J.: Semantic segmentation of underwater imagery: Dataset and benchmark. In: 2020 IEEE/RSJ international conference on intelligent robots and systems (IROS). pp. 1769–1776. IEEE (2020)

  51. [51]

    org/ixi-dataset/,https://brain-development.org/ixi-dataset/

    IXI Project: Ixi dataset: Information extraction from images.https://brain-development. org/ixi-dataset/,https://brain-development.org/ixi-dataset/

  52. [52]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Jia, X., Zhu, C., Li, M., Tang, W., Zhou, W.: Llvip: A visible-infrared paired dataset for low-light vision. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 3496–3504 (2021)

  53. [53]

    arXiv preprint arXiv:2401.04088 (2024)

    Jiang, A.Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D.S., Casas, D.d.l., Hanna, E.B., Bressand, F., et al.: Mixtral of experts. arXiv preprint arXiv:2401.04088 (2024)

  54. [54]

    arXiv preprint arXiv:2303.14969 (2023)

    Kim, D., Kim, J., Cho, S., Luo, C., Hong, S.: Universal few-shot learning of dense prediction tasks with visual token matching. arXiv preprint arXiv:2303.14969 (2023)

  55. [55]

    Advances in neural information processing systems25(2012)

    Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems25(2012)

  56. [56]

    In: 2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG)

    Kuzdeuov, A., Taratynova, D., Tleuliyev, A., Varol, H.A.: Openthermalpose: An open-source annotated thermal human pose dataset and initial yolov8-pose baselines. In: 2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG). pp. 1–8. IEEE (2024)

  57. [57]

    IEEE Transactions on Biometrics, Behavior, and Identity Science (2025)

    Kuzdeuov, A., Zakaryanov, M., Tleuliyev, A., Varol, H.A.: Openthermalpose2: Extending the open-source annotated thermal human pose dataset with more data, subjects, and poses. IEEE Transactions on Biometrics, Behavior, and Identity Science (2025)

  58. [58]

    Current Opinion in Behavioral Sciences29, 97–104 (2019)

    Lake, B.M., Salakhutdinov, R., Tenenbaum, J.B.: The omniglot challenge: a 3-year progress report. Current Opinion in Behavioral Sciences29, 97–104 (2019)

  59. [59]

    In: Workshop on challenges in representation learning, ICML

    Lee, D.H., et al.: Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In: Workshop on challenges in representation learning, ICML. vol. 3, p. 896. Atlanta (2013)

  60. [60]

    arXiv preprint arXiv:2602.03210 (2026)

    Li, Z., Duan, Z., Ye, J., Chen, C., Chen, D., Li, Y ., Chen, Y .: Viral: Visual in-context reasoning via analogy in diffusion transformers. arXiv preprint arXiv:2602.03210 (2026)

  61. [61]

    In: European conference on computer vision

    Lin, T.Y ., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European conference on computer vision. pp. 740–755. Springer (2014)

  62. [62]

    Neurocomputing338, 139–153 (2019)

    Liu, Y ., Yao, J., Lu, X., Xie, R., Li, L.: Deepcrack: A deep hierarchical feature learning architecture for crack segmentation. Neurocomputing338, 139–153 (2019)

  63. [63]

    In: Proceedings of the IEEE conference on computer vision and pattern recognition

    Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3431–3440 (2015)

  64. [64]

    In: Proceedings of the Annual Meeting of the Cognitive Science Society

    Lu, H., Liu, Q., Ichien, N., Yuille, A.L., Holyoak, K.J.: Seeing the meaning: Vision meets semanticsin solving pictorial analogy problems. In: Proceedings of the Annual Meeting of the Cognitive Science Society. vol. 41 (2019) 13

  65. [65]

    https:// github.com/pytorch/vision(2016)

    maintainers, T., contributors: Torchvision: Pytorch’s computer vision library. https:// github.com/pytorch/vision(2016)

  66. [66]

    https://doi.org/10.5281/zenodo.18944560, https://doi.org/10.5281/zenodo.18944560

    Markiewicz, C.J., Brett, M., Hanke, M., Côté, M.A., Cipollini, B., McCarthy, P., Papadopou- los Orfanos, D., Jarecka, D., Cheng, C.P., Larson, E., Halchenko, Y .O., Cottaar, M., Ghosh, S., Wassermann, D., Gerhard, S., Lee, G.R., Baratz, Z., Moloney, B., Wang, H.T., Kastman, E., Kaczmarzyk, J., Guidotti, R., Daniel, J., Duek, O., Rokem, A., Scheltienne, M....

  67. [67]

    In: van der Walt, S., Millman, J

    McKinney, W.: Data structures for statistical computing in python. In: van der Walt, S., Millman, J. (eds.) Proceedings of the 9th Python in Science Conference. pp. 56–61. Proceedings of the Python in Science Conference, SciPy (2010). https://doi.org/10.25080/Majora-92bf1922- 00a

  68. [68]

    Annals of the New York Academy of Sciences1505(1), 79–101 (2021)

    Mitchell, M.: Abstraction and analogy-making in artificial intelligence. Annals of the New York Academy of Sciences1505(1), 79–101 (2021)

  69. [69]

    In: International Workshop on Efficient Medical Artificial Intelligence

    Negrini, A., Reiß, S.: Conquering the retina: Bringing visual in-context learning to oct. In: International Workshop on Efficient Medical Artificial Intelligence. pp. 21–30. Springer (2025)

  70. [70]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Oorloff, T., Sindagi, V ., Bandara, W.G.C., Shafahi, A., Ghiasi, A., Prakash, C., Ardekani, R.: Stable diffusion models are secretly good at visual in-context learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 23604–23613 (2025)

  71. [71]

    In: Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., Garnett, R

    Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S.: Pytorch: An imperative style, high-performance deep learning library. In: Wallach, H., Larochelle, H....

  72. [72]

    Journal of Machine Learning Research12, 2825–2830 (2011)

    Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V ., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V ., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine learning in Python. Journal of Machine Learning Research12, 2825–2830 (2011)

  73. [73]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023)

  74. [74]

    arXiv preprint arXiv:2604.11998 (2026)

    Qiu, X., Fu, Y ., Geng, J., Ren, B., Pan, J., Wu, Z., Tang, H., Fu, Y ., Timofte, R., Sebe, N., et al.: The second challenge on cross-domain few-shot object detection at ntire 2026: Methods and results. arXiv preprint arXiv:2604.11998 (2026)

  75. [75]

    OpenAI blog1(8), 9 (2019)

    Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog1(8), 9 (2019)

  76. [76]

    British Journal of Medical Psy- chology (1941) 14

    Raven, J.C.: Standardization of progressive matrices, 1938. British Journal of Medical Psy- chology (1941) 14

  77. [77]

    In: International conference on learning representations (2017)

    Ravi, S., Larochelle, H.: Optimization as a model for few-shot learning. In: International conference on learning representations (2017)

  78. [78]

    Reiß, S., Marinov, Z., Jaus, A., Seibold, C., Sarfraz, M.S., Rodner, E., Stiefelhagen, R.: Is visual in-context learning for compositional medical tasks within reach? In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 2642–2652 (2025)

  79. [79]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)

  80. [80]

    In: International Conference on Medical image computing and computer- assisted intervention

    Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical image computing and computer- assisted intervention. pp. 234–241. Springer (2015)

Showing first 80 references.