Quo Vadis, Visual In-Context Learning? A Unified Benchmark Across Domains and Tasks

Alexander Jaus; Jiale Wei; Pradnya Halady; Simon Rei{\ss}; Zdravko Marinov

arxiv: 2606.10967 · v1 · pith:ZV3P6ULFnew · submitted 2026-06-09 · 💻 cs.CV

Quo Vadis, Visual In-Context Learning? A Unified Benchmark Across Domains and Tasks

Pradnya Halady , Jiale Wei , Zdravko Marinov , Alexander Jaus , Simon Rei{\ss} This is my paper

Pith reviewed 2026-06-27 13:23 UTC · model grok-4.3

classification 💻 cs.CV

keywords visual in-context learningbenchmarkone-shot evaluationdomain shifttask generalizationmodel adaptationfailure modescomputer vision

0 comments

The pith

A broad benchmark shows visual in-context models have limited ability to adapt to new image domains and tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper constructs a benchmark called VIBE to test how visual in-context learning models perform when given new image distributions and task types that differ from their pre-training data. It runs six models through 14 datasets and 12 tasks, covering 106 combinations, all under a single one-shot evaluation protocol. This reveals consistent limitations and failure patterns that narrower tests had missed. A sympathetic reader would care because the core promise of in-context learning is test-time adaptation without further training, and the results question how far current models reach that goal.

Core claim

The paper establishes that visual in-context learning models exhibit limited adaptation capabilities when evaluated on diverse new imaging domains and a wide range of tasks under a unified one-shot protocol, uncovering systematic failure modes not visible in prior narrow setups that reused pre-training distributions.

What carries the argument

The VIBE benchmark, which organizes 106 dataset-task combinations under a reproducible one-shot evaluation protocol to measure adaptation to new distributions.

If this is right

Current visual in-context models require targeted improvements to handle shifts in both image appearance and task structure.
Evaluation of these models must use out-of-distribution datasets to avoid overestimating adaptation ability.
Observed failure modes point to concrete areas where model design can be refined for better generalization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Wider use of cross-domain benchmarks like this one could guide development of more robust test-time adaptation methods.
The same stress-testing approach could be applied to in-context methods in other data modalities to check for parallel limits.
Experiments that vary the number of context examples beyond one shot could test whether the identified failures are fixed by additional demonstrations.

Load-bearing premise

The chosen datasets and tasks are new enough relative to the models' pre-training data that performance differences reflect genuine adaptation rather than recall.

What would settle it

If the six models achieved similar accuracy on the VIBE tasks as they do on tasks drawn from their original pre-training distributions, the claim of limited adaptation to new distributions would be weakened.

Figures

Figures reproduced from arXiv: 2606.10967 by Alexander Jaus, Jiale Wei, Pradnya Halady, Simon Rei{\ss}, Zdravko Marinov.

**Figure 1.** Figure 1: Overview of our Visual In-Context BEnchmark (VIBE) with diverse tasks and domains. Abstract Visual in-context learning has been proposed as a pathway towards dynamic models that can generate predictions based on a provided context and thereby can adapt to new vision tasks at test-time. Yet, the evaluation of the adaptation capabilities of these models has been limited to narrow setups that mainly mirror ta… view at source ↗

**Figure 2.** Figure 2: Visualization of task encodings in the VIBE toolkit. Perception & Localization, Restoration & Enhancement, and Manipulation & Transformation tasks tinted in their respective colors. 4 [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Quantitative results for semantic segmentation in color-aware mIoU (for DUKE foreground [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Quantitative results for image restoration tasks. We report PSNR for inpainting, denoising [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Quantitative results for modality transfer in PSNR, and rotation and flipping in SSIM. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Qualitative results for subset of 14 task-dataset pairs in [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: Qualitative examples for Painter on NYU-Depth V2, prompted with the colorization task. [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: Qualitative samples for tasks segmentation and depth estimation. [PITH_FULL_IMAGE:figures/full_fig_p036_8.png] view at source ↗

**Figure 9.** Figure 9: Qualitative samples for edge detection, object detection and keypoint detection. [PITH_FULL_IMAGE:figures/full_fig_p037_9.png] view at source ↗

**Figure 10.** Figure 10: Qualitative samples for the colorization task covering different domains. [PITH_FULL_IMAGE:figures/full_fig_p038_10.png] view at source ↗

**Figure 11.** Figure 11: Qualitative samples for the denoising task covering different domains. [PITH_FULL_IMAGE:figures/full_fig_p039_11.png] view at source ↗

**Figure 12.** Figure 12: Qualitative samples for the inpainting task covering different domains. [PITH_FULL_IMAGE:figures/full_fig_p040_12.png] view at source ↗

**Figure 13.** Figure 13: Qualitative samples for different artifact removal modes in CDD [36]. [PITH_FULL_IMAGE:figures/full_fig_p041_13.png] view at source ↗

**Figure 14.** Figure 14: Qualitative samples for modality transfer, rotation, flip and object removal. [PITH_FULL_IMAGE:figures/full_fig_p042_14.png] view at source ↗

read the original abstract

Visual in-context learning has been proposed as a pathway towards dynamic models that can generate predictions based on a provided context and thereby can adapt to new vision tasks at test-time. Yet, the evaluation of the adaptation capabilities of these models has been limited to narrow setups that mainly mirror tasks or image domains from pre-training for which real adaptation is not required. We address this gap by constructing a broad Visual In-Context BEnchmark (VIBE) with a focus on diverse imaging domains and a wide range of tasks. With this, we are able to get a much clearer picture of the adaptive capabilities of visual in-context models when faced with new image- and task distributions. We stress test six models on $14$ datasets and $12$ tasks (in total, we explore $106$ dataset-task combinations) and compare them under a unified, reproducible evaluation protocol, in an one-shot setting. Our evaluation uncovers key insights on the state of visual in-context learning, including limitations, systematic failure modes and promising directions. To foster broader evaluation, we will openly release our VIBE toolkit.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

VIBE gives a wider test bed for visual in-context learning than prior narrow setups, but the results on adaptation are hard to interpret without checks that the 14 datasets sit outside the models' pre-training distributions.

read the letter

The paper's main contribution is the VIBE benchmark: 14 datasets, 12 tasks, 106 one-shot combinations, run on six models under one protocol. That scale and the explicit push beyond the usual narrow task/image overlap with pre-training is the new part. The abstract makes clear they want to measure real adaptation rather than recall, and they plan to release the toolkit, which is useful for anyone working on dynamic vision models.

What stands out is the coverage across domains and the reproducible setup. Prior work stayed close to pre-training distributions, so this broader sweep can surface systematic failure modes that narrower tests miss. If the full paper delivers clean code and clear selection rules, that alone makes it worth having around.

The soft spot is the missing verification on the central assumption. The stress-test note flags that we need evidence the chosen datasets and tasks are genuinely new distributions relative to the six models' training data. The abstract states the motivation but gives no overlap stats, corpus comparisons, or explicit OOD justification. If the methods section does the same, the headline results on adaptation limits and promising directions rest on an untested premise. That does not kill the benchmark, but it does limit how strongly we can read the performance numbers as evidence of true in-context ability.

This is a benchmark paper aimed at the visual in-context learning community. Readers who care about evaluation standards will get value from the protocol and the released toolkit even if they later add their own distribution checks. It is coherent on its own terms and shows honest engagement with the gap in prior evaluations, so it deserves a serious referee rather than a desk reject. I would send it out, with a request that the authors add the pre-training overlap analysis if it is not already there.

Referee Report

2 major / 2 minor

Summary. The paper introduces the Visual In-Context BEnchmark (VIBE) to address limitations in prior evaluations of visual in-context learning, which often overlap with pre-training distributions. It constructs a benchmark spanning 14 datasets and 12 tasks (106 dataset-task combinations), evaluates six models in a unified one-shot protocol, and reports insights on limitations, systematic failure modes, and promising directions, with plans to release the VIBE toolkit.

Significance. If the datasets satisfy the OOD condition, the work provides a more rigorous assessment of genuine adaptation in visual in-context models than prior narrow setups. The unified reproducible protocol, broad coverage across domains and tasks, and commitment to open-sourcing the toolkit are clear strengths that would support future research.

major comments (2)

[§3 and §4] §3 (Dataset Curation) and §4 (Evaluation Protocol): The central claim that measured one-shot performance reflects genuine adaptation (rather than recall) requires the 14 datasets to lie outside the pre-training distributions of the six models. The manuscript motivates this via diverse domains but supplies no overlap statistics, pre-training corpus comparisons, or explicit OOD verification for the chosen datasets and tasks; this verification is load-bearing for interpreting the 106 combinations as testing adaptation.
[§5] §5 (Results and Analysis): The reported insights on systematic failure modes and promising directions are presented without quantitative breakdowns (e.g., per-model error patterns or statistical significance tests across the 106 combinations) that would allow readers to assess whether the observed limitations are robust or driven by a subset of tasks.

minor comments (2)

[Abstract and §4] The abstract and introduction use 'one-shot setting' inconsistently with the later description of the protocol; clarify whether support examples are strictly one or few-shot.
[Table 1] Table 1 (or equivalent dataset summary) should include explicit columns for domain diversity metrics or pre-training overlap indicators to make the selection criteria transparent.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which helps strengthen the claims regarding genuine adaptation and the robustness of the reported insights. We address each major comment below.

read point-by-point responses

Referee: [§3 and §4] §3 (Dataset Curation) and §4 (Evaluation Protocol): The central claim that measured one-shot performance reflects genuine adaptation (rather than recall) requires the 14 datasets to lie outside the pre-training distributions of the six models. The manuscript motivates this via diverse domains but supplies no overlap statistics, pre-training corpus comparisons, or explicit OOD verification for the chosen datasets and tasks; this verification is load-bearing for interpreting the 106 combinations as testing adaptation.

Authors: We agree that explicit OOD verification is important for interpreting the results as genuine adaptation. The 14 datasets were deliberately chosen from domains (medical imaging, remote sensing, microscopy, and other specialized scientific imaging) that are known to have minimal overlap with common pre-training corpora such as ImageNet, LAION, or COCO. However, the manuscript does not include quantitative overlap statistics or direct corpus comparisons. In revision, we will add an appendix providing dataset provenance details, references to known pre-training data distributions, and any available overlap analyses or citations that support the OOD status of these tasks and domains. revision: yes
Referee: [§5] §5 (Results and Analysis): The reported insights on systematic failure modes and promising directions are presented without quantitative breakdowns (e.g., per-model error patterns or statistical significance tests across the 106 combinations) that would allow readers to assess whether the observed limitations are robust or driven by a subset of tasks.

Authors: The current manuscript already reports full performance tables and figures covering all 106 dataset-task combinations, which form the basis for the discussed patterns and failure modes. To increase rigor, we will augment Section 5 with additional quantitative elements: per-model aggregated statistics across task categories, breakdowns of error patterns (e.g., by failure type), and statistical significance tests (such as paired comparisons or non-parametric tests) across the combinations to confirm that the observed limitations are not driven by outliers. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark with no derivations or self-referential reductions

full rationale

The paper constructs and applies the VIBE benchmark to evaluate six models across 106 dataset-task combinations in a one-shot setting. No equations, fitted parameters, or derivations are present. The central claims rest on direct experimental comparisons under a unified protocol rather than any quantity that reduces to its own inputs by construction. Dataset selection is motivated by the goal of testing adaptation outside pre-training distributions, but this is an empirical assumption (not a derived result) and does not create circularity. No self-citation chains or ansatzes are invoked to justify core results. The work is self-contained as a benchmark study.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper introduces no new mathematical entities or fitted parameters; the central claim rests on empirical evaluation under standard machine-learning assumptions about one-shot protocols and dataset diversity.

axioms (1)

domain assumption One-shot evaluation is a valid protocol for measuring adaptation to new distributions.
The entire evaluation is performed in the one-shot setting as stated in the abstract.

pith-pipeline@v0.9.1-grok · 5738 in / 1165 out tokens · 26281 ms · 2026-06-27T13:23:18.528914+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

111 extracted references · 6 canonical work pages

[1]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Akkaynak, D., Treibitz, T.: Sea-thru: A method for removing water from underwater images. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 1682–1691 (2019)

2019
[2]

In: ACM SIGGRAPH 2024 conference papers

Alaluf, Y ., Garibi, D., Patashnik, O., Averbuch-Elor, H., Cohen-Or, D.: Cross-image attention for zero-shot appearance transfer. In: ACM SIGGRAPH 2024 conference papers. pp. 1–12 (2024)

2024
[3]

arXiv preprint arXiv:2503.09837 (2025)

Anis, A.M., Ali, H., Sarfraz, S.: On the limitations of vision-language models in understanding image transforms. arXiv preprint arXiv:2503.09837 (2025)

arXiv 2025
[4]

In: 2020 International joint conference on neural networks (IJCNN)

Arazo, E., Ortego, D., Albert, P., O’Connor, N.E., McGuinness, K.: Pseudo-labeling and confirmation bias in deep semi-supervised learning. In: 2020 International joint conference on neural networks (IJCNN). pp. 1–8. IEEE (2020)

2020
[5]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Bai, Y ., Geng, X., Mangalam, K., Bar, A., Yuille, A.L., Darrell, T., Malik, J., Efros, A.A.: Sequential modeling enables scalable learning for large vision models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22861–22872 (2024)

2024
[6]

arXiv preprint arXiv:2106.08254 (2021)

Bao, H., Dong, L., Piao, S., Wei, F.: Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254 (2021)

Pith/arXiv arXiv 2021
[7]

Advances in neural information processing systems35, 25005–25017 (2022)

Bar, A., Gandelsman, Y ., Darrell, T., Globerson, A., Efros, A.: Visual prompting via image inpainting. Advances in neural information processing systems35, 25005–25017 (2022)

2022
[8]

In: International conference on machine learning

Barrett, D., Hill, F., Santoro, A., Morcos, A., Lillicrap, T.: Measuring abstract reasoning in neural networks. In: International conference on machine learning. pp. 511–520. PMLR (2018)

2018
[9]

Journal of machine learning research3(Feb), 1137–1155 (2003)

Bengio, Y ., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic language model. Journal of machine learning research3(Feb), 1137–1155 (2003)

2003
[10]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Bitton, Y ., Yosef, R., Strugo, E., Shahaf, D., Schwartz, R., Stanovsky, G.: Vasr: Visual analogies of situation recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 37, pp. 241–249 (2023)

2023
[11]

Advances in Neural Information Processing Systems36, 73299–73311 (2023)

Blumenstiel, B., Jakubik, J., Kühne, H., Vössing, M.: What a mess: Multi-domain evaluation of zero-shot semantic segmentation. Advances in Neural Information Processing Systems36, 73299–73311 (2023)

2023
[12]

arXiv preprint arXiv:2604.13883 (2026)

Born, F., Neuhäuser, T., Muttenthaler, L., Roads, B.D., Spitzer, B., Lampinen, A.K., Jones, M., Müller, K.R., Mozer, M.C.: Context sensitivity improves human-machine visual alignment. arXiv preprint arXiv:2604.13883 (2026)

Pith/arXiv arXiv 2026
[13]

Bradski, G.: The OpenCV Library. Dr. Dobb’s Journal of Software Tools (2000) 10

2000
[14]

In: DAGM German Conference on Pattern Recognition

Bratuli´c, J., Mittal, S., Hoffmann, D.T., Böhm, S., Schirrmeister, R.T., Ball, T., Rupprecht, C., Brox, T.: Unlocking in-context learning for natural datasets beyond language modelling. In: DAGM German Conference on Pattern Recognition. pp. 303–319. Springer (2025)

2025
[15]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: Learning to follow image editing instructions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18392–18402 (2023)

2023
[16]

Advances in neural information processing systems33, 1877–1901 (2020)

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems33, 1877–1901 (2020)

1901
[17]

Machine learning28(1), 41–75 (1997)

Caruana, R.: Multitask learning. Machine learning28(1), 41–75 (1997)

1997
[18]

Advances in neural information processing systems35, 18878–18891 (2022)

Chan, S., Santoro, A., Lampinen, A., Wang, J., Singh, A., Richemond, P., McClelland, J., Hill, F.: Data distributional properties drive emergent in-context learning in transformers. Advances in neural information processing systems35, 18878–18891 (2022)

2022
[19]

In: International conference on machine learning

Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International conference on machine learning. pp. 1597–1607. PmLR (2020)

2020
[20]

In: Computer graphics forum

Cohen, N., Newman, Y ., Shamir, A.: Semantic segmentation in art paintings. In: Computer graphics forum. vol. 41, pp. 261–275. Wiley Online Library (2022)

2022
[21]

O’Reilly (2013)

Collette, A.: Python and HDF5. O’Reilly (2013)

2013
[22]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3213–3223 (2016)

2016
[23]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Czolbe, S., Dalca, A.V .: Neuralizer: General neuroimage analysis without re-training. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6217–6230 (2023)

2023
[24]

In: 2009 IEEE conference on computer vision and pattern recognition

Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255. Ieee (2009)

2009
[25]

In: International conference on machine learning

Donahue, J., Jia, Y ., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.: Decaf: A deep convolutional activation feature for generic visual recognition. In: International conference on machine learning. pp. 647–655. PMLR (2014)

2014
[26]

In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., De- hghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021....

2021
[27]

Environmental Modelling & Software149, 105333 (2022)

Erfani, S.M.H., Wu, Z., Wu, X., Wang, S., Goharian, E.: Atlantis: A benchmark for semantic segmentation of waterbody images. Environmental Modelling & Software149, 105333 (2022)

2022
[28]

In: Forty-first international conference on machine learning (2024)

Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y ., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: Forty-first international conference on machine learning (2024)

2024
[29]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12873–12883 (2021)

2021
[30]

International journal of computer vision88(2), 303–338 (2010)

Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. International journal of computer vision88(2), 303–338 (2010)

2010
[31]

In: International conference on machine learning

Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: International conference on machine learning. pp. 1126–1135. PMLR (2017) 11

2017
[32]

https://www.iwf.org.uk/about-us/why-we-exist/our-research/ how-ai-is-being-abused-to-create-child-sexual-abuse-imagery/ (2026), ac- cessed: 2026-05-05

Foundation, I.W.: Ai-generated child sexual abuse: 2026 report on trends, data & human impact. https://www.iwf.org.uk/about-us/why-we-exist/our-research/ how-ai-is-being-abused-to-create-child-sexual-abuse-imagery/ (2026), ac- cessed: 2026-05-05

2026
[33]

arXiv preprint arXiv:2306.09344 (2023)

Fu, S., Tamir, N., Sundaram, S., Chai, L., Zhang, R., Dekel, T., Isola, P.: Dreamsim: Learning new dimensions of human visual similarity using synthetic data. arXiv preprint arXiv:2306.09344 (2023)

Pith/arXiv arXiv 2023
[34]

In: Proceedings of the IEEE international conference on computer vision

Girshick, R.: Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision. pp. 1440–1448 (2015)

2015
[35]

ACM Transactions on Graphics (TOG)43(4), 1–15 (2024)

Gu, Z., Yang, S., Liao, J., Huo, J., Gao, Y .: Analogist: Out-of-the-box visual in-context learning with image diffusion model. ACM Transactions on Graphics (TOG)43(4), 1–15 (2024)

2024
[36]

In: European conference on computer vision

Guo, Y ., Gao, Y ., Lu, Y ., Zhu, H., Liu, R.W., He, S.: Onerestore: A universal restoration framework for composite degradation. In: European conference on computer vision. pp. 255–272. Springer (2024)

2024
[37]

In: European conference on computer vision

Guo, Y ., Codella, N.C., Karlinsky, L., Codella, J.V ., Smith, J.R., Saenko, K., Rosing, T., Feris, R.: A broader study of cross-domain few-shot learning. In: European conference on computer vision. pp. 124–141. Springer (2020)

2020
[38]

In: Salakhutdinov, R., Kolter, Z., Heller, K.A., Weller, A., Oliver, N., Scarlett, J., Berkenkamp, F

Hao, Z., Guo, J., Wang, C., Tang, Y ., Wu, H., Hu, H., Han, K., Xu, C.: Data-efficient large vision models through sequential autoregression. In: Salakhutdinov, R., Kolter, Z., Heller, K.A., Weller, A., Oliver, N., Scarlett, J., Berkenkamp, F. (eds.) Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. Pr...

2024
[39]

Harris and K

Harris, C.R., Millman, K.J., van der Walt, S.J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N.J., Kern, R., Picus, M., Hoyer, S., van Kerkwijk, M.H., Brett, M., Haldane, A., del Río, J.F., Wiebe, M., Peterson, P., Gérard-Marchant, P., Sheppard, K., Reddy, T., Weckesser, W., Abbasi, H., Gohlke, C., Oliphant, T.E.: A...

work page doi:10.1038/s41586-020-2649-2 2020
[40]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

He, K., Chen, X., Xie, S., Li, Y ., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16000–16009 (2022)

2022
[41]

In: Pro- ceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques

Hertzmann, A., Jacobs, C.E., Oliver, N., Curless, B., Salesin, D.H.: Image analogies. In: Pro- ceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques. p. 327–340. SIGGRAPH ’01, Association for Computing Machinery, New York, NY , USA (2001). https://doi.org/10.1145/383259.383295,https://doi.org/10.1145/383259.383295

work page doi:10.1145/383259.383295 2001
[42]

arXiv preprint arXiv:1902.00120 (2019)

Hill, F., Santoro, A., Barrett, D.G., Morcos, A.S., Lillicrap, T.: Learning to make analogies by contrasting abstract relational structure. arXiv preprint arXiv:1902.00120 (2019)

Pith/arXiv arXiv 1902
[43]

CoRR (2015),http://arxiv.org/abs/1503.02531

Hinton, G.E., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. CoRR (2015),http://arxiv.org/abs/1503.02531

Pith/arXiv arXiv 2015
[44]

Advances in neural information processing systems33, 6840–6851 (2020)

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020)

2020
[45]

The analogical mind: Perspectives from cognitive science pp

Hofstadter, D.R., et al.: Analogy as the core of cognition. The analogical mind: Perspectives from cognitive science pp. 499–538 (2001)

2001
[46]

In: International Conference on Learning Representations (2022),https://openreview.net/forum?id=nZeVKeeFYf9 12

Hu, E.J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., Chen, W.: LoRA: Low-rank adaptation of large language models. In: International Conference on Learning Representations (2022),https://openreview.net/forum?id=nZeVKeeFYf9 12

2022
[47]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Hu, S., Ma, Y ., Liu, X., Wei, Y ., Bai, S.: Stratified rule-aware network for abstract visual reasoning. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 35, pp. 1567–1574 (2021)

2021
[48]

In: Proceedings of the IEEE international conference on computer vision

Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive instance normaliza- tion. In: Proceedings of the IEEE international conference on computer vision. pp. 1501–1510 (2017)

2017
[49]

Computing in Science and Engineering , keywords =

Hunter, J.D.: Matplotlib: A 2d graphics environment. Computing in Science & Engineering 9(3), 90–95 (2007). https://doi.org/10.1109/MCSE.2007.55

work page doi:10.1109/mcse.2007.55 2007
[50]

In: 2020 IEEE/RSJ international conference on intelligent robots and systems (IROS)

Islam, M.J., Edge, C., Xiao, Y ., Luo, P., Mehtaz, M., Morse, C., Enan, S.S., Sattar, J.: Semantic segmentation of underwater imagery: Dataset and benchmark. In: 2020 IEEE/RSJ international conference on intelligent robots and systems (IROS). pp. 1769–1776. IEEE (2020)

2020
[51]

org/ixi-dataset/,https://brain-development.org/ixi-dataset/

IXI Project: Ixi dataset: Information extraction from images.https://brain-development. org/ixi-dataset/,https://brain-development.org/ixi-dataset/
[52]

In: Proceedings of the IEEE/CVF international conference on computer vision

Jia, X., Zhu, C., Li, M., Tang, W., Zhou, W.: Llvip: A visible-infrared paired dataset for low-light vision. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 3496–3504 (2021)

2021
[53]

arXiv preprint arXiv:2401.04088 (2024)

Jiang, A.Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D.S., Casas, D.d.l., Hanna, E.B., Bressand, F., et al.: Mixtral of experts. arXiv preprint arXiv:2401.04088 (2024)

Pith/arXiv arXiv 2024
[54]

arXiv preprint arXiv:2303.14969 (2023)

Kim, D., Kim, J., Cho, S., Luo, C., Hong, S.: Universal few-shot learning of dense prediction tasks with visual token matching. arXiv preprint arXiv:2303.14969 (2023)

arXiv 2023
[55]

Advances in neural information processing systems25(2012)

Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems25(2012)

2012
[56]

In: 2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG)

Kuzdeuov, A., Taratynova, D., Tleuliyev, A., Varol, H.A.: Openthermalpose: An open-source annotated thermal human pose dataset and initial yolov8-pose baselines. In: 2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG). pp. 1–8. IEEE (2024)

2024
[57]

IEEE Transactions on Biometrics, Behavior, and Identity Science (2025)

Kuzdeuov, A., Zakaryanov, M., Tleuliyev, A., Varol, H.A.: Openthermalpose2: Extending the open-source annotated thermal human pose dataset with more data, subjects, and poses. IEEE Transactions on Biometrics, Behavior, and Identity Science (2025)

2025
[58]

Current Opinion in Behavioral Sciences29, 97–104 (2019)

Lake, B.M., Salakhutdinov, R., Tenenbaum, J.B.: The omniglot challenge: a 3-year progress report. Current Opinion in Behavioral Sciences29, 97–104 (2019)

2019
[59]

In: Workshop on challenges in representation learning, ICML

Lee, D.H., et al.: Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In: Workshop on challenges in representation learning, ICML. vol. 3, p. 896. Atlanta (2013)

2013
[60]

arXiv preprint arXiv:2602.03210 (2026)

Li, Z., Duan, Z., Ye, J., Chen, C., Chen, D., Li, Y ., Chen, Y .: Viral: Visual in-context reasoning via analogy in diffusion transformers. arXiv preprint arXiv:2602.03210 (2026)

arXiv 2026
[61]

In: European conference on computer vision

Lin, T.Y ., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European conference on computer vision. pp. 740–755. Springer (2014)

2014
[62]

Neurocomputing338, 139–153 (2019)

Liu, Y ., Yao, J., Lu, X., Xie, R., Li, L.: Deepcrack: A deep hierarchical feature learning architecture for crack segmentation. Neurocomputing338, 139–153 (2019)

2019
[63]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3431–3440 (2015)

2015
[64]

In: Proceedings of the Annual Meeting of the Cognitive Science Society

Lu, H., Liu, Q., Ichien, N., Yuille, A.L., Holyoak, K.J.: Seeing the meaning: Vision meets semanticsin solving pictorial analogy problems. In: Proceedings of the Annual Meeting of the Cognitive Science Society. vol. 41 (2019) 13

2019
[65]

https:// github.com/pytorch/vision(2016)

maintainers, T., contributors: Torchvision: Pytorch’s computer vision library. https:// github.com/pytorch/vision(2016)

2016
[66]

https://doi.org/10.5281/zenodo.18944560, https://doi.org/10.5281/zenodo.18944560

Markiewicz, C.J., Brett, M., Hanke, M., Côté, M.A., Cipollini, B., McCarthy, P., Papadopou- los Orfanos, D., Jarecka, D., Cheng, C.P., Larson, E., Halchenko, Y .O., Cottaar, M., Ghosh, S., Wassermann, D., Gerhard, S., Lee, G.R., Baratz, Z., Moloney, B., Wang, H.T., Kastman, E., Kaczmarzyk, J., Guidotti, R., Daniel, J., Duek, O., Rokem, A., Scheltienne, M....

work page doi:10.5281/zenodo.18944560 2026
[67]

In: van der Walt, S., Millman, J

McKinney, W.: Data structures for statistical computing in python. In: van der Walt, S., Millman, J. (eds.) Proceedings of the 9th Python in Science Conference. pp. 56–61. Proceedings of the Python in Science Conference, SciPy (2010). https://doi.org/10.25080/Majora-92bf1922- 00a

work page doi:10.25080/majora-92bf1922- 2010
[68]

Annals of the New York Academy of Sciences1505(1), 79–101 (2021)

Mitchell, M.: Abstraction and analogy-making in artificial intelligence. Annals of the New York Academy of Sciences1505(1), 79–101 (2021)

2021
[69]

In: International Workshop on Efficient Medical Artificial Intelligence

Negrini, A., Reiß, S.: Conquering the retina: Bringing visual in-context learning to oct. In: International Workshop on Efficient Medical Artificial Intelligence. pp. 21–30. Springer (2025)

2025
[70]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Oorloff, T., Sindagi, V ., Bandara, W.G.C., Shafahi, A., Ghiasi, A., Prakash, C., Ardekani, R.: Stable diffusion models are secretly good at visual in-context learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 23604–23613 (2025)

2025
[71]

In: Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., Garnett, R

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S.: Pytorch: An imperative style, high-performance deep learning library. In: Wallach, H., Larochelle, H....

2019
[72]

Journal of Machine Learning Research12, 2825–2830 (2011)

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V ., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V ., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine learning in Python. Journal of Machine Learning Research12, 2825–2830 (2011)

2011
[73]

In: Proceedings of the IEEE/CVF international conference on computer vision

Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023)

2023
[74]

arXiv preprint arXiv:2604.11998 (2026)

Qiu, X., Fu, Y ., Geng, J., Ren, B., Pan, J., Wu, Z., Tang, H., Fu, Y ., Timofte, R., Sebe, N., et al.: The second challenge on cross-domain few-shot object detection at ntire 2026: Methods and results. arXiv preprint arXiv:2604.11998 (2026)

Pith/arXiv arXiv 2026
[75]

OpenAI blog1(8), 9 (2019)

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog1(8), 9 (2019)

2019
[76]

British Journal of Medical Psy- chology (1941) 14

Raven, J.C.: Standardization of progressive matrices, 1938. British Journal of Medical Psy- chology (1941) 14

1938
[77]

In: International conference on learning representations (2017)

Ravi, S., Larochelle, H.: Optimization as a model for few-shot learning. In: International conference on learning representations (2017)

2017
[78]

Reiß, S., Marinov, Z., Jaus, A., Seibold, C., Sarfraz, M.S., Rodner, E., Stiefelhagen, R.: Is visual in-context learning for compositional medical tasks within reach? In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 2642–2652 (2025)

2025
[79]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)

2022
[80]

In: International Conference on Medical image computing and computer- assisted intervention

Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical image computing and computer- assisted intervention. pp. 234–241. Springer (2015)

2015

Showing first 80 references.

[1] [1]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Akkaynak, D., Treibitz, T.: Sea-thru: A method for removing water from underwater images. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 1682–1691 (2019)

2019

[2] [2]

In: ACM SIGGRAPH 2024 conference papers

Alaluf, Y ., Garibi, D., Patashnik, O., Averbuch-Elor, H., Cohen-Or, D.: Cross-image attention for zero-shot appearance transfer. In: ACM SIGGRAPH 2024 conference papers. pp. 1–12 (2024)

2024

[3] [3]

arXiv preprint arXiv:2503.09837 (2025)

Anis, A.M., Ali, H., Sarfraz, S.: On the limitations of vision-language models in understanding image transforms. arXiv preprint arXiv:2503.09837 (2025)

arXiv 2025

[4] [4]

In: 2020 International joint conference on neural networks (IJCNN)

Arazo, E., Ortego, D., Albert, P., O’Connor, N.E., McGuinness, K.: Pseudo-labeling and confirmation bias in deep semi-supervised learning. In: 2020 International joint conference on neural networks (IJCNN). pp. 1–8. IEEE (2020)

2020

[5] [5]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Bai, Y ., Geng, X., Mangalam, K., Bar, A., Yuille, A.L., Darrell, T., Malik, J., Efros, A.A.: Sequential modeling enables scalable learning for large vision models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22861–22872 (2024)

2024

[6] [6]

arXiv preprint arXiv:2106.08254 (2021)

Bao, H., Dong, L., Piao, S., Wei, F.: Beit: Bert pre-training of image transformers. arXiv preprint arXiv:2106.08254 (2021)

Pith/arXiv arXiv 2021

[7] [7]

Advances in neural information processing systems35, 25005–25017 (2022)

Bar, A., Gandelsman, Y ., Darrell, T., Globerson, A., Efros, A.: Visual prompting via image inpainting. Advances in neural information processing systems35, 25005–25017 (2022)

2022

[8] [8]

In: International conference on machine learning

Barrett, D., Hill, F., Santoro, A., Morcos, A., Lillicrap, T.: Measuring abstract reasoning in neural networks. In: International conference on machine learning. pp. 511–520. PMLR (2018)

2018

[9] [9]

Journal of machine learning research3(Feb), 1137–1155 (2003)

Bengio, Y ., Ducharme, R., Vincent, P., Jauvin, C.: A neural probabilistic language model. Journal of machine learning research3(Feb), 1137–1155 (2003)

2003

[10] [10]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Bitton, Y ., Yosef, R., Strugo, E., Shahaf, D., Schwartz, R., Stanovsky, G.: Vasr: Visual analogies of situation recognition. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 37, pp. 241–249 (2023)

2023

[11] [11]

Advances in Neural Information Processing Systems36, 73299–73311 (2023)

Blumenstiel, B., Jakubik, J., Kühne, H., Vössing, M.: What a mess: Multi-domain evaluation of zero-shot semantic segmentation. Advances in Neural Information Processing Systems36, 73299–73311 (2023)

2023

[12] [12]

arXiv preprint arXiv:2604.13883 (2026)

Born, F., Neuhäuser, T., Muttenthaler, L., Roads, B.D., Spitzer, B., Lampinen, A.K., Jones, M., Müller, K.R., Mozer, M.C.: Context sensitivity improves human-machine visual alignment. arXiv preprint arXiv:2604.13883 (2026)

Pith/arXiv arXiv 2026

[13] [13]

Bradski, G.: The OpenCV Library. Dr. Dobb’s Journal of Software Tools (2000) 10

2000

[14] [14]

In: DAGM German Conference on Pattern Recognition

Bratuli´c, J., Mittal, S., Hoffmann, D.T., Böhm, S., Schirrmeister, R.T., Ball, T., Rupprecht, C., Brox, T.: Unlocking in-context learning for natural datasets beyond language modelling. In: DAGM German Conference on Pattern Recognition. pp. 303–319. Springer (2025)

2025

[15] [15]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Brooks, T., Holynski, A., Efros, A.A.: Instructpix2pix: Learning to follow image editing instructions. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18392–18402 (2023)

2023

[16] [16]

Advances in neural information processing systems33, 1877–1901 (2020)

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems33, 1877–1901 (2020)

1901

[17] [17]

Machine learning28(1), 41–75 (1997)

Caruana, R.: Multitask learning. Machine learning28(1), 41–75 (1997)

1997

[18] [18]

Advances in neural information processing systems35, 18878–18891 (2022)

Chan, S., Santoro, A., Lampinen, A., Wang, J., Singh, A., Richemond, P., McClelland, J., Hill, F.: Data distributional properties drive emergent in-context learning in transformers. Advances in neural information processing systems35, 18878–18891 (2022)

2022

[19] [19]

In: International conference on machine learning

Chen, T., Kornblith, S., Norouzi, M., Hinton, G.: A simple framework for contrastive learning of visual representations. In: International conference on machine learning. pp. 1597–1607. PmLR (2020)

2020

[20] [20]

In: Computer graphics forum

Cohen, N., Newman, Y ., Shamir, A.: Semantic segmentation in art paintings. In: Computer graphics forum. vol. 41, pp. 261–275. Wiley Online Library (2022)

2022

[21] [21]

O’Reilly (2013)

Collette, A.: Python and HDF5. O’Reilly (2013)

2013

[22] [22]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Cordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S., Schiele, B.: The cityscapes dataset for semantic urban scene understanding. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3213–3223 (2016)

2016

[23] [23]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Czolbe, S., Dalca, A.V .: Neuralizer: General neuroimage analysis without re-training. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 6217–6230 (2023)

2023

[24] [24]

In: 2009 IEEE conference on computer vision and pattern recognition

Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE conference on computer vision and pattern recognition. pp. 248–255. Ieee (2009)

2009

[25] [25]

In: International conference on machine learning

Donahue, J., Jia, Y ., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.: Decaf: A deep convolutional activation feature for generic visual recognition. In: International conference on machine learning. pp. 647–655. PMLR (2014)

2014

[26] [26]

In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., De- hghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: 9th International Conference on Learning Representations, ICLR 2021, Virtual Event, Austria, May 3-7, 2021....

2021

[27] [27]

Environmental Modelling & Software149, 105333 (2022)

Erfani, S.M.H., Wu, Z., Wu, X., Wang, S., Goharian, E.: Atlantis: A benchmark for semantic segmentation of waterbody images. Environmental Modelling & Software149, 105333 (2022)

2022

[28] [28]

In: Forty-first international conference on machine learning (2024)

Esser, P., Kulal, S., Blattmann, A., Entezari, R., Müller, J., Saini, H., Levi, Y ., Lorenz, D., Sauer, A., Boesel, F., et al.: Scaling rectified flow transformers for high-resolution image synthesis. In: Forty-first international conference on machine learning (2024)

2024

[29] [29]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Esser, P., Rombach, R., Ommer, B.: Taming transformers for high-resolution image synthesis. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 12873–12883 (2021)

2021

[30] [30]

International journal of computer vision88(2), 303–338 (2010)

Everingham, M., Van Gool, L., Williams, C.K., Winn, J., Zisserman, A.: The pascal visual object classes (voc) challenge. International journal of computer vision88(2), 303–338 (2010)

2010

[31] [31]

In: International conference on machine learning

Finn, C., Abbeel, P., Levine, S.: Model-agnostic meta-learning for fast adaptation of deep networks. In: International conference on machine learning. pp. 1126–1135. PMLR (2017) 11

2017

[32] [32]

https://www.iwf.org.uk/about-us/why-we-exist/our-research/ how-ai-is-being-abused-to-create-child-sexual-abuse-imagery/ (2026), ac- cessed: 2026-05-05

Foundation, I.W.: Ai-generated child sexual abuse: 2026 report on trends, data & human impact. https://www.iwf.org.uk/about-us/why-we-exist/our-research/ how-ai-is-being-abused-to-create-child-sexual-abuse-imagery/ (2026), ac- cessed: 2026-05-05

2026

[33] [33]

arXiv preprint arXiv:2306.09344 (2023)

Fu, S., Tamir, N., Sundaram, S., Chai, L., Zhang, R., Dekel, T., Isola, P.: Dreamsim: Learning new dimensions of human visual similarity using synthetic data. arXiv preprint arXiv:2306.09344 (2023)

Pith/arXiv arXiv 2023

[34] [34]

In: Proceedings of the IEEE international conference on computer vision

Girshick, R.: Fast r-cnn. In: Proceedings of the IEEE international conference on computer vision. pp. 1440–1448 (2015)

2015

[35] [35]

ACM Transactions on Graphics (TOG)43(4), 1–15 (2024)

Gu, Z., Yang, S., Liao, J., Huo, J., Gao, Y .: Analogist: Out-of-the-box visual in-context learning with image diffusion model. ACM Transactions on Graphics (TOG)43(4), 1–15 (2024)

2024

[36] [36]

In: European conference on computer vision

Guo, Y ., Gao, Y ., Lu, Y ., Zhu, H., Liu, R.W., He, S.: Onerestore: A universal restoration framework for composite degradation. In: European conference on computer vision. pp. 255–272. Springer (2024)

2024

[37] [37]

In: European conference on computer vision

Guo, Y ., Codella, N.C., Karlinsky, L., Codella, J.V ., Smith, J.R., Saenko, K., Rosing, T., Feris, R.: A broader study of cross-domain few-shot learning. In: European conference on computer vision. pp. 124–141. Springer (2020)

2020

[38] [38]

In: Salakhutdinov, R., Kolter, Z., Heller, K.A., Weller, A., Oliver, N., Scarlett, J., Berkenkamp, F

Hao, Z., Guo, J., Wang, C., Tang, Y ., Wu, H., Hu, H., Han, K., Xu, C.: Data-efficient large vision models through sequential autoregression. In: Salakhutdinov, R., Kolter, Z., Heller, K.A., Weller, A., Oliver, N., Scarlett, J., Berkenkamp, F. (eds.) Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024. Pr...

2024

[39] [39]

Harris and K

Harris, C.R., Millman, K.J., van der Walt, S.J., Gommers, R., Virtanen, P., Cournapeau, D., Wieser, E., Taylor, J., Berg, S., Smith, N.J., Kern, R., Picus, M., Hoyer, S., van Kerkwijk, M.H., Brett, M., Haldane, A., del Río, J.F., Wiebe, M., Peterson, P., Gérard-Marchant, P., Sheppard, K., Reddy, T., Weckesser, W., Abbasi, H., Gohlke, C., Oliphant, T.E.: A...

work page doi:10.1038/s41586-020-2649-2 2020

[40] [40]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

He, K., Chen, X., Xie, S., Li, Y ., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16000–16009 (2022)

2022

[41] [41]

In: Pro- ceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques

Hertzmann, A., Jacobs, C.E., Oliver, N., Curless, B., Salesin, D.H.: Image analogies. In: Pro- ceedings of the 28th Annual Conference on Computer Graphics and Interactive Techniques. p. 327–340. SIGGRAPH ’01, Association for Computing Machinery, New York, NY , USA (2001). https://doi.org/10.1145/383259.383295,https://doi.org/10.1145/383259.383295

work page doi:10.1145/383259.383295 2001

[42] [42]

arXiv preprint arXiv:1902.00120 (2019)

Hill, F., Santoro, A., Barrett, D.G., Morcos, A.S., Lillicrap, T.: Learning to make analogies by contrasting abstract relational structure. arXiv preprint arXiv:1902.00120 (2019)

Pith/arXiv arXiv 1902

[43] [43]

CoRR (2015),http://arxiv.org/abs/1503.02531

Hinton, G.E., Vinyals, O., Dean, J.: Distilling the knowledge in a neural network. CoRR (2015),http://arxiv.org/abs/1503.02531

Pith/arXiv arXiv 2015

[44] [44]

Advances in neural information processing systems33, 6840–6851 (2020)

Ho, J., Jain, A., Abbeel, P.: Denoising diffusion probabilistic models. Advances in neural information processing systems33, 6840–6851 (2020)

2020

[45] [45]

The analogical mind: Perspectives from cognitive science pp

Hofstadter, D.R., et al.: Analogy as the core of cognition. The analogical mind: Perspectives from cognitive science pp. 499–538 (2001)

2001

[46] [46]

In: International Conference on Learning Representations (2022),https://openreview.net/forum?id=nZeVKeeFYf9 12

Hu, E.J., Shen, Y ., Wallis, P., Allen-Zhu, Z., Li, Y ., Wang, S., Wang, L., Chen, W.: LoRA: Low-rank adaptation of large language models. In: International Conference on Learning Representations (2022),https://openreview.net/forum?id=nZeVKeeFYf9 12

2022

[47] [47]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Hu, S., Ma, Y ., Liu, X., Wei, Y ., Bai, S.: Stratified rule-aware network for abstract visual reasoning. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 35, pp. 1567–1574 (2021)

2021

[48] [48]

In: Proceedings of the IEEE international conference on computer vision

Huang, X., Belongie, S.: Arbitrary style transfer in real-time with adaptive instance normaliza- tion. In: Proceedings of the IEEE international conference on computer vision. pp. 1501–1510 (2017)

2017

[49] [49]

Computing in Science and Engineering , keywords =

Hunter, J.D.: Matplotlib: A 2d graphics environment. Computing in Science & Engineering 9(3), 90–95 (2007). https://doi.org/10.1109/MCSE.2007.55

work page doi:10.1109/mcse.2007.55 2007

[50] [50]

In: 2020 IEEE/RSJ international conference on intelligent robots and systems (IROS)

Islam, M.J., Edge, C., Xiao, Y ., Luo, P., Mehtaz, M., Morse, C., Enan, S.S., Sattar, J.: Semantic segmentation of underwater imagery: Dataset and benchmark. In: 2020 IEEE/RSJ international conference on intelligent robots and systems (IROS). pp. 1769–1776. IEEE (2020)

2020

[51] [51]

org/ixi-dataset/,https://brain-development.org/ixi-dataset/

IXI Project: Ixi dataset: Information extraction from images.https://brain-development. org/ixi-dataset/,https://brain-development.org/ixi-dataset/

[52] [52]

In: Proceedings of the IEEE/CVF international conference on computer vision

Jia, X., Zhu, C., Li, M., Tang, W., Zhou, W.: Llvip: A visible-infrared paired dataset for low-light vision. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 3496–3504 (2021)

2021

[53] [53]

arXiv preprint arXiv:2401.04088 (2024)

Jiang, A.Q., Sablayrolles, A., Roux, A., Mensch, A., Savary, B., Bamford, C., Chaplot, D.S., Casas, D.d.l., Hanna, E.B., Bressand, F., et al.: Mixtral of experts. arXiv preprint arXiv:2401.04088 (2024)

Pith/arXiv arXiv 2024

[54] [54]

arXiv preprint arXiv:2303.14969 (2023)

Kim, D., Kim, J., Cho, S., Luo, C., Hong, S.: Universal few-shot learning of dense prediction tasks with visual token matching. arXiv preprint arXiv:2303.14969 (2023)

arXiv 2023

[55] [55]

Advances in neural information processing systems25(2012)

Krizhevsky, A., Sutskever, I., Hinton, G.E.: Imagenet classification with deep convolutional neural networks. Advances in neural information processing systems25(2012)

2012

[56] [56]

In: 2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG)

Kuzdeuov, A., Taratynova, D., Tleuliyev, A., Varol, H.A.: Openthermalpose: An open-source annotated thermal human pose dataset and initial yolov8-pose baselines. In: 2024 IEEE 18th International Conference on Automatic Face and Gesture Recognition (FG). pp. 1–8. IEEE (2024)

2024

[57] [57]

IEEE Transactions on Biometrics, Behavior, and Identity Science (2025)

Kuzdeuov, A., Zakaryanov, M., Tleuliyev, A., Varol, H.A.: Openthermalpose2: Extending the open-source annotated thermal human pose dataset with more data, subjects, and poses. IEEE Transactions on Biometrics, Behavior, and Identity Science (2025)

2025

[58] [58]

Current Opinion in Behavioral Sciences29, 97–104 (2019)

Lake, B.M., Salakhutdinov, R., Tenenbaum, J.B.: The omniglot challenge: a 3-year progress report. Current Opinion in Behavioral Sciences29, 97–104 (2019)

2019

[59] [59]

In: Workshop on challenges in representation learning, ICML

Lee, D.H., et al.: Pseudo-label: The simple and efficient semi-supervised learning method for deep neural networks. In: Workshop on challenges in representation learning, ICML. vol. 3, p. 896. Atlanta (2013)

2013

[60] [60]

arXiv preprint arXiv:2602.03210 (2026)

Li, Z., Duan, Z., Ye, J., Chen, C., Chen, D., Li, Y ., Chen, Y .: Viral: Visual in-context reasoning via analogy in diffusion transformers. arXiv preprint arXiv:2602.03210 (2026)

arXiv 2026

[61] [61]

In: European conference on computer vision

Lin, T.Y ., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European conference on computer vision. pp. 740–755. Springer (2014)

2014

[62] [62]

Neurocomputing338, 139–153 (2019)

Liu, Y ., Yao, J., Lu, X., Xie, R., Li, L.: Deepcrack: A deep hierarchical feature learning architecture for crack segmentation. Neurocomputing338, 139–153 (2019)

2019

[63] [63]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Long, J., Shelhamer, E., Darrell, T.: Fully convolutional networks for semantic segmentation. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 3431–3440 (2015)

2015

[64] [64]

In: Proceedings of the Annual Meeting of the Cognitive Science Society

Lu, H., Liu, Q., Ichien, N., Yuille, A.L., Holyoak, K.J.: Seeing the meaning: Vision meets semanticsin solving pictorial analogy problems. In: Proceedings of the Annual Meeting of the Cognitive Science Society. vol. 41 (2019) 13

2019

[65] [65]

https:// github.com/pytorch/vision(2016)

maintainers, T., contributors: Torchvision: Pytorch’s computer vision library. https:// github.com/pytorch/vision(2016)

2016

[66] [66]

https://doi.org/10.5281/zenodo.18944560, https://doi.org/10.5281/zenodo.18944560

Markiewicz, C.J., Brett, M., Hanke, M., Côté, M.A., Cipollini, B., McCarthy, P., Papadopou- los Orfanos, D., Jarecka, D., Cheng, C.P., Larson, E., Halchenko, Y .O., Cottaar, M., Ghosh, S., Wassermann, D., Gerhard, S., Lee, G.R., Baratz, Z., Moloney, B., Wang, H.T., Kastman, E., Kaczmarzyk, J., Guidotti, R., Daniel, J., Duek, O., Rokem, A., Scheltienne, M....

work page doi:10.5281/zenodo.18944560 2026

[67] [67]

In: van der Walt, S., Millman, J

McKinney, W.: Data structures for statistical computing in python. In: van der Walt, S., Millman, J. (eds.) Proceedings of the 9th Python in Science Conference. pp. 56–61. Proceedings of the Python in Science Conference, SciPy (2010). https://doi.org/10.25080/Majora-92bf1922- 00a

work page doi:10.25080/majora-92bf1922- 2010

[68] [68]

Annals of the New York Academy of Sciences1505(1), 79–101 (2021)

Mitchell, M.: Abstraction and analogy-making in artificial intelligence. Annals of the New York Academy of Sciences1505(1), 79–101 (2021)

2021

[69] [69]

In: International Workshop on Efficient Medical Artificial Intelligence

Negrini, A., Reiß, S.: Conquering the retina: Bringing visual in-context learning to oct. In: International Workshop on Efficient Medical Artificial Intelligence. pp. 21–30. Springer (2025)

2025

[70] [70]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Oorloff, T., Sindagi, V ., Bandara, W.G.C., Shafahi, A., Ghiasi, A., Prakash, C., Ardekani, R.: Stable diffusion models are secretly good at visual in-context learning. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 23604–23613 (2025)

2025

[71] [71]

In: Wallach, H., Larochelle, H., Beygelzimer, A., d'Alché-Buc, F., Fox, E., Garnett, R

Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J., Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga, L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L., Bai, J., Chintala, S.: Pytorch: An imperative style, high-performance deep learning library. In: Wallach, H., Larochelle, H....

2019

[72] [72]

Journal of Machine Learning Research12, 2825–2830 (2011)

Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V ., Thirion, B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, V ., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Perrot, M., Duchesnay, E.: Scikit-learn: Machine learning in Python. Journal of Machine Learning Research12, 2825–2830 (2011)

2011

[73] [73]

In: Proceedings of the IEEE/CVF international conference on computer vision

Peebles, W., Xie, S.: Scalable diffusion models with transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 4195–4205 (2023)

2023

[74] [74]

arXiv preprint arXiv:2604.11998 (2026)

Qiu, X., Fu, Y ., Geng, J., Ren, B., Pan, J., Wu, Z., Tang, H., Fu, Y ., Timofte, R., Sebe, N., et al.: The second challenge on cross-domain few-shot object detection at ntire 2026: Methods and results. arXiv preprint arXiv:2604.11998 (2026)

Pith/arXiv arXiv 2026

[75] [75]

OpenAI blog1(8), 9 (2019)

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al.: Language models are unsupervised multitask learners. OpenAI blog1(8), 9 (2019)

2019

[76] [76]

British Journal of Medical Psy- chology (1941) 14

Raven, J.C.: Standardization of progressive matrices, 1938. British Journal of Medical Psy- chology (1941) 14

1938

[77] [77]

In: International conference on learning representations (2017)

Ravi, S., Larochelle, H.: Optimization as a model for few-shot learning. In: International conference on learning representations (2017)

2017

[78] [78]

Reiß, S., Marinov, Z., Jaus, A., Seibold, C., Sarfraz, M.S., Rodner, E., Stiefelhagen, R.: Is visual in-context learning for compositional medical tasks within reach? In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 2642–2652 (2025)

2025

[79] [79]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Rombach, R., Blattmann, A., Lorenz, D., Esser, P., Ommer, B.: High-resolution image synthesis with latent diffusion models. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10684–10695 (2022)

2022

[80] [80]

In: International Conference on Medical image computing and computer- assisted intervention

Ronneberger, O., Fischer, P., Brox, T.: U-net: Convolutional networks for biomedical image segmentation. In: International Conference on Medical image computing and computer- assisted intervention. pp. 234–241. Springer (2015)

2015