Show Me Examples: Inferring Visual Concepts from Image Sets

Bj\"orn Ommer; Josh Susskind; Kolja Bauer; Miguel Angel Bautista; Nick Stracke; Stefan Andreas Baumann

arxiv: 2607.02402 · v1 · pith:WK3MFAJFnew · submitted 2026-07-02 · 💻 cs.CV

Show Me Examples: Inferring Visual Concepts from Image Sets

Nick Stracke , Kolja Bauer , Stefan Andreas Baumann , Miguel Angel Bautista , Josh Susskind , Bj\"orn Ommer This is my paper

Pith reviewed 2026-07-03 15:09 UTC · model grok-4.3

classification 💻 cs.CV

keywords visual concept inferenceimage setsvision-language modelsconcept embeddingsimage generationVICIS task

0 comments

The pith

A training framework lets models infer shared visual concepts from small image sets and generate new images that apply the concept to a query.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision-language models follow text instructions but fail to infer concepts from visual examples alone. The paper defines the VICIS task: given a context set of images sharing a concept and one query image, generate outputs that keep the concept while matching the query. It proposes a training framework and architecture that learn to extract concept-specific embeddings from image sets. Experiments on synthetic data and large-scale ImageNet collections show the model produces more accurate and diverse generations than current VLMs and extends to unseen concepts and sketch inputs.

Core claim

The paper claims that state-of-the-art vision-language models perform poorly on inferring visual concepts from image sets, often ignoring the visual context or defaulting to biased generations, while a proposed training framework and architecture that learn to infer visual concepts from image sets and extract concept-specific embeddings from queries can generate more accurate and diverse outputs and generalize to unseen concepts and modalities such as sketches.

What carries the argument

The VICIS task, which requires generating images that preserve a context-defined concept from a set of examples while remaining consistent with a query image, supported by an architecture that extracts concept-specific embeddings.

If this is right

Vision-language models could perform visual-only reasoning tasks without needing textual descriptions of the concept.
Generation quality improves in both accuracy to the inferred concept and output diversity.
The same embedding extraction supports generalization to concepts and input modalities absent from training.
Models become less prone to ignoring visual context or falling back on training biases.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Interactive systems could let users supply image examples to teach new visual concepts on the fly.
The embedding approach might transfer to video sequences or 3D shapes by treating them as extended image sets.
Few-shot visual classification could improve by treating each class as a concept inferred from an example set.

Load-bearing premise

The training framework and architecture can reliably extract and apply concept-specific embeddings from image sets.

What would settle it

Run the model on held-out image sets that define a single clear visual concept and check whether its generated outputs preserve that concept at a higher rate than standard vision-language models while still matching the query image.

Figures

Figures reproduced from arXiv: 2607.02402 by Bj\"orn Ommer, Josh Susskind, Kolja Bauer, Miguel Angel Bautista, Nick Stracke, Stefan Andreas Baumann.

**Figure 2.** Figure 2: Typical VLM failure cases on our VICIS task: [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Model Overview. Given a set of images that share a concept (here, shape), the set learner predicts a concept space that captures all possible instantiations of that visual concept. We project the embedded query image in that concept space to obtain its concrete instantiation (here, circle) and remove all other concepts that cannot be represented in that space. Finally, the diffusion model is conditioned on… view at source ↗

**Figure 4.** Figure 4: Instantiation Illustration. Instantiation Module Given a query image \protect \mathbf {x}^\mathrm {query} , we encode it individually using the same pretrained encoder E, but utilize only the CLS token embedding \protect \mathbf {e}^\mathrm {query} . To isolate the concept-specific information identified by the Set Learner, we project this embedding onto the embedding space spanned by the concept directi… view at source ↗

**Figure 5.** Figure 5: Visualization of our toy dataset and results. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Hierarchical Concept Spaces. A simplified visualization of our ImageNet hierarchy, which is loosely based on the WordNet hierarchy. We define a concept space with respect to the next hierarchy level. The Dc(animals) concept space differentiates roughly between different types of animals whereas Dc(felines) is more granular and operates on a lower level of the hierarchy. 4.2 VICIS on Real-World Data While t… view at source ↗

**Figure 7.** Figure 7: When a query does not fully match the concept space defined by the context set, it aligns with the closest point on the learned manifold. Although trained only on ImageNet, the model generalizes to sketch inputs by extrapolating semantically. This applies to both query and context images. In the last two rows, we show how different hierarchy levels affect generation: projecting a tiger onto the feline spac… view at source ↗

**Figure 8.** Figure 8: We show how the VICIS task can be used for more applications, such as learning transformations. Here, the context set gives multiple examples of the desired transformation, which can then be applied to the query image. We show generations for held-out query images. 4.5 Concept-Specific Embedding Analysis 10 0 10 1 10 2 Number of Samples 0 1 2 3 4 5 6 7 Score Avg. Fishers Discriminant Ratio LDA PCA Ours [P… view at source ↗

**Figure 9.** Figure 9: Class Separation ↑ Our predicted directions better separate concepts with significantly fewer samples. Since our model predicts an embedding of the instantiated query concept, we can analyze the structure and expressiveness of that embedding space to better understand how the model solves the task. As described in Sec. 3.2, our method predicts concept directions inferred by the Set Learner. This approach… view at source ↗

read the original abstract

Vision-language models (VLMs) can follow complex textual instructions, yet they struggle to reason from purely visual context. In particular, current models fail to infer shared concepts from sets of example images and apply them to new inputs. We introduce Visual Concept Inference from Sets (VICIS), a task that evaluates this capability. Given a small context set of images sharing a concept and a query image, the model must generate new images that preserve the context-defined concept while remaining consistent with the query. We show that state-of-the-art VLMs perform poorly on this task, often ignoring the visual context or defaulting to biased generations. To address this gap, we propose a training framework and architecture that learn to infer visual concepts from image sets and extract concept-specific embeddings from queries. Experiments on synthetic data and large-scale ImageNet/WordNet data show that our model generates more accurate and diverse outputs and generalizes to unseen concepts and modalities such as sketches.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper defines a new task VICIS for inferring visual concepts from small image sets and proposes a training setup that beats standard VLMs at generating consistent outputs.

read the letter

The main takeaway is that current VLMs are weak at extracting a shared visual concept from a handful of example images and then applying it to a new query image. The authors turn this into an explicit task called VICIS and show that off-the-shelf models often ignore the examples or fall back on text biases.

What is new is the task framing itself plus an architecture that learns concept-specific embeddings from the set and conditions generation on both the concept and the query. They train on synthetic data and large ImageNet/WordNet collections, and report that the resulting model produces more accurate and diverse images while generalizing to unseen concepts and to sketches.

The work does a clean job of identifying a concrete failure mode and giving it a measurable form. The generalization claims to new concepts and modalities are the part that could matter for downstream use.

The soft spot is that the abstract supplies no numbers, no baseline details, and no description of the architecture or loss, so the size of the improvement and the fairness of the comparison stay unclear. If the full paper has solid quantitative results, ablations, and controls on data splits, that would tighten the case.

This is aimed at people working on visual reasoning and few-shot adaptation in VLMs. It has enough of a concrete task and reported gains to merit a serious referee, even if the numbers need close checking.

Referee Report

2 major / 1 minor

Summary. The paper introduces the Visual Concept Inference from Sets (VICIS) task, in which a model is given a small set of images sharing an implicit visual concept plus a query image and must generate new images that preserve the concept while remaining consistent with the query. It reports that current VLMs perform poorly on VICIS, often ignoring the visual context, and proposes a training framework together with an architecture that learns to extract concept-specific embeddings from image sets. Experiments on synthetic data and large-scale ImageNet/WordNet data are claimed to show that the proposed model produces more accurate and diverse outputs and generalizes to unseen concepts and to other modalities such as sketches.

Significance. If the quantitative claims hold, the work would be significant for vision-language modeling because it directly targets the gap between text-instruction following and visual-context reasoning. The introduction of the VICIS benchmark and a method for learning concept-specific embeddings from sets could stimulate further research on few-shot visual concept acquisition and cross-modal generalization. The use of both controlled synthetic data and large-scale ImageNet/WordNet hierarchies strengthens the potential impact provided the reported gains are substantial, reproducible, and accompanied by appropriate baselines.

major comments (2)

[Abstract] Abstract: the central claim that the proposed model 'generates more accurate and diverse outputs' is stated without any quantitative metrics, baselines, tables, or experimental protocol, so the data-to-claim link cannot be evaluated. This is load-bearing for the paper's main empirical contribution.
[Abstract] Abstract / Experiments: the statement that the model 'generalizes to unseen concepts and modalities such as sketches' is presented without any description of the held-out splits, evaluation protocol, or quantitative results that would allow verification of the generalization claim.

minor comments (1)

[Abstract] The abstract would benefit from a one-sentence description of the key architectural component that enables concept-specific embedding extraction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback. We agree that the abstract requires strengthening to make the empirical claims verifiable and will revise it accordingly in the next version.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that the proposed model 'generates more accurate and diverse outputs' is stated without any quantitative metrics, baselines, tables, or experimental protocol, so the data-to-claim link cannot be evaluated. This is load-bearing for the paper's main empirical contribution.

Authors: We agree the abstract should link claims to evidence. The revised abstract will include specific quantitative results (e.g., accuracy and diversity gains on synthetic and ImageNet/WordNet benchmarks), name the main baselines, and reference the relevant tables and evaluation protocol. revision: yes
Referee: [Abstract] Abstract / Experiments: the statement that the model 'generalizes to unseen concepts and modalities such as sketches' is presented without any description of the held-out splits, evaluation protocol, or quantitative results that would allow verification of the generalization claim.

Authors: We will revise the abstract to briefly specify the held-out concept splits, the cross-modal evaluation protocol (including sketches), and report the corresponding quantitative generalization results. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper introduces the VICIS task and a training framework/architecture for concept inference from image sets. No equations, parameter fits, or self-citations appear in the abstract or described content that would reduce any claimed prediction or result to its own inputs by construction. Experimental claims on synthetic and ImageNet/WordNet data are presented as independent evaluations rather than tautological outputs. The derivation chain is self-contained with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No free parameters, axioms, or invented entities are described in the abstract.

pith-pipeline@v0.9.1-grok · 5707 in / 951 out tokens · 27450 ms · 2026-07-03T15:09:33.880322+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

43 extracted references · 15 canonical work pages · 9 internal anchors

[1]

fal.ai.https://fal.ai/(2025), accessed: 2025-09-25 12, 2

2025
[2]

Advances in neural information processing systems35, 23716– 23736 (2022) 1

Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems35, 23716– 23736 (2022) 1

2022
[3]

Stochastic Interpolants: A Unifying Framework for Flows and Diffusions

Albergo, M.S., Boffi, N.M., Vanden-Eijnden, E.: Stochastic interpolants: A unifying framework for flows and diffusions. arXiv preprint arXiv:2303.08797 (2023) 7, 2, 5

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Bai, Y., Geng, X., Mangalam, K., Bar, A., Yuille, A.L., Darrell, T., Malik, J., Efros, A.A.: Sequential modeling enables scalable learning for large vision models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22861–22872 (2024) 3

2024
[5]

Advances in Neural Information Processing Systems 36, 63758–63778 (2023) 3

Balazevic, I., Steiner, D., Parthasarathy, N., Arandjelović, R., Henaff, O.: Towards in-context scene understanding. Advances in Neural Information Processing Systems 36, 63758–63778 (2023) 3

2023
[6]

Advances in Neural Information Processing Systems35, 25005–25017 (2022) 3, 8, 10, 5

Bar, A., Gandelsman, Y., Darrell, T., Globerson, A., Efros, A.: Visual prompting via image inpainting. Advances in Neural Information Processing Systems35, 25005–25017 (2022) 3, 8, 10, 5

2022
[7]

1 kontext: Flow matching for in- context image generation and editing in latent space

Batifol, S., Blattmann, A., Boesel, F., Consul, S., Diagne, C., Dockhorn, T., English, J., English, Z., Esser, P., Kulal, S., et al.: Flux. 1 kontext: Flow matching for in- context image generation and editing in latent space. arXiv e-prints pp. arXiv–2506 (2025) 1, 8, 13

2025
[8]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Baumann, S.A., Krause, F., Neumayr, M., Stracke, N., Sevi, M., Hu, V.T., Ommer, B.: Continuous, subject-specific attribute control in t2i models by identifying semantic directions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 13231–13241 (June 2025) 5

2025
[9]

Advances in neural information processing systems33, 1877–1901 (2020) 3

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems33, 1877–1901 (2020) 3

1901
[10]

In: Proceedings of the IEEE/CVF international conference on computer vision

Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9650–9660 (2021) 6, 2

2021
[11]

Advances in neural information processing systems31(2018) 4

Chen, R.T., Li, X., Grosse, R.B., Duvenaud, D.K.: Isolating sources of disentan- glement in variational autoencoders. Advances in neural information processing systems31(2018) 4

2018
[12]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Chen, X., Wu, Z., Liu, X., Pan, Z., Liu, W., Xie, Z., Yu, X., Ruan, C.: Janus-pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811 (2025) 1

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

On the Measure of Intelligence

Chollet, F.: On the measure of intelligence. arXiv preprint arXiv:1911.01547 (2019) 5

work page internal anchor Pith review Pith/arXiv arXiv 1911
[14]

In: The Twelfth International Conference on Learning Representations (2024), https://openreview.net/forum?id=2dnO3LLiJ12, 6

Darcet, T., Oquab, M., Mairal, J., Bojanowski, P.: Vision transformers need registers. In: The Twelfth International Conference on Learning Representations (2024), https://openreview.net/forum?id=2dnO3LLiJ12, 6

2024
[15]

Emerging Properties in Unified Multimodal Pretraining

Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., et al.: Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683 (2025) 3, 8, 10

work page internal anchor Pith review Pith/arXiv arXiv 2025
[16]

In: 2009 IEEE Conference on Computer Vision and Show Me Examples 17 Pattern Recognition

Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Show Me Examples 17 Pattern Recognition. pp. 248–255 (2009).https://doi.org/10.1109/CVPR.2009. 520684810

work page doi:10.1109/cvpr.2009 2009
[17]

In: Al-Onaizan, Y., Bansal, M., Chen, Y.N

Dong, Q., Li, L., Dai, D., Zheng, C., Ma, J., Li, R., Xia, H., Xu, J., Wu, Z., Chang, B., Sun, X., Li, L., Sui, Z.: A survey on in-context learning. In: Al-Onaizan, Y., Bansal, M., Chen, Y.N. (eds.) Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. pp. 1107–1128. Association for Computational Linguistics, Miami, Flori...

work page doi:10.18653/v1/ 2024
[18]

In: International Conference on Learning Representations (2021),https://openreview

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (2021),https://openreview. net/forum?id=YicbFdNTTy2

2021
[19]

com/en/introducing-gemini-2-5-flash-image/ , accessed: 2025-09-25 1, 8, 13, 3

Fortin, A., Vernade, G., Kampf, K., Reshi, A.: Introducing gemini 2.5 flash image, our state-of-the-art image model (Aug 2025),https://developers.googleblog. com/en/introducing-gemini-2-5-flash-image/ , accessed: 2025-09-25 1, 8, 13, 3

2025
[20]

In: European Conference on Computer Vision

Gandikota, R., Materzyńska, J., Zhou, T., Torralba, A., Bau, D.: Concept sliders: Lora adaptors for precise control in diffusion models. In: European Conference on Computer Vision. pp. 172–188. Springer (2024) 5

2024
[21]

arXiv preprint arXiv:2502.01639 (2025) 5

Gandikota, R., Wu, Z., Zhang, R., Bau, D., Shechtman, E., Kolkin, N.: Slider- space: Decomposing the visual capabilities of diffusion models. arXiv preprint arXiv:2502.01639 (2025) 5

work page arXiv 2025
[22]

In: Proceedings of the ieee/cvf international conference on computer vision

Goetschalckx, L., Andonian, A., Oliva, A., Isola, P.: Ganalyze: Toward visual definitions of cognitive image properties. In: Proceedings of the ieee/cvf international conference on computer vision. pp. 5744–5753 (2019) 4

2019
[23]

In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2026),https://openreview

Gong, Y., Song, Y., Li, Y., Li, C., Zhang, Y.: Relationadapter: Learning and transferring visual relation with diffusion transformers. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2026),https://openreview. net/forum?id=DOb47fj0cl13

2026
[24]

In: International Conference on Learning Representations (2017),https://openreview.net/forum?id=Sy2fzU9gl4

Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., Lerchner, A.: beta-VAE: Learning basic visual concepts with a constrained variational framework. In: International Conference on Learning Representations (2017),https://openreview.net/forum?id=Sy2fzU9gl4

2017
[25]

Illume+: Illuminating unified mllm with dual visual tokenization and diffusion refinement

Huang, R., Wang, C., Yang, J., Lu, G., Yuan, Y., Han, J., Hou, L., Zhang, W., Hong, L., Zhao, H., et al.: Illume+: Illuminating unified mllm with dual visual tokenization and diffusion refinement. arXiv preprint arXiv:2504.01934 (2025) 3, 8, 10

work page arXiv 2025
[26]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Johnson, J., Hariharan, B., Van Der Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., Girshick, R.: Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2901–2910 (2017) 5

2017
[27]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4401–4410 (2019) 4

2019
[28]

In: International conference on artificial intelligence and statistics

Khemakhem, I., Kingma, D., Monti, R., Hyvarinen, A.: Variational autoencoders and nonlinear ica: A unifying framework. In: International conference on artificial intelligence and statistics. pp. 2207–2217. PMLR (2020) 4

2020
[29]

Adam: A Method for Stochastic Optimization

Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 2 18 N. Stracke et al

work page internal anchor Pith review Pith/arXiv arXiv 2014
[30]

In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=PqvMRDCJT9t7

Lipman, Y., Chen, R.T.Q., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=PqvMRDCJT9t7

2023
[31]

Advances in neural information processing systems36, 34892–34916 (2023) 1

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023) 1

2023
[32]

In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=XVjTT1nw5z7

Liu, X., Gong, C., qiang liu: Flow straight and fast: Learning to generate and transfer data with rectified flow. In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=XVjTT1nw5z7

2023
[33]

In: international conference on machine learning

Locatello, F., Bauer, S., Lucic, M., Raetsch, G., Gelly, S., Schölkopf, B., Bachem, O.: Challenging common assumptions in the unsupervised learning of disentangled representations. In: international conference on machine learning. pp. 4114–4124. PMLR (2019) 4

2019
[34]

In: Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994 (1994),https://aclanthology.org/H94-1111/8, 9

Miller, G.A.: WordNet: A lexical database for English. In: Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994 (1994),https://aclanthology.org/H94-1111/8, 9

1994
[35]

Motamed, S., Culp, L., Swersky, K., Jaini, P., Geirhos, R.: Do generative video models understand physical principles? arXiv preprint arXiv:2501.09038 (2025) 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[36]

DINOv2: Learning Robust Visual Features without Supervision

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023) 6

work page internal anchor Pith review Pith/arXiv arXiv 2023
[37]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021) 6

2021
[38]

Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: Caltech-ucsd birds- 200-2011. Tech. Rep. CNS-TR-2011-001, California Institute of Technology (2011) 5

2011
[39]

Advances in neural information processing systems32(2019) 10, 12

Wang, H., Ge, S., Lipton, Z., Xing, E.P.: Learning robust global representations by penalizing local predictive power. Advances in neural information processing systems32(2019) 10, 12

2019
[40]

Qwen-Image Technical Report

Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., Yin, S.m., Bai, S., Xu, X., Chen, Y., et al.: Qwen-image technical report. arXiv preprint arXiv:2508.02324 (2025) 1, 13

work page internal anchor Pith review Pith/arXiv arXiv 2025
[41]

arXiv preprint arXiv:2312.01771 (2023) 3

Xu, J., Gandelsman, Y., Bar, A., Yang, J., Gao, J., Darrell, T., Wang, X.: Improv: Inpainting-based multimodal prompting for computer vision tasks. arXiv preprint arXiv:2312.01771 (2023) 3

work page arXiv 2023
[42]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025) 8

work page internal anchor Pith review Pith/arXiv arXiv 2025
[43]

Yang, M., Liu, F., Chen, Z., Shen, X., Hao, J., Wang, J.: Causalvae: Structured causal disentanglement in variational autoencoder. arXiv preprint arXiv:2004.08697 (2020) 4 Show Me Examples 1 Show Me Examples: Inferring Visual Concepts from Image Sets Supplementary Material A Multiple Shared Concepts Real-world scenarios often involve context sets with mul...

work page arXiv 2004

[1] [1]

fal.ai.https://fal.ai/(2025), accessed: 2025-09-25 12, 2

2025

[2] [2]

Advances in neural information processing systems35, 23716– 23736 (2022) 1

Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millican, K., Reynolds, M., et al.: Flamingo: a visual language model for few-shot learning. Advances in neural information processing systems35, 23716– 23736 (2022) 1

2022

[3] [3]

Stochastic Interpolants: A Unifying Framework for Flows and Diffusions

Albergo, M.S., Boffi, N.M., Vanden-Eijnden, E.: Stochastic interpolants: A unifying framework for flows and diffusions. arXiv preprint arXiv:2303.08797 (2023) 7, 2, 5

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Bai, Y., Geng, X., Mangalam, K., Bar, A., Yuille, A.L., Darrell, T., Malik, J., Efros, A.A.: Sequential modeling enables scalable learning for large vision models. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 22861–22872 (2024) 3

2024

[5] [5]

Advances in Neural Information Processing Systems 36, 63758–63778 (2023) 3

Balazevic, I., Steiner, D., Parthasarathy, N., Arandjelović, R., Henaff, O.: Towards in-context scene understanding. Advances in Neural Information Processing Systems 36, 63758–63778 (2023) 3

2023

[6] [6]

Advances in Neural Information Processing Systems35, 25005–25017 (2022) 3, 8, 10, 5

Bar, A., Gandelsman, Y., Darrell, T., Globerson, A., Efros, A.: Visual prompting via image inpainting. Advances in Neural Information Processing Systems35, 25005–25017 (2022) 3, 8, 10, 5

2022

[7] [7]

1 kontext: Flow matching for in- context image generation and editing in latent space

Batifol, S., Blattmann, A., Boesel, F., Consul, S., Diagne, C., Dockhorn, T., English, J., English, Z., Esser, P., Kulal, S., et al.: Flux. 1 kontext: Flow matching for in- context image generation and editing in latent space. arXiv e-prints pp. arXiv–2506 (2025) 1, 8, 13

2025

[8] [8]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Baumann, S.A., Krause, F., Neumayr, M., Stracke, N., Sevi, M., Hu, V.T., Ommer, B.: Continuous, subject-specific attribute control in t2i models by identifying semantic directions. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 13231–13241 (June 2025) 5

2025

[9] [9]

Advances in neural information processing systems33, 1877–1901 (2020) 3

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J.D., Dhariwal, P., Nee- lakantan, A., Shyam, P., Sastry, G., Askell, A., et al.: Language models are few-shot learners. Advances in neural information processing systems33, 1877–1901 (2020) 3

1901

[10] [10]

In: Proceedings of the IEEE/CVF international conference on computer vision

Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 9650–9660 (2021) 6, 2

2021

[11] [11]

Advances in neural information processing systems31(2018) 4

Chen, R.T., Li, X., Grosse, R.B., Duvenaud, D.K.: Isolating sources of disentan- glement in variational autoencoders. Advances in neural information processing systems31(2018) 4

2018

[12] [12]

Janus-Pro: Unified Multimodal Understanding and Generation with Data and Model Scaling

Chen, X., Wu, Z., Liu, X., Pan, Z., Liu, W., Xie, Z., Yu, X., Ruan, C.: Janus-pro: Unified multimodal understanding and generation with data and model scaling. arXiv preprint arXiv:2501.17811 (2025) 1

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

On the Measure of Intelligence

Chollet, F.: On the measure of intelligence. arXiv preprint arXiv:1911.01547 (2019) 5

work page internal anchor Pith review Pith/arXiv arXiv 1911

[14] [14]

In: The Twelfth International Conference on Learning Representations (2024), https://openreview.net/forum?id=2dnO3LLiJ12, 6

Darcet, T., Oquab, M., Mairal, J., Bojanowski, P.: Vision transformers need registers. In: The Twelfth International Conference on Learning Representations (2024), https://openreview.net/forum?id=2dnO3LLiJ12, 6

2024

[15] [15]

Emerging Properties in Unified Multimodal Pretraining

Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., et al.: Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683 (2025) 3, 8, 10

work page internal anchor Pith review Pith/arXiv arXiv 2025

[16] [16]

In: 2009 IEEE Conference on Computer Vision and Show Me Examples 17 Pattern Recognition

Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Show Me Examples 17 Pattern Recognition. pp. 248–255 (2009).https://doi.org/10.1109/CVPR.2009. 520684810

work page doi:10.1109/cvpr.2009 2009

[17] [17]

In: Al-Onaizan, Y., Bansal, M., Chen, Y.N

Dong, Q., Li, L., Dai, D., Zheng, C., Ma, J., Li, R., Xia, H., Xu, J., Wu, Z., Chang, B., Sun, X., Li, L., Sui, Z.: A survey on in-context learning. In: Al-Onaizan, Y., Bansal, M., Chen, Y.N. (eds.) Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing. pp. 1107–1128. Association for Computational Linguistics, Miami, Flori...

work page doi:10.18653/v1/ 2024

[18] [18]

In: International Conference on Learning Representations (2021),https://openreview

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (2021),https://openreview. net/forum?id=YicbFdNTTy2

2021

[19] [19]

com/en/introducing-gemini-2-5-flash-image/ , accessed: 2025-09-25 1, 8, 13, 3

Fortin, A., Vernade, G., Kampf, K., Reshi, A.: Introducing gemini 2.5 flash image, our state-of-the-art image model (Aug 2025),https://developers.googleblog. com/en/introducing-gemini-2-5-flash-image/ , accessed: 2025-09-25 1, 8, 13, 3

2025

[20] [20]

In: European Conference on Computer Vision

Gandikota, R., Materzyńska, J., Zhou, T., Torralba, A., Bau, D.: Concept sliders: Lora adaptors for precise control in diffusion models. In: European Conference on Computer Vision. pp. 172–188. Springer (2024) 5

2024

[21] [21]

arXiv preprint arXiv:2502.01639 (2025) 5

Gandikota, R., Wu, Z., Zhang, R., Bau, D., Shechtman, E., Kolkin, N.: Slider- space: Decomposing the visual capabilities of diffusion models. arXiv preprint arXiv:2502.01639 (2025) 5

work page arXiv 2025

[22] [22]

In: Proceedings of the ieee/cvf international conference on computer vision

Goetschalckx, L., Andonian, A., Oliva, A., Isola, P.: Ganalyze: Toward visual definitions of cognitive image properties. In: Proceedings of the ieee/cvf international conference on computer vision. pp. 5744–5753 (2019) 4

2019

[23] [23]

In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2026),https://openreview

Gong, Y., Song, Y., Li, Y., Li, C., Zhang, Y.: Relationadapter: Learning and transferring visual relation with diffusion transformers. In: The Thirty-ninth Annual Conference on Neural Information Processing Systems (2026),https://openreview. net/forum?id=DOb47fj0cl13

2026

[24] [24]

In: International Conference on Learning Representations (2017),https://openreview.net/forum?id=Sy2fzU9gl4

Higgins, I., Matthey, L., Pal, A., Burgess, C., Glorot, X., Botvinick, M., Mohamed, S., Lerchner, A.: beta-VAE: Learning basic visual concepts with a constrained variational framework. In: International Conference on Learning Representations (2017),https://openreview.net/forum?id=Sy2fzU9gl4

2017

[25] [25]

Illume+: Illuminating unified mllm with dual visual tokenization and diffusion refinement

Huang, R., Wang, C., Yang, J., Lu, G., Yuan, Y., Han, J., Hou, L., Zhang, W., Hong, L., Zhao, H., et al.: Illume+: Illuminating unified mllm with dual visual tokenization and diffusion refinement. arXiv preprint arXiv:2504.01934 (2025) 3, 8, 10

work page arXiv 2025

[26] [26]

In: Proceedings of the IEEE conference on computer vision and pattern recognition

Johnson, J., Hariharan, B., Van Der Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., Girshick, R.: Clevr: A diagnostic dataset for compositional language and elementary visual reasoning. In: Proceedings of the IEEE conference on computer vision and pattern recognition. pp. 2901–2910 (2017) 5

2017

[27] [27]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Karras, T., Laine, S., Aila, T.: A style-based generator architecture for generative adversarial networks. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 4401–4410 (2019) 4

2019

[28] [28]

In: International conference on artificial intelligence and statistics

Khemakhem, I., Kingma, D., Monti, R., Hyvarinen, A.: Variational autoencoders and nonlinear ica: A unifying framework. In: International conference on artificial intelligence and statistics. pp. 2207–2217. PMLR (2020) 4

2020

[29] [29]

Adam: A Method for Stochastic Optimization

Kingma, D.P., Ba, J.: Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 (2014) 2 18 N. Stracke et al

work page internal anchor Pith review Pith/arXiv arXiv 2014

[30] [30]

In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=PqvMRDCJT9t7

Lipman, Y., Chen, R.T.Q., Ben-Hamu, H., Nickel, M., Le, M.: Flow matching for generative modeling. In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=PqvMRDCJT9t7

2023

[31] [31]

Advances in neural information processing systems36, 34892–34916 (2023) 1

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. Advances in neural information processing systems36, 34892–34916 (2023) 1

2023

[32] [32]

In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=XVjTT1nw5z7

Liu, X., Gong, C., qiang liu: Flow straight and fast: Learning to generate and transfer data with rectified flow. In: The Eleventh International Conference on Learning Representations (2023),https://openreview.net/forum?id=XVjTT1nw5z7

2023

[33] [33]

In: international conference on machine learning

Locatello, F., Bauer, S., Lucic, M., Raetsch, G., Gelly, S., Schölkopf, B., Bachem, O.: Challenging common assumptions in the unsupervised learning of disentangled representations. In: international conference on machine learning. pp. 4114–4124. PMLR (2019) 4

2019

[34] [34]

In: Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994 (1994),https://aclanthology.org/H94-1111/8, 9

Miller, G.A.: WordNet: A lexical database for English. In: Human Language Technology: Proceedings of a Workshop held at Plainsboro, New Jersey, March 8-11, 1994 (1994),https://aclanthology.org/H94-1111/8, 9

1994

[35] [35]

Motamed, S., Culp, L., Swersky, K., Jaini, P., Geirhos, R.: Do generative video models understand physical principles? arXiv preprint arXiv:2501.09038 (2025) 5

work page internal anchor Pith review Pith/arXiv arXiv 2025

[36] [36]

DINOv2: Learning Robust Visual Features without Supervision

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193 (2023) 6

work page internal anchor Pith review Pith/arXiv arXiv 2023

[37] [37]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021) 6

2021

[38] [38]

Wah, C., Branson, S., Welinder, P., Perona, P., Belongie, S.: Caltech-ucsd birds- 200-2011. Tech. Rep. CNS-TR-2011-001, California Institute of Technology (2011) 5

2011

[39] [39]

Advances in neural information processing systems32(2019) 10, 12

Wang, H., Ge, S., Lipton, Z., Xing, E.P.: Learning robust global representations by penalizing local predictive power. Advances in neural information processing systems32(2019) 10, 12

2019

[40] [40]

Qwen-Image Technical Report

Wu, C., Li, J., Zhou, J., Lin, J., Gao, K., Yan, K., Yin, S.m., Bai, S., Xu, X., Chen, Y., et al.: Qwen-image technical report. arXiv preprint arXiv:2508.02324 (2025) 1, 13

work page internal anchor Pith review Pith/arXiv arXiv 2025

[41] [41]

arXiv preprint arXiv:2312.01771 (2023) 3

Xu, J., Gandelsman, Y., Bar, A., Yang, J., Gao, J., Darrell, T., Wang, X.: Improv: Inpainting-based multimodal prompting for computer vision tasks. arXiv preprint arXiv:2312.01771 (2023) 3

work page arXiv 2023

[42] [42]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025) 8

work page internal anchor Pith review Pith/arXiv arXiv 2025

[43] [43]

Yang, M., Liu, F., Chen, Z., Shen, X., Hao, J., Wang, J.: Causalvae: Structured causal disentanglement in variational autoencoder. arXiv preprint arXiv:2004.08697 (2020) 4 Show Me Examples 1 Show Me Examples: Inferring Visual Concepts from Image Sets Supplementary Material A Multiple Shared Concepts Real-world scenarios often involve context sets with mul...

work page arXiv 2004