arxiv: 2605.02892 · v1 · submitted 2026-05-04 · 💻 cs.CV · cs.IR

Recognition: 2 theorem links

· Lean Theorem

AlbumFill: Album-Guided Reasoning and Retrieval for Personalized Image Completion

Yu-Ju Tsai , Brian Price , Qing Liu , Luis Figueroa , Daniil Pakhomov , Zhihong Ding , Scott Cohen , Ming-Hsuan Yang

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:09 UTC · model grok-4.3

classification 💻 cs.CV cs.IR

keywords personalized image completionimage inpaintingreference-based generationvision-language modelsalbum retrievalidentity preservationoccluded image restoration

0 comments

The pith

AlbumFill retrieves identity-consistent references from personal albums using vision-language model inferences to complete occluded personal photos without training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

AlbumFill is a training-free framework for personalized image completion that restores occluded regions in personal photos while preserving the person's identity and appearance. It addresses the shortcomings of generic inpainting models, which often change identities, and methods that assume suitable reference images are already provided by automatically searching the user's personal photo album. A vision-language model first infers the missing semantic cues from the occluded image to guide retrieval of matching references, which are then passed to reference-based completion models for the final output. The authors introduce a new dataset of 54,000 human-centric samples with associated albums to evaluate this approach. Experiments show that identity-consistent reference retrieval is essential for effective personalized completion.

Core claim

The central claim is that a vision-language model can infer missing semantic cues from an occluded image to retrieve identity-consistent references from a personal album, enabling reference-based models to perform personalized image completion without any training or explicit reference provision.

What carries the argument

The AlbumFill pipeline: VLM-based inference of semantic cues from the occluded image to perform composed retrieval of matching references from the personal album, followed by reference-based inpainting.

If this is right

Suitable references for completion can be located automatically within personal albums instead of requiring explicit user provision.
Identity consistency in the completed images improves because the retrieved references depict the same individual.
The new 54K-sample dataset enables standardized testing of album-guided personalized completion methods.
Generic inpainting baselines struggle with identity preservation, underscoring the value of targeted reference retrieval.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could extend to completing sequences in personal videos by treating album frames as references.
Advances in vision-language model accuracy would directly boost the reliability of cue inference and reference selection.
Personal photo collections function as user-specific data sources for consistency without needing model retraining.

Load-bearing premise

A vision-language model can accurately infer the missing semantic cues from the occluded image to guide effective retrieval of suitable references from the album.

What would settle it

An evaluation where VLM-inferred cues are replaced with incorrect ones, and the resulting completions show no improvement in identity preservation metrics over generic inpainting on the 54K dataset.

Figures

Figures reproduced from arXiv: 2605.02892 by Brian Price, Daniil Pakhomov, Luis Figueroa, Ming-Hsuan Yang, Qing Liu, Scott Cohen, Yu-Ju Tsai, Zhihong Ding.

**Figure 1.** Figure 1: Comparison between previous reference-based image completion and view at source ↗

**Figure 2.** Figure 2: Data generation pipeline for constructing our Album Dataset. view at source ↗

**Figure 3.** Figure 3: AlbumFill system overview. (a) Given a masked target image, a VisionLanguage Model (VLM) performs masked visual reasoning to generate a textual hypothesis describing the likely content behind the masked region. (b) The reasoning text and visible context are used to compose a multimodal query that retrieves the most semantically aligned and identity-consistent reference image from the user’s personal albu… view at source ↗

**Figure 4.** Figure 4: Visual comparison with different categories of completion methods view at source ↗

read the original abstract

Personalized image completion aims to restore occluded regions in personal photos while preserving identity and appearance. Existing methods either rely on generic inpainting models that often fail to maintain identity consistency, or assume that suitable reference images are explicitly provided. In practice, suitable references are often not explicitly provided, requiring the system to search for identity-consistent images within personal photo collections. We present AlbumFill, a training-free framework that retrieves identity-consistent references from personal albums for personalized completion. Given an occluded image and a personal album, a vision-language model infers missing semantic cues to guide composed image retrieval, and the retrieved references are used by reference-based completion models. To facilitate this task, we introduce a dataset containing 54K human-centric samples with associated album images. Experiments across multiple baselines demonstrate the difficulty of personalized completion and highlight the importance of identity-consistent reference retrieval. Project Page: https://liagm.github.io/AlbumFill/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AlbumFill sets up a training-free pipeline for album-guided personalized image completion with a new dataset, but leaves the VLM cue inference step without direct validation.

read the letter

The main thing to know is that this paper describes a training-free way to do personalized image completion: a VLM infers semantic cues from an occluded photo, those cues drive retrieval of matching images from a personal album, and the retrieved references then feed into a reference-based completion model. They also release a 54K-sample dataset of human-centric occluded images paired with album photos. This is new in framing the problem as retrieval from an existing album rather than assuming a reference is already given or falling back to generic inpainting. The pipeline chains off-the-shelf VLMs and completion models without new training, which keeps it practical for consumer photo tools. The dataset is a useful concrete contribution for benchmarking identity-preserving edits. The work does a reasonable job highlighting why identity consistency matters and why standard approaches fall short in real album scenarios. Experiments are said to show the task is difficult and that consistent retrieval helps, which matches the motivation. The soft spot is the untested assumption that the VLM produces accurate enough cues. The stress-test note is on point here: if the model misses clothing, pose, or background details, retrieval will not deliver useful references and the completion step cannot preserve identity. The abstract gives no isolated metrics on cue quality against human annotations or oracle comparisons, so end-to-end results alone do not confirm the inference step works reliably. Quantitative details and baseline numbers are also thin in the provided summary. This paper is for researchers working on image editing, retrieval-augmented vision, or personalized AI tools. Someone building or evaluating consumer photo features would get value from the task setup and dataset even if the method needs tightening. I would send it to peer review. The motivation and dataset are solid enough to warrant referee time, though revisions should add direct tests of the cue inference component.

Referee Report

3 major / 2 minor

Summary. The paper presents AlbumFill, a training-free framework for personalized image completion that, given an occluded personal photo and a user's album, employs a vision-language model to infer missing semantic cues from the occluded image, performs composed image retrieval to select identity-consistent reference images from the album, and feeds those references into reference-based completion models to restore the occluded regions while preserving identity and appearance. The authors introduce a new dataset of 54K human-centric samples with associated album images and report experiments across baselines that illustrate the task's difficulty and the importance of identity-consistent retrieval.

Significance. If the core retrieval step proves reliable, the work addresses a practical limitation in existing inpainting and reference-based completion methods by automating the discovery of suitable personal references without requiring explicit user-provided images or task-specific training. The training-free design that chains off-the-shelf VLMs and completion models is a notable strength, and the introduced dataset could become a useful benchmark for future personalized completion research.

major comments (3)

[Abstract and §4] Abstract and §4 (Experiments): the claim that experiments 'demonstrate the difficulty of personalized completion and highlight the importance of identity-consistent reference retrieval' is not supported by any reported quantitative metrics, baselines, or ablation numbers in the provided abstract or high-level description; without these, the central empirical claim cannot be evaluated.
[§3] §3 (Method): the framework's training-free claim rests on the untested assumption that a VLM, given only the occluded image, produces semantic cues accurate enough for effective composed retrieval of identity-consistent album images. No direct metrics (cue precision/recall vs. human annotations, oracle-cue ablations, or failure-case analysis) isolate this inference step, which is load-bearing for downstream identity preservation.
[§4] §4 (Experiments): the manuscript does not report end-to-end quantitative results (e.g., identity similarity scores, perceptual metrics, or comparisons against generic inpainting and non-album baselines) that would show whether the retrieved references actually improve completion quality over alternatives.

minor comments (2)

[Dataset] The dataset description should include more detail on how album images were paired with occluded queries and any filtering criteria applied to the 54K samples.
[Figures] Ensure all figures clearly label the occluded input, inferred cues, retrieved references, and final output for each example.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and outline targeted revisions to strengthen the empirical grounding of the claims.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): the claim that experiments 'demonstrate the difficulty of personalized completion and highlight the importance of identity-consistent reference retrieval' is not supported by any reported quantitative metrics, baselines, or ablation numbers in the provided abstract or high-level description; without these, the central empirical claim cannot be evaluated.

Authors: We agree that the abstract and the high-level summary of §4 do not include specific quantitative numbers or ablation details. Although the full experiments section reports comparisons across multiple baselines, we will revise the abstract to incorporate key quantitative results (e.g., identity similarity improvements) and expand §4 with explicit ablation tables and baseline numbers to directly support the claims regarding task difficulty and the value of identity-consistent retrieval. revision: yes
Referee: [§3] §3 (Method): the framework's training-free claim rests on the untested assumption that a VLM, given only the occluded image, produces semantic cues accurate enough for effective composed retrieval of identity-consistent album images. No direct metrics (cue precision/recall vs. human annotations, oracle-cue ablations, or failure-case analysis) isolate this inference step, which is load-bearing for downstream identity preservation.

Authors: The referee is correct that we have not isolated the VLM cue inference step with direct metrics. We will add an oracle-cue ablation study in the revised §4, comparing end-to-end retrieval and completion performance when using VLM-inferred cues versus ground-truth cues. We will also include qualitative failure-case analysis for the cue inference process. Direct precision/recall against human annotations is not currently available and would require new labeling effort; we will note this as a limitation and potential future direction. revision: partial
Referee: [§4] §4 (Experiments): the manuscript does not report end-to-end quantitative results (e.g., identity similarity scores, perceptual metrics, or comparisons against generic inpainting and non-album baselines) that would show whether the retrieved references actually improve completion quality over alternatives.

Authors: We thank the referee for highlighting this gap. While baseline comparisons are present, we will substantially expand §4 to report full end-to-end quantitative results. These will include identity similarity scores (using standard face recognition embeddings), perceptual metrics (e.g., LPIPS and FID), and direct comparisons against generic inpainting models as well as non-album reference baselines. This will clearly quantify the benefit of the album-guided, identity-consistent retrieval. revision: yes

Circularity Check

0 steps flagged

No circularity: framework chains external VLMs and completion models without self-referential reductions

full rationale

The paper describes a training-free pipeline: VLM infers semantic cues from occluded images, guides composed retrieval from the album, then applies reference-based completion. No equations, fitted parameters, or predictions are presented that reduce by construction to the inputs. The new dataset (54K samples) and baseline experiments serve as external validation rather than tautological definitions. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The method relies on off-the-shelf components whose behavior is independent of the present work, satisfying the criteria for a self-contained non-circular contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework depends on pre-trained vision-language models and reference-based completion models; no new free parameters, axioms beyond standard domain assumptions, or invented entities are introduced.

axioms (1)

domain assumption Vision-language models can infer missing semantic cues from occluded personal images to guide retrieval
This inference step is presented as the mechanism for finding suitable album references.

pith-pipeline@v0.9.0 · 5479 in / 1189 out tokens · 46456 ms · 2026-05-08T18:09:15.389736+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

62 extracted references · 16 canonical work pages · 11 internal anchors

[1]

GPT-4 Technical Report

Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023) 4

work page internal anchor Pith review arXiv 2023
[2]

TPAMI (2025) 9, 11

Agnolucci, L., Baldrati, A., Del Bimbo, A., Bertini, M.: isearle: Improving textual inversion for zero-shot composed image retrieval. TPAMI (2025) 9, 11

2025
[3]

Qwen2.5-VL Technical Report

Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025) 2, 3

work page Pith review arXiv 2025
[4]

In: ICCV (2023) 4

Baldrati, A., Agnolucci, L., Bertini, M., Del Bimbo, A.: Zero-shot composed image retrieval with textual inversion. In: ICCV (2023) 4

2023
[5]

In: CVPR (2024) 4

Cao,C.,Cai,Y.,Dong,Q.,Wang,Y.,Fu,Y.:Leftrefill:Fillingrightcanvasbasedon left reference through generalized text-to-image diffusion model. In: CVPR (2024) 4

2024
[6]

In: ICCV (2021) 9

Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: ICCV (2021) 9

2021
[7]

In: NeurIPS (2024) 2, 4, 9, 11, 12, 13

Chen, X., Feng, Y., Chen, M., Wang, Y., Zhang, S., Liu, Y., Shen, Y., Zhao, H.: Zero-shot image editing with reference imitation. In: NeurIPS (2024) 2, 4, 9, 11, 12, 13

2024
[8]

In: CVPR (2024) 4

Chen, X., Huang, L., Liu, Y., Shen, Y., Zhao, D., Zhao, H.: Anydoor: Zero-shot object-level image customization. In: CVPR (2024) 4

2024
[9]

In: CVPR (2025) 5, 9, 11, 14

Chen, X., Zhang, Z., Zhang, H., Zhou, Y., Kim, S.Y., Liu, Q., Li, Y., Zhang, J., Zhao, N., Wang, Y., et al.: Unireal: Universal image generation and editing via learning real-world dynamics. In: CVPR (2025) 5, 9, 11, 14

2025
[10]

In: CVPR (2024) 2, 3

Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: CVPR (2024) 2, 3

2024
[11]

Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025) 4, 10

work page Pith review arXiv 2025
[12]

In: NeurIPS (2023) 3

Dai, W., Li, J., Li, D., Tiong, A., Zhao, J., Wang, W., Li, B., Fung, P.N., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning. In: NeurIPS (2023) 3

2023
[13]

DeepMind, G.: Gemini 3 pro image model card (November 2025),https:// storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-Pro-Image- Model-Card.pdf3, 5, 11, 12, 13

2025
[14]

Emerging Properties in Unified Multimodal Pretraining

Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., et al.: Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683 (2025) 3, 5, 11, 12, 13

work page internal anchor Pith review arXiv 2025
[15]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Dosovitskiy, A.: An image is worth 16x16 words: Transformers for image recogni- tion at scale. arXiv preprint arXiv:2010.11929 (2020) 9

work page Pith review arXiv 2010
[16]

In: NeurIPS (2024) 9

Fu, S., Tamir, N., Sundaram, S., Chai, L., Zhang, R., Dekel, T., Isola, P.: Dream- sim: Learning new dimensions of human visual similarity using synthetic data. In: NeurIPS (2024) 9

2024
[17]

The Llama 3 Herd of Models

Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Let- man, A., Mathur, A., Schelten, A., Vaughan, A., et al.: The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024) 2, 3 16 Y.-J. Tsai et al

work page internal anchor Pith review arXiv 2024
[18]

In: CVPR (2024) 3, 4, 9, 11, 13

Gu, G., Chun, S., Kim, W., Kang, Y., Yun, S.: Language-only training of zero-shot composed image retrieval. In: CVPR (2024) 3, 4, 9, 11, 13

2024
[19]

In: ECCV (2024) 1

Gu, J., Zhao, N., Xiong, W., Liu, Q., Zhang, Z., Zhang, H., Zhang, J., Jung, H., Wang, Y., Wang, X.E.: Swapanything: Enabling arbitrary object swapping in personalized image editing. In: ECCV (2024) 1

2024
[20]

Guo, J., Deng, J., et al.: Insightface: State-of-the-art 2d and 3d face analysis project.https://github.com/deepinsight/insightface(2025) 5

2025
[21]

Hore, A., Ziou, D.: Image quality metrics: Psnr vs. ssim. In: ICPR (2010) 9

2010
[22]

In: ECCV (2024) 3, 4

Jang, Y.K., Huynh, D., Shah, A., Chen, W.K., Lim, S.N.: Spherical linear interpo- lation and text-anchoring for zero-shot composed image retrieval. In: ECCV (2024) 3, 4

2024
[23]

Jocher, G., Chaurasia, A., Qiu, J.: Ultralytics yolov8 (2023) 5

2023
[24]

In: ICCV (2015) 1

Joon Oh, S., Benenson, R., Fritz, M., Schiele, B.: Person recognition in personal photo collections. In: ICCV (2015) 1

2015
[25]

In: ECCV (2024) 2, 11, 12, 13

Ju, X., Liu, X., Wang, X., Bian, Y., Shan, Y., Xu, Q.: Brushnet: A plug-and-play image inpainting model with decomposed dual-branch diffusion. In: ECCV (2024) 2, 11, 12, 13

2024
[26]

In: ICLR (2024) 4, 9, 11

Karthik, S., Roth, K., Mancini, M., Akata, Z.: Vision-by-language for training-free compositional image retrieval. In: ICLR (2024) 4, 9, 11

2024
[27]

In: ICML (2023) 2, 3

Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: ICML (2023) 2, 3

2023
[28]

In: ACM SIGIR (2024) 3, 4

Lin, H., Wen, H., Song, X., Liu, M., Hu, Y., Nie, L.: Fine-grained textual inversion network for zero-shot composed image retrieval. In: ACM SIGIR (2024) 3, 4

2024
[29]

In: CVPR (2024) 3, 9, 10

Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. In: CVPR (2024) 3, 9, 10

2024
[30]

In: NeurIPS (2023) 3

Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023) 3

2023
[31]

DeepSeek-VL: Towards Real-World Vision-Language Understanding

Lu, H., Liu, W., Zhang, B., Wang, B., Dong, K., Liu, B., Sun, J., Ren, T., Li, Z., Yang, H., et al.: Deepseek-vl: towards real-world vision-language understanding. arXiv preprint arXiv:2403.05525 (2024) 4

work page internal anchor Pith review arXiv 2024
[32]

In: CVPR (2022) 2

Lugmayr, A., Danelljan, M., Romero, A., Yu, F., Timofte, R., Van Gool, L.: Re- paint: Inpainting using denoising diffusion probabilistic models. In: CVPR (2022) 2

2022
[33]

In: CVPR (2024) 3

Mitra, C., Huang, B., Darrell, T., Herzig, R.: Compositional chain-of-thought prompting for large multimodal models. In: CVPR (2024) 3

2024
[34]

Kosmos-2: Grounding Multimodal Large Language Models to the World

Peng, Z., Wang, W., Dong, L., Hao, Y., Huang, S., Ma, S., Wei, F.: Kosmos- 2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824 (2023) 3

work page internal anchor Pith review arXiv 2023
[35]

In: ICML (2021) 4, 9

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021) 4, 9

2021
[36]

In: CVPR (2023) 3, 4

Saito, K., Sohn, K., Zhang, X., Li, C.L., Lee, C.Y., Saenko, K., Pfister, T.: Pic2word: Mapping pictures to words for zero-shot composed image retrieval. In: CVPR (2023) 3, 4

2023
[37]

In: NeurIPS (2024) 3

Shao, H., Qian, S., Xiao, H., Song, G., Zong, Z., Wang, L., Liu, Y., Li, H.: Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning. In: NeurIPS (2024) 3

2024
[38]

In: CVPR (2023) 4 AlbumFill17

Song, Y., Zhang, Z., Lin, Z., Cohen, S., Price, B., Zhang, J., Kim, S.Y., Aliaga, D.: Objectstitch: Object compositing with diffusion model. In: CVPR (2023) 4 AlbumFill17

2023
[39]

In: CVPR (2024) 4

Song, Y., Zhang, Z., Lin, Z., Cohen, S., Price, B., Zhang, J., Kim, S.Y., Zhang, H., Xiong, W., Aliaga, D.: Imprint: Generative object compositing by learning identity-preserving representation. In: CVPR (2024) 4

2024
[40]

arXiv preprint arXiv:2312.08924 (2023) 4

Sun, S., Ye, F., Gong, S.: Training-free zero-shot composed image retrieval with local concept reranking. arXiv preprint arXiv:2312.08924 (2023) 4

work page arXiv 2023
[41]

In: CVPR (2024) 3, 4

Suo, Y., Ma, F., Zhu, L., Yang, Y.: Knowledge-enhanced dual-stream zero-shot composed image retrieval. In: CVPR (2024) 3, 4

2024
[42]

In: WACV (2022) 2

Suvorov, R., Logacheva, E., Mashikhin, A., Remizova, A., Ashukha, A., Silvestrov, A., Kong, N., Goka, H., Park, K., Lempitsky, V.: Resolution-robust large mask inpainting with fourier convolutions. In: WACV (2022) 2

2022
[43]

In: AAAI (2024) 3, 4

Tang, Y., Yu, J., Gai, K., Zhuang, J., Xiong, G., Hu, Y., Wu, Q.: Context-i2w: Mapping images to context-dependent words for accurate zero-shot composed im- age retrieval. In: AAAI (2024) 3, 4

2024
[44]

In: CVPR (2025) 4

Tang, Y., Zhang, J., Qin, X., Yu, J., Gou, G., Xiong, G., Lin, Q., Rajmohan, S., Zhang, D., Wu, Q.: Reason-before-retrieve: One-stage reflective chain-of-thoughts for training-free zero-shot composed image retrieval. In: CVPR (2025) 4

2025
[45]

Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

Team, G., Georgiev, P., Lei, V.I., Burnell, R., Bai, L., Gulati, A., Tanzer, G., Vin- cent,D.,Pan,Z.,Wang,S.,etal.:Gemini1.5:Unlockingmultimodalunderstanding across millions of tokens of context. arXiv preprint arXiv:2403.05530 (2024) 4

work page internal anchor Pith review arXiv 2024
[46]

Gemma 3 Technical Report

Team, G., Kamath, A., Ferret, J., Pathak, S., Vieillard, N., Merhej, R., Perrin, S., Matejovicova, T., Ramé, A., Rivière, M., et al.: Gemma 3 technical report. arXiv preprint arXiv:2503.19786 (2025) 2, 3

work page internal anchor Pith review arXiv 2025
[47]

Communications of the ACM (2016) 1, 5

Thomee, B., Shamma, D.A., Friedland, G., Elizalde, B., Ni, K., Poland, D., Borth, D., Li, L.J.: Yfcc100m: The new data in multimedia research. Communications of the ACM (2016) 1, 5

2016
[48]

In: ACM MM (2025) 5

Tian, X., Li, W., Xu, B., Yuan, Y., Wang, Y., Shen, H.: Mige: Mutually enhanced multimodal instruction-based image generation and editing. In: ACM MM (2025) 5

2025
[49]

In: ICCV (2025) 1, 2, 4, 9, 11, 12, 13

Tsai, Y.J., Price, B., Liu, Q., Figueroa, L., Pakhomov, D., Ding, Z., Cohen, S., Yang, M.H.: Completeme: Reference-based human image completion. In: ICCV (2025) 1, 2, 4, 9, 11, 12, 13

2025
[50]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al.: Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 (2024) 2, 3

work page Pith review arXiv 2024
[51]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025) 2, 3, 9, 10

work page internal anchor Pith review arXiv 2025
[52]

In: CVPR (2016) 4, 5

Wang, Y., Lin, Z., Shen, X., Mech, R., Miller, G., Cottrell, G.W.: Event-specific image importance. In: CVPR (2016) 4, 5

2016
[53]

TIP (2004) 9

Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. TIP (2004) 9

2004
[54]

In: NeurIPS (2022) 3

Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. In: NeurIPS (2022) 3

2022
[55]

Qwen3 Technical Report

Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025) 9, 10

work page internal anchor Pith review arXiv 2025
[56]

In: CVPR (2023) 2, 4 18 Y.-J

Yang, B., Gu, S., Zhang, B., Zhang, T., Chen, X., Sun, X., Chen, D., Wen, F.: Paint by example: Exemplar-based image editing with diffusion models. In: CVPR (2023) 2, 4 18 Y.-J. Tsai et al

2023
[57]

MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

Yang, Z., Li, L., Wang, J., Lin, K., Azarnasab, E., Ahmed, F., Liu, Z., Liu, C., Zeng, M., Wang, L.: Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381 (2023) 3

work page internal anchor Pith review arXiv 2023
[58]

GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

Zeng, A., Lv, X., Zheng, Q., Hou, Z., Chen, B., Xie, C., Wang, C., Yin, D., Zeng, H., Zhang, J., et al.: Glm-4.5: Agentic, reasoning, and coding (arc) foundation models. arXiv preprint arXiv:2508.06471 (2025) 2, 3

work page internal anchor Pith review arXiv 2025
[59]

In: ACM MM (2025) 5

Zhang, J., Li, M., Tang, J., Deng, J., Tian, S., Liu, X., Zhang, M., Ye, G., Jiang, Y.G.: Editmaster: Bridging text instruction and visual example for multimodal guided image editing. In: ACM MM (2025) 5

2025
[60]

In: CVPR (2018) 9

Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018) 9

2018
[61]

TMLR (2024) 3

Zhang, Z., Zhang, A., Li, M., Zhao, H., Karypis, G., Smola, A.: Multimodal chain- of-thought reasoning in language models. TMLR (2024) 3

2024
[62]

In: CVPR (2021) 2, 4

Zhou, Y., Barnes, C., Shechtman, E., Amirghodsi, S.: Transfill: Reference-guided image inpainting by merging multiple color and spatial transformations. In: CVPR (2021) 2, 4

2021