pith. machine review for the scientific record. sign in

arxiv: 2605.02892 · v1 · submitted 2026-05-04 · 💻 cs.CV · cs.IR

Recognition: 2 theorem links

· Lean Theorem

AlbumFill: Album-Guided Reasoning and Retrieval for Personalized Image Completion

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:09 UTC · model grok-4.3

classification 💻 cs.CV cs.IR
keywords personalized image completionimage inpaintingreference-based generationvision-language modelsalbum retrievalidentity preservationoccluded image restoration
0
0 comments X

The pith

AlbumFill retrieves identity-consistent references from personal albums using vision-language model inferences to complete occluded personal photos without training.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

AlbumFill is a training-free framework for personalized image completion that restores occluded regions in personal photos while preserving the person's identity and appearance. It addresses the shortcomings of generic inpainting models, which often change identities, and methods that assume suitable reference images are already provided by automatically searching the user's personal photo album. A vision-language model first infers the missing semantic cues from the occluded image to guide retrieval of matching references, which are then passed to reference-based completion models for the final output. The authors introduce a new dataset of 54,000 human-centric samples with associated albums to evaluate this approach. Experiments show that identity-consistent reference retrieval is essential for effective personalized completion.

Core claim

The central claim is that a vision-language model can infer missing semantic cues from an occluded image to retrieve identity-consistent references from a personal album, enabling reference-based models to perform personalized image completion without any training or explicit reference provision.

What carries the argument

The AlbumFill pipeline: VLM-based inference of semantic cues from the occluded image to perform composed retrieval of matching references from the personal album, followed by reference-based inpainting.

If this is right

  • Suitable references for completion can be located automatically within personal albums instead of requiring explicit user provision.
  • Identity consistency in the completed images improves because the retrieved references depict the same individual.
  • The new 54K-sample dataset enables standardized testing of album-guided personalized completion methods.
  • Generic inpainting baselines struggle with identity preservation, underscoring the value of targeted reference retrieval.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could extend to completing sequences in personal videos by treating album frames as references.
  • Advances in vision-language model accuracy would directly boost the reliability of cue inference and reference selection.
  • Personal photo collections function as user-specific data sources for consistency without needing model retraining.

Load-bearing premise

A vision-language model can accurately infer the missing semantic cues from the occluded image to guide effective retrieval of suitable references from the album.

What would settle it

An evaluation where VLM-inferred cues are replaced with incorrect ones, and the resulting completions show no improvement in identity preservation metrics over generic inpainting on the 54K dataset.

Figures

Figures reproduced from arXiv: 2605.02892 by Brian Price, Daniil Pakhomov, Luis Figueroa, Ming-Hsuan Yang, Qing Liu, Scott Cohen, Yu-Ju Tsai, Zhihong Ding.

Figure 1
Figure 1. Figure 1: Comparison between previous reference-based image completion and view at source ↗
Figure 2
Figure 2. Figure 2: Data generation pipeline for constructing our Album Dataset. view at source ↗
Figure 3
Figure 3. Figure 3: AlbumFill system overview. (a) Given a masked target image, a Vision￾Language Model (VLM) performs masked visual reasoning to generate a textual hy￾pothesis describing the likely content behind the masked region. (b) The reasoning text and visible context are used to compose a multimodal query that retrieves the most semantically aligned and identity-consistent reference image from the user’s personal albu… view at source ↗
Figure 4
Figure 4. Figure 4: Visual comparison with different categories of completion methods view at source ↗
read the original abstract

Personalized image completion aims to restore occluded regions in personal photos while preserving identity and appearance. Existing methods either rely on generic inpainting models that often fail to maintain identity consistency, or assume that suitable reference images are explicitly provided. In practice, suitable references are often not explicitly provided, requiring the system to search for identity-consistent images within personal photo collections. We present AlbumFill, a training-free framework that retrieves identity-consistent references from personal albums for personalized completion. Given an occluded image and a personal album, a vision-language model infers missing semantic cues to guide composed image retrieval, and the retrieved references are used by reference-based completion models. To facilitate this task, we introduce a dataset containing 54K human-centric samples with associated album images. Experiments across multiple baselines demonstrate the difficulty of personalized completion and highlight the importance of identity-consistent reference retrieval. Project Page: https://liagm.github.io/AlbumFill/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper presents AlbumFill, a training-free framework for personalized image completion that, given an occluded personal photo and a user's album, employs a vision-language model to infer missing semantic cues from the occluded image, performs composed image retrieval to select identity-consistent reference images from the album, and feeds those references into reference-based completion models to restore the occluded regions while preserving identity and appearance. The authors introduce a new dataset of 54K human-centric samples with associated album images and report experiments across baselines that illustrate the task's difficulty and the importance of identity-consistent retrieval.

Significance. If the core retrieval step proves reliable, the work addresses a practical limitation in existing inpainting and reference-based completion methods by automating the discovery of suitable personal references without requiring explicit user-provided images or task-specific training. The training-free design that chains off-the-shelf VLMs and completion models is a notable strength, and the introduced dataset could become a useful benchmark for future personalized completion research.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (Experiments): the claim that experiments 'demonstrate the difficulty of personalized completion and highlight the importance of identity-consistent reference retrieval' is not supported by any reported quantitative metrics, baselines, or ablation numbers in the provided abstract or high-level description; without these, the central empirical claim cannot be evaluated.
  2. [§3] §3 (Method): the framework's training-free claim rests on the untested assumption that a VLM, given only the occluded image, produces semantic cues accurate enough for effective composed retrieval of identity-consistent album images. No direct metrics (cue precision/recall vs. human annotations, oracle-cue ablations, or failure-case analysis) isolate this inference step, which is load-bearing for downstream identity preservation.
  3. [§4] §4 (Experiments): the manuscript does not report end-to-end quantitative results (e.g., identity similarity scores, perceptual metrics, or comparisons against generic inpainting and non-album baselines) that would show whether the retrieved references actually improve completion quality over alternatives.
minor comments (2)
  1. [Dataset] The dataset description should include more detail on how album images were paired with occluded queries and any filtering criteria applied to the 54K samples.
  2. [Figures] Ensure all figures clearly label the occluded input, inferred cues, retrieved references, and final output for each example.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and outline targeted revisions to strengthen the empirical grounding of the claims.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): the claim that experiments 'demonstrate the difficulty of personalized completion and highlight the importance of identity-consistent reference retrieval' is not supported by any reported quantitative metrics, baselines, or ablation numbers in the provided abstract or high-level description; without these, the central empirical claim cannot be evaluated.

    Authors: We agree that the abstract and the high-level summary of §4 do not include specific quantitative numbers or ablation details. Although the full experiments section reports comparisons across multiple baselines, we will revise the abstract to incorporate key quantitative results (e.g., identity similarity improvements) and expand §4 with explicit ablation tables and baseline numbers to directly support the claims regarding task difficulty and the value of identity-consistent retrieval. revision: yes

  2. Referee: [§3] §3 (Method): the framework's training-free claim rests on the untested assumption that a VLM, given only the occluded image, produces semantic cues accurate enough for effective composed retrieval of identity-consistent album images. No direct metrics (cue precision/recall vs. human annotations, oracle-cue ablations, or failure-case analysis) isolate this inference step, which is load-bearing for downstream identity preservation.

    Authors: The referee is correct that we have not isolated the VLM cue inference step with direct metrics. We will add an oracle-cue ablation study in the revised §4, comparing end-to-end retrieval and completion performance when using VLM-inferred cues versus ground-truth cues. We will also include qualitative failure-case analysis for the cue inference process. Direct precision/recall against human annotations is not currently available and would require new labeling effort; we will note this as a limitation and potential future direction. revision: partial

  3. Referee: [§4] §4 (Experiments): the manuscript does not report end-to-end quantitative results (e.g., identity similarity scores, perceptual metrics, or comparisons against generic inpainting and non-album baselines) that would show whether the retrieved references actually improve completion quality over alternatives.

    Authors: We thank the referee for highlighting this gap. While baseline comparisons are present, we will substantially expand §4 to report full end-to-end quantitative results. These will include identity similarity scores (using standard face recognition embeddings), perceptual metrics (e.g., LPIPS and FID), and direct comparisons against generic inpainting models as well as non-album reference baselines. This will clearly quantify the benefit of the album-guided, identity-consistent retrieval. revision: yes

Circularity Check

0 steps flagged

No circularity: framework chains external VLMs and completion models without self-referential reductions

full rationale

The paper describes a training-free pipeline: VLM infers semantic cues from occluded images, guides composed retrieval from the album, then applies reference-based completion. No equations, fitted parameters, or predictions are presented that reduce by construction to the inputs. The new dataset (54K samples) and baseline experiments serve as external validation rather than tautological definitions. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The method relies on off-the-shelf components whose behavior is independent of the present work, satisfying the criteria for a self-contained non-circular contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework depends on pre-trained vision-language models and reference-based completion models; no new free parameters, axioms beyond standard domain assumptions, or invented entities are introduced.

axioms (1)
  • domain assumption Vision-language models can infer missing semantic cues from occluded personal images to guide retrieval
    This inference step is presented as the mechanism for finding suitable album references.

pith-pipeline@v0.9.0 · 5479 in / 1189 out tokens · 46456 ms · 2026-05-08T18:09:15.389736+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 16 canonical work pages · 11 internal anchors

  1. [1]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023) 4

  2. [2]

    TPAMI (2025) 9, 11

    Agnolucci, L., Baldrati, A., Del Bimbo, A., Bertini, M.: isearle: Improving textual inversion for zero-shot composed image retrieval. TPAMI (2025) 9, 11

  3. [3]

    Qwen2.5-VL Technical Report

    Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025) 2, 3

  4. [4]

    In: ICCV (2023) 4

    Baldrati, A., Agnolucci, L., Bertini, M., Del Bimbo, A.: Zero-shot composed image retrieval with textual inversion. In: ICCV (2023) 4

  5. [5]

    In: CVPR (2024) 4

    Cao,C.,Cai,Y.,Dong,Q.,Wang,Y.,Fu,Y.:Leftrefill:Fillingrightcanvasbasedon left reference through generalized text-to-image diffusion model. In: CVPR (2024) 4

  6. [6]

    In: ICCV (2021) 9

    Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: ICCV (2021) 9

  7. [7]

    In: NeurIPS (2024) 2, 4, 9, 11, 12, 13

    Chen, X., Feng, Y., Chen, M., Wang, Y., Zhang, S., Liu, Y., Shen, Y., Zhao, H.: Zero-shot image editing with reference imitation. In: NeurIPS (2024) 2, 4, 9, 11, 12, 13

  8. [8]

    In: CVPR (2024) 4

    Chen, X., Huang, L., Liu, Y., Shen, Y., Zhao, D., Zhao, H.: Anydoor: Zero-shot object-level image customization. In: CVPR (2024) 4

  9. [9]

    In: CVPR (2025) 5, 9, 11, 14

    Chen, X., Zhang, Z., Zhang, H., Zhou, Y., Kim, S.Y., Liu, Q., Li, Y., Zhang, J., Zhao, N., Wang, Y., et al.: Unireal: Universal image generation and editing via learning real-world dynamics. In: CVPR (2025) 5, 9, 11, 14

  10. [10]

    In: CVPR (2024) 2, 3

    Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: CVPR (2024) 2, 3

  11. [11]

    Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities

    Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025) 4, 10

  12. [12]

    In: NeurIPS (2023) 3

    Dai, W., Li, J., Li, D., Tiong, A., Zhao, J., Wang, W., Li, B., Fung, P.N., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning. In: NeurIPS (2023) 3

  13. [13]

    DeepMind, G.: Gemini 3 pro image model card (November 2025),https:// storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-Pro-Image- Model-Card.pdf3, 5, 11, 12, 13

  14. [14]

    Emerging Properties in Unified Multimodal Pretraining

    Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., et al.: Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683 (2025) 3, 5, 11, 12, 13

  15. [15]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Dosovitskiy, A.: An image is worth 16x16 words: Transformers for image recogni- tion at scale. arXiv preprint arXiv:2010.11929 (2020) 9

  16. [16]

    In: NeurIPS (2024) 9

    Fu, S., Tamir, N., Sundaram, S., Chai, L., Zhang, R., Dekel, T., Isola, P.: Dream- sim: Learning new dimensions of human visual similarity using synthetic data. In: NeurIPS (2024) 9

  17. [17]

    The Llama 3 Herd of Models

    Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Let- man, A., Mathur, A., Schelten, A., Vaughan, A., et al.: The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024) 2, 3 16 Y.-J. Tsai et al

  18. [18]

    In: CVPR (2024) 3, 4, 9, 11, 13

    Gu, G., Chun, S., Kim, W., Kang, Y., Yun, S.: Language-only training of zero-shot composed image retrieval. In: CVPR (2024) 3, 4, 9, 11, 13

  19. [19]

    In: ECCV (2024) 1

    Gu, J., Zhao, N., Xiong, W., Liu, Q., Zhang, Z., Zhang, H., Zhang, J., Jung, H., Wang, Y., Wang, X.E.: Swapanything: Enabling arbitrary object swapping in personalized image editing. In: ECCV (2024) 1

  20. [20]

    Guo, J., Deng, J., et al.: Insightface: State-of-the-art 2d and 3d face analysis project.https://github.com/deepinsight/insightface(2025) 5

  21. [21]

    Hore, A., Ziou, D.: Image quality metrics: Psnr vs. ssim. In: ICPR (2010) 9

  22. [22]

    In: ECCV (2024) 3, 4

    Jang, Y.K., Huynh, D., Shah, A., Chen, W.K., Lim, S.N.: Spherical linear interpo- lation and text-anchoring for zero-shot composed image retrieval. In: ECCV (2024) 3, 4

  23. [23]

    Jocher, G., Chaurasia, A., Qiu, J.: Ultralytics yolov8 (2023) 5

  24. [24]

    In: ICCV (2015) 1

    Joon Oh, S., Benenson, R., Fritz, M., Schiele, B.: Person recognition in personal photo collections. In: ICCV (2015) 1

  25. [25]

    In: ECCV (2024) 2, 11, 12, 13

    Ju, X., Liu, X., Wang, X., Bian, Y., Shan, Y., Xu, Q.: Brushnet: A plug-and-play image inpainting model with decomposed dual-branch diffusion. In: ECCV (2024) 2, 11, 12, 13

  26. [26]

    In: ICLR (2024) 4, 9, 11

    Karthik, S., Roth, K., Mancini, M., Akata, Z.: Vision-by-language for training-free compositional image retrieval. In: ICLR (2024) 4, 9, 11

  27. [27]

    In: ICML (2023) 2, 3

    Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: ICML (2023) 2, 3

  28. [28]

    In: ACM SIGIR (2024) 3, 4

    Lin, H., Wen, H., Song, X., Liu, M., Hu, Y., Nie, L.: Fine-grained textual inversion network for zero-shot composed image retrieval. In: ACM SIGIR (2024) 3, 4

  29. [29]

    In: CVPR (2024) 3, 9, 10

    Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. In: CVPR (2024) 3, 9, 10

  30. [30]

    In: NeurIPS (2023) 3

    Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023) 3

  31. [31]

    DeepSeek-VL: Towards Real-World Vision-Language Understanding

    Lu, H., Liu, W., Zhang, B., Wang, B., Dong, K., Liu, B., Sun, J., Ren, T., Li, Z., Yang, H., et al.: Deepseek-vl: towards real-world vision-language understanding. arXiv preprint arXiv:2403.05525 (2024) 4

  32. [32]

    In: CVPR (2022) 2

    Lugmayr, A., Danelljan, M., Romero, A., Yu, F., Timofte, R., Van Gool, L.: Re- paint: Inpainting using denoising diffusion probabilistic models. In: CVPR (2022) 2

  33. [33]

    In: CVPR (2024) 3

    Mitra, C., Huang, B., Darrell, T., Herzig, R.: Compositional chain-of-thought prompting for large multimodal models. In: CVPR (2024) 3

  34. [34]

    Kosmos-2: Grounding Multimodal Large Language Models to the World

    Peng, Z., Wang, W., Dong, L., Hao, Y., Huang, S., Ma, S., Wei, F.: Kosmos- 2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824 (2023) 3

  35. [35]

    In: ICML (2021) 4, 9

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021) 4, 9

  36. [36]

    In: CVPR (2023) 3, 4

    Saito, K., Sohn, K., Zhang, X., Li, C.L., Lee, C.Y., Saenko, K., Pfister, T.: Pic2word: Mapping pictures to words for zero-shot composed image retrieval. In: CVPR (2023) 3, 4

  37. [37]

    In: NeurIPS (2024) 3

    Shao, H., Qian, S., Xiao, H., Song, G., Zong, Z., Wang, L., Liu, Y., Li, H.: Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning. In: NeurIPS (2024) 3

  38. [38]

    In: CVPR (2023) 4 AlbumFill17

    Song, Y., Zhang, Z., Lin, Z., Cohen, S., Price, B., Zhang, J., Kim, S.Y., Aliaga, D.: Objectstitch: Object compositing with diffusion model. In: CVPR (2023) 4 AlbumFill17

  39. [39]

    In: CVPR (2024) 4

    Song, Y., Zhang, Z., Lin, Z., Cohen, S., Price, B., Zhang, J., Kim, S.Y., Zhang, H., Xiong, W., Aliaga, D.: Imprint: Generative object compositing by learning identity-preserving representation. In: CVPR (2024) 4

  40. [40]

    arXiv preprint arXiv:2312.08924 (2023) 4

    Sun, S., Ye, F., Gong, S.: Training-free zero-shot composed image retrieval with local concept reranking. arXiv preprint arXiv:2312.08924 (2023) 4

  41. [41]

    In: CVPR (2024) 3, 4

    Suo, Y., Ma, F., Zhu, L., Yang, Y.: Knowledge-enhanced dual-stream zero-shot composed image retrieval. In: CVPR (2024) 3, 4

  42. [42]

    In: WACV (2022) 2

    Suvorov, R., Logacheva, E., Mashikhin, A., Remizova, A., Ashukha, A., Silvestrov, A., Kong, N., Goka, H., Park, K., Lempitsky, V.: Resolution-robust large mask inpainting with fourier convolutions. In: WACV (2022) 2

  43. [43]

    In: AAAI (2024) 3, 4

    Tang, Y., Yu, J., Gai, K., Zhuang, J., Xiong, G., Hu, Y., Wu, Q.: Context-i2w: Mapping images to context-dependent words for accurate zero-shot composed im- age retrieval. In: AAAI (2024) 3, 4

  44. [44]

    In: CVPR (2025) 4

    Tang, Y., Zhang, J., Qin, X., Yu, J., Gou, G., Xiong, G., Lin, Q., Rajmohan, S., Zhang, D., Wu, Q.: Reason-before-retrieve: One-stage reflective chain-of-thoughts for training-free zero-shot composed image retrieval. In: CVPR (2025) 4

  45. [45]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context

    Team, G., Georgiev, P., Lei, V.I., Burnell, R., Bai, L., Gulati, A., Tanzer, G., Vin- cent,D.,Pan,Z.,Wang,S.,etal.:Gemini1.5:Unlockingmultimodalunderstanding across millions of tokens of context. arXiv preprint arXiv:2403.05530 (2024) 4

  46. [46]

    Gemma 3 Technical Report

    Team, G., Kamath, A., Ferret, J., Pathak, S., Vieillard, N., Merhej, R., Perrin, S., Matejovicova, T., Ramé, A., Rivière, M., et al.: Gemma 3 technical report. arXiv preprint arXiv:2503.19786 (2025) 2, 3

  47. [47]

    Communications of the ACM (2016) 1, 5

    Thomee, B., Shamma, D.A., Friedland, G., Elizalde, B., Ni, K., Poland, D., Borth, D., Li, L.J.: Yfcc100m: The new data in multimedia research. Communications of the ACM (2016) 1, 5

  48. [48]

    In: ACM MM (2025) 5

    Tian, X., Li, W., Xu, B., Yuan, Y., Wang, Y., Shen, H.: Mige: Mutually enhanced multimodal instruction-based image generation and editing. In: ACM MM (2025) 5

  49. [49]

    In: ICCV (2025) 1, 2, 4, 9, 11, 12, 13

    Tsai, Y.J., Price, B., Liu, Q., Figueroa, L., Pakhomov, D., Ding, Z., Cohen, S., Yang, M.H.: Completeme: Reference-based human image completion. In: ICCV (2025) 1, 2, 4, 9, 11, 12, 13

  50. [50]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al.: Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 (2024) 2, 3

  51. [51]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025) 2, 3, 9, 10

  52. [52]

    In: CVPR (2016) 4, 5

    Wang, Y., Lin, Z., Shen, X., Mech, R., Miller, G., Cottrell, G.W.: Event-specific image importance. In: CVPR (2016) 4, 5

  53. [53]

    TIP (2004) 9

    Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. TIP (2004) 9

  54. [54]

    In: NeurIPS (2022) 3

    Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. In: NeurIPS (2022) 3

  55. [55]

    Qwen3 Technical Report

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025) 9, 10

  56. [56]

    In: CVPR (2023) 2, 4 18 Y.-J

    Yang, B., Gu, S., Zhang, B., Zhang, T., Chen, X., Sun, X., Chen, D., Wen, F.: Paint by example: Exemplar-based image editing with diffusion models. In: CVPR (2023) 2, 4 18 Y.-J. Tsai et al

  57. [57]

    MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action

    Yang, Z., Li, L., Wang, J., Lin, K., Azarnasab, E., Ahmed, F., Liu, Z., Liu, C., Zeng, M., Wang, L.: Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381 (2023) 3

  58. [58]

    GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models

    Zeng, A., Lv, X., Zheng, Q., Hou, Z., Chen, B., Xie, C., Wang, C., Yin, D., Zeng, H., Zhang, J., et al.: Glm-4.5: Agentic, reasoning, and coding (arc) foundation models. arXiv preprint arXiv:2508.06471 (2025) 2, 3

  59. [59]

    In: ACM MM (2025) 5

    Zhang, J., Li, M., Tang, J., Deng, J., Tian, S., Liu, X., Zhang, M., Ye, G., Jiang, Y.G.: Editmaster: Bridging text instruction and visual example for multimodal guided image editing. In: ACM MM (2025) 5

  60. [60]

    In: CVPR (2018) 9

    Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018) 9

  61. [61]

    TMLR (2024) 3

    Zhang, Z., Zhang, A., Li, M., Zhao, H., Karypis, G., Smola, A.: Multimodal chain- of-thought reasoning in language models. TMLR (2024) 3

  62. [62]

    In: CVPR (2021) 2, 4

    Zhou, Y., Barnes, C., Shechtman, E., Amirghodsi, S.: Transfill: Reference-guided image inpainting by merging multiple color and spatial transformations. In: CVPR (2021) 2, 4