Recognition: 2 theorem links
· Lean TheoremAlbumFill: Album-Guided Reasoning and Retrieval for Personalized Image Completion
Pith reviewed 2026-05-08 18:09 UTC · model grok-4.3
The pith
AlbumFill retrieves identity-consistent references from personal albums using vision-language model inferences to complete occluded personal photos without training.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a vision-language model can infer missing semantic cues from an occluded image to retrieve identity-consistent references from a personal album, enabling reference-based models to perform personalized image completion without any training or explicit reference provision.
What carries the argument
The AlbumFill pipeline: VLM-based inference of semantic cues from the occluded image to perform composed retrieval of matching references from the personal album, followed by reference-based inpainting.
If this is right
- Suitable references for completion can be located automatically within personal albums instead of requiring explicit user provision.
- Identity consistency in the completed images improves because the retrieved references depict the same individual.
- The new 54K-sample dataset enables standardized testing of album-guided personalized completion methods.
- Generic inpainting baselines struggle with identity preservation, underscoring the value of targeted reference retrieval.
Where Pith is reading between the lines
- The approach could extend to completing sequences in personal videos by treating album frames as references.
- Advances in vision-language model accuracy would directly boost the reliability of cue inference and reference selection.
- Personal photo collections function as user-specific data sources for consistency without needing model retraining.
Load-bearing premise
A vision-language model can accurately infer the missing semantic cues from the occluded image to guide effective retrieval of suitable references from the album.
What would settle it
An evaluation where VLM-inferred cues are replaced with incorrect ones, and the resulting completions show no improvement in identity preservation metrics over generic inpainting on the 54K dataset.
Figures
read the original abstract
Personalized image completion aims to restore occluded regions in personal photos while preserving identity and appearance. Existing methods either rely on generic inpainting models that often fail to maintain identity consistency, or assume that suitable reference images are explicitly provided. In practice, suitable references are often not explicitly provided, requiring the system to search for identity-consistent images within personal photo collections. We present AlbumFill, a training-free framework that retrieves identity-consistent references from personal albums for personalized completion. Given an occluded image and a personal album, a vision-language model infers missing semantic cues to guide composed image retrieval, and the retrieved references are used by reference-based completion models. To facilitate this task, we introduce a dataset containing 54K human-centric samples with associated album images. Experiments across multiple baselines demonstrate the difficulty of personalized completion and highlight the importance of identity-consistent reference retrieval. Project Page: https://liagm.github.io/AlbumFill/
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents AlbumFill, a training-free framework for personalized image completion that, given an occluded personal photo and a user's album, employs a vision-language model to infer missing semantic cues from the occluded image, performs composed image retrieval to select identity-consistent reference images from the album, and feeds those references into reference-based completion models to restore the occluded regions while preserving identity and appearance. The authors introduce a new dataset of 54K human-centric samples with associated album images and report experiments across baselines that illustrate the task's difficulty and the importance of identity-consistent retrieval.
Significance. If the core retrieval step proves reliable, the work addresses a practical limitation in existing inpainting and reference-based completion methods by automating the discovery of suitable personal references without requiring explicit user-provided images or task-specific training. The training-free design that chains off-the-shelf VLMs and completion models is a notable strength, and the introduced dataset could become a useful benchmark for future personalized completion research.
major comments (3)
- [Abstract and §4] Abstract and §4 (Experiments): the claim that experiments 'demonstrate the difficulty of personalized completion and highlight the importance of identity-consistent reference retrieval' is not supported by any reported quantitative metrics, baselines, or ablation numbers in the provided abstract or high-level description; without these, the central empirical claim cannot be evaluated.
- [§3] §3 (Method): the framework's training-free claim rests on the untested assumption that a VLM, given only the occluded image, produces semantic cues accurate enough for effective composed retrieval of identity-consistent album images. No direct metrics (cue precision/recall vs. human annotations, oracle-cue ablations, or failure-case analysis) isolate this inference step, which is load-bearing for downstream identity preservation.
- [§4] §4 (Experiments): the manuscript does not report end-to-end quantitative results (e.g., identity similarity scores, perceptual metrics, or comparisons against generic inpainting and non-album baselines) that would show whether the retrieved references actually improve completion quality over alternatives.
minor comments (2)
- [Dataset] The dataset description should include more detail on how album images were paired with occluded queries and any filtering criteria applied to the 54K samples.
- [Figures] Ensure all figures clearly label the occluded input, inferred cues, retrieved references, and final output for each example.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and outline targeted revisions to strengthen the empirical grounding of the claims.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): the claim that experiments 'demonstrate the difficulty of personalized completion and highlight the importance of identity-consistent reference retrieval' is not supported by any reported quantitative metrics, baselines, or ablation numbers in the provided abstract or high-level description; without these, the central empirical claim cannot be evaluated.
Authors: We agree that the abstract and the high-level summary of §4 do not include specific quantitative numbers or ablation details. Although the full experiments section reports comparisons across multiple baselines, we will revise the abstract to incorporate key quantitative results (e.g., identity similarity improvements) and expand §4 with explicit ablation tables and baseline numbers to directly support the claims regarding task difficulty and the value of identity-consistent retrieval. revision: yes
-
Referee: [§3] §3 (Method): the framework's training-free claim rests on the untested assumption that a VLM, given only the occluded image, produces semantic cues accurate enough for effective composed retrieval of identity-consistent album images. No direct metrics (cue precision/recall vs. human annotations, oracle-cue ablations, or failure-case analysis) isolate this inference step, which is load-bearing for downstream identity preservation.
Authors: The referee is correct that we have not isolated the VLM cue inference step with direct metrics. We will add an oracle-cue ablation study in the revised §4, comparing end-to-end retrieval and completion performance when using VLM-inferred cues versus ground-truth cues. We will also include qualitative failure-case analysis for the cue inference process. Direct precision/recall against human annotations is not currently available and would require new labeling effort; we will note this as a limitation and potential future direction. revision: partial
-
Referee: [§4] §4 (Experiments): the manuscript does not report end-to-end quantitative results (e.g., identity similarity scores, perceptual metrics, or comparisons against generic inpainting and non-album baselines) that would show whether the retrieved references actually improve completion quality over alternatives.
Authors: We thank the referee for highlighting this gap. While baseline comparisons are present, we will substantially expand §4 to report full end-to-end quantitative results. These will include identity similarity scores (using standard face recognition embeddings), perceptual metrics (e.g., LPIPS and FID), and direct comparisons against generic inpainting models as well as non-album reference baselines. This will clearly quantify the benefit of the album-guided, identity-consistent retrieval. revision: yes
Circularity Check
No circularity: framework chains external VLMs and completion models without self-referential reductions
full rationale
The paper describes a training-free pipeline: VLM infers semantic cues from occluded images, guides composed retrieval from the album, then applies reference-based completion. No equations, fitted parameters, or predictions are presented that reduce by construction to the inputs. The new dataset (54K samples) and baseline experiments serve as external validation rather than tautological definitions. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The method relies on off-the-shelf components whose behavior is independent of the present work, satisfying the criteria for a self-contained non-circular contribution.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Vision-language models can infer missing semantic cues from occluded personal images to guide retrieval
Reference graph
Works this paper leans on
-
[1]
Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F.L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al.: Gpt-4 technical report. arXiv preprint arXiv:2303.08774 (2023) 4
work page internal anchor Pith review arXiv 2023
-
[2]
TPAMI (2025) 9, 11
Agnolucci, L., Baldrati, A., Del Bimbo, A., Bertini, M.: isearle: Improving textual inversion for zero-shot composed image retrieval. TPAMI (2025) 9, 11
2025
-
[3]
Bai, S., Chen, K., Liu, X., Wang, J., Ge, W., Song, S., Dang, K., Wang, P., Wang, S., Tang, J., et al.: Qwen2. 5-vl technical report. arXiv preprint arXiv:2502.13923 (2025) 2, 3
work page Pith review arXiv 2025
-
[4]
In: ICCV (2023) 4
Baldrati, A., Agnolucci, L., Bertini, M., Del Bimbo, A.: Zero-shot composed image retrieval with textual inversion. In: ICCV (2023) 4
2023
-
[5]
In: CVPR (2024) 4
Cao,C.,Cai,Y.,Dong,Q.,Wang,Y.,Fu,Y.:Leftrefill:Fillingrightcanvasbasedon left reference through generalized text-to-image diffusion model. In: CVPR (2024) 4
2024
-
[6]
In: ICCV (2021) 9
Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: ICCV (2021) 9
2021
-
[7]
In: NeurIPS (2024) 2, 4, 9, 11, 12, 13
Chen, X., Feng, Y., Chen, M., Wang, Y., Zhang, S., Liu, Y., Shen, Y., Zhao, H.: Zero-shot image editing with reference imitation. In: NeurIPS (2024) 2, 4, 9, 11, 12, 13
2024
-
[8]
In: CVPR (2024) 4
Chen, X., Huang, L., Liu, Y., Shen, Y., Zhao, D., Zhao, H.: Anydoor: Zero-shot object-level image customization. In: CVPR (2024) 4
2024
-
[9]
In: CVPR (2025) 5, 9, 11, 14
Chen, X., Zhang, Z., Zhang, H., Zhou, Y., Kim, S.Y., Liu, Q., Li, Y., Zhang, J., Zhao, N., Wang, Y., et al.: Unireal: Universal image generation and editing via learning real-world dynamics. In: CVPR (2025) 5, 9, 11, 14
2025
-
[10]
In: CVPR (2024) 2, 3
Chen, Z., Wu, J., Wang, W., Su, W., Chen, G., Xing, S., Zhong, M., Zhang, Q., Zhu, X., Lu, L., et al.: Internvl: Scaling up vision foundation models and aligning for generic visual-linguistic tasks. In: CVPR (2024) 2, 3
2024
-
[11]
Comanici, G., Bieber, E., Schaekermann, M., Pasupat, I., Sachdeva, N., Dhillon, I., Blistein, M., Ram, O., Zhang, D., Rosen, E., et al.: Gemini 2.5: Pushing the frontier with advanced reasoning, multimodality, long context, and next generation agentic capabilities. arXiv preprint arXiv:2507.06261 (2025) 4, 10
work page Pith review arXiv 2025
-
[12]
In: NeurIPS (2023) 3
Dai, W., Li, J., Li, D., Tiong, A., Zhao, J., Wang, W., Li, B., Fung, P.N., Hoi, S.: Instructblip: Towards general-purpose vision-language models with instruction tuning. In: NeurIPS (2023) 3
2023
-
[13]
DeepMind, G.: Gemini 3 pro image model card (November 2025),https:// storage.googleapis.com/deepmind-media/Model-Cards/Gemini-3-Pro-Image- Model-Card.pdf3, 5, 11, 12, 13
2025
-
[14]
Emerging Properties in Unified Multimodal Pretraining
Deng, C., Zhu, D., Li, K., Gou, C., Li, F., Wang, Z., Zhong, S., Yu, W., Nie, X., Song, Z., et al.: Emerging properties in unified multimodal pretraining. arXiv preprint arXiv:2505.14683 (2025) 3, 5, 11, 12, 13
work page internal anchor Pith review arXiv 2025
-
[15]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Dosovitskiy, A.: An image is worth 16x16 words: Transformers for image recogni- tion at scale. arXiv preprint arXiv:2010.11929 (2020) 9
work page Pith review arXiv 2010
-
[16]
In: NeurIPS (2024) 9
Fu, S., Tamir, N., Sundaram, S., Chai, L., Zhang, R., Dekel, T., Isola, P.: Dream- sim: Learning new dimensions of human visual similarity using synthetic data. In: NeurIPS (2024) 9
2024
-
[17]
Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Let- man, A., Mathur, A., Schelten, A., Vaughan, A., et al.: The llama 3 herd of models. arXiv preprint arXiv:2407.21783 (2024) 2, 3 16 Y.-J. Tsai et al
work page internal anchor Pith review arXiv 2024
-
[18]
In: CVPR (2024) 3, 4, 9, 11, 13
Gu, G., Chun, S., Kim, W., Kang, Y., Yun, S.: Language-only training of zero-shot composed image retrieval. In: CVPR (2024) 3, 4, 9, 11, 13
2024
-
[19]
In: ECCV (2024) 1
Gu, J., Zhao, N., Xiong, W., Liu, Q., Zhang, Z., Zhang, H., Zhang, J., Jung, H., Wang, Y., Wang, X.E.: Swapanything: Enabling arbitrary object swapping in personalized image editing. In: ECCV (2024) 1
2024
-
[20]
Guo, J., Deng, J., et al.: Insightface: State-of-the-art 2d and 3d face analysis project.https://github.com/deepinsight/insightface(2025) 5
2025
-
[21]
Hore, A., Ziou, D.: Image quality metrics: Psnr vs. ssim. In: ICPR (2010) 9
2010
-
[22]
In: ECCV (2024) 3, 4
Jang, Y.K., Huynh, D., Shah, A., Chen, W.K., Lim, S.N.: Spherical linear interpo- lation and text-anchoring for zero-shot composed image retrieval. In: ECCV (2024) 3, 4
2024
-
[23]
Jocher, G., Chaurasia, A., Qiu, J.: Ultralytics yolov8 (2023) 5
2023
-
[24]
In: ICCV (2015) 1
Joon Oh, S., Benenson, R., Fritz, M., Schiele, B.: Person recognition in personal photo collections. In: ICCV (2015) 1
2015
-
[25]
In: ECCV (2024) 2, 11, 12, 13
Ju, X., Liu, X., Wang, X., Bian, Y., Shan, Y., Xu, Q.: Brushnet: A plug-and-play image inpainting model with decomposed dual-branch diffusion. In: ECCV (2024) 2, 11, 12, 13
2024
-
[26]
In: ICLR (2024) 4, 9, 11
Karthik, S., Roth, K., Mancini, M., Akata, Z.: Vision-by-language for training-free compositional image retrieval. In: ICLR (2024) 4, 9, 11
2024
-
[27]
In: ICML (2023) 2, 3
Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: Bootstrapping language-image pre- training with frozen image encoders and large language models. In: ICML (2023) 2, 3
2023
-
[28]
In: ACM SIGIR (2024) 3, 4
Lin, H., Wen, H., Song, X., Liu, M., Hu, Y., Nie, L.: Fine-grained textual inversion network for zero-shot composed image retrieval. In: ACM SIGIR (2024) 3, 4
2024
-
[29]
In: CVPR (2024) 3, 9, 10
Liu, H., Li, C., Li, Y., Lee, Y.J.: Improved baselines with visual instruction tuning. In: CVPR (2024) 3, 9, 10
2024
-
[30]
In: NeurIPS (2023) 3
Liu, H., Li, C., Wu, Q., Lee, Y.J.: Visual instruction tuning. In: NeurIPS (2023) 3
2023
-
[31]
DeepSeek-VL: Towards Real-World Vision-Language Understanding
Lu, H., Liu, W., Zhang, B., Wang, B., Dong, K., Liu, B., Sun, J., Ren, T., Li, Z., Yang, H., et al.: Deepseek-vl: towards real-world vision-language understanding. arXiv preprint arXiv:2403.05525 (2024) 4
work page internal anchor Pith review arXiv 2024
-
[32]
In: CVPR (2022) 2
Lugmayr, A., Danelljan, M., Romero, A., Yu, F., Timofte, R., Van Gool, L.: Re- paint: Inpainting using denoising diffusion probabilistic models. In: CVPR (2022) 2
2022
-
[33]
In: CVPR (2024) 3
Mitra, C., Huang, B., Darrell, T., Herzig, R.: Compositional chain-of-thought prompting for large multimodal models. In: CVPR (2024) 3
2024
-
[34]
Kosmos-2: Grounding Multimodal Large Language Models to the World
Peng, Z., Wang, W., Dong, L., Hao, Y., Huang, S., Ma, S., Wei, F.: Kosmos- 2: Grounding multimodal large language models to the world. arXiv preprint arXiv:2306.14824 (2023) 3
work page internal anchor Pith review arXiv 2023
-
[35]
In: ICML (2021) 4, 9
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: ICML (2021) 4, 9
2021
-
[36]
In: CVPR (2023) 3, 4
Saito, K., Sohn, K., Zhang, X., Li, C.L., Lee, C.Y., Saenko, K., Pfister, T.: Pic2word: Mapping pictures to words for zero-shot composed image retrieval. In: CVPR (2023) 3, 4
2023
-
[37]
In: NeurIPS (2024) 3
Shao, H., Qian, S., Xiao, H., Song, G., Zong, Z., Wang, L., Liu, Y., Li, H.: Visual cot: Advancing multi-modal language models with a comprehensive dataset and benchmark for chain-of-thought reasoning. In: NeurIPS (2024) 3
2024
-
[38]
In: CVPR (2023) 4 AlbumFill17
Song, Y., Zhang, Z., Lin, Z., Cohen, S., Price, B., Zhang, J., Kim, S.Y., Aliaga, D.: Objectstitch: Object compositing with diffusion model. In: CVPR (2023) 4 AlbumFill17
2023
-
[39]
In: CVPR (2024) 4
Song, Y., Zhang, Z., Lin, Z., Cohen, S., Price, B., Zhang, J., Kim, S.Y., Zhang, H., Xiong, W., Aliaga, D.: Imprint: Generative object compositing by learning identity-preserving representation. In: CVPR (2024) 4
2024
-
[40]
arXiv preprint arXiv:2312.08924 (2023) 4
Sun, S., Ye, F., Gong, S.: Training-free zero-shot composed image retrieval with local concept reranking. arXiv preprint arXiv:2312.08924 (2023) 4
-
[41]
In: CVPR (2024) 3, 4
Suo, Y., Ma, F., Zhu, L., Yang, Y.: Knowledge-enhanced dual-stream zero-shot composed image retrieval. In: CVPR (2024) 3, 4
2024
-
[42]
In: WACV (2022) 2
Suvorov, R., Logacheva, E., Mashikhin, A., Remizova, A., Ashukha, A., Silvestrov, A., Kong, N., Goka, H., Park, K., Lempitsky, V.: Resolution-robust large mask inpainting with fourier convolutions. In: WACV (2022) 2
2022
-
[43]
In: AAAI (2024) 3, 4
Tang, Y., Yu, J., Gai, K., Zhuang, J., Xiong, G., Hu, Y., Wu, Q.: Context-i2w: Mapping images to context-dependent words for accurate zero-shot composed im- age retrieval. In: AAAI (2024) 3, 4
2024
-
[44]
In: CVPR (2025) 4
Tang, Y., Zhang, J., Qin, X., Yu, J., Gou, G., Xiong, G., Lin, Q., Rajmohan, S., Zhang, D., Wu, Q.: Reason-before-retrieve: One-stage reflective chain-of-thoughts for training-free zero-shot composed image retrieval. In: CVPR (2025) 4
2025
-
[45]
Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context
Team, G., Georgiev, P., Lei, V.I., Burnell, R., Bai, L., Gulati, A., Tanzer, G., Vin- cent,D.,Pan,Z.,Wang,S.,etal.:Gemini1.5:Unlockingmultimodalunderstanding across millions of tokens of context. arXiv preprint arXiv:2403.05530 (2024) 4
work page internal anchor Pith review arXiv 2024
-
[46]
Team, G., Kamath, A., Ferret, J., Pathak, S., Vieillard, N., Merhej, R., Perrin, S., Matejovicova, T., Ramé, A., Rivière, M., et al.: Gemma 3 technical report. arXiv preprint arXiv:2503.19786 (2025) 2, 3
work page internal anchor Pith review arXiv 2025
-
[47]
Communications of the ACM (2016) 1, 5
Thomee, B., Shamma, D.A., Friedland, G., Elizalde, B., Ni, K., Poland, D., Borth, D., Li, L.J.: Yfcc100m: The new data in multimedia research. Communications of the ACM (2016) 1, 5
2016
-
[48]
In: ACM MM (2025) 5
Tian, X., Li, W., Xu, B., Yuan, Y., Wang, Y., Shen, H.: Mige: Mutually enhanced multimodal instruction-based image generation and editing. In: ACM MM (2025) 5
2025
-
[49]
In: ICCV (2025) 1, 2, 4, 9, 11, 12, 13
Tsai, Y.J., Price, B., Liu, Q., Figueroa, L., Pakhomov, D., Ding, Z., Cohen, S., Yang, M.H.: Completeme: Reference-based human image completion. In: ICCV (2025) 1, 2, 4, 9, 11, 12, 13
2025
-
[50]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Wang, P., Bai, S., Tan, S., Wang, S., Fan, Z., Bai, J., Chen, K., Liu, X., Wang, J., Ge, W., et al.: Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution. arXiv preprint arXiv:2409.12191 (2024) 2, 3
work page Pith review arXiv 2024
-
[51]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025) 2, 3, 9, 10
work page internal anchor Pith review arXiv 2025
-
[52]
In: CVPR (2016) 4, 5
Wang, Y., Lin, Z., Shen, X., Mech, R., Miller, G., Cottrell, G.W.: Event-specific image importance. In: CVPR (2016) 4, 5
2016
-
[53]
TIP (2004) 9
Wang, Z., Bovik, A.C., Sheikh, H.R., Simoncelli, E.P.: Image quality assessment: from error visibility to structural similarity. TIP (2004) 9
2004
-
[54]
In: NeurIPS (2022) 3
Wei, J., Wang, X., Schuurmans, D., Bosma, M., Xia, F., Chi, E., Le, Q.V., Zhou, D., et al.: Chain-of-thought prompting elicits reasoning in large language models. In: NeurIPS (2022) 3
2022
-
[55]
Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al.: Qwen3 technical report. arXiv preprint arXiv:2505.09388 (2025) 9, 10
work page internal anchor Pith review arXiv 2025
-
[56]
In: CVPR (2023) 2, 4 18 Y.-J
Yang, B., Gu, S., Zhang, B., Zhang, T., Chen, X., Sun, X., Chen, D., Wen, F.: Paint by example: Exemplar-based image editing with diffusion models. In: CVPR (2023) 2, 4 18 Y.-J. Tsai et al
2023
-
[57]
MM-REACT: Prompting ChatGPT for Multimodal Reasoning and Action
Yang, Z., Li, L., Wang, J., Lin, K., Azarnasab, E., Ahmed, F., Liu, Z., Liu, C., Zeng, M., Wang, L.: Mm-react: Prompting chatgpt for multimodal reasoning and action. arXiv preprint arXiv:2303.11381 (2023) 3
work page internal anchor Pith review arXiv 2023
-
[58]
GLM-4.5: Agentic, Reasoning, and Coding (ARC) Foundation Models
Zeng, A., Lv, X., Zheng, Q., Hou, Z., Chen, B., Xie, C., Wang, C., Yin, D., Zeng, H., Zhang, J., et al.: Glm-4.5: Agentic, reasoning, and coding (arc) foundation models. arXiv preprint arXiv:2508.06471 (2025) 2, 3
work page internal anchor Pith review arXiv 2025
-
[59]
In: ACM MM (2025) 5
Zhang, J., Li, M., Tang, J., Deng, J., Tian, S., Liu, X., Zhang, M., Ye, G., Jiang, Y.G.: Editmaster: Bridging text instruction and visual example for multimodal guided image editing. In: ACM MM (2025) 5
2025
-
[60]
In: CVPR (2018) 9
Zhang, R., Isola, P., Efros, A.A., Shechtman, E., Wang, O.: The unreasonable effectiveness of deep features as a perceptual metric. In: CVPR (2018) 9
2018
-
[61]
TMLR (2024) 3
Zhang, Z., Zhang, A., Li, M., Zhao, H., Karypis, G., Smola, A.: Multimodal chain- of-thought reasoning in language models. TMLR (2024) 3
2024
-
[62]
In: CVPR (2021) 2, 4
Zhou, Y., Barnes, C., Shechtman, E., Amirghodsi, S.: Transfill: Reference-guided image inpainting by merging multiple color and spatial transformations. In: CVPR (2021) 2, 4
2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.