Recognition: unknown
LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation
Pith reviewed 2026-05-10 15:31 UTC · model grok-4.3
The pith
Object-centric vision supplies a framework that extends LMMs to precise object-level understanding, segmentation, editing, and generation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that object-centric vision supplies a principled framework for addressing LMM limitations in instance identification, identity preservation, and precise localization by promoting explicit representations and operations over visual entities. It organizes the literature into object-centric visual understanding, object-centric referring segmentation, object-centric visual editing, and object-centric visual generation; summarizes the key modeling paradigms, learning strategies, and evaluation protocols supporting these capabilities; and outlines open challenges including robust instance permanence, fine-grained spatial control, consistent multi-step interaction, unified cross-</
What carries the argument
The four-theme organization of object-centric visual understanding, referring segmentation, visual editing, and visual generation that structures the surveyed advances.
If this is right
- LMMs gain the ability to identify and track specific object instances across image sequences and edits.
- Systems can modify only designated regions while preserving the identity and appearance of untouched objects.
- Evaluation protocols shift from global scene metrics to instance-level precision and consistency measures.
- Development efforts converge on shared modeling paradigms that support all four tasks under a single architecture.
Where Pith is reading between the lines
- The review's structure could serve as a template for new benchmarks that test cross-task consistency rather than isolated capabilities.
- Future work might test whether the same object-centric priors transfer to video or 3D domains without additional supervision.
- If the four themes prove incomplete, the field would need a fifth category for object-centric reasoning over temporal or causal relations.
Load-bearing premise
That the reviewed papers sufficiently represent the full intersection of LMMs and object-centric vision and that this intersection indeed forms a coherent, extensible framework.
What would settle it
A systematic audit that finds a large fraction of high-impact LMM papers on object-level tasks either omit explicit object representations or achieve comparable gains without them.
read the original abstract
Large Multimodal Models (LMMs) have achieved remarkable progress in general-purpose vision--language understanding, yet they remain limited in tasks requiring precise object-level grounding, fine-grained spatial reasoning, and controllable visual manipulation. In particular, existing systems often struggle to identify the correct instance, preserve object identity across interactions, and localize or modify designated regions with high precision. Object-centric vision provides a principled framework for addressing these challenges by promoting explicit representations and operations over visual entities, thereby extending multimodal systems from global scene understanding to object-level understanding, segmentation, editing, and generation. This paper presents a comprehensive review of recent advances at the convergence of LMMs and object-centric vision. We organize the literature into four major themes: object-centric visual understanding, object-centric referring segmentation, object-centric visual editing, and object-centric visual generation. We further summarize the key modeling paradigms, learning strategies, and evaluation protocols that support these capabilities. Finally, we discuss open challenges and future directions, including robust instance permanence, fine-grained spatial control, consistent multi-step interaction, unified cross-task modeling, and reliable benchmarking under distribution shift. We hope this paper provides a structured perspective on the development of scalable, precise, and trustworthy object-centric multimodal systems.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper presents a comprehensive review of recent advances at the convergence of Large Multimodal Models (LMMs) and object-centric vision. It organizes the existing literature into four major themes—object-centric visual understanding, object-centric referring segmentation, object-centric visual editing, and object-centric visual generation—while summarizing key modeling paradigms, learning strategies, and evaluation protocols. The review concludes by discussing open challenges and future directions, including robust instance permanence, fine-grained spatial control, consistent multi-step interaction, unified cross-task modeling, and reliable benchmarking under distribution shift.
Significance. If the curation and summaries are accurate and reasonably complete, the survey would provide a useful structured perspective for researchers working on extending LMMs from global scene understanding to precise object-level capabilities. Its primary contribution is organizational synthesis rather than new derivations or experiments, which is appropriate for a review paper in a fast-moving area; explicit credit is due for framing the four themes as a coherent lens and for highlighting actionable future directions without introducing unverified claims.
minor comments (2)
- [Abstract] Abstract and introduction: the phrasing that object-centric vision 'provides a principled framework' is presented as established motivation; a brief paragraph contrasting it with alternative (e.g., pixel- or region-based) approaches would clarify why the four-theme organization follows naturally rather than appearing as one possible taxonomy.
- The four-theme structure is clear, but boundary papers that span multiple themes (e.g., a method that performs both referring segmentation and editing) should be explicitly noted so readers understand how overlaps are handled.
Simulated Author's Rebuttal
We thank the referee for their positive and accurate summary of our survey, as well as for recommending minor revision. The referee's assessment correctly captures the paper's organizational structure around the four themes, its synthesis of paradigms and challenges, and its focus on future directions without overclaiming novelty. Since no specific major comments were raised in the report, we have no points requiring rebuttal or clarification at this time. We will incorporate any minor suggestions during the revision process to further strengthen the manuscript.
Circularity Check
No significant circularity; survey paper with no derivations or fitted quantities
full rationale
This paper is a literature review that organizes external work into four themes (object-centric understanding, referring segmentation, editing, and generation) and summarizes paradigms, strategies, and protocols. No equations, predictions, parameters, or derivation chains appear in the abstract or described structure. The claim that object-centric vision supplies a 'principled framework' is motivational framing rather than a testable proposition whose validity depends on internal reductions. All cited results are external to the present manuscript, so no self-citation load-bearing, self-definitional, or fitted-input patterns exist. The contribution is curation and perspective, which remains valid independently of any single assumption or result.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Gpt-4v(ision) system card. 2023. 24
2023
-
[2]
Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes
Panos Achlioptas, Ahmed Abdelreheem, Fei Xia, Mohamed Elhoseiny, and Leonidas Guibas. Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes. 2020
2020
-
[3]
Flamingo: a visual language model for few-shot learning
Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advancesin neural information processing systems, 35:23716–23736, 2022
2022
-
[4]
Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Fan Yang, Wenbin Ge, Han Yu, Fei Huang, Binyuan Hui, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Real-time 3d-aware portrait editing from a single image
Qingyan Bai, Zifan Shi, Yinghao Xu, Hao Ouyang, Qiuyu Wang, Ceyuan Yang, Xuan Wang, Gordon Wetzstein, Yujun Shen, and Qifeng Chen. Real-time 3d-aware portrait editing from a single image. InEuropean Conference on Computer Vision, pages 344–362. Springer, 2024
2024
-
[6]
Qingyan Bai, Qiuyu Wang, Hao Ouyang, Yue Yu, Hanlin Wang, Wen Wang, Ka Leong Cheng, Shuailei Ma, Yanhong Zeng, Zichen Liu, et al. Scaling instruction-based video editing with a high-quality synthetic dataset. arXiv preprint arXiv:2510.15742, 2025
-
[7]
Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Zhang, Tianbao Xie, Zong-Ming Cheng, Zhang Hang, Zhibo Yang, Haiyang Xu, and Juntang Lin
Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Hua Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zhangping Fu, Yuanzhong Xu, Jiabo Ye, X.-C. Zhang, Tianbao Xie, Zong-Ming Cheng, Zhang Hang, Zhibo Yang, Haiyang Xu, and Juntang Lin. Qwen2.5-vl technica...
2025
-
[9]
One token to seg them all: Language instructed reasoning segmentation in videos.Advancesin Neural Information Processing Systems 37, 2024
Zechen Bai, Joya Chen, Ziteng Gao, Tong He, Lei Liu, Haiyang Mei, Mike Zheng Shou, Pichao Wang, and Zheng Zhang. One token to seg them all: Language instructed reasoning segmentation in videos.Advancesin Neural Information Processing Systems 37, 2024
2024
-
[10]
Flux.1-dev, 2024
Black Forest Labs. Flux.1-dev, 2024
2024
-
[11]
Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets
Andreas Blattmann. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023
work page internal anchor Pith review arXiv 2023
-
[12]
Ledits++: Limitless image editing using text-to-image models
Manuel Brack, Felix Friedrich, Katharina Kornmeier, Linoy Tsaban, Patrick Schramowski, Kristian Kersting, and Apolinário Passos. Ledits++: Limitless image editing using text-to-image models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
2024
-
[13]
Tim Brooks, Aleksander Holynski, and Alexei A. Efros. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023
2023
-
[14]
Pixel-level reasoning segmentation via multi-turn conversations
Dunbo Cai, Xiaocui Yang, Yongkang Liu, Daling Wang, Feng Shi, Yifei Zhang, and Soujanya Poria. Pixel-level reasoning segmentation via multi-turn conversations. 2025
2025
-
[15]
Meyer, Yuning Chai, Dennis Park, and Yong Jae Lee
Mu Cai, Haotian Liu, Siva Karthik Mustikovela, Gregory P. Meyer, Yuning Chai, Dennis Park, and Yong Jae Lee. Vip-llava: Making large multimodal models understand arbitrary visual prompts. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
2024
-
[16]
Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing
Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xiaohu Qie, and Yinqiang Zheng. Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2023
2023
-
[17]
Shixiang Cao, L. Gui, and Yu-Xiong Wang. Emergent visual grounding in large multimodal models without grounding supervision. arXiv preprint arXiv:2410.08209, 2024
-
[18]
SAM 3: Segment Anything with Concepts
Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts. arXiv preprint arXiv:2511.16719, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T. Freeman. Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022
2022
-
[20]
Chang, and Matthias Nießner
Dave Zhenyu Chen, Anne Lynn S. Chang, and Matthias Nießner. Scanrefer: 3d object localization in rgb-d scans using natural language. 2020. 25
2020
-
[21]
VideoCrafter1: Open Diffusion Models for High-Quality Video Generation
Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023
work page internal anchor Pith review arXiv 2023
-
[22]
Gary Chan, and Hongyang Zhang
Jierun Chen, Fangyun Wei, Jinjing Zhao, Sizhe Song, Bohuai Wu, Zhuoxuan Peng, S.-H. Gary Chan, and Hongyang Zhang. Revisiting referring expression comprehension evaluation in the era of large multimodal models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2025
2025
-
[23]
Chang, Thomas Funkhouser, and Silvio Savarese
Kai Chen, Christopher Choy, Manolis Savva, Anne Lynn S. Chang, Thomas Funkhouser, and Silvio Savarese. Text2shape: Generating shapes from natural language by learning joint embeddings. 2019
2019
-
[24]
Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic
Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic.arXiv preprint arXiv:2306.15195, 2023
work page internal anchor Pith review arXiv 2023
-
[25]
3d-dres: Detailed 3d referring expression segmentation
Qi Chen, Changli Wu, Jiayi Ji, Yiwei ma, and Liujuan Cao. 3d-dres: Detailed 3d referring expression segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, 2026
2026
-
[26]
Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, and Ivan Laptev. Language conditioned spatial relation reasoning for 3d object grounding.arXiv preprint arXiv:2211.09646, 2022
-
[27]
Tianyu Chen, Yasi Zhang, Zhi Zhang, Peiyu Yu, Shu Wang, Zhendong Wang, Kevin Lin, Xiaofei Wang, Zhengyuan Yang, Linjie Li, et al. Edival-agent: An object-centric framework for automated, fine-grained evaluation of multi-turn editing.arXiv preprint arXiv:2509.13399, 2025
-
[28]
Panda-70m: Captioning 70m videos with multiple cross-modality teachers
Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-Wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Feng Ren, Ming-Hsuan Yang, and Sergey Tulyakov. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
2024
-
[29]
Gaussianeditor: Swift and controllable 3d editing with gaussian splatting
Yiwen Chen, Zilong Chen, Chi Zhang, Feng Wang, Xiaofeng Yang, Yikai Wang, Zhongang Cai, Lei Yang, Huaping Liu, and Guosheng Lin. Gaussianeditor: Swift and controllable 3d editing with gaussian splatting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
2024
-
[30]
Contextflow: Training-free video object editing via adaptive context enrichment
Yiyang Chen, Xuanhua He, Xiujun Ma, and Jack Ma. Contextflow: Training-free video object editing via adaptive context enrichment. 2026
2026
-
[31]
How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites
Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. Science China Information Sciences, 67(12):220101, 2024
2024
-
[32]
Spatialrgpt: Grounded spatial reasoning in vision-language models.Advances in Neural Information Processing Systems 37, 2024
An-Chieh Cheng, Yang Fu, Qiushan Guo, Jan Kautz, Sifei Liu, Xiaolong Wang, Ruihan Yang, and Hongxu Yin. Spatialrgpt: Grounded spatial reasoning in vision-language models.Advances in Neural Information Processing Systems 37, 2024
2024
-
[33]
3d aware region prompted vision language model
An-Chieh Cheng, Yang Fu, Yukang Chen, Zhijian Liu, Xiaolong Li, Subhashree Radhakrishnan, Song Han, Yao Lu, Jan Kautz, Pavlo Molchanov, Hongxu Yin, Xiaolong Wang, and Sifei Liu. 3d aware region prompted vision language model. arXiv preprint arXiv:2509.13317, 2025
-
[34]
Masked-attention mask transformer for universal image segmentation
Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1290–1299, 2022
2022
-
[35]
Instructblip: Towards general-purpose vision-language models with instruction tuning
Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advancesin neural information processing systems, 36:49250–49267, 2023
2023
-
[36]
RynnBrain: Open Embodied Foundation Models.arXiv preprint arXiv:2602.14979, 2026
Ronghao Dang, Jiayan Guo, Bohan Hou, Sicong Leng, Kehan Li, Xin Li, Jiangpin Liu, Yunxuan Mao, Zhikai Wang, Yuqian Yuan, et al. Rynnbrain: Open embodied foundation models.arXiv preprint arXiv:2602.14979, 2026
- [37]
-
[38]
Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram Voleti, Samir Yitzhak Gadre, Eli VanderBilt, Aniruddha Kembhavi, Carl Vondrick, Georgia 26 Gkioxari, Kiana Ehsani, Ludwig Schmidt, and Ali Farhadi. Objaverse-xl: A universe of 10m+ 3d objects.arXiv preprint arXiv:2307.05663, 2023
-
[39]
Objaverse: A universe of annotated 3d objects
Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsanit, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023
2023
-
[40]
Mevis: A large-scale benchmark for video segmentation with motion expressions
Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, and Chen Change Loy. Mevis: A large-scale benchmark for video segmentation with motion expressions. InProceedings of the IEEE/CVF international conference on computer vision, pages 2694–2703, 2023
2023
-
[41]
Multimodal referring segmentation: A survey
Henghui Ding, Song Tang, Shuting He, Chang Liu, Zuxuan Wu, and Yu-Gang Jiang. Multimodal referring segmentation: A survey. 2025
2025
-
[42]
Z3d: Zero-shot 3d visual grounding from images.arXiv preprint arXiv:2602.03361, 2026
Nikita Drozdov, Andrey Lemeshko, Nikita Gavrilov, Anton Konushin, Danila Rukhovich, and Maksim Kolodi- azhnyi. Z3d: Zero-shot 3d visual grounding from images.arXiv preprint arXiv:2602.03361, 2026
-
[43]
Scaling rectified flow transformers for high-resolution image synthesis
Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-firstinternational conference on machine learning, 2024
2024
-
[44]
Videoagent: A memory-augmented multimodal agent for video understanding
Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, and Qing Li. Videoagent: A memory-augmented multimodal agent for video understanding. 2024
2024
-
[45]
An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion
Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H. Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion.arXiv preprint arXiv:2208.01618, 2022
work page internal anchor Pith review arXiv 2022
-
[46]
Mani-gs: Gaussian splatting manipulation with triangular mesh
Xiangjun Gao, Xiaoyu Li, Yiyu Zhuang, Qi Zhang, Wenbo Hu, Chaopeng Zhang, Yao Yao, Ying Shan, and Long Quan. Mani-gs: Gaussian splatting manipulation with triangular mesh. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025
2025
-
[47]
Mini-internvl: a flexible-transfer pocket multi-modal model with 5% parameters and 90% performance
Zhangwei Gao, Zhe Chen, Erfei Cui, Yiming Ren, Weiyun Wang, Jinguo Zhu, Hao Tian, Shenglong Ye, Junjun He, Xizhou Zhu, et al. Mini-internvl: a flexible-transfer pocket multi-modal model with 5% parameters and 90% performance. Visual Intelligence, 2(1):32, 2024
2024
-
[48]
Tokenflow: Con- sistent diffusion features for consistent video editing,
Michal Geyer, Omer Bar-Tal, Shai Bagon, and Tali Dekel. Tokenflow: Consistent diffusion features for consistent video editing. arXiv preprint arXiv:2307.10373, 2023
-
[49]
The devil is in temporal token: High quality video reasoning segmentation
Shaogang Gong, Yunzhi Zhuge, Pengfei Zhang, Zongxin Yang, Pingping Zhang, and Huchuan Lu. The devil is in temporal token: High quality video reasoning segmentation. 2025
2025
-
[50]
Demystifying flux architecture.arXiv preprint arXiv:2507.09595, 2025
Or Greenberg. Demystifying flux architecture.arXiv preprint arXiv:2507.09595, 2025
-
[51]
Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025
work page internal anchor Pith review arXiv 2025
-
[52]
Regiongpt: Towards region understanding vision language model
Qiushan Guo, Shalini De Mello, Hongxu Yin, Wonmin Byeon, Ka Chun Cheung, Yizhou Yu, Ping Luo, and Sifei Liu. Regiongpt: Towards region understanding vision language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
2024
-
[53]
AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning
Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023
work page internal anchor Pith review arXiv 2023
-
[54]
Lvis: A dataset for large vocabulary instance segmentation
Agrim Gupta, Piotr Dollár, and Ross Girshick. Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019
2019
-
[55]
Merlin: Multimodal embedding refinement via llm-based iterative navigation for text-video retrieval-rerank pipeline
Donghoon Han, Eunhwan Park, Gisang Lee, Adam Lee, and Nojun Kwak. Merlin: Multimodal embedding refinement via llm-based iterative navigation for text-video retrieval-rerank pipeline. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, 2024
2024
-
[56]
Decomposed Attention Fusion in MLLMs for Training-Free Video Reasoning Segmentation
Su Ho Han, Jeongseok Hyun, Pilhyeon Lee, Minho Shim, Dongyoon Wee, and Seon Joo Kim. Decomposed attention fusion in mllms for training-free video reasoning segmentation.arXiv preprint arXiv:2510.19592, 2025. 27
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[57]
Multi-modal instruction tuned llms with fine-grained visual perception
Junwen He, Yifan Wang, Lijun Wang, Huchuan Lu, Jun-Yan He, Jin-Peng Lan, Bin Luo, and Xuansong Xie. Multi-modal instruction tuned llms with fine-grained visual perception. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
2024
-
[58]
Refmask3d: Language-guided transformer for 3d referring segmentation
Shuting He and Henghui Ding. Refmask3d: Language-guided transformer for 3d referring segmentation. In Proceedings of the 32nd ACM International Conference on Multimedia, 2024
2024
-
[59]
Omni-rgpt: Unifying image and video region-level understanding via token marks
Miran Heo, Min-HungChen, De-An Huang, Sifei Liu, SubhashreeRadhakrishnan, Seon Joo Kim, Yu-Chiang Frank Wang, and Ryo Hachiuma. Omni-rgpt: Unifying image and video region-level understanding via token marks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025
2025
-
[60]
Prompt-to-Prompt Image Editing with Cross Attention Control
Amir Hertz, Ron Mokady, Jay M. Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to- prompt image editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022
work page internal anchor Pith review arXiv 2022
-
[61]
Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020
Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020
2020
-
[62]
arXiv preprint arXiv:2512.16760 (2025) 2 16 X
Tianshuai Hu, Xiaolu Liu, Song Wang, Yiyao Zhu, Ao Liang, Lingdong Kong, Guoyang Zhao, Zeying Gong, Jun Cen, Zhiyu Huang, et al. Vision-language-action models for autonomous driving: Past, present, and future. arXiv preprint arXiv:2512.16760, 2025
-
[63]
Finecaption: Compositional image captioning focusing on wherever you want at any granularity
Hang Hua, Qing Liu, Lingzhi Zhang, Jing Shi, Soo Ye Kim, Zhifei Zhang, Yilin Wang, Jianming Zhang, Zhe Lin, and Jiebo Luo. Finecaption: Compositional image captioning focusing on wherever you want at any granularity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025
2025
-
[64]
Viewsrd: 3d visual grounding via structured multi-view decomposition
Ronggang Huang, Haoxin Yang, Yan Cai, Xuemiao Xu, Huaidong Zhang, and Shengfeng He. Viewsrd: 3d visual grounding via structured multi-view decomposition. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025
2025
-
[65]
Dive: Taming dino for subject-driven video editing
Yi Huang, Wei Xiong, He Zhang, Chaoqi Chen, Jianzhuang Liu, Mingfu Yan, and Shifeng Chen. Dive: Taming dino for subject-driven video editing. 2024
2024
-
[66]
Yuzhi Huang, Kairun Wen, Rongxin Gao, Dongxuan Liu, Yibin Lou, Jie Wu, Jing Xu, Jian Zhang, Zheng Yang, Yunlong Lin, Chenxin Li, Panwang Pan, Junbin Lu, Jingyan Jiang, Xinghao Ding, Yue Huang, and Zhi Wang. Thinking in dynamics: How multimodal large language models perceive, track, and reason dynamics in physical 4d world.arXiv preprint arXiv:2603.12746, 2026
-
[67]
Identity decoupling for multi-subject personaliza- tion of text-to-image models.Advancesin Neural Information Processing Systems 37, 2024
Sung Ju Hwang, Sangwon Jang, Jaehyeong Jo, and Kimin Lee. Identity decoupling for multi-subject personaliza- tion of text-to-image models.Advancesin Neural Information Processing Systems 37, 2024
2024
-
[68]
Jiang, Negar Hassanpour, Mohammad Salameh, Mohammadreza Samadi, Jiao He, Fengyu Sun, and Di Tao Niu
L. Jiang, Negar Hassanpour, Mohammad Salameh, Mohammadreza Samadi, Jiao He, Fengyu Sun, and Di Tao Niu. Pixelman: Consistent object editing with diffusion models via pixel manipulation and generation. In Proceedings of the AAAI Conference on Artificial Intelligence, 2025
2025
-
[69]
Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advancesin Neural Information Processing Systems 37, 2024
Yi Jiang, Bingyue Peng, Keyu Tian, Liwei Wang, and Zehuan Yuan. Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advancesin Neural Information Processing Systems 37, 2024
2024
-
[70]
Dang Jisheng, Wu Xudong, Wang Bimei, Lv Ning, Chen Jiayu, Jingwen Zhao, Jizhao Liu, Juncheng Li, Teng Wang, et al. Decoupled seg tokens make stronger reasoning video segmenter and grounder.arXiv preprint arXiv:2506.22880, 2025
-
[71]
Hao Kang, Stathi Fotiadis, Liming Jiang, Yan Qing, Yiwei Jia, Zichuan Liu, Min Jin Chong, and Xin Lu. Flux already knows – activating subject-driven image generation without training.arXiv preprint arXiv:2504.11478, 2025
-
[72]
PhysGaia: A Physics-Aware Benchmark with Multi-Body Interactions for Dynamic Novel View Synthesis
Mi Jeong Kim, Gunhee Kim, Jin Sun Choi, Wonjae Roh, and Bohyung Han. Physgaia: A physics-aware benchmark with multi-body interactions for dynamic novel view synthesis.arXiv preprint arXiv:2506.02794, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[73]
Shamma, Michael S
Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations.International Journal of Computer Vision, 2017
2017
-
[74]
Max Ku, Cong Wei, Weiming Ren, Huan Yang, and Wenhu Chen. Anyv2v: A tuning-free framework for any video-to-video editing tasks.arXiv preprint arXiv:2403.14468, 2024. 28
-
[75]
Multi-concept customization of text-to-image diffusion
Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi-concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023
2023
-
[76]
Fu, Christopher Ré, and David W
Hermann Kumbong, Xian Liu, Tsung-Yi Lin, Mingyu Liu, Xihui Liu, Ziwei Liu, Daniel Y. Fu, Christopher Ré, and David W. Romero. Hmar: Efficient hierarchical masked auto-regressive image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025
2025
-
[77]
Lisa: Reasoning segmentation via large language model
Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024
2024
-
[78]
Text4seg: Reimagining image segmentation as text generation.arXiv preprint arXiv:2410.09855, 2024
Mengcheng Lan, Zhaofeng Chen, Yue Zhou, Jiaxing Xu, Yiping Ke, Xinjiang Wang, Litong Feng, and Wei Zhang. Text4seg: Reimagining image segmentation as text generation.arXiv preprint arXiv:2410.09855, 2024
-
[79]
Mengcheng Lan, Zhaofeng Chen, Jiaxing Xu, Zongrui Li, Yiping Ke, Xudong Jiang, Yingchen Yu, Yunqing Zhao, and Song Bai. Text4seg++: Advancing image segmentation via generative language modeling.arXiv preprint arXiv:2509.06321, 2025
-
[80]
LLaVA-OneVision: Easy Visual Task Transfer
Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.