arxiv: 2604.11789 · v2 · submitted 2026-04-13 · 💻 cs.CV

Recognition: unknown

LMMs Meet Object-Centric Vision: Understanding, Segmentation, Editing and Generation

Yuqian Yuan , Wenqiao Zhang , Juekai Lin , Yu Zhong , Mingjian Gao , Binhe Yu , Yunqi Cao , Wentong Li

show 2 more authors

Yueting Zhuang Beng Chin Ooi

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:31 UTC · model grok-4.3

classification 💻 cs.CV

keywords Large Multimodal ModelsObject-Centric VisionVisual UnderstandingReferring SegmentationVisual EditingVisual GenerationMultimodal Systems

0 comments

The pith

Object-centric vision supplies a framework that extends LMMs to precise object-level understanding, segmentation, editing, and generation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This review organizes recent work on large multimodal models around the idea that explicit object representations solve core limitations in identifying specific instances, maintaining identity during edits, and localizing changes with high precision. The authors group the literature into four themes and extract common modeling choices, training approaches, and evaluation methods across them. A reader gains a map for moving multimodal systems from coarse scene descriptions to controllable, entity-focused interactions. The paper closes by listing open problems such as instance permanence and consistent multi-step control.

Core claim

The paper claims that object-centric vision supplies a principled framework for addressing LMM limitations in instance identification, identity preservation, and precise localization by promoting explicit representations and operations over visual entities. It organizes the literature into object-centric visual understanding, object-centric referring segmentation, object-centric visual editing, and object-centric visual generation; summarizes the key modeling paradigms, learning strategies, and evaluation protocols supporting these capabilities; and outlines open challenges including robust instance permanence, fine-grained spatial control, consistent multi-step interaction, unified cross-</

What carries the argument

The four-theme organization of object-centric visual understanding, referring segmentation, visual editing, and visual generation that structures the surveyed advances.

If this is right

LMMs gain the ability to identify and track specific object instances across image sequences and edits.
Systems can modify only designated regions while preserving the identity and appearance of untouched objects.
Evaluation protocols shift from global scene metrics to instance-level precision and consistency measures.
Development efforts converge on shared modeling paradigms that support all four tasks under a single architecture.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The review's structure could serve as a template for new benchmarks that test cross-task consistency rather than isolated capabilities.
Future work might test whether the same object-centric priors transfer to video or 3D domains without additional supervision.
If the four themes prove incomplete, the field would need a fifth category for object-centric reasoning over temporal or causal relations.

Load-bearing premise

That the reviewed papers sufficiently represent the full intersection of LMMs and object-centric vision and that this intersection indeed forms a coherent, extensible framework.

What would settle it

A systematic audit that finds a large fraction of high-impact LMM papers on object-level tasks either omit explicit object representations or achieve comparable gains without them.

read the original abstract

Large Multimodal Models (LMMs) have achieved remarkable progress in general-purpose vision--language understanding, yet they remain limited in tasks requiring precise object-level grounding, fine-grained spatial reasoning, and controllable visual manipulation. In particular, existing systems often struggle to identify the correct instance, preserve object identity across interactions, and localize or modify designated regions with high precision. Object-centric vision provides a principled framework for addressing these challenges by promoting explicit representations and operations over visual entities, thereby extending multimodal systems from global scene understanding to object-level understanding, segmentation, editing, and generation. This paper presents a comprehensive review of recent advances at the convergence of LMMs and object-centric vision. We organize the literature into four major themes: object-centric visual understanding, object-centric referring segmentation, object-centric visual editing, and object-centric visual generation. We further summarize the key modeling paradigms, learning strategies, and evaluation protocols that support these capabilities. Finally, we discuss open challenges and future directions, including robust instance permanence, fine-grained spatial control, consistent multi-step interaction, unified cross-task modeling, and reliable benchmarking under distribution shift. We hope this paper provides a structured perspective on the development of scalable, precise, and trustworthy object-centric multimodal systems.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is a survey paper that organizes existing LMM work with object-centric vision into four themes but adds no new methods or results.

read the letter

The core takeaway is that this paper is a literature review. It groups recent advances at the LMM and object-centric vision intersection into four themes—object-centric visual understanding, referring segmentation, visual editing, and visual generation—while summarizing modeling paradigms, strategies, and evaluation protocols. It closes with open challenges such as instance permanence, fine-grained spatial control, and reliable benchmarking under shift. That organizational map is the main deliverable; there are no new derivations, experiments, or technical contributions beyond the synthesis itself. The motivation section does a clean job of stating why current LMMs fall short on precise object-level tasks and how explicit object representations could help. The listed future directions are reasonable and point to concrete gaps that matter for editing and interactive systems. The soft spots are the usual ones for a survey. Its value rests entirely on coverage and accuracy of the summaries, and the abstract alone does not let me verify whether important papers are omitted or whether the four-way split forces awkward fits. The assertion that object-centric vision supplies a “principled framework” is motivational framing rather than a claim backed by new evidence here. No equations or data are presented, so there is no internal circularity to worry about. This paper is mainly for researchers who want a high-level map of the subfield or who are scouting open problems in precise multimodal manipulation. A reader already deep in one of the four themes will probably not learn much new. It deserves peer review because a well-executed survey can save others time and highlight directions worth pursuing, even if it requires revisions for completeness and balance.

Referee Report

0 major / 2 minor

Summary. The paper presents a comprehensive review of recent advances at the convergence of Large Multimodal Models (LMMs) and object-centric vision. It organizes the existing literature into four major themes—object-centric visual understanding, object-centric referring segmentation, object-centric visual editing, and object-centric visual generation—while summarizing key modeling paradigms, learning strategies, and evaluation protocols. The review concludes by discussing open challenges and future directions, including robust instance permanence, fine-grained spatial control, consistent multi-step interaction, unified cross-task modeling, and reliable benchmarking under distribution shift.

Significance. If the curation and summaries are accurate and reasonably complete, the survey would provide a useful structured perspective for researchers working on extending LMMs from global scene understanding to precise object-level capabilities. Its primary contribution is organizational synthesis rather than new derivations or experiments, which is appropriate for a review paper in a fast-moving area; explicit credit is due for framing the four themes as a coherent lens and for highlighting actionable future directions without introducing unverified claims.

minor comments (2)

[Abstract] Abstract and introduction: the phrasing that object-centric vision 'provides a principled framework' is presented as established motivation; a brief paragraph contrasting it with alternative (e.g., pixel- or region-based) approaches would clarify why the four-theme organization follows naturally rather than appearing as one possible taxonomy.
The four-theme structure is clear, but boundary papers that span multiple themes (e.g., a method that performs both referring segmentation and editing) should be explicitly noted so readers understand how overlaps are handled.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive and accurate summary of our survey, as well as for recommending minor revision. The referee's assessment correctly captures the paper's organizational structure around the four themes, its synthesis of paradigms and challenges, and its focus on future directions without overclaiming novelty. Since no specific major comments were raised in the report, we have no points requiring rebuttal or clarification at this time. We will incorporate any minor suggestions during the revision process to further strengthen the manuscript.

Circularity Check

0 steps flagged

No significant circularity; survey paper with no derivations or fitted quantities

full rationale

This paper is a literature review that organizes external work into four themes (object-centric understanding, referring segmentation, editing, and generation) and summarizes paradigms, strategies, and protocols. No equations, predictions, parameters, or derivation chains appear in the abstract or described structure. The claim that object-centric vision supplies a 'principled framework' is motivational framing rather than a testable proposition whose validity depends on internal reductions. All cited results are external to the present manuscript, so no self-citation load-bearing, self-definitional, or fitted-input patterns exist. The contribution is curation and perspective, which remains valid independently of any single assumption or result.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a literature review, the paper introduces no free parameters, axioms, or invented entities; all content draws from cited prior work.

pith-pipeline@v0.9.0 · 5548 in / 1112 out tokens · 70869 ms · 2026-05-10T15:31:22.509475+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

250 extracted references · 103 canonical work pages · 35 internal anchors

[1]

Gpt-4v(ision) system card. 2023. 24

2023
[2]

Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes

Panos Achlioptas, Ahmed Abdelreheem, Fei Xia, Mohamed Elhoseiny, and Leonidas Guibas. Referit3d: Neural listeners for fine-grained 3d object identification in real-world scenes. 2020

2020
[3]

Flamingo: a visual language model for few-shot learning

Jean-Baptiste Alayrac, Jeff Donahue, Pauline Luc, Antoine Miech, Iain Barr, Yana Hasson, Karel Lenc, Arthur Mensch, Katherine Millican, Malcolm Reynolds, et al. Flamingo: a visual language model for few-shot learning. Advancesin neural information processing systems, 35:23716–23736, 2022

2022
[4]

Qwen Technical Report

Jinze Bai, Shuai Bai, Yunfei Chu, Zeyu Cui, Kai Dang, Xiaodong Deng, Fan Yang, Wenbin Ge, Han Yu, Fei Huang, Binyuan Hui, et al. Qwen technical report.arXiv preprint arXiv:2309.16609, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[5]

Real-time 3d-aware portrait editing from a single image

Qingyan Bai, Zifan Shi, Yinghao Xu, Hao Ouyang, Qiuyu Wang, Ceyuan Yang, Xuan Wang, Gordon Wetzstein, Yujun Shen, and Qifeng Chen. Real-time 3d-aware portrait editing from a single image. InEuropean Conference on Computer Vision, pages 344–362. Springer, 2024

2024
[6]

CoRR , volume =

Qingyan Bai, Qiuyu Wang, Hao Ouyang, Yue Yu, Hanlin Wang, Wen Wang, Ka Leong Cheng, Shuailei Ma, Yanhong Zeng, Zichen Liu, et al. Scaling instruction-based video editing with a high-quality synthetic dataset. arXiv preprint arXiv:2510.15742, 2025

work page arXiv 2025
[7]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Zhang, Tianbao Xie, Zong-Ming Cheng, Zhang Hang, Zhibo Yang, Haiyang Xu, and Juntang Lin

Shuai Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, Sibo Song, Kai Dang, Peng Wang, Shijie Wang, Jun Tang, Hua Zhong, Yuanzhi Zhu, Mingkun Yang, Zhaohai Li, Jianqiang Wan, Pengfei Wang, Wei Ding, Zhangping Fu, Yuanzhong Xu, Jiabo Ye, X.-C. Zhang, Tianbao Xie, Zong-Ming Cheng, Zhang Hang, Zhibo Yang, Haiyang Xu, and Juntang Lin. Qwen2.5-vl technica...

2025
[9]

One token to seg them all: Language instructed reasoning segmentation in videos.Advancesin Neural Information Processing Systems 37, 2024

Zechen Bai, Joya Chen, Ziteng Gao, Tong He, Lei Liu, Haiyang Mei, Mike Zheng Shou, Pichao Wang, and Zheng Zhang. One token to seg them all: Language instructed reasoning segmentation in videos.Advancesin Neural Information Processing Systems 37, 2024

2024
[10]

Flux.1-dev, 2024

Black Forest Labs. Flux.1-dev, 2024

2024
[11]

Stable Video Diffusion: Scaling Latent Video Diffusion Models to Large Datasets

Andreas Blattmann. Stable video diffusion: Scaling latent video diffusion models to large datasets.arXiv preprint arXiv:2311.15127, 2023

work page internal anchor Pith review arXiv 2023
[12]

Ledits++: Limitless image editing using text-to-image models

Manuel Brack, Felix Friedrich, Katharina Kornmeier, Linoy Tsaban, Patrick Schramowski, Kristian Kersting, and Apolinário Passos. Ledits++: Limitless image editing using text-to-image models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2024
[13]

Tim Brooks, Aleksander Holynski, and Alexei A. Efros. Instructpix2pix: Learning to follow image editing instructions. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

2023
[14]

Pixel-level reasoning segmentation via multi-turn conversations

Dunbo Cai, Xiaocui Yang, Yongkang Liu, Daling Wang, Feng Shi, Yifei Zhang, and Soujanya Poria. Pixel-level reasoning segmentation via multi-turn conversations. 2025

2025
[15]

Meyer, Yuning Chai, Dennis Park, and Yong Jae Lee

Mu Cai, Haotian Liu, Siva Karthik Mustikovela, Gregory P. Meyer, Yuning Chai, Dennis Park, and Yong Jae Lee. Vip-llava: Making large multimodal models understand arbitrary visual prompts. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2024
[16]

Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing

Mingdeng Cao, Xintao Wang, Zhongang Qi, Ying Shan, Xiaohu Qie, and Yinqiang Zheng. Masactrl: Tuning-free mutual self-attention control for consistent image synthesis and editing. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2023

2023
[17]

Gui, and Yu-Xiong Wang

Shixiang Cao, L. Gui, and Yu-Xiong Wang. Emergent visual grounding in large multimodal models without grounding supervision. arXiv preprint arXiv:2410.08209, 2024

work page arXiv 2024
[18]

SAM 3: Segment Anything with Concepts

Nicolas Carion, Laura Gustafson, Yuan-Ting Hu, Shoubhik Debnath, Ronghang Hu, Didac Suris, Chaitanya Ryali, Kalyan Vasudev Alwala, Haitham Khedr, Andrew Huang, et al. Sam 3: Segment anything with concepts. arXiv preprint arXiv:2511.16719, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Huiwen Chang, Han Zhang, Lu Jiang, Ce Liu, and William T. Freeman. Maskgit: Masked generative image transformer. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022

2022
[20]

Chang, and Matthias Nießner

Dave Zhenyu Chen, Anne Lynn S. Chang, and Matthias Nießner. Scanrefer: 3d object localization in rgb-d scans using natural language. 2020. 25

2020
[21]

VideoCrafter1: Open Diffusion Models for High-Quality Video Generation

Haoxin Chen, Menghan Xia, Yingqing He, Yong Zhang, Xiaodong Cun, Shaoshu Yang, Jinbo Xing, Yaofang Liu, Qifeng Chen, Xintao Wang, Chao Weng, and Ying Shan. Videocrafter1: Open diffusion models for high-quality video generation. arXiv preprint arXiv:2310.19512, 2023

work page internal anchor Pith review arXiv 2023
[22]

Gary Chan, and Hongyang Zhang

Jierun Chen, Fangyun Wei, Jinjing Zhao, Sizhe Song, Bohuai Wu, Zhuoxuan Peng, S.-H. Gary Chan, and Hongyang Zhang. Revisiting referring expression comprehension evaluation in the era of large multimodal models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2025

2025
[23]

Chang, Thomas Funkhouser, and Silvio Savarese

Kai Chen, Christopher Choy, Manolis Savva, Anne Lynn S. Chang, Thomas Funkhouser, and Silvio Savarese. Text2shape: Generating shapes from natural language by learning joint embeddings. 2019

2019
[24]

Shikra: Unleashing Multimodal LLM's Referential Dialogue Magic

Keqin Chen, Zhao Zhang, Weili Zeng, Richong Zhang, Feng Zhu, and Rui Zhao. Shikra: Unleashing multimodal llm’s referential dialogue magic.arXiv preprint arXiv:2306.15195, 2023

work page internal anchor Pith review arXiv 2023
[25]

3d-dres: Detailed 3d referring expression segmentation

Qi Chen, Changli Wu, Jiayi Ji, Yiwei ma, and Liujuan Cao. 3d-dres: Detailed 3d referring expression segmentation. In Proceedings of the AAAI Conference on Artificial Intelligence, 2026

2026
[26]

Language conditioned spatial relation reasoning for 3d object grounding.arXiv preprint arXiv:2211.09646, 2022

Shizhe Chen, Pierre-Louis Guhur, Makarand Tapaswi, Cordelia Schmid, and Ivan Laptev. Language conditioned spatial relation reasoning for 3d object grounding.arXiv preprint arXiv:2211.09646, 2022

work page arXiv 2022
[27]

Edival-agent: An object-centric framework for automated, fine-grained evaluation of multi-turn editing.arXiv preprint arXiv:2509.13399, 2025

Tianyu Chen, Yasi Zhang, Zhi Zhang, Peiyu Yu, Shu Wang, Zhendong Wang, Kevin Lin, Xiaofei Wang, Zhengyuan Yang, Linjie Li, et al. Edival-agent: An object-centric framework for automated, fine-grained evaluation of multi-turn editing.arXiv preprint arXiv:2509.13399, 2025

work page arXiv 2025
[28]

Panda-70m: Captioning 70m videos with multiple cross-modality teachers

Tsai-Shien Chen, Aliaksandr Siarohin, Willi Menapace, Ekaterina Deyneka, Hsiang-Wei Chao, Byung Eun Jeon, Yuwei Fang, Hsin-Ying Lee, Jian Feng Ren, Ming-Hsuan Yang, and Sergey Tulyakov. Panda-70m: Captioning 70m videos with multiple cross-modality teachers. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2024
[29]

Gaussianeditor: Swift and controllable 3d editing with gaussian splatting

Yiwen Chen, Zilong Chen, Chi Zhang, Feng Wang, Xiaofeng Yang, Yikai Wang, Zhongang Cai, Lei Yang, Huaping Liu, and Guosheng Lin. Gaussianeditor: Swift and controllable 3d editing with gaussian splatting. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2024
[30]

Contextflow: Training-free video object editing via adaptive context enrichment

Yiyang Chen, Xuanhua He, Xiujun Ma, and Jack Ma. Contextflow: Training-free video object editing via adaptive context enrichment. 2026

2026
[31]

How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites

Zhe Chen, Weiyun Wang, Hao Tian, Shenglong Ye, Zhangwei Gao, Erfei Cui, Wenwen Tong, Kongzhi Hu, Jiapeng Luo, Zheng Ma, et al. How far are we to gpt-4v? closing the gap to commercial multimodal models with open-source suites. Science China Information Sciences, 67(12):220101, 2024

2024
[32]

Spatialrgpt: Grounded spatial reasoning in vision-language models.Advances in Neural Information Processing Systems 37, 2024

An-Chieh Cheng, Yang Fu, Qiushan Guo, Jan Kautz, Sifei Liu, Xiaolong Wang, Ruihan Yang, and Hongxu Yin. Spatialrgpt: Grounded spatial reasoning in vision-language models.Advances in Neural Information Processing Systems 37, 2024

2024
[33]

3d aware region prompted vision language model

An-Chieh Cheng, Yang Fu, Yukang Chen, Zhijian Liu, Xiaolong Li, Subhashree Radhakrishnan, Song Han, Yao Lu, Jan Kautz, Pavlo Molchanov, Hongxu Yin, Xiaolong Wang, and Sifei Liu. 3d aware region prompted vision language model. arXiv preprint arXiv:2509.13317, 2025

work page arXiv 2025
[34]

Masked-attention mask transformer for universal image segmentation

Bowen Cheng, Ishan Misra, Alexander G Schwing, Alexander Kirillov, and Rohit Girdhar. Masked-attention mask transformer for universal image segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 1290–1299, 2022

2022
[35]

Instructblip: Towards general-purpose vision-language models with instruction tuning

Wenliang Dai, Junnan Li, Dongxu Li, Anthony Tiong, Junqi Zhao, Weisheng Wang, Boyang Li, Pascale N Fung, and Steven Hoi. Instructblip: Towards general-purpose vision-language models with instruction tuning. Advancesin neural information processing systems, 36:49250–49267, 2023

2023
[36]

RynnBrain: Open Embodied Foundation Models.arXiv preprint arXiv:2602.14979, 2026

Ronghao Dang, Jiayan Guo, Bohan Hou, Sicong Leng, Kehan Li, Xin Li, Jiangpin Liu, Yunxuan Mao, Zhikai Wang, Yuqian Yuan, et al. Rynnbrain: Open embodied foundation models.arXiv preprint arXiv:2602.14979, 2026

work page arXiv 2026
[37]

Ruili Dang, Y. F. Yuan, Yunxuan Mao, Kehan Li, J. Liu, Z. Wang, Xin Li, Fan Wang, and Deli Zhao. Rynnec: Bringing mllms into embodied world.arXiv preprint arXiv:2508.14160, 2025

work page arXiv 2025
[38]

CoRR , volume =

Matt Deitke, Ruoshi Liu, Matthew Wallingford, Huong Ngo, Oscar Michel, Aditya Kusupati, Alan Fan, Christian Laforte, Vikram Voleti, Samir Yitzhak Gadre, Eli VanderBilt, Aniruddha Kembhavi, Carl Vondrick, Georgia 26 Gkioxari, Kiana Ehsani, Ludwig Schmidt, and Ali Farhadi. Objaverse-xl: A universe of 10m+ 3d objects.arXiv preprint arXiv:2307.05663, 2023

work page arXiv 2023
[39]

Objaverse: A universe of annotated 3d objects

Matt Deitke, Dustin Schwenk, Jordi Salvador, Luca Weihs, Oscar Michel, Eli VanderBilt, Ludwig Schmidt, Kiana Ehsanit, Aniruddha Kembhavi, and Ali Farhadi. Objaverse: A universe of annotated 3d objects. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

2023
[40]

Mevis: A large-scale benchmark for video segmentation with motion expressions

Henghui Ding, Chang Liu, Shuting He, Xudong Jiang, and Chen Change Loy. Mevis: A large-scale benchmark for video segmentation with motion expressions. InProceedings of the IEEE/CVF international conference on computer vision, pages 2694–2703, 2023

2023
[41]

Multimodal referring segmentation: A survey

Henghui Ding, Song Tang, Shuting He, Chang Liu, Zuxuan Wu, and Yu-Gang Jiang. Multimodal referring segmentation: A survey. 2025

2025
[42]

Z3d: Zero-shot 3d visual grounding from images.arXiv preprint arXiv:2602.03361, 2026

Nikita Drozdov, Andrey Lemeshko, Nikita Gavrilov, Anton Konushin, Danila Rukhovich, and Maksim Kolodi- azhnyi. Z3d: Zero-shot 3d visual grounding from images.arXiv preprint arXiv:2602.03361, 2026

work page arXiv 2026
[43]

Scaling rectified flow transformers for high-resolution image synthesis

Patrick Esser, Sumith Kulal, Andreas Blattmann, Rahim Entezari, Jonas Müller, Harry Saini, Yam Levi, Dominik Lorenz, Axel Sauer, Frederic Boesel, et al. Scaling rectified flow transformers for high-resolution image synthesis. In Forty-firstinternational conference on machine learning, 2024

2024
[44]

Videoagent: A memory-augmented multimodal agent for video understanding

Yue Fan, Xiaojian Ma, Rujie Wu, Yuntao Du, Jiaqi Li, Zhi Gao, and Qing Li. Videoagent: A memory-augmented multimodal agent for video understanding. 2024

2024
[45]

An Image is Worth One Word: Personalizing Text-to-Image Generation using Textual Inversion

Rinon Gal, Yuval Alaluf, Yuval Atzmon, Or Patashnik, Amit H. Bermano, Gal Chechik, and Daniel Cohen-Or. An image is worth one word: Personalizing text-to-image generation using textual inversion.arXiv preprint arXiv:2208.01618, 2022

work page internal anchor Pith review arXiv 2022
[46]

Mani-gs: Gaussian splatting manipulation with triangular mesh

Xiangjun Gao, Xiaoyu Li, Yiyu Zhuang, Qi Zhang, Wenbo Hu, Chaopeng Zhang, Yao Yao, Ying Shan, and Long Quan. Mani-gs: Gaussian splatting manipulation with triangular mesh. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2025
[47]

Mini-internvl: a flexible-transfer pocket multi-modal model with 5% parameters and 90% performance

Zhangwei Gao, Zhe Chen, Erfei Cui, Yiming Ren, Weiyun Wang, Jinguo Zhu, Hao Tian, Shenglong Ye, Junjun He, Xizhou Zhu, et al. Mini-internvl: a flexible-transfer pocket multi-modal model with 5% parameters and 90% performance. Visual Intelligence, 2(1):32, 2024

2024
[48]

Tokenflow: Con- sistent diffusion features for consistent video editing,

Michal Geyer, Omer Bar-Tal, Shai Bagon, and Tali Dekel. Tokenflow: Consistent diffusion features for consistent video editing. arXiv preprint arXiv:2307.10373, 2023

work page arXiv 2023
[49]

The devil is in temporal token: High quality video reasoning segmentation

Shaogang Gong, Yunzhi Zhuge, Pengfei Zhang, Zongxin Yang, Pingping Zhang, and Huchuan Lu. The devil is in temporal token: High quality video reasoning segmentation. 2025

2025
[50]

Demystifying flux architecture.arXiv preprint arXiv:2507.09595, 2025

Or Greenberg. Demystifying flux architecture.arXiv preprint arXiv:2507.09595, 2025

work page arXiv 2025
[51]

Dong Guo, Faming Wu, Feida Zhu, Fuxing Leng, Guang Shi, Haobin Chen, Haoqi Fan, Jian Wang, Jianyu Jiang, Jiawei Wang, et al. Seed1. 5-vl technical report.arXiv preprint arXiv:2505.07062, 2025

work page internal anchor Pith review arXiv 2025
[52]

Regiongpt: Towards region understanding vision language model

Qiushan Guo, Shalini De Mello, Hongxu Yin, Wonmin Byeon, Ka Chun Cheung, Yizhou Yu, Ping Luo, and Sifei Liu. Regiongpt: Towards region understanding vision language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2024
[53]

AnimateDiff: Animate Your Personalized Text-to-Image Diffusion Models without Specific Tuning

Yuwei Guo, Ceyuan Yang, Anyi Rao, Zhengyang Liang, Yaohui Wang, Yu Qiao, Maneesh Agrawala, Dahua Lin, and Bo Dai. Animatediff: Animate your personalized text-to-image diffusion models without specific tuning. arXiv preprint arXiv:2307.04725, 2023

work page internal anchor Pith review arXiv 2023
[54]

Lvis: A dataset for large vocabulary instance segmentation

Agrim Gupta, Piotr Dollár, and Ross Girshick. Lvis: A dataset for large vocabulary instance segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2019

2019
[55]

Merlin: Multimodal embedding refinement via llm-based iterative navigation for text-video retrieval-rerank pipeline

Donghoon Han, Eunhwan Park, Gisang Lee, Adam Lee, and Nojun Kwak. Merlin: Multimodal embedding refinement via llm-based iterative navigation for text-video retrieval-rerank pipeline. InProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing: Industry Track, 2024

2024
[56]

Decomposed Attention Fusion in MLLMs for Training-Free Video Reasoning Segmentation

Su Ho Han, Jeongseok Hyun, Pilhyeon Lee, Minho Shim, Dongyoon Wee, and Seon Joo Kim. Decomposed attention fusion in mllms for training-free video reasoning segmentation.arXiv preprint arXiv:2510.19592, 2025. 27

work page internal anchor Pith review Pith/arXiv arXiv 2025
[57]

Multi-modal instruction tuned llms with fine-grained visual perception

Junwen He, Yifan Wang, Lijun Wang, Huchuan Lu, Jun-Yan He, Jin-Peng Lan, Bin Luo, and Xuansong Xie. Multi-modal instruction tuned llms with fine-grained visual perception. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2024
[58]

Refmask3d: Language-guided transformer for 3d referring segmentation

Shuting He and Henghui Ding. Refmask3d: Language-guided transformer for 3d referring segmentation. In Proceedings of the 32nd ACM International Conference on Multimedia, 2024

2024
[59]

Omni-rgpt: Unifying image and video region-level understanding via token marks

Miran Heo, Min-HungChen, De-An Huang, Sifei Liu, SubhashreeRadhakrishnan, Seon Joo Kim, Yu-Chiang Frank Wang, and Ryo Hachiuma. Omni-rgpt: Unifying image and video region-level understanding via token marks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2025
[60]

Prompt-to-Prompt Image Editing with Cross Attention Control

Amir Hertz, Ron Mokady, Jay M. Tenenbaum, Kfir Aberman, Yael Pritch, and Daniel Cohen-Or. Prompt-to- prompt image editing with cross attention control.arXiv preprint arXiv:2208.01626, 2022

work page internal anchor Pith review arXiv 2022
[61]

Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

Jonathan Ho, Ajay Jain, and Pieter Abbeel. Denoising diffusion probabilistic models.Advances in neural information processing systems, 33:6840–6851, 2020

2020
[62]

arXiv preprint arXiv:2512.16760 (2025) 2 16 X

Tianshuai Hu, Xiaolu Liu, Song Wang, Yiyao Zhu, Ao Liang, Lingdong Kong, Guoyang Zhao, Zeying Gong, Jun Cen, Zhiyu Huang, et al. Vision-language-action models for autonomous driving: Past, present, and future. arXiv preprint arXiv:2512.16760, 2025

work page arXiv 2025
[63]

Finecaption: Compositional image captioning focusing on wherever you want at any granularity

Hang Hua, Qing Liu, Lingzhi Zhang, Jing Shi, Soo Ye Kim, Zhifei Zhang, Yilin Wang, Jianming Zhang, Zhe Lin, and Jiebo Luo. Finecaption: Compositional image captioning focusing on wherever you want at any granularity. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2025
[64]

Viewsrd: 3d visual grounding via structured multi-view decomposition

Ronggang Huang, Haoxin Yang, Yan Cai, Xuemiao Xu, Huaidong Zhang, and Shengfeng He. Viewsrd: 3d visual grounding via structured multi-view decomposition. InProceedings of the IEEE/CVF International Conference on Computer Vision, 2025

2025
[65]

Dive: Taming dino for subject-driven video editing

Yi Huang, Wei Xiong, He Zhang, Chaoqi Chen, Jianzhuang Liu, Mingfu Yan, and Shifeng Chen. Dive: Taming dino for subject-driven video editing. 2024

2024
[66]

Thinking in dynamics: How multimodal large language models perceive, track, and reason dynamics in physical 4d world.arXiv preprint arXiv:2603.12746,

Yuzhi Huang, Kairun Wen, Rongxin Gao, Dongxuan Liu, Yibin Lou, Jie Wu, Jing Xu, Jian Zhang, Zheng Yang, Yunlong Lin, Chenxin Li, Panwang Pan, Junbin Lu, Jingyan Jiang, Xinghao Ding, Yue Huang, and Zhi Wang. Thinking in dynamics: How multimodal large language models perceive, track, and reason dynamics in physical 4d world.arXiv preprint arXiv:2603.12746, 2026

work page arXiv 2026
[67]

Identity decoupling for multi-subject personaliza- tion of text-to-image models.Advancesin Neural Information Processing Systems 37, 2024

Sung Ju Hwang, Sangwon Jang, Jaehyeong Jo, and Kimin Lee. Identity decoupling for multi-subject personaliza- tion of text-to-image models.Advancesin Neural Information Processing Systems 37, 2024

2024
[68]

Jiang, Negar Hassanpour, Mohammad Salameh, Mohammadreza Samadi, Jiao He, Fengyu Sun, and Di Tao Niu

L. Jiang, Negar Hassanpour, Mohammad Salameh, Mohammadreza Samadi, Jiao He, Fengyu Sun, and Di Tao Niu. Pixelman: Consistent object editing with diffusion models via pixel manipulation and generation. In Proceedings of the AAAI Conference on Artificial Intelligence, 2025

2025
[69]

Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advancesin Neural Information Processing Systems 37, 2024

Yi Jiang, Bingyue Peng, Keyu Tian, Liwei Wang, and Zehuan Yuan. Visual autoregressive modeling: Scalable image generation via next-scale prediction.Advancesin Neural Information Processing Systems 37, 2024

2024
[70]

Decoupled seg tokens make stronger reasoning video segmenter and grounder.arXiv preprint arXiv:2506.22880, 2025

Dang Jisheng, Wu Xudong, Wang Bimei, Lv Ning, Chen Jiayu, Jingwen Zhao, Jizhao Liu, Juncheng Li, Teng Wang, et al. Decoupled seg tokens make stronger reasoning video segmenter and grounder.arXiv preprint arXiv:2506.22880, 2025

work page arXiv 2025
[71]

Flux already knows – activating subject-driven image generation without training.arXiv preprint arXiv:2504.11478, 2025

Hao Kang, Stathi Fotiadis, Liming Jiang, Yan Qing, Yiwei Jia, Zichuan Liu, Min Jin Chong, and Xin Lu. Flux already knows – activating subject-driven image generation without training.arXiv preprint arXiv:2504.11478, 2025

work page arXiv 2025
[72]

PhysGaia: A Physics-Aware Benchmark with Multi-Body Interactions for Dynamic Novel View Synthesis

Mi Jeong Kim, Gunhee Kim, Jin Sun Choi, Wonjae Roh, and Bohyung Han. Physgaia: A physics-aware benchmark with multi-body interactions for dynamic novel view synthesis.arXiv preprint arXiv:2506.02794, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[73]

Shamma, Michael S

Ranjay Krishna, Yuke Zhu, Oliver Groth, Justin Johnson, Kenji Hata, Joshua Kravitz, Stephanie Chen, Yannis Kalantidis, Li-Jia Li, David A. Shamma, Michael S. Bernstein, and Li Fei-Fei. Visual genome: Connecting language and vision using crowdsourced dense image annotations.International Journal of Computer Vision, 2017

2017
[74]

Anyv2v: A tuning-free framework for any video-to- video editing tasks.arXiv preprint arXiv:2403.14468, 2024

Max Ku, Cong Wei, Weiming Ren, Huan Yang, and Wenhu Chen. Anyv2v: A tuning-free framework for any video-to-video editing tasks.arXiv preprint arXiv:2403.14468, 2024. 28

work page arXiv 2024
[75]

Multi-concept customization of text-to-image diffusion

Nupur Kumari, Bingliang Zhang, Richard Zhang, Eli Shechtman, and Jun-Yan Zhu. Multi-concept customization of text-to-image diffusion. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023

2023
[76]

Fu, Christopher Ré, and David W

Hermann Kumbong, Xian Liu, Tsung-Yi Lin, Mingyu Liu, Xihui Liu, Ziwei Liu, Daniel Y. Fu, Christopher Ré, and David W. Romero. Hmar: Efficient hierarchical masked auto-regressive image generation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

2025
[77]

Lisa: Reasoning segmentation via large language model

Xin Lai, Zhuotao Tian, Yukang Chen, Yanwei Li, Yuhui Yuan, Shu Liu, and Jiaya Jia. Lisa: Reasoning segmentation via large language model. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024

2024
[78]

Text4seg: Reimagining image segmentation as text generation.arXiv preprint arXiv:2410.09855, 2024

Mengcheng Lan, Zhaofeng Chen, Yue Zhou, Jiaxing Xu, Yiping Ke, Xinjiang Wang, Litong Feng, and Wei Zhang. Text4seg: Reimagining image segmentation as text generation.arXiv preprint arXiv:2410.09855, 2024

work page arXiv 2024
[79]

Text4seg++: Advancing image segmentation via generative language modeling.arXiv preprint arXiv:2509.06321, 2025

Mengcheng Lan, Zhaofeng Chen, Jiaxing Xu, Zongrui Li, Yiping Ke, Xudong Jiang, Yingchen Yu, Yunqing Zhao, and Song Bai. Text4seg++: Advancing image segmentation via generative language modeling.arXiv preprint arXiv:2509.06321, 2025

work page arXiv 2025
[80]

LLaVA-OneVision: Easy Visual Task Transfer

Bo Li, Yuanhan Zhang, Dong Guo, Renrui Zhang, Feng Li, Hao Zhang, Kaichen Zhang, Peiyuan Zhang, Yanwei Li, Ziwei Liu, et al. Llava-onevision: Easy visual task transfer.arXiv preprint arXiv:2408.03326, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

Showing first 80 references.