pith. machine review for the scientific record. sign in

arxiv: 2602.06663 · v2 · submitted 2026-02-06 · 💻 cs.CV

Recognition: no theorem link

PlanViz: Evaluating Planning-Oriented Image Generation and Editing for Computer-Use Tasks

Authors on Pith no claims yet

Pith reviewed 2026-05-16 06:57 UTC · model grok-4.3

classification 💻 cs.CV
keywords PlanVizbenchmarkimage generationimage editingplanningunified multimodal modelscomputer-use tasksroute planning
0
0 comments X

The pith

PlanViz is a benchmark that tests unified multimodal models on generating and editing images for everyday planning tasks such as routes, diagrams, and interfaces.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces PlanViz as a benchmark to assess unified multimodal models on generating and editing images that support planning in everyday computer tasks. It targets three specific areas: planning routes, creating work diagrams, and displaying web interfaces or UIs. By using human-annotated questions and reference images with quality controls, it provides a structured way to test spatial reasoning and procedural understanding. The evaluation uses a custom PlanScore that looks at correctness, visual quality, and efficiency. Experiments reveal that current models struggle with these tasks, pointing to needed improvements in planning capabilities.

Core claim

PlanViz is a benchmark consisting of three sub-tasks—route planning, work diagramming, and web&UI displaying—designed to evaluate image generation and editing in unified multimodal models for computer-use scenarios, using human-annotated data with quality control and the PlanScore metric for assessing correctness, visual quality, and efficiency.

What carries the argument

The PlanViz benchmark, which defines three representative sub-tasks and applies the PlanScore metric to measure generated images for correctness, visual quality, and efficiency.

If this is right

  • Unified multimodal models currently lack strong capabilities in spatial reasoning and procedural understanding for planning-oriented image tasks.
  • The benchmark provides a way to measure specific weaknesses in image generation and editing for practical computer-use applications.
  • Future model development can focus on improving performance in route planning, diagramming, and UI display to better support daily tasks.
  • Quality-controlled human annotations offer a reliable standard for tracking progress in these planning capabilities.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Improved results on PlanViz could enable more reliable AI assistance for real-world navigation apps and interface prototyping.
  • The sub-task structure might be extended to additional planning domains such as scheduling visuals or assembly instructions.
  • Automated scoring extensions could allow faster iteration when testing larger numbers of models on similar planning tasks.

Load-bearing premise

The three sub-tasks and human-annotated data with quality control sufficiently capture the planning capabilities needed for computer-use scenarios.

What would settle it

A model producing high PlanScore results across all three sub-tasks on the human-annotated test set without targeted fine-tuning would show that current limitations in spatial and procedural image generation have already been overcome.

Figures

Figures reproduced from arXiv: 2602.06663 by Fan Li, Jiaqi Xu, Junxian Li, Kai Liu, Leyang Chen, Linghe Kong, Renjing Pei, Weida Wang, Yulun Zhang, Zhixin Wang.

Figure 1
Figure 1. Figure 1: Examples of generation (left) and editing (right). [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The overview of PlanViz. Our evaluation includes image generation and editing, with three proposed sub-tasks: route [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The distribution (left) and the word cloud (right) of [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Pipeline of data construction. It consists of four stages: high-quality data collecting and cleaning, human annotation, [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Difference between MLLM-as-judges in previous [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Score distribution across different models. We [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Case studies of different models on different sub-tasks. We choose both open-source and closed-source models. [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

Unified multimodal models (UMMs) have shown impressive capabilities in generating natural images and supporting multimodal reasoning. However, their potential in supporting computer-use planning tasks, which are closely related to our lives, remain underexplored. Image generation and editing in computer-use tasks require capabilities like spatial reasoning and procedural understanding, and it is still unknown whether UMMs have these capabilities to finish these tasks or not. Therefore, we propose PlanViz, a new benchmark designed to evaluate image generation and editing for computer-use tasks. To achieve the goal of our evaluation, we focus on sub-tasks which frequently involve in daily life and require planning. Specifically, three representative sub-tasks are designed: route planning, work diagramming, and web&UI displaying. We address challenges in data quality ensuring by curating human-annotated questions and reference images, and a quality control process. For detailed and exact evaluation, a task-adaptive score, PlanScore, is proposed. The score helps understanding the correctness, visual quality and efficiency of generated images. Through experiments, we highlight key limitations and opportunities for future research on this topic.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces PlanViz, a benchmark for evaluating unified multimodal models (UMMs) on planning-oriented image generation and editing tasks for computer-use scenarios. It defines three representative sub-tasks (route planning, work diagramming, and web&UI displaying), curates human-annotated questions and reference images with a quality control process, and proposes the task-adaptive PlanScore metric to assess correctness, visual quality, and efficiency. Experiments are described at a high level as highlighting key limitations of current UMMs and opportunities for future work.

Significance. If validated, PlanViz could fill a gap in multimodal evaluation by targeting spatial reasoning and procedural planning capabilities relevant to real-world computer-use applications. The human-annotation pipeline and adaptive scoring approach offer potential for more nuanced assessment than generic image metrics, which could inform model development in this domain.

major comments (3)
  1. [Abstract / §3] The abstract and benchmark description provide no quantitative details on validation of the PlanScore metric (e.g., correlation with human judgments or inter-annotator agreement), which is load-bearing for claims that it enables 'detailed and exact evaluation' of correctness, quality, and efficiency.
  2. [§4] §4 (Experiments): The section reports only high-level findings on model limitations without quantitative results, tables of scores across models/sub-tasks, or details on the experimental setup (e.g., model selection, generation parameters, or statistical significance), preventing verification of the highlighted opportunities.
  3. [§3.1] §3.1: The claim that the three sub-tasks 'sufficiently capture the planning capabilities' for computer-use scenarios rests on representativeness without supporting evidence such as coverage analysis or comparison to broader task taxonomies.
minor comments (2)
  1. [§3.3] Notation for PlanScore components is introduced without a clear equation or pseudocode definition, which would improve reproducibility.
  2. [Figures 2-4] Figure captions for example generations lack explicit labels for sub-task type and reference vs. generated images.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which highlights important areas for improvement in our manuscript. We agree that strengthening the validation of PlanScore, expanding the experimental section with quantitative details, and providing more evidence for sub-task selection will enhance the paper's rigor and clarity. We will incorporate these changes in the revised version.

read point-by-point responses
  1. Referee: [Abstract / §3] The abstract and benchmark description provide no quantitative details on validation of the PlanScore metric (e.g., correlation with human judgments or inter-annotator agreement), which is load-bearing for claims that it enables 'detailed and exact evaluation' of correctness, quality, and efficiency.

    Authors: We agree that quantitative validation is necessary to support the claims about PlanScore. In the revised manuscript, we will add specific details on inter-annotator agreement for the human annotations and correlation analysis between PlanScore and human judgments, to be included in §3 and referenced in the abstract. revision: yes

  2. Referee: [§4] §4 (Experiments): The section reports only high-level findings on model limitations without quantitative results, tables of scores across models/sub-tasks, or details on the experimental setup (e.g., model selection, generation parameters, or statistical significance), preventing verification of the highlighted opportunities.

    Authors: We acknowledge that the current §4 lacks sufficient quantitative detail. We will revise this section to include full tables of scores across models and sub-tasks, complete experimental setup information (model selection, generation parameters), and any relevant statistical analyses to allow verification of the findings and opportunities identified. revision: yes

  3. Referee: [§3.1] §3.1: The claim that the three sub-tasks 'sufficiently capture the planning capabilities' for computer-use scenarios rests on representativeness without supporting evidence such as coverage analysis or comparison to broader task taxonomies.

    Authors: We will strengthen §3.1 by adding supporting evidence for the sub-task selection, including references to existing computer-use task taxonomies and a brief coverage analysis demonstrating how the chosen sub-tasks (route planning, work diagramming, and web&UI displaying) align with core planning requirements in daily computer-use scenarios. revision: yes

Circularity Check

0 steps flagged

No significant circularity in benchmark construction

full rationale

The paper proposes PlanViz as a new benchmark with three sub-tasks (route planning, work diagramming, web&UI displaying), human-annotated data under quality control, and the PlanScore metric. No equations, fitted parameters, or derivations are present that reduce to inputs by construction. The central claims rest on independent data curation and metric definition rather than self-referential definitions, self-citation chains, or renamed known results. This is standard benchmark design with no load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The benchmark rests on the domain assumption that the three selected sub-tasks adequately represent planning needs in computer-use scenarios and that human-annotated references with quality control produce reliable evaluation data; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption The three sub-tasks (route planning, work diagramming, web&UI displaying) represent key planning capabilities required in computer-use tasks.
    Stated explicitly when defining the benchmark focus in the abstract.

pith-pipeline@v0.9.0 · 5522 in / 1390 out tokens · 87376 ms · 2026-05-16T06:57:59.202111+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. G$^2$TR: Generation-Guided Visual Token Reduction for Separate-Encoder Unified Multimodal Models

    cs.CV 2026-05 unverdicted novelty 7.0

    G²TR reduces visual tokens and prefill computation by 1.94x in separate-encoder UMMs via generation-guided importance from VAE latent consistency while preserving reasoning accuracy and editing quality.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · cited by 1 Pith paper · 15 internal anchors

  1. [1]

    Hervé Abdi. 2010. Coefficient of variation.Encyclopedia of research design(2010)

  2. [2]

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 (2023)

  3. [3]

    James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. 2023. Improving image gen- eration with better captions.Computer Science. https://cdn.openai.com/papers/dall- e-3.pdf(2023)

  4. [4]

    Qi Cai, Jingwen Chen, Yang Chen, Yehao Li, Fuchen Long, Yingwei Pan, Zhaofan Qiu, Yiheng Zhang, Fengbin Gao, Peihan Xu, et al. 2025. HiDream-I1: A High- Efficient Image Generative Foundation Model with Sparse Diffusion Transformer. arXiv preprint arXiv:2505.22705(2025)

  5. [5]

    Dongping Chen, Ruoxi Chen, Shilin Zhang, Yaochen Wang, Yinuo Liu, Huichi Zhou, Qihui Zhang, Yao Wan, Pan Zhou, and Lichao Sun. 2024. Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark. InICML

  6. [6]

    Junying Chen, Zhenyang Cai, Pengcheng Chen, Shunian Chen, Ke Ji, Xidong Wang, Yunjin Yang, and Benyou Wang. 2025. Sharegpt-4o-image: Aligning multi- modal models with gpt-4o-level image generation.arXiv preprint arXiv:2506.18095 (2025)

  7. [7]

    Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. 2025. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811 (2025)

  8. [8]

    Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Li YanTao, Jianbing Zhang, and Zhiyong Wu. 2024. Seeclick: Harnessing gui grounding for advanced visual gui agents. InACL

  9. [9]

    Wei Chow, Jiachun Pan, Yongyuan Liang, Mingze Zhou, Xue Song, Liyu Jia, Saining Zhang, Siliang Tang, Juncheng Li, Fengda Zhang, et al. 2025. WEAVE: Unleashing and Benchmarking the In-context Interleaved Comprehension and Generation.arXiv preprint arXiv:2511.11434(2025)

  10. [10]

    Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah

  11. [11]

    Diffusion models in vision: A survey.TPAMI(2023)

  12. [12]

    Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. 2025. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683(2025)

  13. [13]

    Haonan Dong, Kehan Jiang, Haoran Ye, Wenhao Zhu, Zhaolu Kang, and Guo- jie Song. 2026. NeuReasoner: Towards Explainable, Controllable, and Unified Reasoning via Mixture-of-Neurons.arXiv preprint arXiv:2604.02972(2026)

  14. [14]

    Haonan Dong, Wenhao Zhu, Guojie Song, and Liang Wang. 2025. AuroRA: Breaking Low-Rank Bottleneck of LoRA with Nonlinear Mapping. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems

  15. [15]

    Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. 2023. Geneval: An object-focused framework for evaluating text-to-image alignment. InNeurIPS

  16. [16]

    Google. 2025. Gemini 3 Pro Image (Nano Banana Pro). https://aistudio.google. com/models/gemini-3-pro-image

  17. [17]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783 (2024)

  18. [18]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al . 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)

  19. [19]

    Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium. InNeurIPS

  20. [20]

    Kehan Jiang, Haonan Dong, Zhaolu Kang, Zhengzhou Zhu, and Guojie Song

  21. [21]

    FoE: Forest of Errors Makes the First Solution the Best in Large Reasoning Models.arXiv preprint arXiv:2604.02967(2026)

  22. [22]

    Black Forest Labs. 2024. Flux. https://github.com/black-forest-labs/flux

  23. [23]

    Han Li, Xinyu Peng, Yaoming Wang, Zelin Peng, Xin Chen, Rongxiang Weng, Jingang Wang, Xunliang Cai, Wenrui Dai, and Hongkai Xiong. 2025. Onecat: Decoder-only auto-regressive model for unified understanding and generation. arXiv preprint arXiv:2509.03498, 2025(2025)

  24. [24]

    Junxian Li, Di Zhang, Xunzhi Wang, Zeying Hao, Jingdi Lei, Qian Tan, Cai Zhou, Wei Liu, Yaotian Yang, Xinrui Xiong, et al. 2025. Chemvlm: Exploring the power of multimodal large language models in chemistry area. InAAAI

  25. [25]

    Yi Li, Haonan Wang, Qixiang Zhang, Boyu Xiao, Chenchang Hu, Hualiang Wang, and Xiaomeng Li. 2025. Unieval: Unified holistic evaluation for unified multimodal understanding and generation.arXiv preprint arXiv:2505.10483(2025)

  26. [26]

    Zijie Li, Henry Li, Yichun Shi, Amir Barati Farimani, Yuval Kluger, Linjie Yang, and Peng Wang. 2025. Dual diffusion for unified image generation and understanding. InCVPR

  27. [27]

    Wenqi Liang, Gan Sun, Yao He, Jiahua Dong, Suyan Dai, Ivan Laptev, Salman Khan, and Yang Cong. 2025. Pixelvla: Advancing pixel-level understanding in vision-language-action model.arXiv preprint arXiv:2511.01571(2025)

  28. [28]

    Yongyuan Liang, Wei Chow, Feng Li, Ziqiao Ma, Xiyao Wang, Jiageng Mao, Jiuhai Chen, Jiatao Gu, Yue Wang, and Furong Huang. 2025. ROVER: Benchmarking Reciprocal Cross-Modal Reasoning for Omnimodal Generation.arXiv preprint arXiv:2511.01163(2025)

  29. [29]

    Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al. 2025. Step1x-edit: A practical framework for general image editing.arXiv(2025)

  30. [30]

    OpenAI. 2025. Introducing gpt-5. https://openai.com/index/introducing-gpt-5/

  31. [31]

    OpenAI. 2025. Introducing our latest image generation model in the API. https: //openai.com/index/image-generation-api/

  32. [32]

    Kaihang Pan, Weile Chen, Haiyi Qiu, Qifan Yu, Wendong Bu, Zehan Wang, Yun Zhu, Juncheng Li, and Siliang Tang. 2025. Wiseedit: Benchmarking cognition-and creativity-informed image editing.arXiv preprint arXiv:2512.00387(2025)

  33. [33]

    Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In CVPR

  34. [34]

    Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. 2025. Seedream 4.0: Toward next-generation multimodal image generation.arXiv(2025)

  35. [35]

    Yuchen Sun, Shanhui Zhao, Tao Yu, Hao Wen, Samith Va, Mengwei Xu, Yuanchun Li, and Chongyang Zhang. 2025. Gui-xplore: Empowering generalizable gui agents with one exploration. InCVPR

  36. [36]

    Zeyi Sun, Ziyu Liu, Yuhang Zang, Yuhang Cao, Xiaoyi Dong, Tong Wu, Dahua Lin, and Jiaqi Wang. 2025. Seagent: Self-evolving computer use agent with autonomous learning from experience.arXiv preprint arXiv:2508.04700(2025)

  37. [37]

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. 2023. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805(2023)

  38. [38]

    NextStep Team, Chunrui Han, Guopeng Li, Jingwei Wu, Quan Sun, Yan Cai, Yuang Peng, Zheng Ge, Deyu Zhou, Haomiao Tang, Hongyu Zhou, Kenkun Liu, Ailin Huang, Bin Wang, Changxin Miao, Deshan Sun, En Yu, Fukun Yin, Gang Yu, Hao Nie, Haoran Lv, Hanpeng Hu, Jia Wang, Jian Zhou, Jianjian Sun, Kaijun Tan, Kang An, Kangheng Lin, Liang Zhao, Mei Chen, Peng Xing, Ru...

  39. [39]

    Qwen Team. 2025. Qwen2.5-VL. https://qwenlm.github.io/blog/qwen2.5-vl/

  40. [40]

    Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. 2025. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314(2025)

  41. [41]

    Guo-Hua Wang, Shanshan Zhao, Xinjie Zhang, Liangfu Cao, Pengxin Zhan, Lunhao Duan, Shiyin Lu, Minghao Fu, Jianshan Zhao, Yang Li, and Qing-Guo Chen. 2025. Ovis-U1 Technical Report.arXiv preprint arXiv:2506.23044(2025)

  42. [42]

    Jiahao Wang, Caixia Yan, Haonan Lin, Weizhan Zhang, Mengmeng Wang, Tieliang Gong, Guang Dai, and Hao Sun. 2024. Oneactor: Consistent subject generation via cluster-conditioned guidance. InNeurIPS

  43. [43]

    Jiahao Wang, Caixia Yan, Weizhan Zhang, Haonan Lin, Mengmeng Wang, Guang Dai, Tieliang Gong, Hao Sun, and Jingdong Wang. 2025. Spotactor: training-free layout-controlled consistent image generation. InAAAI

  44. [44]

    Zhaokai Wang, Penghao Yin, Xiangyu Zhao, Changyao Tian, Yu Qiao, Wenhai Wang, Jifeng Dai, and Gen Luo. 2025. Genexam: A multidisciplinary text-to-image exam.arXiv preprint arXiv:2509.14232(2025)

  45. [45]

    Hongyang Wei, Baixin Xu, Hongbo Liu, Cyrus Wu, Jie Liu, Yi Peng, Peiyu Wang, Zexiang Liu, Jingwen He, Yidan Xietian, et al. 2025. Skywork unipic 2.0: Building kontext model with online rl for unified multimodal model.arXiv preprint arXiv:2509.04548(2025)

  46. [46]

    Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng- ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al . 2025. Qwen-image technical report.arXiv preprint arXiv:2508.02324(2025)

  47. [47]

    Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. 2025. OmniGen2: Exploration to Advanced Multimodal Generation.arXiv(2025)

  48. [48]

    Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, et al . 2024. Vila-u: a unified foundation model integrating visual understanding and generation.arXiv preprint arXiv:2409.04429(2024)

  49. [49]

    Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. 2025. Omnigen: Unified image generation. InCVPR

  50. [50]

    Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. 2025. Show-o: One Single Transformer to Unify Multimodal Understanding and Generation. InICLR

  51. [51]

    Zhenhua Xu, Dongsheng Chen, Shuo Wang, Jian Li, Chengjie Wang, Meng Han, and Yabiao Wang. 2026. AdaMARP: An Adaptive Multi-Agent Interaction Framework for General Immersive Role-Playing.arXiv preprint arXiv:2601.11007 (2026)

  52. [52]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)

  53. [53]

    Qifan Yu, Wei Chow, Zhongqi Yue, Kaihang Pan, Yang Wu, Xiaoyang Wan, Juncheng Li, Siliang Tang, Hanwang Zhang, and Yueting Zhuang. 2025. Anyedit: Mastering unified high-quality image editing for any idea. InCVPR

  54. [54]

    Jianxiang Zang, Yongda Wei, Ruxue Bai, Shiyu Jiang, Nijia Mo, Binhong Li, Qiang Sun, and Hui Liu. 2025. Reward Auditor: Inference on Reward Modeling Suitability in Real-World Perturbed Scenarios.arXiv preprint arXiv:2512.00920 (2025)

  55. [55]

    Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang

  56. [56]

    The unreasonable effectiveness of deep features as a perceptual metric. In CVPR

  57. [57]

    Haozhe Zhao, Xiaojian Shawn Ma, Liang Chen, Shuzheng Si, Rujie Wu, Kaikai An, Peiyu Yu, Minjia Zhang, Qing Li, and Baobao Chang. 2024. Ultraedit: Instruction- based fine-grained image editing at scale. InNeurIPS

  58. [58]

    Xiangyu Zhao, Peiyuan Zhang, Kexian Tang, Xiaorong Zhu, Hao Li, Wenhao Chai, Zicheng Zhang, Renqiu Xia, Guangtao Zhai, Junchi Yan, et al. 2025. Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing. InNeurIPS

  59. [59]

    Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. 2025. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479(2025)