Recognition: no theorem link
PlanViz: Evaluating Planning-Oriented Image Generation and Editing for Computer-Use Tasks
Pith reviewed 2026-05-16 06:57 UTC · model grok-4.3
The pith
PlanViz is a benchmark that tests unified multimodal models on generating and editing images for everyday planning tasks such as routes, diagrams, and interfaces.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PlanViz is a benchmark consisting of three sub-tasks—route planning, work diagramming, and web&UI displaying—designed to evaluate image generation and editing in unified multimodal models for computer-use scenarios, using human-annotated data with quality control and the PlanScore metric for assessing correctness, visual quality, and efficiency.
What carries the argument
The PlanViz benchmark, which defines three representative sub-tasks and applies the PlanScore metric to measure generated images for correctness, visual quality, and efficiency.
If this is right
- Unified multimodal models currently lack strong capabilities in spatial reasoning and procedural understanding for planning-oriented image tasks.
- The benchmark provides a way to measure specific weaknesses in image generation and editing for practical computer-use applications.
- Future model development can focus on improving performance in route planning, diagramming, and UI display to better support daily tasks.
- Quality-controlled human annotations offer a reliable standard for tracking progress in these planning capabilities.
Where Pith is reading between the lines
- Improved results on PlanViz could enable more reliable AI assistance for real-world navigation apps and interface prototyping.
- The sub-task structure might be extended to additional planning domains such as scheduling visuals or assembly instructions.
- Automated scoring extensions could allow faster iteration when testing larger numbers of models on similar planning tasks.
Load-bearing premise
The three sub-tasks and human-annotated data with quality control sufficiently capture the planning capabilities needed for computer-use scenarios.
What would settle it
A model producing high PlanScore results across all three sub-tasks on the human-annotated test set without targeted fine-tuning would show that current limitations in spatial and procedural image generation have already been overcome.
Figures
read the original abstract
Unified multimodal models (UMMs) have shown impressive capabilities in generating natural images and supporting multimodal reasoning. However, their potential in supporting computer-use planning tasks, which are closely related to our lives, remain underexplored. Image generation and editing in computer-use tasks require capabilities like spatial reasoning and procedural understanding, and it is still unknown whether UMMs have these capabilities to finish these tasks or not. Therefore, we propose PlanViz, a new benchmark designed to evaluate image generation and editing for computer-use tasks. To achieve the goal of our evaluation, we focus on sub-tasks which frequently involve in daily life and require planning. Specifically, three representative sub-tasks are designed: route planning, work diagramming, and web&UI displaying. We address challenges in data quality ensuring by curating human-annotated questions and reference images, and a quality control process. For detailed and exact evaluation, a task-adaptive score, PlanScore, is proposed. The score helps understanding the correctness, visual quality and efficiency of generated images. Through experiments, we highlight key limitations and opportunities for future research on this topic.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces PlanViz, a benchmark for evaluating unified multimodal models (UMMs) on planning-oriented image generation and editing tasks for computer-use scenarios. It defines three representative sub-tasks (route planning, work diagramming, and web&UI displaying), curates human-annotated questions and reference images with a quality control process, and proposes the task-adaptive PlanScore metric to assess correctness, visual quality, and efficiency. Experiments are described at a high level as highlighting key limitations of current UMMs and opportunities for future work.
Significance. If validated, PlanViz could fill a gap in multimodal evaluation by targeting spatial reasoning and procedural planning capabilities relevant to real-world computer-use applications. The human-annotation pipeline and adaptive scoring approach offer potential for more nuanced assessment than generic image metrics, which could inform model development in this domain.
major comments (3)
- [Abstract / §3] The abstract and benchmark description provide no quantitative details on validation of the PlanScore metric (e.g., correlation with human judgments or inter-annotator agreement), which is load-bearing for claims that it enables 'detailed and exact evaluation' of correctness, quality, and efficiency.
- [§4] §4 (Experiments): The section reports only high-level findings on model limitations without quantitative results, tables of scores across models/sub-tasks, or details on the experimental setup (e.g., model selection, generation parameters, or statistical significance), preventing verification of the highlighted opportunities.
- [§3.1] §3.1: The claim that the three sub-tasks 'sufficiently capture the planning capabilities' for computer-use scenarios rests on representativeness without supporting evidence such as coverage analysis or comparison to broader task taxonomies.
minor comments (2)
- [§3.3] Notation for PlanScore components is introduced without a clear equation or pseudocode definition, which would improve reproducibility.
- [Figures 2-4] Figure captions for example generations lack explicit labels for sub-task type and reference vs. generated images.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback, which highlights important areas for improvement in our manuscript. We agree that strengthening the validation of PlanScore, expanding the experimental section with quantitative details, and providing more evidence for sub-task selection will enhance the paper's rigor and clarity. We will incorporate these changes in the revised version.
read point-by-point responses
-
Referee: [Abstract / §3] The abstract and benchmark description provide no quantitative details on validation of the PlanScore metric (e.g., correlation with human judgments or inter-annotator agreement), which is load-bearing for claims that it enables 'detailed and exact evaluation' of correctness, quality, and efficiency.
Authors: We agree that quantitative validation is necessary to support the claims about PlanScore. In the revised manuscript, we will add specific details on inter-annotator agreement for the human annotations and correlation analysis between PlanScore and human judgments, to be included in §3 and referenced in the abstract. revision: yes
-
Referee: [§4] §4 (Experiments): The section reports only high-level findings on model limitations without quantitative results, tables of scores across models/sub-tasks, or details on the experimental setup (e.g., model selection, generation parameters, or statistical significance), preventing verification of the highlighted opportunities.
Authors: We acknowledge that the current §4 lacks sufficient quantitative detail. We will revise this section to include full tables of scores across models and sub-tasks, complete experimental setup information (model selection, generation parameters), and any relevant statistical analyses to allow verification of the findings and opportunities identified. revision: yes
-
Referee: [§3.1] §3.1: The claim that the three sub-tasks 'sufficiently capture the planning capabilities' for computer-use scenarios rests on representativeness without supporting evidence such as coverage analysis or comparison to broader task taxonomies.
Authors: We will strengthen §3.1 by adding supporting evidence for the sub-task selection, including references to existing computer-use task taxonomies and a brief coverage analysis demonstrating how the chosen sub-tasks (route planning, work diagramming, and web&UI displaying) align with core planning requirements in daily computer-use scenarios. revision: yes
Circularity Check
No significant circularity in benchmark construction
full rationale
The paper proposes PlanViz as a new benchmark with three sub-tasks (route planning, work diagramming, web&UI displaying), human-annotated data under quality control, and the PlanScore metric. No equations, fitted parameters, or derivations are present that reduce to inputs by construction. The central claims rest on independent data curation and metric definition rather than self-referential definitions, self-citation chains, or renamed known results. This is standard benchmark design with no load-bearing circular steps.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The three sub-tasks (route planning, work diagramming, web&UI displaying) represent key planning capabilities required in computer-use tasks.
Forward citations
Cited by 1 Pith paper
-
G$^2$TR: Generation-Guided Visual Token Reduction for Separate-Encoder Unified Multimodal Models
G²TR reduces visual tokens and prefill computation by 1.94x in separate-encoder UMMs via generation-guided importance from VAE latent consistency while preserving reasoning accuracy and editing quality.
Reference graph
Works this paper leans on
-
[1]
Hervé Abdi. 2010. Coefficient of variation.Encyclopedia of research design(2010)
work page 2010
-
[2]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Floren- cia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. 2023. Gpt-4 technical report.arXiv preprint arXiv:2303.08774 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[3]
James Betker, Gabriel Goh, Li Jing, Tim Brooks, Jianfeng Wang, Linjie Li, Long Ouyang, Juntang Zhuang, Joyce Lee, Yufei Guo, et al. 2023. Improving image gen- eration with better captions.Computer Science. https://cdn.openai.com/papers/dall- e-3.pdf(2023)
work page 2023
- [4]
-
[5]
Dongping Chen, Ruoxi Chen, Shilin Zhang, Yaochen Wang, Yinuo Liu, Huichi Zhou, Qihui Zhang, Yao Wan, Pan Zhou, and Lichao Sun. 2024. Mllm-as-a-judge: Assessing multimodal llm-as-a-judge with vision-language benchmark. InICML
work page 2024
- [6]
-
[7]
Xiaokang Chen, Zhiyu Wu, Xingchao Liu, Zizheng Pan, Wen Liu, Zhenda Xie, Xingkai Yu, and Chong Ruan. 2025. Janus-pro: Unified multimodal understanding and generation with data and model scaling.arXiv preprint arXiv:2501.17811 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Kanzhi Cheng, Qiushi Sun, Yougang Chu, Fangzhi Xu, Li YanTao, Jianbing Zhang, and Zhiyong Wu. 2024. Seeclick: Harnessing gui grounding for advanced visual gui agents. InACL
work page 2024
- [9]
-
[10]
Florinel-Alin Croitoru, Vlad Hondru, Radu Tudor Ionescu, and Mubarak Shah
-
[11]
Diffusion models in vision: A survey.TPAMI(2023)
work page 2023
-
[12]
Chaorui Deng, Deyao Zhu, Kunchang Li, Chenhui Gou, Feng Li, Zeyu Wang, Shu Zhong, Weihao Yu, Xiaonan Nie, Ziang Song, et al. 2025. Emerging properties in unified multimodal pretraining.arXiv preprint arXiv:2505.14683(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[13]
Haonan Dong, Kehan Jiang, Haoran Ye, Wenhao Zhu, Zhaolu Kang, and Guo- jie Song. 2026. NeuReasoner: Towards Explainable, Controllable, and Unified Reasoning via Mixture-of-Neurons.arXiv preprint arXiv:2604.02972(2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[14]
Haonan Dong, Wenhao Zhu, Guojie Song, and Liang Wang. 2025. AuroRA: Breaking Low-Rank Bottleneck of LoRA with Nonlinear Mapping. InThe Thirty- ninth Annual Conference on Neural Information Processing Systems
work page 2025
-
[15]
Dhruba Ghosh, Hannaneh Hajishirzi, and Ludwig Schmidt. 2023. Geneval: An object-focused framework for evaluating text-to-image alignment. InNeurIPS
work page 2023
-
[16]
Google. 2025. Gemini 3 Pro Image (Nano Banana Pro). https://aistudio.google. com/models/gemini-3-pro-image
work page 2025
-
[17]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783 (2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[18]
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Ruoyu Zhang, Runxin Xu, Qihao Zhu, Shirong Ma, Peiyi Wang, Xiao Bi, et al . 2025. Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning.arXiv preprint arXiv:2501.12948(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Martin Heusel, Hubert Ramsauer, Thomas Unterthiner, Bernhard Nessler, and Sepp Hochreiter. 2017. Gans trained by a two time-scale update rule converge to a local nash equilibrium. InNeurIPS
work page 2017
-
[20]
Kehan Jiang, Haonan Dong, Zhaolu Kang, Zhengzhou Zhu, and Guojie Song
-
[21]
FoE: Forest of Errors Makes the First Solution the Best in Large Reasoning Models.arXiv preprint arXiv:2604.02967(2026)
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[22]
Black Forest Labs. 2024. Flux. https://github.com/black-forest-labs/flux
work page 2024
- [23]
-
[24]
Junxian Li, Di Zhang, Xunzhi Wang, Zeying Hao, Jingdi Lei, Qian Tan, Cai Zhou, Wei Liu, Yaotian Yang, Xinrui Xiong, et al. 2025. Chemvlm: Exploring the power of multimodal large language models in chemistry area. InAAAI
work page 2025
- [25]
-
[26]
Zijie Li, Henry Li, Yichun Shi, Amir Barati Farimani, Yuval Kluger, Linjie Yang, and Peng Wang. 2025. Dual diffusion for unified image generation and understanding. InCVPR
work page 2025
- [27]
- [28]
-
[29]
Shiyu Liu, Yucheng Han, Peng Xing, Fukun Yin, Rui Wang, Wei Cheng, Jiaqi Liao, Yingming Wang, Honghao Fu, Chunrui Han, et al. 2025. Step1x-edit: A practical framework for general image editing.arXiv(2025)
work page 2025
-
[30]
OpenAI. 2025. Introducing gpt-5. https://openai.com/index/introducing-gpt-5/
work page 2025
-
[31]
OpenAI. 2025. Introducing our latest image generation model in the API. https: //openai.com/index/image-generation-api/
work page 2025
- [32]
-
[33]
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. 2022. High-resolution image synthesis with latent diffusion models. In CVPR
work page 2022
-
[34]
Team Seedream, Yunpeng Chen, Yu Gao, Lixue Gong, Meng Guo, Qiushan Guo, Zhiyao Guo, Xiaoxia Hou, Weilin Huang, Yixuan Huang, et al. 2025. Seedream 4.0: Toward next-generation multimodal image generation.arXiv(2025)
work page 2025
-
[35]
Yuchen Sun, Shanhui Zhao, Tao Yu, Hao Wen, Samith Va, Mengwei Xu, Yuanchun Li, and Chongyang Zhang. 2025. Gui-xplore: Empowering generalizable gui agents with one exploration. InCVPR
work page 2025
- [36]
-
[37]
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean-Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. 2023. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805(2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[38]
NextStep Team, Chunrui Han, Guopeng Li, Jingwei Wu, Quan Sun, Yan Cai, Yuang Peng, Zheng Ge, Deyu Zhou, Haomiao Tang, Hongyu Zhou, Kenkun Liu, Ailin Huang, Bin Wang, Changxin Miao, Deshan Sun, En Yu, Fukun Yin, Gang Yu, Hao Nie, Haoran Lv, Hanpeng Hu, Jia Wang, Jian Zhou, Jianjian Sun, Kaijun Tan, Kang An, Kangheng Lin, Liang Zhao, Mei Chen, Peng Xing, Ru...
-
[39]
Qwen Team. 2025. Qwen2.5-VL. https://qwenlm.github.io/blog/qwen2.5-vl/
work page 2025
-
[40]
Team Wan, Ang Wang, Baole Ai, Bin Wen, Chaojie Mao, Chen-Wei Xie, Di Chen, Feiwu Yu, Haiming Zhao, Jianxiao Yang, et al. 2025. Wan: Open and advanced large-scale video generative models.arXiv preprint arXiv:2503.20314(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [41]
-
[42]
Jiahao Wang, Caixia Yan, Haonan Lin, Weizhan Zhang, Mengmeng Wang, Tieliang Gong, Guang Dai, and Hao Sun. 2024. Oneactor: Consistent subject generation via cluster-conditioned guidance. InNeurIPS
work page 2024
-
[43]
Jiahao Wang, Caixia Yan, Weizhan Zhang, Haonan Lin, Mengmeng Wang, Guang Dai, Tieliang Gong, Hao Sun, and Jingdong Wang. 2025. Spotactor: training-free layout-controlled consistent image generation. InAAAI
work page 2025
-
[44]
Zhaokai Wang, Penghao Yin, Xiangyu Zhao, Changyao Tian, Yu Qiao, Wenhai Wang, Jifeng Dai, and Gen Luo. 2025. Genexam: A multidisciplinary text-to-image exam.arXiv preprint arXiv:2509.14232(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [45]
-
[46]
Chenfei Wu, Jiahao Li, Jingren Zhou, Junyang Lin, Kaiyuan Gao, Kun Yan, Sheng- ming Yin, Shuai Bai, Xiao Xu, Yilei Chen, et al . 2025. Qwen-image technical report.arXiv preprint arXiv:2508.02324(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[47]
Chenyuan Wu, Pengfei Zheng, Ruiran Yan, Shitao Xiao, Xin Luo, Yueze Wang, Wanli Li, Xiyan Jiang, Yexin Liu, Junjie Zhou, et al. 2025. OmniGen2: Exploration to Advanced Multimodal Generation.arXiv(2025)
work page 2025
-
[48]
Yecheng Wu, Zhuoyang Zhang, Junyu Chen, Haotian Tang, Dacheng Li, Yunhao Fang, Ligeng Zhu, Enze Xie, Hongxu Yin, Li Yi, et al . 2024. Vila-u: a unified foundation model integrating visual understanding and generation.arXiv preprint arXiv:2409.04429(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[49]
Shitao Xiao, Yueze Wang, Junjie Zhou, Huaying Yuan, Xingrun Xing, Ruiran Yan, Chaofan Li, Shuting Wang, Tiejun Huang, and Zheng Liu. 2025. Omnigen: Unified image generation. InCVPR
work page 2025
-
[50]
Jinheng Xie, Weijia Mao, Zechen Bai, David Junhao Zhang, Weihao Wang, Kevin Qinghong Lin, Yuchao Gu, Zhijie Chen, Zhenheng Yang, and Mike Zheng Shou. 2025. Show-o: One Single Transformer to Unify Multimodal Understanding and Generation. InICLR
work page 2025
- [51]
-
[52]
An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, et al. 2025. Qwen3 technical report.arXiv preprint arXiv:2505.09388(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[53]
Qifan Yu, Wei Chow, Zhongqi Yue, Kaihang Pan, Yang Wu, Xiaoyang Wan, Juncheng Li, Siliang Tang, Hanwang Zhang, and Yueting Zhuang. 2025. Anyedit: Mastering unified high-quality image editing for any idea. InCVPR
work page 2025
-
[54]
Jianxiang Zang, Yongda Wei, Ruxue Bai, Shiyu Jiang, Nijia Mo, Binhong Li, Qiang Sun, and Hui Liu. 2025. Reward Auditor: Inference on Reward Modeling Suitability in Real-World Perturbed Scenarios.arXiv preprint arXiv:2512.00920 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[55]
Richard Zhang, Phillip Isola, Alexei A Efros, Eli Shechtman, and Oliver Wang
-
[56]
The unreasonable effectiveness of deep features as a perceptual metric. In CVPR
-
[57]
Haozhe Zhao, Xiaojian Shawn Ma, Liang Chen, Shuzheng Si, Rujie Wu, Kaikai An, Peiyu Yu, Minjia Zhang, Qing Li, and Baobao Chang. 2024. Ultraedit: Instruction- based fine-grained image editing at scale. InNeurIPS
work page 2024
-
[58]
Xiangyu Zhao, Peiyuan Zhang, Kexian Tang, Xiaorong Zhu, Hao Li, Wenhao Chai, Zicheng Zhang, Renqiu Xia, Guangtao Zhai, Junchi Yan, et al. 2025. Envisioning Beyond the Pixels: Benchmarking Reasoning-Informed Visual Editing. InNeurIPS
work page 2025
-
[59]
Jinguo Zhu, Weiyun Wang, Zhe Chen, Zhaoyang Liu, Shenglong Ye, Lixin Gu, Hao Tian, Yuchen Duan, Weijie Su, Jie Shao, et al. 2025. Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.